Summary:
The abundance of various cell types can vary significantly among patients with varying phenotypes and even those with the same phenotype. Recent scientific advancements provide mounting evidence that other clinical variables, such as age, gender, and lifestyle habits, can also influence the abundance of certain cell types. However, current methods for integrating single-cell-level omics data with clinical variables are inadequate. In this study, we propose a regularized Bayesian Dirichlet-multinomial regression framework to investigate the relationship between single-cell RNA sequencing data and patient-level clinical data. Additionally, the model employs a novel hierarchical tree structure to identify such relationships at different cell-type levels. Our model successfully uncovers significant associations between specific cell types and clinical variables across three distinct diseases: pulmonary fibrosis, COVID-19, and non-small cell lung cancer. This integrative analysis provides biological insights and could potentially inform clinical interventions for various diseases.
Keywords: Dirichlet-multinomial regression models, spike-and-slap priors, hierarchical tree, integrative analysis, single-cell RNA sequencing
1. Introduction
Single-cell RNA sequencing (scRNA-seq) has emerged as a powerful tool for discerning cell types within complex tissues and elucidating their functional roles (Kolodziejczyk et al., 2015; Papalexi and Satija, 2018; Luecken and Theis, 2019). However, translating cell type abundance into phenotypic associations is increasingly recognized as contingent upon multifaceted clinical variables such as age, gender, race, and ethnicity categories, among others. Recent scientific breakthroughs underscore the profound influence of these factors on modulating the abundance of specific cell types (Newman et al., 2019), yet existing methodologies for integrating scRNA-seq data with clinical variables are still inadequate. Therefore, there is an increasing need for innovative statistical approaches that can effectively integrate single-cell-level omics data with diverse clinical variables, thereby enhancing our understanding of the intricate relationships between cellular composition and phenotypic traits in scRNA-seq studies.
The integration of biological profiles (e.g., microarray data, bulk RNA-seq data, metagenomics data, etc) and clinical data has long been of interest. For instance, Gevaert et al. (2006) proposed a Bayesian network to integrate the microarray and clinical data for predicting the prognosis of breast cancer. Zhu et al. (2017) integrated clinical and multiple omics data for prognostic assessment across human cancers with the help of a kernel learning method. Li et al. (2017) developed a multivariate zero-inflated logistic normal model to quantify the associations between microbiome abundances and multiple factors based on microbiome compositional data instead of the count data. The analysis of sequence count data, particularly when integrating clinical variables, has been addressed through various well-established statistical methods. For instance, Wadsworth et al. (2017) introduced an integrative Bayesian Dirichlet-multinomial (DM) regression model tailored for microbiome sequencing reads. Jiang et al. (2021) developed a zero-inflated negative binomial (NB) regression model for similar datasets, utilizing the paired taxonomic tree structure to enhance the integrative analysis. The renowned DESeq2 package (Love et al., 2014) employs the NB distribution for analyzing bulk RNA-seq data. These models typically rely on Poisson or NB distributions for count data, with the DM distribution gaining popularity due to its effectiveness in characterizing the compositionality of some sequence count data (e.g., microbiome data). Zero-inflated distributions have been employed to address the prevalent issue of data sparsity, such as in microbiome and scRNA-seq datasets. However, the specific investigation of the association between clinical variables and scRNA-seq data, as well as its profound implications for certain diseases and biological significance, remains underexplored.
In this study, we employ a DM log-linear regression model to analyze cell type abundance based on scRNA-seq data alongside relevant clinical variables. Drawing upon the previous works (Wadsworth et al., 2017; Jiang et al., 2021), our approach facilitates the examination of associations between cell types and various clinical variables. We adopt spike-and-slab priors for the regression coefficients to selectively identify significant relationships. The efficacy of our model is demonstrated through a simulation study and further validated using three real datasets from distinct diseases. To better elucidate the connections between cell type abundance derived from scRNA-seq data and the related clinical variables, we construct a hierarchical tree approach that highlights the impact of these clinical variables on different levels of cell types. Exploring the association between clinical variables and cell types offers valuable insights into the role of cell abundance in defining phenotypes and the mechanisms through which these variables affect cellular dynamics. This understanding is crucial for deciphering the complex interplay between cellular composition and external factors, thereby advancing our knowledge of cellular function and its impact on broader biological systems.
The article is organized as follows. Section 2 introduces the data preprocessing steps that generate the cell-type abundance data and outlines the data notations. Section 3 describes the structure of the Bayesian DM log-linear regression model, incorporating spike-and-slab priors to enhance model robustness. Section 4 details the Markov chain Monte Carlo (MCMC) algorithm and the posterior inference of the key model parameters. In Section 5, we evaluate the model’s performance through a simulation study and present the results from three case studies involving pulmonary fibrosis, COVID-19, and non-small cell lung cancer. Our conclusions are presented in Section 6.
2. Data
2.1. Data preparation
The raw scRNA-seq data, consisting of short reads from the transcript 3’ end with unique molecular identifiers (UMIs) for each cell type, are processed using the Cell Ranger Pipeline (v6.1.1) for sample demultiplexing, barcode processing, and single-cell gene counting matrix generation. Raw reads of FASTQ files are aligned to the human hg38 reference genome. The Seurat package (v4.1.1)(Hao et al., 2021) is used for clustering and uniform manifold approximation and projection (UMAP) analysis. Differentially expressed genes (DEGs) are found using the Wilcoxon rank-sum test with a p-value threshold of ⩽ 0.05, and p-values are adjusted based on Bonferroni correction for multiple comparisons. Cell-type specific marker genes are identified for cell type assignment with the FindConservedMarkers function in the Seurat package, and then using the DESeq2 function to identify DEGs. Data are then normalized and scaled to account for technical variations in the Seurat package. Next, we perform dimension reduction with the UMAP algorithm to visualize the data in a lower-dimension space. The clustering algorithm Louvain is applied to group cells with similar expression profiles into clusters. Each cluster is examined for DEGs, known as marker genes, which are characteristic of specific cell types. By comparing these marker genes to known gene expression profiles from established cell type databases or literature, we are able to assign each cluster to a specific cell type.
2.2. Data notation
After the cell-type information has been obtained from the above steps, we generate an cell-type abundance count matrix , where each row indexed by represents a patient and each column indexed by corresponds to a specific cell type. Each row indicates the cell-type abundance of patient , with being the total number of cell type found in the scRNA-seq data of patient . Besides, we summarize the paired clinical data as an matrix denoted by , with the -th row representing the measurements of all the clinical variables from patient .
3. Model
To identify significant associations between a range of clinical variables and cell types or grouped cell types, we introduce a hierarchical Bayesian framework that combines the analysis of cell-type abundance count data with clinical information. In this framework, cell type abundance from a patient is assumed to be drawn from a DM distribution. Clinical variables are seamlessly incorporated into the model by parameterizing the DM distribution’s parameters through a log-linear regression approach. This methodology enables a direct and integrated examination of how clinical variables influence cell-type distributions.
3.1. Dirichlet-multinomial (DM) level
We start by modeling each row of the cell-type abundance data with a multinomial distribution
(1) |
with being the summation of all cell-type counts in vector , and the -dimensional vector is defined on a -dimensional simplex
(2) |
We further impose a conjugate Dirichlet prior on the parameter to allow for over-dispersed distributions, that is , where each element of the -dimensional vector is strictly positive. By integrating out, we get the resulting DM model, , where the corresponding probability mass function is
(3) |
where . Compared with multinomial distribution, this setting allows for over-dispersed distributions by inducing an increase in the variance by a factor of , which is greater than 1.
3.2. Log-linear regression level
The covariates matrix are then incorporated into the model by a log-linear regression framework, where the parameter of the DM distribution is linked to the covariates by specifying
(4) |
where is an R-dimensional vector, with each element , modeling the effect of the -th covariate on the -th cell type. The intercept term serves as the log baseline parameter for the cell type .
Identifying the significant associations between cell type abundance and clinical variables is equivalent to finding the non-zero . In practice, not all of the clinical variables are associated with the abundance of each cell type, therefore, we specify a spike-and-slab prior as
(5) |
where indicates the -th covariate is associated with the abundance of the -th cell type, and otherwise. Here is an indicator function. The latent binary variables serve as the indicators of which pairs of cell type and clinical variable have a significant association. We further complete the model by setting the prior of to be an inverse-gamma distribution . A common way of choosing the value of and is to set and , which suggests a flat prior of variance that encourages the selection of relatively large effects.
The model is completed by setting and . Practically, we set , and the choice of hyperparameters and reflects the prior belief that a proportion of the cell type and covariate associations would be selected as discriminating among all pairs. For most cases, a value of corresponds to assuming a priori that 10% to 20% of the covariates will be selected. Finally, we further complete the model by setting as a relatively large value (e.g., 10), since such a choice suggests a flat prior distribution on the location of the coefficients. We denote and , where , , and .
3.3. Incorporating cell-type tree structure
Cell-type abundance data can be summarized at different hierarchical levels based on their similarities. Given all the cell types (leave nodes) at the bottom level, we can build a binary tree from bottom to top by the “height” and “merge” attributes of each parent node (Figure 1a). For example, leaf (cell type) and merge first with the smallest height, then we obtain the first parent node, denoted by “node 1”, and the corresponding “layer 1”. is the total number of cell types as in the count matrix . From now on, “node” only refers to the parent nodes, not the leaves.
By merging the nodes in this manner, we end up having parent nodes from bottom to top. By cutting the tree horizontally from the bottom, for each merge we get a new cut that represents a new layer, and we have in total layers (Figure 1b).
Now consider a matrix where each row is a vector of sets representing the composition of nodes in layer , . indicates the bottom level that contains only the leaves. , is a set of integers representing all the leaves in a certain node. Taking Figure 1c as an example, for layer , leaf 1 and 2 are in the same parent node denoted by , and leaf 4, 5, and 6 are in the node denoted by .
We can aggregate the counts into any upper layer q based on the way the nodes merge, denoted by an matrix . Notice that . More explicitly, let be a function of vectors that can extract all the unique values in the input vector, then returns all the unique sets in , and by Figure 1a and 1c we can observe that there should be unique sets in . Then is obtained in the following manner: for row in , each element of it is calculated by
(6) |
where is the th unique set in , and is the count in row and column of . In Figure 1c, for layer 3 we have .
Correspondingly, assume , we link the parameter with the covariates by
(7) |
where is an -dimensional vector, with each element , modeling the effect of the -th covariate on the node specified by set . Similarly, we specify a spike-and-slab prior as .
For a specific node defined by set , the corresponding matrix of indicators and regression coefficients are calculated as
(8) |
and
(9) |
where indicates the total number of appearance of node in all layers, and is the regression coefficient for layer that is associated with the th covariate and node .
4. Model Fitting
4.1. The MCMC algorithm
We implement the MCMC algorithm for posterior inference by updating each step with Random Walk Metropolis-Hasting (RWMH) sampling. The details of the MCMC algorithm are in Algorithm 1.
According to the model described in Section 3, the full data likelihood is given as follows.
(10) |
where , , and .
Then, we update the parameters in each iteration following the steps below:
Jointly update and : We perform a between-model step first using an add-delete algorithm. For each and each , we change the value of . For the add case, i.e. , we propose from . For the delete case, i.e. , we set . Matrices and are identical to and , except for the elements and , which are replaced by and , respectively. We then accept the proposed new with probability , where
(11) |
Update when : A within-model step is followed to further update each where . We first propose a new from with RWMH algorithm. is identical to , except that the element is replaced by . Then we accept the proposed value with probability , where
(12) |
Update : We update each , sequentially using RWMH algorithm. We first propose a new from . is identical to , except that the element is replaced by . Then we accept the proposed value with probability , where
(13) |
4.2. Posterior inference
For the posterior inference, our aim is to identify the significant associations between cell type abundance and covariates by selecting over and the corresponding matrix . One way to summarize the posterior inference of the latent variable is via the estimates of the marginal posterior probabilities of inclusion (PPI). Suppose that we have in total iterations and the burn-in rate is 50%, then the PPI of each single is calculated by , where is the accepted proposal of in the th iteration. In this way, we are able to select the significant associations by specifying a threshold on PPIs. One choice of threshold is to assign it a fixed value, e.g., 0.5. Another more popular way is to choose a threshold that controls the Bayesian false discovery rate (FDR) which is calculated as
(14) |
where is the threshold and is the indicator function. An optimal choice of can be found for a certain error rate by choosing such that , and a common setting of the error rate is .
5. Results
To illustrate the capability of our model to estimate associations between cell type abundance and clinical variables, we applied it to a simulated dataset. We conducted comparative analyses with established classical methods. The superior performance of our model underscored its efficacy in addressing integration challenges. Then we applied the model to three real datasets. Compared to the results in the original studies, our model effectively highlighted relationships between cell types and variables. Additionally, it shed light on new findings not identified by the original studies, further indicating our model’s effectiveness.
5.1. Simulation study
We first evaluated the proposed model with a simulated dataset. To mimic the real-world scenario, we utilized the covariate matrix from the first real dataset of our study (the pulmonary fibrosis dataset). In this dataset, and , which indicates that there are 25 patients and four covariates under consideration. We set the number of cell types to be and simulated the in the following manner. We first sampled all the entries of matrix from Bernoulli(0.2). When the corresponding , was set to zero, and when the corresponding , we sampled by first sampling a from a uniform distribution Uniform(1, 5), and a corresponding parameter from Bernoulli(0.5). If , , and if , , which means has a 50% probability of being negative. The vector was sample from a truncated normal distribution Normal(0, 5) within the limit [−3, 3]. Each row of , denoted by , was simulated from , where . can be calculated by . , representing the total number of cells from each patient.
We implemented the proposed Bayesian model by setting , , which corresponds to assuming a priori that 20% of the covariates would be selected. We set , , which suggests a flat prior distribution of that encourages the selection of relatively large effects. We further set to allow a flat prior on . The variance of the proposal distribution of the MCMC algorithm was set as . We set the total iteration of MCMC as , and the burn-in rate as 50%. The receiving operating curves (ROC) plot and the corresponding area under the curve (AUC) between the true values and the PPI of served as the metric of this study.
Then we compared our model to two simple alternatives. First, we fit a simple linear regression model on each column of (each column represents the counts of a single cell type among patients), using as the covariate matrix. We then obtained sets of regression coefficients. Subsequently, we constructed the ROC curve and calculated the AUC using the p-values associated with each regression coefficient. Next, we tested the FDR-corrected correlation coefficient of each pair of columns between and (i.e. pair-wise correlation tests) and saved the adjusted p-values for the calculation of ROC and AUC. The comparison results are shown in Figure 2. Our model reached an AUC of 0.995, while the correlation test and linear regression obtained AUC of 0.832 and 0.808, respectively.
5.2. Application to pulmonary fibrosis dataset
Pulmonary fibrosis (PF) is a chronic lung condition characterized by aberrant epithelial restructuring and the accumulation of extracellular matrix (ECM), marking a significant pathology within the pulmonary system. We applied our method to a scRNA-seq dataset from 30 patients, among whom 20 were diagnosed with PF, while the remaining ten, who didn’t have PF, served as the control group (Habermann et al., 2020). The authors have previously defined cell types in this PF scRNA-seq data and also provided clinical information on these 30 patients, including age, sex, smoking history, and disease status . The patients were between ages of 17 and 74. Among these 30 patients, 17 (56.7%) were male, and 16 (53.3%) had ever smoked in their lives. After removing the patients with missing values, we kept patients to proceed with our study.
Figure 3a depicts the associations between the abundance of 30 distinct cell types and four clinical parameters, namely age, sex, smoking history, and disease status (i.e., with PF or not). The analysis revealed positive correlations between certain cell types [e.g., Epi(KRT5-/KRT17+) and Epi(basal)] and the presence of PF. Conversely, we observed no significant correlations between these cell types and other clinical factors, such as age, sex, or smoking history. Figure 3b presents the posterior estimates of the regression coefficients between disease status and the aforementioned cell types. The significant coefficients corroborate the associations highlighted in Figure 3a, emphasizing the relationship between changes in disease status and alterations in cell type abundances. Figure 3c presents the 95% posterior credible intervals for between the disease status and the cell types Epi(KRT5-/KRT17+) and Epi(basal), respectively. These exclusively positive intervals further substantiate their strong positive association with PF. Figure 3d visualizes the 95% posterior credible intervals for significant regression coefficients within the hierarchical tree structure. These intervals underscore the significance of these coefficients in elucidating the impact of PF on cell type distributions. Figure 3e highlights elevated levels of Epi(SCGB1A1+), Epi(SCGB3A2+), and Epi(basal) cell types in patients diagnosed with PF. This suggests a significant increase in the abundance of these cell types among individuals with the disease, as discerned through the hierarchical model. Finally, Figure 3f displays the predicted proportions of Epi(basal) and Epi(KRT5-/KRT17+) alongside actual proportions in patients. The close alignment between the predicted and actual proportions underscores the model’s accuracy and emphasizes the heightened presence of these cell types in PF patients.
It is worth noting that our predicted associations are consistent with multiple known biological observations: Basal cell hyperplasia is indeed a characteristic feature observed in epithelial remodeling associated with fibrotic lung disease and other chronic lung conditions. This phenomenon involves an increase in the number of basal cells in the epithelium, which is the layer of cells lining the respiratory tract (Beppu et al., 2023; Ortiz-Zapater et al., 2022). The aberrant expansion of KRT5-/KRT17+ epithelial cells in pulmonary fibrosis lungs is important in understanding the pathogenesis of fibrotic lung diseases. This observation highlights changes in the epithelial cell populations that contribute to the progression of these diseases (Habermann et al., 2020; Valenzi et al., 2021). Previous papers have validated the positive correlation between Epi (basal) and Epi (KRT5-/KRT17+) abundance with pulmonary fibrosis, supporting the power of DM regression in revealing the genuine biological relationship between cell type abundance and disease.
5.3. Application to COVID-19 dataset
Immune dysregulation in patients with coronavirus disease 2019 (COVID-19) significantly influences the symptom and mortality rates. To provide a comprehensive landscape of relevant immune cell dynamics in COVID-19 patients, a recent study leveraged scRNA-seq to analyze 284 samples from 196 individuals, including COVID-19 patients and controls, thereby elucidating a comprehensive immune landscape that comprises 1.46 million cells across distinct immune cell types: B, CD4, CD8, dendritic cells (DC), epithelial cell (Epi), macrophages (Macro), mast cell (Mast), megakaryocytes (Mega), monocytes (Mono), neutrophils (Neu), natural killer cells (NK), and plasma (Ren et al., 2021). We analyzed five clinical variable: age, sex, scRNA-seq platform, disease symptoms, and disease stages. By combining the disease symptom and stage into a single variable, we run our model with clinical variables. After preprocessing the data by removing samples with missing values, samples were kept in our study, and 177 (62.5%) were from male patients. The ages range from 6 to 92, and two different sequencing platforms, 10 × 3′ and 10 × 5′, were utilized to generate the scRNA-seq data. 28 (9.9%) samples served as the control group, 121 (42.8%) exhibited mild to moderate symptoms, and 134 (47.3%) had severe symptoms. Additionally, 139 (49.1%) samples were collected from patients at the convalescence stage, and 116 (41.0%) during the progression stage.
As shown in Figure 4a, among these immune cell types, NK, Mono, DC, CD8, and CD4 were found to have a higher abundance level within patients exhibiting disease progression and severe symptoms. Notably, Figures 4b and 4c underscore a significant observation: age displays a negative correlation with the abundance of CD4 and CD8 T cells, particularly evident in patients undergoing disease progression and exhibiting moderate symptoms. Figure 4b (left) delineates a significant increase in the cell types comprising Mast, Macro, DC, and Mono among patients in advancing disease stages. Conversely, Figure 4b (right) demonstrates that all hierarchical cell types are more abundant in patients with severe symptoms, highlighting the impact of disease severity on immune cell dynamics. Further elaborating on these associations, Figure 4c confirms the inverse relationship between age and the levels of CD4 and CD8 expression, aligning with the observed heightened response in critical conditions. During severe disease progression, increased levels of NK, Mono, DC, and both CD8 and CD4 T cells are evident, indicating an intensified immune response. Figure 4d illustrates the 95% posterior credible intervals for the significant regression coefficients between the disease status, symptom severity, and the aforementioned immune cells. The intervals substantially exceeding zero for CD4, CD8, Mono, and NK suggest a robust positive association with advanced disease states, highlighting their pivotal roles in the immune system’s response to disease progression. Moreover, Figure 4e presents the 95% posterior credible intervals for the significant of the hierarchical tree structure, further delineating the statistical significance of these relationships within the clustered groups of immune cells. Finally, Figure 4f shows the predicted proportions of CD4, CD8, Mono, and NK under various clinical scenarios, with the actual proportions observed in patients overlaid as black dots. This visualization not only confirms the model’s accuracy but also emphasizes the prevalence of these immune cells in patients undergoing severe disease progression.
Furthermore, it is noteworthy that our model predictions are consistent with biological facts: age-related thymic involution and the accumulation of memory T cells contribute to the decline of CD4+ and CD8+ T cells, which are essential components of the adaptive immune response. This phenomenon is termed immunosenescence and is associated with decreased immune function in older adults (Zhang et al., 2021; Ramasubramanian et al., 2022; Li et al., 2019). The negative correlation between CD8 and CD4 cell abundance with age is consistent with the biological truth.
5.4. Application to lung cancer dataset
Non-small cell lung cancer (NSCLC) is well-known for being a highly aggressive and heterogeneous disease with diverse histological subtypes. A comprehensive understanding of the immune and stroma cell types across various NSCLC patient subgroups with distinct phenotypes is still largely lacking. A recent study has applied scRNA-seq to delineate distinct cell types sourced from patients (Salcher et al., 2022). In our study, covariates were examined, including age, sex, smoking history, and tumor stage. Of the 179 patients, 83 (46.4%) were male and 96 (53.6%) were female. Additionally, 101 (56.4%) patients had a history of smoking. Out of the 179 patients, 66 (36.9%) did not have NSCLC and served as the control group, 38 (21.2%) were in an advanced tumor stage, and 75 (41.9%) were in the early tumor stage.
The analysis depicted in Figures 5a and 5c reveals distinct patterns of cellular abundance that correlate with tumor progression and lifestyle factors. Specifically, patients in the early stages of tumor development exhibit increased levels of regulatory T cells (T(reg)), cytotoxic T cells (T(CD8+)), and helper T cells (T(CD4+)), alongside decreased abundances of monocytes (Mono) and alveolar macrophages (Macro(alv)). Conversely, patients with advanced tumors show reduced levels of Mono and Macro(alv), but an elevated presence of neutrophils. Additionally, a notable increase in plasmacytoid dendritic cells (Den(plas)), natural killer cells (NK), and B cells is observed among patients with a history of smoking, highlighting the impact of lifestyle factors on immune cell dynamics. Our hierarchical analysis, as demonstrated in Figure 5b, shows that the cell types comprising non-specific monocytes (Mono(non)), alveolar macrophages (Macro(alv)), macrophages (Macro), and dendritic cells expressing CD1c (Den(CD1c+)) are less abundant at both early and advanced stages of the disease. In contrast, early-stage tumors are characterized by a higher prevalence of T(reg), T(CD4+), T(CD8+), NK, Mast, Den(plas), B, conventional dendritic cells (Den(conv)), and Plasma, suggesting a more active immune response at this stage of tumor development. Figure 5d illustrates the 95% posterior credible intervals for significant regression coefficients between disease covariates and cell types, establishing strong statistical evidence of their associations. Figure 5e further extends these findings by showing the credible intervals for significant values within the hierarchical tree structure, providing insights into the interconnected nature of cell type dynamics and disease characteristics. In Figure 5f, we compared the estimated proportions with the observed proportions of these cell types, depicted through underlying barplots and overlaid black spots, respectively. This comparison reveals that Macro(alv) and Mono(non) are predominantly observed in non-cancer patients, whereas T cells, specifically T(CD4+), T(CD8+), and T(reg), are significantly more abundant in patients with early-stage lung cancer, underscoring their potential roles in early immune surveillance and response to tumor presence.
Furthermore, the predicted associations are consistent with published biological observations: T cells (T(reg)), cytotoxic T cells (T(CD8+)), and helper T cells (T(CD4+)) are crucial for anti-tumor immunity. In the early stage of lung cancer, regulatory T cells (Tregs) exhibit significant plasticity and functional diversity across various tumors within the tumor microenvironment and have been found to increase (Principe et al., 2021; Liang et al., 2022; Li et al., 2024). Several studies have demonstrated that tumors with a high quantity of FoxP3+ regulatory T cells (Tregs) also exhibit a substantial presence of other immune cells, including CD8+ cytotoxic T cells and proliferative immune cells (Koll et al., 2023; Oshi et al., 2022). Additionally, the increased count of CD4+ cells was observed in non-small cell lung cancer tumor-infiltrating lymphocytes (Woo et al., 2001). Studies reported that as lung cancer progresses, the proportion of alveolar macrophages (Macro(alv)) gradually decreases (Wang et al., 2023), which supports the negative coefficient of the abundance of Macro (alv) in both NSCLC early stage and advanced stage. Neutrophil expansion is associated with changes in the inflammatory milieu of patients with non-small cell lung cancer (NSCLC) who have resectable disease (Mitchell et al., 2020; Kargl et al., 2017; Horvath et al., 2024) and reflected by the positive coefficient from DM regression. Epidemiologic studies have shown that cigarette smoking leads to an increased prevalence of class-switched memory B cells in peripheral blood and memory IgG+ B cells in the lungs (Brandsma et al., 2009, 2012; Qiu et al., 2017). This is consistent with a positive relationship between the abundance of B cells and smoking in the DM regression.
6. Conclusion
In this study, we introduce and validate a regularized Bayesian Dirichlet-multinomial regression model for the integrative analysis of clinical information from patients and their scRNA-seq data. Our findings underscore the model’s robustness in elucidating complex interactions between cell type abundance derived from scRNA-seq data and various clinical covariates across multiple disease states. Significantly, our model successfully identified key associations between specific cell types and clinical variables in the contexts of pulmonary fibrosis, COVID-19, and lung cancer. These relationships, some of which were not previously documented, highlight the potential of single-cell technologies coupled with advanced computational methods to deepen our understanding of disease mechanisms. For instance, the model’s ability to link specific epithelial and immune cell dynamics with clinical outcomes in pulmonary fibrosis and COVID-19 provides new insights into their roles in disease progression and immune response, respectively. The hierarchical tree structure utilized in our analysis further refines the understanding of cellular interactions by mapping out how groups of cells are influenced by clinical features. This approach not only facilitates a more granular analysis but also highlights potential cellular targets for therapeutic intervention. While the model demonstrates significant promise, it also presents challenges, primarily in computational demand and data requirements. Future work will focus on optimizing these computational aspects and expanding the model’s applicability to incorporate dynamic analyses for chronic conditions, where understanding temporal changes in cell-type abundance could be particularly informative. Overall, our study advances the field of integrative analysis by providing a powerful tool for uncovering the nuanced relationships between cellular composition and clinical characteristics, thereby aiding in the development of targeted therapies and improving our understanding of complex diseases.
Funding
This work was supported by the following funding: the National Science Foundation [2210912, 2113674] and the National Institutes of Health [1R01GM141519] (to QL); the Rally Foundation, Children’s Cancer Fund (Dallas), the Cancer Prevention and Research Institute of Texas (RP180319, RP200103, RP220032, RP170152 and RP180805), and the National Institutes of Health (R01DK127037, R01CA263079, R21CA259771, UM1HG011996, and R01HL144969) (to LX).
Funding Statement
This work was supported by the following funding: the National Science Foundation [2210912, 2113674] and the National Institutes of Health [1R01GM141519] (to QL); the Rally Foundation, Children’s Cancer Fund (Dallas), the Cancer Prevention and Research Institute of Texas (RP180319, RP200103, RP220032, RP170152 and RP180805), and the National Institutes of Health (R01DK127037, R01CA263079, R21CA259771, UM1HG011996, and R01HL144969) (to LX).
Footnotes
Conflict of Interest
The authors have no conflicts of interest to declare.
Data Availability
The code and data of this study are accessible through the GitHub repository at https://github.com/yg2485/Bayesian-DM-Regression.
References
- Beppu A. K., Zhao J., Yao C., Carraro G., Israely E., Coelho A. L., Drake K., Hogaboam C. M., Parks W. C., Kolls J. K., et al. (2023). Epithelial plasticity and innate immune activation promote lung tissue remodeling following respiratory viral infection. Nature Communications 14, 5814. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brandsma C.-A., Hylkema M. N., Geerlings M., van Geffen W. H., Postma D. S., Timens W., and Kerstjens H. A. (2009). Increased levels of (class switched) memory b cells in peripheral blood of current smokers. Respiratory research 10, 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brandsma C.-A., Kerstjens H. A., van Geffen W. H., Geerlings M., Postma D. S., Hylkema M. N., and Timens W. (2012). Differential switching to igg and iga in active smoking copd patients and healthy controls. European Respiratory Journal 40, 313–321. [DOI] [PubMed] [Google Scholar]
- Gevaert O., Smet F. D., Timmerman D., Moreau Y., and Moor B. D. (2006). Predicting the prognosis of breast cancer by integrating clinical and microarray data with bayesian networks. Bioinformatics 22, e184–e190. [DOI] [PubMed] [Google Scholar]
- Habermann A. C., Gutierrez A. J., Bui L. T., Yahn S. L., Winters N. I., Calvi C. L., Peter L., Chung M.-I., Taylor C. J., Jetter C., and et al. (2020). Single-cell rna sequencing reveals profibrotic roles of distinct epithelial and mesenchymal lineages in pulmonary fibrosis. Science Advances 6, eaba1972. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hao Y., Hao S., Andersen-Nissen E., Mauck W. M., Zheng S., Butler A., Lee M. J., Wilk A. J., Darby C., Zager M., and et al. (2021). Integrated analysis of multimodal single-cell data. Cell 184, 3573–3587.e29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Horvath L., Puschmann C., Scheiber A., Martowicz A., Sturm G., Trajanoski Z., Wolf D., Pircher A., and Salcher S. (2024). Beyond binary: bridging neutrophil diversity to new therapeutic approaches in nsclc. Trends in Cancer. [DOI] [PubMed] [Google Scholar]
- Jiang S., Xiao G., Koh A. Y., Li J. K. Q., and Zhan X. (2021). A bayesian zero-inflated negative binomial regression model for the integrative analysis of microbiome data. Biostatistics 22, 522–540. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kargl J., Busch S. E., Yang G. H., Kim K.-H., Hanke M. L., Metz H. E., Hubbard J. J., Lee S. M., Madtes D. K., McIntosh M. W., et al. (2017). Neutrophils dominate the immune cell composition in non-small cell lung cancer. Nature communications 8, 14381. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koll F. J., Banek S., Kluth L., Köllermann J., Bankov K., Chun F. K.-H., Wild P. J., Weigert A., and Reis H. (2023). Tumor-associated macrophages and tregs influence and represent immune cell infiltration of muscle-invasive bladder cancer and predict prognosis. Journal of Translational Medicine 21,. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kolodziejczyk A. A., Kim J. K., Svensson V., Marioni J. C., and Teichmann S. A. (2015). The technology and biology of single-cell rna sequencing. Molecular cell 58, 610–620. [DOI] [PubMed] [Google Scholar]
- Li M., Yao D., Zeng X., Kasakovski D., Zhang Y., Chen S., Zha X., Li Y., and Xu L. (2019). Age related human t cell subset evolution and senescence. Immunity & Ageing 16, 1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Q., Guindani M., Reich B. J., Bondell H. D., and Vannucci M. (2017). A bayesian mixture model for clustering and selection of feature occurrence rates under mean constraints. Statistical Analysis and Data Mining: The ASA Data Science Journal 10, 393–409. [Google Scholar]
- Li Y., Zhang C., Jiang A., Lin A., Liu Z., Cheng X., Wang W., Cheng Q., Zhang J., Wei T., and et al. (2024). Potential anti-tumor effects of regulatory t cells in the tumor microenvironment: a review. Journal of Translational Medicine 22,. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liang J., Bi G., Shan G., Jin X., Bian Y., and Wang Q. (2022). Tumor-associated regulatory t cells in non-small-cell lung cancer: Current advances and future perspectives. Journal of Immunology Research 2022, 1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Love M. I., Huber W., and Anders S. (2014). Moderated estimation of fold change and dispersion for rna-seq data with deseq2. Genome biology 15, 1–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luecken M. D. and Theis F. J. (2019). Current best practices in single-cell rna-seq analysis: a tutorial. Molecular systems biology 15, e8746. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mitchell K. G., Diao L., Karpinets T., Negrao M. V., Tran H. T., Parra E. R., Corsini E. M., Reuben A., Federico L., Bernatchez C., et al. (2020). Neutrophil expansion defines an immunoinhibitory peripheral and intratumoral inflammatory milieu in resected non-small cell lung cancer: a descriptive analysis of a prospectively immunoprofiled cohort. Journal for immunotherapy of cancer 8,. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Newman A. M., Steen C. B., Liu C. L., Gentles A. J., Chaudhuri A. A., Scherer F., Khodadoust M. S., Esfahani M. S., Luca B. A., Steiner D., et al. (2019). Determining cell type abundance and expression from bulk tissues with digital cytometry. Nature biotechnology 37, 773–782. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ortiz-Zapater E., Signes-Costa J., Montero P., and Roger I. (2022). Lung fibrosis and fibrosis in the lungs: Is it all about myofibroblasts? Biomedicines 10, 1423. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oshi M., Sarkar J., Wu R., Tokumaru Y., Yan L., Nakagawa K., Ishibe A., Matsuyama R., Endo I., and Takabe K. (2022). Intratumoral density of regulatory t cells is a predictor of host immune response and chemotherapy response in colorectal cancer. American journal of cancer research 12, 490–503. [PMC free article] [PubMed] [Google Scholar]
- Papalexi E. and Satija R. (2018). Single-cell rna sequencing to explore immune cell heterogeneity. Nature Reviews Immunology 18, 35–45. [DOI] [PubMed] [Google Scholar]
- Principe D. R., Chiec L., Mohindra N. A., and Munshi H. G. (2021). Regulatory t-cells as an emerging barrier to immune checkpoint inhibition in lung cancer. Frontiers in Oncology 11,. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Qiu F., Liang C.-L., Liu H., Zeng Y.-Q., Hou S., Huang S., Lai X., and Dai Z. (2017). Impacts of cigarette smoking on immune responsiveness: Up and down or upside down? Oncotarget 8, 268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ramasubramanian R., Meier H. C., Vivek S., Klopack E., Crimmins E. M., Faul J., Nikolich-Žugich J., and Thyagarajan B. (2022). Evaluation of t-cell aging-related immune phenotypes in the context of biological aging and multimorbidity in the health and retirement study. Immunity & Ageing 19, 33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ren X., Wen W., Fan X., Hou W.and Su B., Cai P., Li J., Liu Y., Tang F., Zhang F., et al. (2021). Covid-19 immune features revealed by a large-scale single-cell transcriptome atlas. Cell 184, 1895–1913. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Salcher S., Sturm G., Horvath L., Untergasser G., Kuempers C., Fotakis G., Panizzolo E., Martowicz A., Trebo M., Pall G., et al. (2022). High-resolution single-cell atlas reveals diversity and plasticity of tissue-resident neutrophils in non-small cell lung cancer. Cancer cell 40, 1503–1520. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Valenzi E., Tabib T., Papazoglou A., Sembrat J., Trejo Bittar H. E., Rojas M., and Lafyatis R. (2021). Disparate interferon signaling and shared aberrant basaloid cells in single-cell profiling of idiopathic pulmonary fibrosis and systemic sclerosis-associated interstitial lung disease. Frontiers in immunology 12, 595811. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wadsworth W. D., Argiento R., Guindani M., Galloway-Pena J., Shelburne S. A., and Vannucci M. (2017). An integrative bayesian dirichlet-multinomial regression model for the analysis of taxonomic abundances in microbiome data. BMC bioinformatics 18, 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang J., Wu W., Xia J., Chen L., Liu D., Wang G., Wang L., and Zheng Q. (2023). Dynamic changes in macrophage subtypes during lung cancer progression and metastasis at single-cell resolution. Journal of Thoracic Disease 15, 4456. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Woo E. Y., Chu C. S., Goletz T. J., Schlienger K., Yeh H., Coukos G., Rubin S. C., Kaiser L. R., and June C. H. (2001). Regulatory cd4+ cd25+ t cells in tumors from patients with early-stage non-small cell lung cancer and late-stage ovarian cancer. Cancer research 61, 4766–4772. [PubMed] [Google Scholar]
- Zhang H., Weyand C. M., and Goronzy J. J. (2021). Hallmarks of the aging t-cell system. The FEBS journal 288, 7123–7142. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhu B., Song N., Shen R., Arora A., Machiela M. J., Song L., Landi M. T., Ghosh D., Chatterjee N., Baladandayuthapani V., et al. (2017). Integrating clinical and multiple omics data for prognostic assessment across human cancers. Scientific reports 7, 16954. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
The code and data of this study are accessible through the GitHub repository at https://github.com/yg2485/Bayesian-DM-Regression.