Abstract
Deconvoluting cell-state abundances from bulk RNA-sequencing data can add considerable value to existing data, but achieving fine-resolution and high-accuracy deconvolution remains a challenge. Here we introduce MeDuSA, a mixed model-based method that leverages single-cell RNA-sequencing data as a reference to estimate cell-state abundances along a one-dimensional trajectory in bulk RNA-sequencing data. The advantage of MeDuSA lies primarily in estimating cell abundance in each state while fitting the remaining cells of the same type individually as random effects. Extensive simulations and real-data benchmark analyses demonstrate that MeDuSA greatly improves the estimation accuracy over existing methods for one-dimensional trajectories. Applying MeDuSA to cohort-level RNA-sequencing datasets reveals associations of cell-state abundances with disease or treatment conditions and cell-state-dependent genetic control of transcription. Our study provides a high-accuracy and fine-resolution method for cell-state deconvolution along a one-dimensional trajectory and demonstrates its utility in characterizing the dynamics of cell states in various biological processes.
Subject terms: Statistical methods, Software, Data integration, Gene expression, Computational models
MeDuSA, a mixed model-based method, leverages single-cell RNA-sequencing data for high-accuracy, fine-resolution cellular deconvolution in bulk RNA-sequencing data. It improves deconvolution accuracy over existing methods, revealing cell-state dynamics in various biological processes.
Main
Cellular deconvolution is a computational technique aimed to estimate cellular compositions from tissue-level ‘bulk’ omics data1,2. With the increasing availability of bulk RNA-sequencing (RNA-seq) data, cellular deconvolution has become a pivotal approach for estimating cell-type compositions in a tissue of interest. This methodological advance has greatly facilitated research to understand the roles of different cell types in dynamic disease processes (for example, quantifying immune cell infiltrations in solid tumors)3–5, probe genetic regulatory mechanisms at the cellular level (for example, cell-type-specific expression quantitative trait locus analysis)6–9 and adjust biases caused by cell-type compositions in association analyses (for example, using cell-type compositions for covariate adjustment)7,10,11.
Over the past decade, many cellular deconvolution methods have been developed and benchmarked1,2, including BayesPrism12, CIBERSORT13 and MuSiC14 among others. Most of them share a typical workflow, that is, generating cell-type-specific gene expression profiles (GEPs) from a reference, such as bulk RNA-seq data from individual cell subsets (for example, CIBERSORT13) or single-cell RNA-seq (scRNA-seq) data (for example, MuSiC14), and utilizing the reference GEPs to compute cell-type compositions in bulk RNA-seq data. Nevertheless, cells of the same type are not homogeneous but distributed across multiple states in a biological process that arises in a context-dependent manner, for example, activation15, differentiation16 or degeneration17. This distribution can vary among different environments, disease conditions and genetically distinct individuals. In this regard, further opportunities and challenges of cellular deconvolution lie in estimating the abundances of cells at different states (referred to as cell-state abundance hereinafter) in bulk RNA-seq data.
Single-cell RNA-seq offers a snapshot of the transcriptome of thousands of diverse cells, providing an avenue for studying cell states in various biological processes18,19. In scRNA-seq data, cells at different states can be computationally ordered to infer cell-state trajectories (for example, pseudotime)19. Cell population mapping (CPM)20 is a cellular deconvolution method specifically designed to exploit ‘cell-state space’ inferred from reference scRNA-seq data to estimate cell-state abundances in bulk RNA-seq data. CPM partitions the cell-state space into several grids, constructs a GEP by randomly sampling a cell from each grid and combines the estimated abundances across thousands of repeats to obtain a single abundance for each cell. While CPM has considerably improved the deconvolution resolution, the accuracy of the estimated cell-state abundance can still be improved, largely because it focuses on only a small number of cells in each sampling repeat without accounting for the remaining cells.
In this Article, we introduce MeDuSA (mixed model-based deconvolution of cell-state abundances), a high-accuracy and fine-resolution cellular deconvolution method that leverages scRNA-seq data as a reference to estimate cell-state abundances along a one-dimensional trajectory in bulk RNA-seq data. MeDuSA features the use of a linear mixed model (LMM) to fit a cell state in question (either a single cell or the mean of multiple cells, referred to as the focal state hereinafter) as a fixed effect and the remaining cells of the same cell type individually as random effects accounting for correlations between cells. This model improves the deconvolution accuracy because the random-effect component allows each cell to have a specific weight on bulk gene expression, resulting in a better capturing of variance in bulk gene expression, and ameliorates the collinearity problem between the cell(s) at the focal state (fitted as a fixed effect) and those at adjacent states (fitted as random effects). We show by extensive simulations and real-data benchmark analyses that MeDuSA is substantially more accurate than existing methods when assessed with one-dimensional trajectories. We also demonstrate the utility of MeDuSA by applying it to cohort-level bulk RNA-seq data to reveal associations of cell-state abundances with disease or treatment conditions and cell-state-dependent genetic control of transcription.
Results
Simulation study
The MeDuSA method is described in Methods, with a schematic summary shown in Fig. 1 and the technical details presented in sections 1 and 2 of the Supplementary Note. Briefly, MeDuSA utilizes scRNA-seq data as a reference to estimate the abundance of cells at various states along a one-dimensional trajectory in bulk RNA-seq data. This is done using an LMM in which the focal state is fitted as a fixed effect and cells at the other states are fitted individually as random effects, with the other cell types fitted as fixed covariates. We performed a series of simulations to assess the performance of MeDuSA and evaluate the robustness of MeDuSA to the choice of parameters (‘Simulation strategy’ in Methods). Our simulations were based on 17 scRNA-seq datasets generated from different species and sequencing platforms with varying number of cells captured (Supplementary Table 1). The cell types and cell-state trajectories of these datasets were annotated and validated previously or in this study (Methods). We split each scRNA-seq dataset into two portions, randomly assigning one portion as simulation source data and the other portion as deconvolution analysis reference data. Synthetic bulk RNA-seq data were generated as mixtures of scRNA-seq profiles, according to four pre-designed cell-state distribution patterns (Fig. 2a). We compared MeDuSA with CPM20, along with cell-type deconvolution and gene enrichment-based methods, including BayesPrism12, MuSiC14, CIBERSORT13, Scaden21, TAPE22 and ssGSEA23, that can be repurposed for cell-state deconvolution by dividing the cell-state trajectory into cell bins (section 3 of the Supplementary Note). The deconvolution accuracy was measured by the concordance correlation coefficient (CCC), Pearson’s correlation (R) and the root mean square deviation (RMSD) between the estimated cell-state abundance and the ground truth. We used CCC as the primary measure of deconvolution accuracy, as it is less sensitive to overweighted outliers than R and more interpretable than RMSD.
The results showed that MeDuSA outperformed the compared methods by a considerable margin for one-dimensional trajectories, especially when the cell-state abundance distribution was non-monotonic (Fig. 2b and Supplementary Figs. 1 and 2). For instance, when the distribution was bimodal, the deconvolution accuracy of MeDuSA (CCC) was 0.85, 3.4-fold higher than the best-performing methods among CPM (−0.05), BayesPrism (0.15), MuSiC (0.25), CIBERSORT (0.13), Scaden (0.03), TAPE(0.007) and ssGSEA (0.23).
We performed a series of sensitivity analyses to investigate the factors that influence the performance of MeDuSA (or cellular deconvolution in general). First, smoothing slightly improved the deconvolution accuracy of MeDuSA (from 0.76 to 0.86), despite that MeDuSA without smoothing (MeDuSA-NS) performed considerably better than the other methods including CPM with smoothing (Supplementary Fig. 2). As smoothing can mask the effects of the other factors, we performed the sensitivity analyses below without the smoothing step. Second, the random-effect component was nominally significant (P < 0.05) in all simulations (maximum P = 1.35 × 10−10), and the significance level decreased with fewer cells fitted (Supplementary Fig. 3). The accuracy of MeDuSA-NS also decreased with fewer cells fitted in the random-effect component (Extended Data Fig. 1), suggesting the benefit of fitting all cells at non-focal states as random effects to reduce residual variance. Third, instead of fitting the non-focal cells each as a random effect, we grouped them into bins along the cell-state trajectory and fitted the mean of each bin as a random effect; the accuracy decreased dramatically from 0.76 to 0.33 (Supplementary Fig. 4), demonstrating the benefit of allowing each cell to have a specific weight on bulk gene expression. Fourth, the accuracy decreased to 0.17 when we fitted the bins each as a fixed effect (Supplementary Fig. 5), showing the benefit of fitting the non-focal states as random effects (to ameliorate the collinearity problem between the focal and non-focal states), as further evidenced by the increased difference between the fixed- and random-effect models with the level of collinearity (Supplementary Fig. 6). Fifth, ignoring the correlations between cells in the random-effect component resulted in decreased deconvolution accuracy (from 0.75 to 0.63), especially when the underlying cell-state abundance distribution is complex (Supplementary Fig. 7). Sixth, we showed that the accuracy of MeDuSA was generally robust when the number of cell states varied from 50 to 1,000 (a larger number of cell states representing a higher deconvolution resolution) (Supplementary Figs. 8 and 9). Finally, we demonstrated the confounding effects of other cell types, which could largely be corrected by fitting the mean expression of each of them as a fixed-effect covariate (Supplementary Fig. 10).
While the use of the LMM improves the deconvolution accuracy, it introduces a much higher level of computational complexity than the models used in CPM and other cell-type deconvolution methods. We improved the computational efficiency of MeDuSA through coding the core algorithm with C++ and applying an appropriate approximation algorithm (‘Computational speed-up’ in Methods). On a unified computing platform with one central processing unit, the runtime of MeDuSA to deconvolute a bulk RNA-seq dataset using a Smart-seq2 or 10X Genomics scRNA-seq dataset as the reference (10,000 cells in both datasets) was 17.1 min or 5.6 min, respectively, 5.3-fold or 3.3-fold faster than CPM (Extended Data Fig. 2).
Benchmark analysis with real bulk RNA-seq data
We then benchmarked the performance of the deconvolution methods with real bulk RNA-seq data. Four sample-matched scRNA-seq and bulk RNA-seq datasets from human esophagi (n = 15), human bone marrows (n = 8), induced pluripotent stem cells (iPSCs; n = 6) and human embryonic stem cells (hPSCs; n = 6) were used in this analysis (Supplementary Fig. 11). In real data, the true cell-state abundances are unknown and need to be estimated. In each dataset, we inferred the cell-state trajectory and estimated the corresponding cell-state abundances from scRNA-seq data. We then compared the estimated cell-state abundances with those obtained from bulk RNA-seq data using the deconvolution methods (Fig. 3a,b). We performed cross-validation where applicable, that is, the samples used for cell-state trajectory inference were excluded from the deconvolution analysis. The results again showed that MeDuSA substantially outperformed the compared methods (Fig. 3c and Supplementary Fig. 12). The mean deconvolution accuracy (CCC) of MeDuSA was 0.70, 2.2-fold higher than the best-performing method among CPM (0.31), BayesPrism (0.19), MuSiC (0.07), CIBERSORT (0.08), Scaden (0.06), TAPE (0.07) and ssGSEA (0.17) (Fig. 3c). The conclusion remained mostly consistent at higher deconvolution resolutions (Extended Data Fig. 3). It is noteworthy that the performances of the deconvolution methods in this real-data benchmark analysis were generally lower than those in the simulation study, probably because of the discrepancies between the bulk RNA-seq and scRNA-seq (Supplementary Fig. 13) and the uncertainty in estimating cell-state abundances from scRNA-seq data19.
We further compared the cell-state abundances of epithelia estimated from the esophagus to those estimated from other tissues without the keratinization process, which can be regarded as negative controls. We applied MeDuSA to bulk RNA-seq data from esophagus mucosa (n = 555), blood (n = 929), heart (n = 861), liver (n = 226), spleen (n = 241), colon (n = 779) and small intestine (n = 187) in the Genotype-Tissue Expression (GTEx), using the fresh esophageal scRNA-seq data above as the reference. Compared with the abundance of epithelium estimated from esophagus, the abundance of epithelium estimated from the non-esophageal tissues was small (Fig. 3d).
Case studies
We next applied MeDuSA in four case studies to demonstrate how a cell-state abundance deconvolution method with substantially improved accuracy can give rise to deeper insights into disease etiology and biological mechanisms.
Application to esophageal carcinoma
We applied MeDuSA to conduct cell-state abundance deconvolution analyses in 109 human esophagus bulk RNA-seq data from The Cancer Genome Atlas (TCGA), of which 98 samples were collected from the esophageal squamous-cell carcinoma (ESCC) tumor tissue, and 11 samples were collected from the adjacent normal esophageal tissue, using the scRNA-seq data from the normal fresh esophageal tissue above as the reference. In this reference data, cell types were annotated according to the marker genes, and the keratinization trajectory of epithelial cells was inferred using Slingshot24 (Fig. 4a). The keratinization trajectory profiles the cytodifferentiation process of epithelial cells, proceeding from the post-germinative state (that is, the basal layer of the epithelium) to the finally cuticularized state (that is, the upper layer of the epithelium) (Fig. 4b,c). ESCC arises from the basal layer of the esophagus epithelium, resulting in a thicker basal layer than that in normal esophagi25,26. Hence, the abundance of epithelial cells in the basal layer (that is, in the earlier stage of the keratinization trajectory) in tumor is expected to be higher than that in normal esophagi. Such an expected histological change can be detected by MeDuSA, as evidenced by the significant difference in the abundance distribution of epithelial cells over the keratinization trajectory between ESCC and normal esophagi (permutation F-test, P = 0.012; Fig. 4d and ‘Testing for differences in cell-state abundances among groups’ in Methods). Considering that the difference was only marginally significant, probably due to the small sample size of normal esophagi in TCGA (n = 11), we combined TCGA data with the data to increase the sample size of normal esophagi to 664. After adjusting for batch effects between TCGA and GTEx27 (Supplementary Fig. 14), we observed similar result as above that relative to normal esophagi, abundance of epithelial cells inferred from tumor tissues shifted toward the basal layer (permutation F-test, P < 1 × 10−4, with the P value capped by the number of permutations; Fig. 4e). An accordant result was obtained in another independent esophagus bulk RNA-seq dataset (n = 46, permutation F-test, P = 3.2 × 10−4; Fig. 4f), using another independent scRNA-seq dataset as the reference (Supplementary Fig. 15).
Application to COVID-19
We applied MeDuSA to RNA-seq data from patients with coronavirus disease 2019 (COVID-19), with the aim to portray the dynamics of CD8+ T cells during the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection. A COVID-19 peripheral blood mononuclear cell (PBMC) scRNA-seq dataset from 6 healthy and 7 SARS-CoV-2-infected donors was used as the reference for the deconvolution analyses (Extended Data Fig. 4a). Altogether, we retrieved 6,762 CD8+ T cells, which were then classified into three subtypes, including naive T cells (Tn), effector-memory T cells (Tem) and exhaustion-like T cells (Tex). The gamma delta T cells were excluded as their development process is disjoint from the other subtypes of CD8+ T cells. Diffusion map and RNA velocity analyses suggested that the CD8+ T cells developed from the naive state (Tn) to the exhaustion state (Tex) (Extended Data Fig. 4b and Supplementary Fig. 16), as validated by the expression pattern of the marker genes (Extended Data Fig. 4c), consistent with previous studies28,29.
Using the reference scRNA-seq above, we deconvoluted a PBMC bulk RNA-seq dataset consisting of 17 healthy donors and 17 patients with COVID-19. We observed a significant difference in the abundance distribution of CD8+ T cells over the development trajectory between healthy donors and patients with COVID-19 (Extended Data Fig. 4d). Compared with healthy donors, CD8+ T cells from patients with COVID-19 were enriched in the exhaustion state (permutation F-test, P = 1.4 × 10−3), in line with previous studies28. These results were replicated in another independent COVID-19 bulk RNA-seq dataset, comprising 10 healthy donors and 44 patients with COVID-19 (permutation F-test, P = 2.6 × 10−4; Extended Data Fig. 4e).
To further investigate the variation of CD8+ T cells among patients with COVID-19 under different clinical conditions, we applied MeDuSA to another PBMC bulk RNA-seq dataset containing 100 patients with relevant clinical indicators. After grouping patients into tertiles according to their blood C-reactive protein (CRP) levels, we found that CD8+ T cells from patients with higher CRP levels showed higher enrichment in the exhaustion state (permutation F-test, P = 0.038; Extended Data Fig. 4f), supporting the hypothesis that inflammation-associated stress may contribute to the dysregulation of CD8+ T cells in patients with COVID-1929. We further analyzed a bulk RNA-seq COVID-19 dataset from patients under different World Health Organization scored clinical phases (13 patients with COVID-19 and 14 healthy donors) (Extended Data Fig. 4h). The result showed a clear trend that patients with COVID-19 at convalescence stages (that is, clinical phases 6 and 7) had similar abundance distribution of CD8+ T cells over the development trajectory to healthy donors (clinical phase 0); in contrast, patients at disease stages (that is, clinical phases 1–5) tended to aggregate together, showing enrichment of CD8+ T cells in the high-exhaustion state (Extended Data Fig. 4g). In summary, our results revealed the dynamics of cell-state abundance of CD8+ T cells over the development trajectory during the SARS-CoV-2 infection, suggesting that CD8+ T cells in patients with COVID-19 were enriched in the inflammation-associated exhaustion state.
Application to skin melanoma
A previous scRNA-seq study of skin melanoma shows that low-exhaustion CD8+ T cells are depleted in T-cell receptor (TCR) expanded clusters but enriched in TCR non-expanded clusters30. In other words, TCR clonal expansions might be positively correlated with the exhaustion state of CD8+ T cells. Using this melanoma scRNA-seq dataset as the reference (Fig. 5a), we applied MeDuSA to TCGA melanoma bulk RNA-seq data (n = 430). The primary aim of this analysis was to understand the association of the exhaustion state of CD8+ T cells with the TCR clonality in a large dataset. Due to the sparseness of CD8+ T cells in the reference melanoma scRNA-seq data, the CD8+ T-cell exhaustion trajectory was annotated using the exhaustion score30 rather than any of the trajectory inference methods and validated by the expression pattern of the marker genes (Fig. 5b). Quantifying the exhaustion scores with two other independent gene sets gave rise to similar results, confirming the robustness of such an annotation (Supplementary Fig. 17).
We grouped 430 TCGA melanoma patients into tertiles according to the TCR expansion levels evaluated by MiXCR31. The MeDuSA deconvolution result showed an enrichment of CD8+ T cells at the exhaustion state, which increased with the TCR expansion level (permutation F-test, P < 1 × 10−4, with the P value capped by the number of permutations; Fig. 5c). In the terminal exhaustion state (that is, time 3, 66–100% of the cell trajectory), the correlation between TCR expansion level and CD8+ T-cell abundance was 0.55 (P = 0.0025) (Supplementary Fig. 18), suggesting that the exhaustion state of CD8+ T cells was positively associated with TCR expansion level in melanoma.
The second aim of this analysis was to investigate the clinical relevance of the exhausted CD8+ T cells. We first sought to examine the association of the exhaustion-state abundance of CD8+ T cells with patients’ overall survival. At each tertile of the exhaustion-state trajectory (time 1, 0–33% of the pseudotime; time 2, 33–66% of the pseudotime; time 3, 66–100% of the pseudotime), we divided TCGA melanoma patients into low and high groups (median cut-off) based on the average abundance of CD8+ T cells. The result showed that only the abundance of CD8+ T cells in the terminal exhaustion state was significantly associated with survival (time 3, log-rank-test, Hazard Ratio (HR) = 2.12, P = 8.2 × 10−7; Fig. 5d). We next examined the association of exhaustion-state abundance of CD8+ T cells with patients’ response to immune-checkpoint blockade (ICB). We collected a melanoma bulk RNA-seq dataset from anti-programmed cell death protein 1 (anti-PD1) pretreatment tumor tissues of 70 patients with metastatic skin melanoma. The MeDuSA deconvolution result suggested that the abundance of CD8+ T cells at the terminal exhaustion state (time 3) was higher in anti-PD1 responders than that in anti-PD1 progressors (P = 0.0069) (Fig. 5e). Collectively, our results suggest that the abundance of CD8+ T cells at the high-exhaustion state was positively correlated with TCR expansion level in melanoma and associated with patient’s overall survival and response to anti-PD1 ICB.
Cell-state-dependent genetic regulation of gene expression
Finally, we applied MeDuSA to deconvolute cell-state abundances in an expression quantitative trait locus (eQTL) dataset (that is, a cohort with both single nucleotide polymorphism (SNP) genotype and bulk RNA-seq data) for detecting cell-state-dependent eQTLs (csd-eQTLs). Note that csd-eQTL mapping has been achieved only recently with cohort-level scRNA-seq data32–35. Using the esophagus scRNA-seq dataset above as the reference, we estimated the cell-state abundances along the epithelial differentiation trajectory in bulk RNA-seq data from 497 GTEx esophagus mucosa samples and computed the abundance of cells in each quartile of the epithelial differentiation trajectory for each sample (Fig. 6a). A csd-eQTL was claimed if the effect an SNP on bulk gene expression depended on cell-state abundance in any of the quartiles (‘Mapping the cell-state-dependent eQTLs’ in Methods). In total, we identified 162 genes with at least one csd-eQTL (defined as csd-eGenes) at 5% false-discovery rate (FDR) (Fig. 6b). The csd-eGenes were enriched in differentially expressed genes (DEGs) along the cell-state trajectory (fold enrichment = 2.12, 95% confidence interval (CI) 1.73–2.52; Fig. 6c,d), which could be replicated using the epithelial differentiation trajectory annotated by another independent esophagus scRNA-seq dataset (Supplementary Fig. 19). We next annotated the epithelial differentiation trajectory using an independent esophagus single-cell assay for transposase-accessible chromatin (scATAC-seq) dataset (Fig. 6e–g) and tested for associations of the epithelial chromatin peaks with this trajectory (‘Annotating the cell-state-dependent chromatin accessibility peaks’ in Methods). We refer to the genomic regions with epithelial chromatin peaks associated with the differentiation trajectory (annotated by the scATAC-seq data) as cell-state-dependent open chromatin regions (csd-OCRs). We found that the lead csd-eQTLs were highly enriched (fold enrichment = 3.30, 95% CI 2.70–3.90) in the csd-OCRs, and the strength of enrichment increased with the significance level used to identify the csd-eQTLs (Fig. 6h). Taken together, we achieved csd-eQTL mapping in a conventional eQTL mapping dataset, and the identified csd-eQTLs were enriched in csd-OCRs and associated with genes enriched with cell-state specific expression (Fig. 6i).
Discussion
In this study, we developed a cellular deconvolution method, MeDuSA, to estimate cell-state abundance over a one-dimensional trajectory in bulk RNA-seq data. Compared with other methods, the substantially increased deconvolution accuracy of MeDuSA is mainly because of fitting the cells at the focal state as a fixed effect and the remaining cells individually as random effects. On average across the RNA-seq datasets used in this study, this approach explains an additional 10–40% of variance in bulk gene expression compared with the binning strategy (Supplementary Figs. 21 and 22).
MeDuSA is well-suited for biological scenarios where the underlying mechanisms involve continuous transitions of cellular states, such as cell development, differentiation or degeneration. In four case studies covering a broad range of research domains, we discovered that cell-state abundance was associated with disease conditions, clinical outcomes, mechanisms of pathogenicity and treatment exposures. These results recapitulated changes in cellular functions under different biological conditions, facilitating our understanding of cellular roles in disease etiology. Further, we showed that MeDuSA can be used to detect csd-eQTLs in bulk RNA-seq data. These results inform future studies to map csd-eQTLs in large cohorts and integrate the csd-eQTLs with data from genome-wide association studies to identify disease-relevant cell states and reveal the biological mechanisms underlying genetic associations for complex traits and diseases.
There are several caveats when applying MeDuSA in practice. First, the cell-state trajectory in reference scRNA-seq data needs to be pre-annotated. Although we have used different methods, including the diffusion map-based method (Slingshot), the RNA velocity-based method (scVelo) and the score-based method (CytoTRACE), for cell-trajectory inference, showing the compatibility of MeDuSA, a biased cell-state-trajectory annotation might result in biased cell-state abundance estimation. Second, the sequencing technology used to generate reference scRNA-seq data is another source of bias for deconvolution analyses. One of the greatest sources of bias in scRNA-seq is dropout events36–38, especially for short-length methods such as those implemented by 10X Genomics. We corrected for this potential bias by filtering out genes expressed in less than 10% cells and averaging gene expression profiles of cells in the focal cell-state (Methods). It is of note that we have covered most common scRNA-seq platforms in simulations and applications including 10X Genomics, Drop-Seq, Seq-Well, C1, inDrop and Smart-seq2 (Supplementary Table 1). Imputing the reference scRNA-seq data by SAVER39 did not improve the performance of MeDuSA in real-data applications (Extended Data Fig. 5). Third, the cell-state trajectory modeled in the current version of MeDuSA is a one-dimensional vector, which may not fully portray the complexity of cellular transitions, particularly in cases of multiple cell trajectories40. More work is warranted in the future to extend MeDuSA to model cell states on a multi-dimensional space. Fourth, a growing number of spatial transcriptomics studies have shown that cellular heterogeneity at spatial coordinates might be associated with unknown biological mechanisms41–43. In this regard, recovering spatial structures of bulk tissues using spatial transcriptomics data as a reference will be another interesting future direction to extend MeDuSA.
Methods
Ethical approval
This study was approved by the Ethics Committee of Westlake University (approval no. 20200722YJ001).
The MeDuSA model
For a cell type of interest (that is, the focal cell type) in a tissue or cell line, the relative abundances of cells at different states (that is, cell-state abundances) can be estimated using a cell-trajectory analysis with scRNA-seq data. For a sample without scRNA-seq but with bulk RNA-seq data available, cell-state abundance can be estimated by projecting the RNA-seq data onto the cell-state trajectory derived from a reference scRNA-seq dataset20. More specifically, cells of the focal cell type in the reference scRNA-seq data are ranked by the cell-state trajectory, and a cell state is defined as a window on this trajectory. The window size can be customized, varying from a single cell to multiple neighboring cells at similar states. Given a specific window size, the cell trajectory in the reference can be subdivided into m consecutive states. When the ith state is regarded as the focal state, the abundance of this state in the bulk RNA-seq data can, in principle, be estimated by the following model: y = xiβi + e, where y is an n × 1 vector comprising expression levels of a list of n signature genes (selected to be associated with cell-state trajectory; section 1 of the Supplementary Note and Supplementary Fig. 30) in the bulk RNA-seq data, xi is an n × 1 vector comprising expression levels of the signature genes in cells at the focal state in the reference, with βi being the cell-state abundance to be estimated, and e is an n × 1 vector of residuals, with . If there are multiple cells at the focal state, the expression level of each gene is averaged across the cells.
A critical limitation of the above model is that the variance in y explained by xi is likely to be minor, leaving a sizable residual variance and thereby considerable uncertainty in the estimated cell-state abundance . We propose to reduce the residual variance by fitting the focal state, together with the remaining cells of the focal cell type and the other cell types in the following LMM:
1 |
where y, xi, βi and e have the same definitions as above; C is matrix of gene expression levels, with each row representing a signature gene and each column representing the mean of each of the other cell types, and γ is a vector of the corresponding effects; Z is also a matrix of gene expression levels, with each row representing a signature gene and each column representing each of the remaining cells of the focal cell type, and α is a vector of the corresponding effects. In this model, βi and γ are treated as fixed effects, whereas α are treated as random effects, with , because the size of α (k cells) is often larger than the size of y (n signature genes). Under this model parameterization, we have .
Compared with the strategy of binning cells by cell-state trajectory and fitting the mean of each bin in a regression model1,2, this LMM has two distinct advantages. As cells, even those of the same type, are biologically heterogeneous, fitting the remaining cells of the focal cell type individually as random effects allows each cell to have a specific weight on bulk gene expression, resulting in a better capturing of the variance in bulk gene expression and thereby a more precise estimate of the focal state in the fixed-effect term (that is, improved deconvolution accuracy). Second, the LMM ameliorates the collinearity problem between cells at the focal state (fitted as a fixed effect) and those at adjacent states (fitted as random effects) because of the shrinkage of random effects. For the other cell types, we fit the mean expression level of a whole cell type as a fixed-effect covariate rather than fitting individual cells as random effects for two reasons. First, the signature genes are selected to be associated with the cell-state trajectory in the focal cell type so that the associations of the signature genes with the other cell types are often weak. Second, fitting multiple random-effect components with near-zero variance often causes convergence problems in estimating the variance components.
In many LMM applications, random effects are assumed to be independent and identically distributed. However, the abundances of cells at adjacent states are likely to be correlated. To accommodate such correlations, we follow the previous work44–46 to model the relationship between the abundance of cell i (strictly speaking, the abundance of cells at a state represented by cell i) and those of the other cells as
2 |
where αi is the abundance of cell i, θ is a scaling factor, wij is the weight between cells i and j, and εi is an error term with . Because cells at closer states tend to have higher correlations, we model the weight between cells i and j as wij = exp(), with dij being the Euclidian distance between the states on the cell-state trajectory46,47. Let W = {wij} be a k × k symmetric zero-diagonal matrix for all cell pairs and D be a diagonal matrix, with each diagonal element being the corresponding row sum of W. We divide each wij by Dii so that the sum of each row of W is unity. To ensure var(α) to be symmetric, we set with λ being a scalar48. Following the Brook’s factorization45,49, we have var(α) = (D − θW)−1λ2. The distribution of y then becomes
3 |
where . We can fit this model iteratively for i from 1 to m to estimate βi for each focal state. Details of the derivation and parameter estimation of equation (3) are provided in the section 2 of the Supplementary Note. It should be noted the βi parameters represent the fractional abundances of different cell states within the focal cell type, which are bound between 0 and 1 and sum up to unity. To ensure unbiased estimation, the estimates of the βi parameters from the MeDuSA models are not constrained. However, for ease of interpretation, one can rescale the raw estimates to range from 0 to 1 and sum up to unity.
Computational speed-up
Running the whole process above is time-consuming, largely because of the rate-limiting step of estimating V (strictly speaking, estimating the parameters to compute ), which needs to be done repeatedly for each focal state. Considering the minimal contribution of a focal cell state to the bulk gene expression level, we speed up the process by estimating V only once under the null model (that is, dropping the focal state from the fixed-effect terms and fitting all cells of the focal cell type in the random-effect term) and plug it in the generalized least squares50 equation to compute for each of the alternative models. This approximation has been widely used in LMM-based genetic association test51–55, and has been proved to be accurate in our benchmark analyses (Supplementary Figs. 27 and 28).
Smoothing
After estimating the cell-state abundances from the process above, we smooth the estimates over the cell-state trajectory by the locally estimated scatterplot smoothing (LOESS) or averaging the nearest neighbors. This smoothing process often leads to improved deconvolution accuracy due to reduced sampling variance of the estimates using the neighboring information.
Simulation strategy
To make the simulation as close to reality as possible, we performed simulations using 17 real scRNA-seq datasets from different sequencing platforms and species. Each dataset was randomly split into two portions, one as the simulation source data and the other as the deconvolution reference data. The synthetic bulk RNA-seq data were generated as mixtures of scRNA-seq profiles based on the simulation source data. We grouped cells into L uniformly distributed states and assigned an abundance (al) to each state (l) according to the pre-designed cell abundance distribution over the cell-state trajectory. To mimic the sampling variances in real bulk RNA-seq data, we randomly selected a certain number of cells (with replacement) from each state based on the assigned cell abundance. The pseudo bulk expression level was obtained by averaging the expression profiles of the selected individual cells. Specifically, the expression level of gene g in the pseudo bulk RNA-seq data was generated as:
where is the expression level of gene g of cell i randomly selected from state l, and n is the total number of selected cells. We set n as to ensure at least one selected cell for the non-empty states and rounded to an integer number. The cell-state abundance (al) was generated as a nonlinear function of the cell trajectory: al = f(tl) with f being the shape mapping function and tl being the median trajectory rank of cells at state l. We designed four cell-state abundance distributions including:
monotonically increasing:
monotonically decreasing:
unimodal:
bimodal:
with k being the curvature of the distribution. The generated cell-state abundances were then normalized so that they sum to unity. To further mimic differences in batch effects between scRNA-seq and bulk RNA-seq data, we added log-normally distributed noises to the pseudo bulk RNA-seq data12. The performances of MeDuSA and other methods under different levels of noises were shown in Supplementary Fig. 29.
Testing for differences in cell-state abundances among groups
We propose an approach, MANOVA-Pro, that combines multiple analysis of variance (MANOVA) with polynomial regression to detect differences in cell-state abundance among groups (for example, case group versus control group). We first utilize the polynomial regression to model the distribution of cell-state abundance along the cell-state trajectory as
5 |
where βj is an m × 1 vector of the estimated cell-state abundances of individual j with m being the number of states along the cell-state trajectory; is an m × k polynomial matrix with t being an m × 1 cell-state vector and (k − 1) being the polynomial degree; bj is a k × 1 vector of the regression coefficients corresponding to T; e is a vector of the residuals, . The polynomial regression coefficients can be estimated as . We next perform an MANOVA analysis to test if there is a difference in any of the polynomial regression coefficients among groups (for example, case versus control) based on the following model:
6 |
where is a k × k matrix with g being the number of groups, bji being a k × 1 vector of the regression coefficients of individual i in the group j, and nj being the number of individuals in the group j; is the hypothesis sum of squares and cross products matrix; is the error sum of squares and cross products matrix. We can use the following F statistic to test against the null hypothesis ,
7 |
where Λ is the Pillai trace with , , and . Under the null hypothesis, this F statistic follows an F distribution with and degrees of freedom.
Correcting for inflation in association test
An important application of the estimated cell-state abundance is to test its association with a phenotype, for example, case-control status, across individuals. Such an analysis can be performed using the MANOVA-PRo method above that tests the association of cell-state abundance with a categorical phenotype, accounting for the relationship between the cell-state abundance and cell trajectory. Alternatively, if the interest is to test whether the abundance of cells in a specific state (or a bin of states without concerning the relationship between the cell-state abundance and cell trajectory within the bin), then the association test can be performed under the linear regression model framework. However, we have observed from simulations that all the association tests mentioned above can suffer from inflation because the estimated cell-state abundance is correlated across individuals, owing to the correlation of gene expression, and such correlation can be group dependent. For example, we observed in multiple bulk RNA-seq datasets that the mean correlation of gene expression was higher within the case or control group than that between groups (Supplementary Fig. 23). One extreme example was that the difference in estimated cell-state abundance between the case and control groups was statistically significant even if the reference scRNA-seq data were randomly generated (Supplementary Fig. 24). Such inflation also probably exists in cell-type deconvolution analyses, as demonstrated in our simulations (Supplementary Fig. 25). To account for such correlation-induced inflation, we propose to assess the significance level of the association by permutation test. In each permutation, we randomly shuffle the signature genes, and re-run the cellular deconvolution analysis and the subsequent association analysis. We repeat the permutation 1,000 times (or 10,000 times when necessary) and compute an empirical P value by comparing the observed association test statistic with the test statistics obtained from permutations. We have demonstrated by simulations under various conditions that the permutation-based test was well calibrated under the null of no association (Supplementary Fig. 26).
Processing the scRNA-seq, bulk RNA-seq and scATAC-seq data
We used 24 scRNA-seq datasets from the public domain (see Supplementary Table 1 for the information about species, sample size, sequencing platform and data access). Among them, 17 scRNA-seq datasets were used in simulation analyses, with the cell-state trajectory annotated previously or in this study using CytoTRACE56. In addition, we used 21 bulk RNA-seq datasets, with relevant tissue, sample size and data access information compiled in Supplementary Table 2. We also utilized scATAC-seq data from three human esophagus samples. The procedures for processing the scRNA-seq, bulk RNA-seq and scATAC-seq data are details in sections 4–6 of the Supplementary Note.
Mapping the cell-state-dependent eQTLs
We used SNP genotype data of 497 GTEx samples. Following the GTEx pipeline, we performed a standard quality control process of the genotype data using PLINK257, with the parameters ‘–geno 0.01–maf 0.05–hwe 0.000001–mind 0.01′. The workflow for mapping the cell-state-dependent eQTLs (csd-eQTLs) is illustrated in Supplementary Fig. 35. The csd-eQTLs were mapped using a linear regression model with an interaction term between SNP genotype and the estimated cell-state abundance: , where yi is the gene expression level of the ith individual, xi is the genotype variable of an SNP, si is the overall abundance of cells at a range of states (for example, one of the quartiles of the cell-state trajectory), xisi is the interaction term, cij represents the jth covariate, and ei is the residual. Following the standard eQTL mapping pipeline58, we used age, sex, the top-five genotype principal components (to correct for population stratification), and 60 PEER59 factors (to correct for biological/technical confounding factors) as the covariates. For each gene, only SNPs within the cis window (that is, ±1 Mb) of the transcription start site were tested. To avoid outlier effects, we performed the rank-based inverse normal transformation of the TMM (i.e., trimmed mean of m-values) normalized gene expression values and the cell-state abundances. We filtered out SNPs with minor allele frequency <0.05. For each of the SNPs retained, we tested the significance of the interaction term for csd-eQTL detection. Following the pipeline of mapping cell-type-dependent eQTLs60, we used eigenMT61 to correct for multiple testing in each cis window. We then computed the Benjamin–Hochberg FDR values based on the eigenMT adjusted P values to determine the experimental-wise significance threshold. The above csd-eQTL mapping process was conducted using the software tensorQTL62.
Annotating the cell-state-dependent chromatin accessibility peaks
We performed dimension-reduction analysis for epithelia in the scATAC-seq data using the same pipeline described above. We used Slingshot24 to annotate the epithelial keratinization (differentiation) trajectory based on the top two eigenvectors (Supplementary Fig. 20). To avoid potential outlier effects, we eliminated chromatin accessibility peaks that were available in less than 5% of the epithelial cells. For each of the remaining peaks, we utilized the generalized additive model implemented in the R package ‘mgcv’ to identify epithelial differentiation-dependent accessible chromatin peaks: y ≈ s(x) + C, where y is a vector of chromatin accessibility peaks across cells, x is a vector of pseudotime values of the epithelial keratinization trajectory, s is the smoothing spline representing the linear combination of cubic basis functions and C is the matrix of covariates. We added the total number of fragments of each cell to account for the variation in sequencing depth63. We used the total number of fragments and donor of the cell as covariates. Following previous studies63,64, we assumed that the chromatin accessibility peaks follow a negative binomial distribution. The strength of association between chromatin accessibility and epithelial differentiation trajectory was quantified by the χ2 value of the smoothing spline.
Enrichment of eQTLs for chromatin accessibility
We assigned cell-state-dependent chromatin accessibility χ2 values obtained above to the SNPs included in the csd-eQTL analysis. SNPs located in regions without chromatin accessibility data were excluded. To avoid ascertainment bias, we randomly sampled control SNPs from null SNPs, ensuring that their number and minor allele frequency distribution matched with those of the SNPs in query. The sampling process was repeated 1,000 times. The fold enrichment was calculated by dividing the mean χ2 value of the SNPs in query by the mean of mean χ2 values across the 1,000 sets of control SNPs. We employed the delta method65,66 to compute the sampling variance of the fold enrichment. Specifically, let x be the mean χ2 value of the SNPs in query and with yi being the mean χ2 value of ith set of control SNPs. The fold enrichment was estimate as , with being the mean across m replicates (m = 1,000 in this case). The variance of is expressed as: . Assuming that , and , with being the observed variance of across replicates, the sampling variance of the fold enrichment estimate can be computed as: .
Enrichment of eGenes in DEGs
We allocated the χ2 values of the DEGs to the genes involved in the csd-eQTL analysis. Employing a similar method as previously mentioned, we computed the fold enrichment of eGenes by dividing the average χ2 value of the eGenes in query by the mean of mean χ2 values obtained from 1,000 sets of randomly chosen control genes. The delta method was used to calculate the sampling variance of the fold enrichment.
Statistics and reproducibility
The P values to test for differences in cell-state abundances among groups were calculated using the permutation-based MANOVA-Pro method. For the survival analysis, P values were computed using a two-sided log-rank test. Csd-eQTLs P values were computed using a one-sided chi-squared test. Enrichment P values for the csd-eGenes and csd-eQTLs were derived through permutations. The sample size for each analysis was determined by the maximum number of eligible samples available in the respective datasets. The study design did not require randomization or blinding. To reproduce the primary results of this research, refer to the analytical pipeline available at https://github.com/LeonSong1995/MeDuSA_Analysis.
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Supplementary information
Source data
Acknowledgements
This research was supported by the Leading Innovative and Entrepreneur Team Introduction Program of Zhejiang (2021R01013), ‘Pioneer’ and ‘Leading Goose’ R&D Program of Zhejiang (2022SDXHDX0001), Research Program of Westlake Laboratory of Life Sciences and Biomedicine (202208013) and Research Center for industries of the Future (RCIF) at Westlake University. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. We thank F. Cheng and T. Xu for helpful discussions and the Westlake University High-Performance Computing Center for assistance in computing. This study used the data from the GTEx (dbGaP accession phs000178) and the TCGA (dbGaP accession phs000424).
Extended data
Author contributions
J.Y. and L.S. conceived the study. J.Y., L.S., T.Q. and X.S. designed the experiment. L.S. and J.Y. developed the methods with input from X.S. L.S. developed the software tool, curated the data and conducted all analyses with the assistance and guidance from J.Y., T.Q. and X.S. J.Y. supervised the project. L.S. and J.Y. wrote the paper with the participation of all authors. All authors reviewed and approved the final paper.
Peer review
Peer review information
Nature Computational Science thanks Carlos Talavera-Lópezand the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Jie Pan, in collaboration with the Nature Computational Science team.
Data availability
All the scRNA-seq, scATAC-seq and bulk RNA-seq data used in this study are available in the public domain with the relevant information summarized in Supplementary Tables 1 and 2. The GTEx genotype data is available at https://gtexportal.org/home/protectedDataAccess. The GTEx eQTLs summary data is available at https://gtexportal.org/home/datasets. The csd-eQTLs summary data is available at 10.5281/zenodo.8018006 ref. 67. The GRCh38 genome is available at https://www.ncbi.nlm.nih.gov/projects/genome/guide/human. The GENECODE-v38 transcriptome reference is available at https://www.gencodegenes.org/human. Source data for Figs. 2–6 and Extended Data Figs. 1–5 are available with this paper.
Code availability
The source code of MeDuSA is available at https://github.com/LeonSong1995/MeDuSA ref. 68.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
The online version contains supplementary material available at 10.1038/s43588-023-00487-2.
References
- 1.Avila Cobos F, Alquicira-Hernandez J, Powell JE, Mestdagh P, De Preter K. Benchmarking of cell type deconvolution pipelines for transcriptomics data. Nat. Commun. 2020;11:5650. doi: 10.1038/s41467-020-19015-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Jin H, Liu Z. A benchmark for RNA-seq deconvolution analysis under dynamic testing environments. Genome Biol. 2021;22:102. doi: 10.1186/s13059-021-02290-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Thorsson V, et al. The immune landscape of cancer. Immunity. 2018;48:812–830 e14. doi: 10.1016/j.immuni.2018.03.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Sayaman RW, et al. Germline genetic contribution to the immune landscape of cancer. Immunity. 2021;54:367–386.e8. doi: 10.1016/j.immuni.2021.01.011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Li B, et al. Comprehensive analyses of tumor immunity: implications for cancer immunotherapy. Genome Biol. 2016;17:174. doi: 10.1186/s13059-016-1028-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Kim-Hellmuth S, et al. Cell type-specific genetic regulation of gene expression across human tissues. Science. 2020;369:eaaz8528. doi: 10.1126/science.aaz8528. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Donovan MKR, D’Antonio-Chronowska A, D’Antonio M, Frazer KA. Cellular deconvolution of GTEx tissues powers discovery of disease and cell-type associated regulatory variants. Nat. Commun. 2020;11:955. doi: 10.1038/s41467-020-14561-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Westra H-J, et al. Cell specific eQTL analysis without sorting cells. PLoS Genet. 2015;11:e1005223. doi: 10.1371/journal.pgen.1005223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Glastonbury CA, Couto Alves A, El-Sayed Moustafa JS, Small KS. Cell-type heterogeneity in adipose tissue is associated with complex traits and reveals disease-relevant cell-specific eQTLs. Am. J. Hum. Genet. 2019;104:1013–1024. doi: 10.1016/j.ajhg.2019.03.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Jaffe AE, Irizarry RA. Accounting for cellular heterogeneity is critical in epigenome-wide association studies. Genome Biol. 2014;15:R31. doi: 10.1186/gb-2014-15-2-r31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Teschendorff AE, Zheng SC. Cell-type deconvolution in epigenome-wide association studies: a review and recommendations. Epigenomics. 2017;9:757–768. doi: 10.2217/epi-2016-0153. [DOI] [PubMed] [Google Scholar]
- 12.Chu T, Wang Z, Pe’er D, Danko CG. Cell type and gene expression deconvolution with BayesPrism enables Bayesian integrative analysis across bulk and single-cell RNA sequencing in oncology. Nat. Cancer. 2022;3:505–517. doi: 10.1038/s43018-022-00356-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Newman AM, et al. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods. 2015;12:453–457. doi: 10.1038/nmeth.3337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Wang X, Park J, Susztak K, Zhang NR, Li M. Bulk tissue cell type deconvolution with multi-subject single-cell expression reference. Nat. Commun. 2019;10:380. doi: 10.1038/s41467-018-08023-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Smith-Garvin JE, Koretzky GA, Jordan MS. T cell activation. Annu. Rev. Immunol. 2009;27:591–619. doi: 10.1146/annurev.immunol.021908.132706. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Sánchez Alvarado A, Yamanaka S. Rethinking differentiation: stem cells, regeneration, and plasticity. Cell. 2014;157:110–119. doi: 10.1016/j.cell.2014.02.041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Fricker M, Tolkovsky AM, Borutaite V, Coleman M, Brown GC. Neuronal cell death. Physiol. Rev. 2018;98:813–880. doi: 10.1152/physrev.00011.2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Wagner A, Regev A, Yosef N. Revealing the vectors of cellular identity with single-cell genomics. Nat. Biotechnol. 2016;34:1145–1160. doi: 10.1038/nbt.3711. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Saelens W, Cannoodt R, Todorov H, Saeys Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 2019;37:547–554. doi: 10.1038/s41587-019-0071-9. [DOI] [PubMed] [Google Scholar]
- 20.Frishberg A, et al. Cell composition analysis of bulk genomics using single-cell data. Nat. Methods. 2019;16:327–332. doi: 10.1038/s41592-019-0355-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Menden K, et al. Deep learning-based cell composition analysis from tissue expression profiles. Sci. Adv. 2020;6:eaba2619. doi: 10.1126/sciadv.aba2619. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Chen Y, et al. Deep autoencoder for interpretable tissue-adaptive deconvolution and cell-type-specific gene analysis. Nat. Commun. 2022;13:6735. doi: 10.1038/s41467-022-34550-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Subramanian A, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA. 2005;102:15545–15550. doi: 10.1073/pnas.0506580102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Street K, et al. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genom. 2018;19:477. doi: 10.1186/s12864-018-4772-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Jain S, Dhingra S. Pathology of esophageal cancer and Barrett’s esophagus. Ann. Cardiothorac. Surg. 2017;6:99–109. doi: 10.21037/acs.2017.03.06. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Sánchez-Danés A, Blanpain C. Deciphering the cells of origin of squamous cell carcinomas. Nat. Rev. Cancer. 2018;18:549–561. doi: 10.1038/s41568-018-0024-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Zhang Y, Parmigiani G, Johnson WE. ComBat-seq: batch effect adjustment for RNA-seq count data. NAR Genom. Bioinform. 2020;2:lqaa078. doi: 10.1093/nargab/lqaa078. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Wauters E, et al. Discriminating mild from critical COVID-19 by innate and adaptive immune single-cell profiling of bronchoalveolar lavages. Cell Res. 2021;31:272–290. doi: 10.1038/s41422-020-00455-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Zhang J-Y, et al. Single-cell landscape of immunological responses in patients with COVID-19. Nat. Immunol. 2020;21:1107–1118. doi: 10.1038/s41590-020-0762-x. [DOI] [PubMed] [Google Scholar]
- 30.Tirosh I, et al. Dissecting the multicellular ecosystem of metastatic melanoma by single-cell RNA-seq. Science. 2016;352:189–196. doi: 10.1126/science.aad0501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Bolotin DA, et al. MiXCR: software for comprehensive adaptive immunity profiling. Nat. Methods. 2015;12:380–381. doi: 10.1038/nmeth.3364. [DOI] [PubMed] [Google Scholar]
- 32.Nathan A, et al. Single-cell eQTL models reveal dynamic T cell state dependence of disease loci. Nature. 2022;606:120–128. doi: 10.1038/s41586-022-04713-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Soskic B, et al. Immune disease risk variants regulate gene expression dynamics during CD4+ T cell activation. Nat. Genet. 2022;54:817–826. doi: 10.1038/s41588-022-01066-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Yazar S, et al. Single-cell eQTL mapping identifies cell type-specific genetic control of autoimmune disease. Science. 2022;376:eabf3041. doi: 10.1126/science.abf3041. [DOI] [PubMed] [Google Scholar]
- 35.Perez RK, et al. Single-cell RNA-seq reveals cell type-specific molecular and genetic associations to lupus. Science. 2022;376:eabf1970. doi: 10.1126/science.abf1970. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Qiu P. Embracing the dropouts in single-cell RNA-seq analysis. Nat. Commun. 2020;11:1169. doi: 10.1038/s41467-020-14976-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Kim TH, Zhou X, Chen M. Demystifying ‘drop-outs’ in single-cell UMI data. Genome Biol. 2020;21:196. doi: 10.1186/s13059-020-02096-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Wang X, He Y, Zhang Q, Ren X, Zhang Z. Direct comparative analyses of 10X Genomics chromium and smart-seq2. Genom. Proteom. Bioinform. 2021;19:253–266. doi: 10.1016/j.gpb.2020.02.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Huang M, et al. SAVER: gene expression recovery for single-cell RNA sequencing. Nat. Methods. 2018;15:539–542. doi: 10.1038/s41592-018-0033-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Barkley D, et al. Cancer cell states recur across tumor types and form specific interactions with the tumor microenvironment. Nat. Genet. 2022;54:1192–1201. doi: 10.1038/s41588-022-01141-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Longo SK, Guo MG, Ji AL, Khavari PA. Integrating single-cell and spatial transcriptomics to elucidate intercellular tissue dynamics. Nat. Rev. Genet. 2021;22:627–644. doi: 10.1038/s41576-021-00370-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Berglund E, et al. Spatial maps of prostate cancer transcriptomes reveal an unexplored landscape of heterogeneity. Nat. Commun. 2018;9:2419. doi: 10.1038/s41467-018-04724-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Ji AL, et al. Multimodal analysis of composition and spatial architecture in human squamous cell carcinoma. Cell. 2020;182:497–514.e22. doi: 10.1016/j.cell.2020.05.039. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Banerjee, S., Carlin, B. P. & Gelfand, A. E. Hierarchical Modeling and Analysis for Spatial Data 2nd edn (Chapman and Hall/CRC, 2014).
- 45.Besag J. Spatial interaction and the statistical analysis of lattice systems. J. R. Stat. Soc. Ser. B. 1974;36:192–236. [Google Scholar]
- 46.Ma, Y. & Zhou, X. Spatially informed cell-type deconvolution for spatial transcriptomics. Nat. Biotechnol.40, 1349–1359 (2022). [DOI] [PMC free article] [PubMed]
- 47.Sun S, Zhu J, Zhou X. Statistical analysis of spatial expression patterns for spatially resolved transcriptomic studies. Nat. Methods. 2020;17:193–200. doi: 10.1038/s41592-019-0701-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Cressie, N. Statistics for spatial data. Wiley, (1993).
- 49.Brook D. On the distinction between the conditional probability and the joint probability approaches in the specification of nearest-neighbour systems. Biometrika. 1964;51:481–483. [Google Scholar]
- 50.Aitken AC. IV.—On least squares and linear combination of observations. Proc. R. Soc. Edinb. 1936;55:42–48. [Google Scholar]
- 51.Yu J, et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 2006;38:203–208. doi: 10.1038/ng1702. [DOI] [PubMed] [Google Scholar]
- 52.Yang J, Zaitlen NA, Goddard ME, Visscher PM, Price AL. Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 2014;46:100–106. doi: 10.1038/ng.2876. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Svishcheva GR, Axenovich TI, Belonogova NM, van Duijn CM, Aulchenko YS. Rapid variance components-based method for whole-genome association analysis. Nat. Genet. 2012;44:1166–1170. doi: 10.1038/ng.2410. [DOI] [PubMed] [Google Scholar]
- 54.Jiang L, et al. A resource-efficient tool for mixed model association analysis of large-scale data. Nat. Genet. 2019;51:1749–1755. doi: 10.1038/s41588-019-0530-8. [DOI] [PubMed] [Google Scholar]
- 55.Listgarten J, et al. Improved linear mixed models for genome-wide association studies. Nat. Methods. 2012;9:525–526. doi: 10.1038/nmeth.2037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Gulati GS, et al. Single-cell transcriptional diversity is a hallmark of developmental potential. Science. 2020;367:405–411. doi: 10.1126/science.aax0249. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Chang CC, et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. GigaScience. 2015;4:7. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.The GTEx Consortium The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science369, 1318–1330 (2020). [DOI] [PMC free article] [PubMed]
- 59.Stegle O, Parts L, Piipari M, Winn J, Durbin R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat. Protoc. 2012;7:500–507. doi: 10.1038/nprot.2011.457. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Kim-Hellmuth S, et al. Cell type-specific genetic regulation of gene expression across human tissues. Science. 2020;369:eaaz8528. doi: 10.1126/science.aaz8528. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Davis JR, et al. An efficient multiple-testing adjustment for eQTL studies that accounts for linkage disequilibrium between variants. Am. J. Hum. Genet. 2016;98:216–224. doi: 10.1016/j.ajhg.2015.11.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Taylor-Weiner A, et al. Scaling computational genomics to millions of individuals with GPUs. Genome Biol. 2019;20:228. doi: 10.1186/s13059-019-1836-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Stuart T, Srivastava A, Madad S, Lareau CA, Satija R. Single-cell chromatin state analysis with Signac. Nat. Methods. 2021;18:1333–1341. doi: 10.1038/s41592-021-01282-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550. doi: 10.1186/s13059-014-0550-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Leal, S. M. Genetics and analysis of quantitative traits. Am. J. Hum. Genet.68, 548–549 (2001).
- 66.Qi T, et al. Genetic control of RNA splicing and its distinct role in complex trait variation. Nat. Genet. 2022;54:1355–1363. doi: 10.1038/s41588-022-01154-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Song, L., Sun, X., Qi, T. & Yang, J. Mixed model-based deconvolution of cell-state abundances along a one-dimensional trajectory [csd-eQTL]. Zenodo10.5281/zenodo.8018006 (2023). [DOI] [PMC free article] [PubMed]
- 68.Song, L., Sun, X., Qi, T. & Yang, J. Mixed model-based deconvolution of cell-state abundances along a one-dimensional trajectory [code]. Code Ocean10.24433/CO.8176953.v1 (2023). [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All the scRNA-seq, scATAC-seq and bulk RNA-seq data used in this study are available in the public domain with the relevant information summarized in Supplementary Tables 1 and 2. The GTEx genotype data is available at https://gtexportal.org/home/protectedDataAccess. The GTEx eQTLs summary data is available at https://gtexportal.org/home/datasets. The csd-eQTLs summary data is available at 10.5281/zenodo.8018006 ref. 67. The GRCh38 genome is available at https://www.ncbi.nlm.nih.gov/projects/genome/guide/human. The GENECODE-v38 transcriptome reference is available at https://www.gencodegenes.org/human. Source data for Figs. 2–6 and Extended Data Figs. 1–5 are available with this paper.
The source code of MeDuSA is available at https://github.com/LeonSong1995/MeDuSA ref. 68.