Skip to main content
NAR Genomics and Bioinformatics logoLink to NAR Genomics and Bioinformatics
. 2021 Jun 22;3(2):lqab056. doi: 10.1093/nargab/lqab056

A computational method for direct imputation of cell type-specific expression profiles and cellular compositions from bulk-tissue RNA-Seq in brain disorders

Abolfazl Doostparast Torshizi 1,2, Jubao Duan 3,4, Kai Wang 5,6,
PMCID: PMC8219045  PMID: 34169279

Abstract

The importance of cell type-specific gene expression in disease-relevant tissues is increasingly recognized in genetic studies of complex diseases. However, most gene expression studies are conducted on bulk tissues, without examining cell type-specific expression profiles. Several computational methods are available for cell type deconvolution (i.e. inference of cellular composition) from bulk RNA-Seq data, but few of them impute cell type-specific expression profiles. We hypothesize that with external prior information such as single cell RNA-seq and population-wide expression profiles, it can be computationally tractable to estimate both cellular composition and cell type-specific expression from bulk RNA-Seq data. Here we introduce CellR, which addresses cross-individual gene expression variations to adjust the weights of cell-specific gene markers. It then transforms the deconvolution problem into a linear programming model while taking into account inter/intra cellular correlations and uses a multi-variate stochastic search algorithm to estimate the cell type-specific expression profiles. Analyses on several complex diseases such as schizophrenia, Alzheimer’s disease, Huntington’s disease and type 2 diabetes validated the efficiency of CellR, while revealing how specific cell types contribute to different diseases. In summary, CellR compares favorably against competing approaches, enabling cell type-specific re-analysis of gene expression data on bulk tissues in complex diseases.

INTRODUCTION

Bulk-tissue RNA sequencing (RNA-seq) yields an average gene expression profile across a collection of heterogeneous cell types, but it does not reveal the cell type-specific gene expression profiles within the specific cell populations of interest. Since not all of the cell types are equally involved in disease progression (1), gene expression analysis on the cell types that are most relevant to the disease may reveal more biological insights than analysis on bulk tissue. For example, developmental processes of organisms including morphogenesis, embryogenesis and cell differentiation are directly affected by relative composition of cell types (2). Likewise, presence or absence of a particular cell type explains etiology of many diseases (3,4). As an example, Alzheimer’s disease is characterized by changes in the glial populations in the brain (5), while the composition of white blood cells can be an indicator of acute cellular rejection of transplanted kidneys (6). It has also been shown how cell type composition plays a critical role in tumorigenesis in which heterogeneity of tumor cells are implicated in cancer metastasis (7). Recent advancement in single cell RNA-seq (scRNA-seq) technologies has made it clear how specific cell types affect the diseases mechanisms. Remarkable findings in autism spectrum disorders (8), schizophrenia (1,9,10), studying retinal tissue (11) and anatomy of human kidneys (12) all demonstrated how specific cell types are most relevant to the pathogenesis of different diseases.

Emergence of scRNA-seq technologies has enabled researchers to formalize classification of inherent heterogeneity of cell populations. However, such technologies are more expensive and analytically challenging than bulk RNA-seq assays, limiting their use in population-scale studies. Despite the prevalence of experimental approaches to enumerate cells such as laser-capture microdissection and cell sorting, in silico deconvolution is gaining popularity. Broadly speaking, computational deconvolution methods can be categorized under two groups (13) including ‘partial’ and ‘complete’ approaches. In the former category, only cellular proportions can be estimated from bulk data while in the latter, cellular proportions and cell-type reference profiles are directly deconvolved from bulk expression data. ‘Complete’ deconvolution approaches can be further split into semi-supervised and unsupervised. Most of the computational methods fall in the semi-supervised category where a set of maker genes for each given cell/tissue types are available (14,15). Another potential classification scheme for in silico cell type deconvolution is based on the type of transcription data: whether the method is designed for microarray or RNA-seq (16). It is unclear whether and how methods exclusively designed for microarray platforms can be effectively adopted for next-generation sequencing data (NGS), given the improved linear associations between true RNA abundance and sequence reads over microarrays (17,18). However, some researchers like Liebner et al. (19) emphasize developing RNA-seq-specific statistical models. In a recent benchmark study, Cobos et al. (20) have made comparisons between some of the available methods which offers insights into the characteristics of the existing methods.

Given a reference scRNA-seq data from tissues of interest, estimating cellular composition of bulk RNA-seq data as well as estimating cell-specific expression profiles is an important yet challenging computational problem. There have been multiple methods proposed over the past few years such as CIBERSORT (3), CIBERSORTx (21), ABIS (22), MuSiC (23), Deconf (24), DC3 (25), lsFit (4) and BSEQ-sc (26). While some of these methods such as Deconf or IsFit can be used in various contexts, others, such as ABIS or CIBERSORT, were primarily developed for certain diseases such as cancers for enumerating immune cell-types and tumor cells. A common feature shared by many of these approaches is their reliance on known markers a priori (i.e. users need to provide a list of ‘marker’ genes for each cell type) as well as their limited use in specialized and well-studied cell types. Yet, CIBERSORTx provides an additional signature extraction module to generate gene markers to be used during the deconvolution process as well as a module to estimate cell-specific gene expression profiles. More recently, efforts have been made to further apply computational models in exploring cell-specificity in transcriptomics studies. Sokolowski et al. (27) have introduced scMappR to study what specific cell-types are mainly driving dysregulation of genes in bulk RNA-seq data. They had demonstrated capabilities of the method by assigning differentially expressed genes to cell-types involved in kidney regeneration, including a small population of immune cells. Moreover, Jaakkola and Elo (28) have introduced a robust linear regression-based approach aimed at estimating cell-specific expression profiles.

To improve the accuracy to infer cell type-specific expression profiles, other factors need to be taken into account such as cross-individual genetic variations that may result in different magnitude of variation of ‘marker’ genes in bulk samples from a specific individual. To overcome these limitations and to maximize accuracy of cell-type deconvolution in a data-driven fashion, we introduce CellR (Figure 1; https://github.com/adoostparast/CellR), a computational method to deconvolve bulk-tissue RNA-Seq data and infer the cellular compositions as well as cell type-specific gene expression values, using an external scRNA-Seq data set as a reference. CellR incorporates cross-individual gene expression variations during the deconvolution process, which assigns different weights to the identified cell markers reflective of variations across individuals in a population. Moreover, given the estimated cellular composition of bulk samples, CellR is capable of imputing expression profiles for each cell type, thus significantly extending the practical utility of the tool beyond cell type deconvolution. Indeed, estimation of cell type-specific gene expression will open new doors to re-analyze gene expression data on bulk tissues in population cohorts on complex diseases, by focusing on comparative analysis on specific cell types. We illustrate a few case studies how such cell type-specific analysis can generate biological insights beyond traditional bulk tissue-based analysis.

Figure 1.

Figure 1.

Schematic of the CellR pipeline. (A) Cell deconvolution module: CellR receives the reference scRNA-seq data followed by pre-processing it to remove unwanted artifacts. CellR finds sets of cell types followed by extracting their corresponding markers. In order to account for genetic variations that may modify gene expression, CellR receives TPM matrix from GTEx for the genes from the tissue under study and calculates the weights of the identified gene markers to create scMM. scMM and bulk RNA-seq data, after pre-processing, are fed to the developed linear programming model and cellular composition of each bulk sample will be output. (B) Cell-specific gene expression profiling module: CellR receives bulk RNA-seq libraries, infers cellular proportions and cell-type counts within each library and processes each library via a newly developed meta-heuristic search optimization algorithm to specify the distribution parameters of each gene within each cell population and outputs a separate transcriptional profiles for distinct cell-types.

MATERIALS AND METHODS

CellR is a data-driven method to recover the cellular composition of bulk RNA-seq samples given an scRNA-seq data (usually generated on a different sample but from the same tissue of interest) as a reference. In the following, various stages of CellR depicted in Figure 1 are thoroughly discussed.

Model structure

CellR has two main modules including: (i) cellular enumeration module aimed at estimating the cellular proportions within bulk RNA-seq samples; (ii) cell-specific gene expression estimation module that infers the gene expression profile for each independent cell type of bulk RNA-seq libraries.

In the cellular enumeration module, given the availability of a reference scRNA-seq data from the tissue under study, CellR partitions the cell types and obtains cell-specific genes that are significantly upregulated in each cell type compared to all others, using Wilcoxon rank sum test. CellR creates a matrix called single cell marker matrix (scMM) describing the expression of the data-derived markers across the sequenced cells while using the cellular annotations provided by the user. Next, using the available data from the GTEx project (29), CellR receives the cross-individual gene expression from specific human tissues and weights the extracted markers so that stable markers, which are less prone to inter-individual variations, rank higher. Upon applying the obtained weights on the scMM followed by receiving and pre-processing the bulk RNA-seq data to normalize for library size, CellR creates a linear programming (LP) model penalized over the contribution of every single cell in the reference data. Two penalty modes are considered including (i) Lasso mode where contribution of transcription-wise correlated cells, i.e., most of the cells belonging to the same cell type, are shrunk to zero and the most informative cells are used in the model; (ii) Ridge mode in which contribution of clustered cells are tightened together so that the overall objective function is minimized. After solving the optimization model, cellular proportion of the identified cell types in bulk tissue RNA-Seq data will be given by CellR. Additionally, using the output cellular proportions by CellR, one could generate the predicted gene expression profiles for each cell type, given the bulk tissue RNA-Seq of the sample.

Cell-specific gene expression estimation module receives cellular proportions in a bulk RNA-seq sample, either generated by CellR or similar approaches, consists of a meta-heuristic multivariate search mechanism to optimize the distribution parameters of each gene within each independent cell population, which later can be used for downstream analysis. This module outputs the overall expression profiles across certain cell populations similar to a bulk RNA-seq data that contain a mixture of expression profiles from multiple cell types.

Optimization model

Let Inline graphic be the objective function of the proposed model as follows:

graphic file with name M7.gif (1)

where Inline graphic represents the gene expression levels of the total number of bulk samples (Inline graphic) such that Inline graphic denotes the number of marker genes, Inline graphic represents the expression of marker gene Inline graphic in the sample Inline graphic, Inline graphic represents the proportion of the total number of cells (Inline graphic) in the bulk sample, in which Inline graphic is the proportion of the cell Inline graphic in the bulk sample Inline graphic, Inline graphic represents the created single-cell marker matrix (scMM), Inline graphic is the complexity factor, and Inline graphic is the elastic net penalty described in what follows. Extending Equation (1), the penalty term will be as follows:

graphic file with name M22.gif (2)

where Inline graphic. Inline graphic equates to ridge mode and Inline graphic denotes lasso mode. In Equation (4), Inline graphic and Inline graphic denote the Inline graphic and Inline graphic norms of the Inline graphic matrix.

In the current version, CellR internally adopts glmnet software package (30) (v. 2.0–16) to solve the optimization problem and uses edgeR (31) (v. 3.22.5) for normalizing the bulk RNA-seq data. glmnet employs cyclical coordinate descent by successively optimizing the objective function over the designed parameters while keeping the others fixed and proceeds the cycle until convergence. Standard procedure recommended by edgeR developers were used to normalize the raw bulk RNA-seq counts. CellR annotates the identified clusters using the cell annotations provided by the user as an input. After solving the optimization model, cellular proportion of the identified cell types in bulk tissue RNA-Seq data will be given by CellR.

Obtaining expression stability of genes using GTEx data

Let Inline graphic be the matrix of Inline graphic extracted markers from the reference scRNA-seq data across the entire number of cells Inline graphic (CellR internally employs some modules from Seurat (32) for marker extraction). Then, scMM can be obtained as follows:

graphic file with name M34.gif (3)

where Inline graphic is the obtained weight vector from Equation (4), Inline graphic is the Inline graphic-th row of the matrix Inline graphic belonging to cell Inline graphic, and Inline graphic represents the element-wise product of the two vectors. Row-wise concatenation of all Inline graphic vectors will create the scMM Inline graphic. In order to obtain the weight vector denoted in Equation (3), let Inline graphic be the TPM (transcripts per million) matrix from genotype-tissue expression (GTEx) database (29) where Inline graphic denotes the genes and Inline graphic denotes the individuals in the GTEx data. Inline graphic denotes the expression of gene Inline graphic for the individual Inline graphic in the consortium. GTEx project is a comprehensive public resource to study tissue-specific gene expression and regulation. Let Inline graphic be the expression of gene Inline graphic across the entire individuals in the GTEx data. We obtain the gene weight vector Inline graphic as follows:

graphic file with name M52.gif (4)

where Inline graphic denotes the weight of the gene Inline graphic, Inline graphic denotes the standard deviation of the expression of the gene Inline graphic across the entire individuals in the GTEx data, Inline graphic is a very small positive real number to avoid having a zero in the denominator, and Inline graphic denotes the total number of marker genes in scMM.

Creating artificial bulk RNA-seq data

Suppose Inline graphic be a scRNA-seq matrix containing Inline graphic cells and Inline graphic genes, respectively. The artificial bulk data Inline graphic can be obtained by summing up the raw counts of each gene across the entire cell population.

graphic file with name M63.gif (5)

Competing methods

Four methods were used for comparing the efficiency and accuracy of CellR including CIBERSORT(3) v. 1.06 (https://cibersort.stanford.edu/), Deconf (24), CIBERSORTx (21) and IsFit (4). We used CellMix 1.6 software package (13) (http://web.cbio.uct.ac.za/∼renaud/CRAN/web/CellMix/) in R to run Deconf and IsFit. For all of these methods, standard running procedures were applied. To create signature marker list in CIBERSORTx, we used the reference scRNA-seq data as well as the phenotype class files and followed the CIBERSORTx standard procedure to create the signature gene file. We used Seurat (32) to extract the marker genes for the identified cell clusters and used these genes as the input markers for Deconf and IsFit.

Cross-validation and re-sampling strategies

A 10-fold cross validation strategy was used to compare the accuracy of CellR against the other methods. First, we split the created artificial bulk RNA-seq datasets into 10 different subsets. In each iteration of cross validation, CellR was applied to each subset and the corresponding root-mean square error (RMSE) was calculated. Next, the obtained RMSE values in each iteration were averaged and reported on each artificial bulk data. To compare stability of the output of the competing methods, we employed a uniform distribution re-sampling with replacement comprising 30% and 10% of the cells in the reference scRNA-seq datasets and trained the models. We iterated the re-sampling for 1,000 times and reported the average RMSEs and their variance in the paper.

Estimating cell type-specific gene expression

We generated cell type-specific gene expression levels through estimating the prior distributions of each cell-type within the bulk samples in the bulk data. We used the following equation to model the expression of a gene in an individual.

Inline graphic where Inline graphic denotes the bulk expression of gene Inline graphic in individual Inline graphic, Inline graphic denotes the proportion of cell type Inline graphic in individual Inline graphic, Inline graphic represents the expression of gene Inline graphic in cell type Inline graphic for individual Inline graphic, and Inline graphic denotes the total number of cell types. Suppose Inline graphic follows a negative binomial distribution of the form Inline graphic We estimate the parameters of the distributions for every gene Inline graphic across every cell type Inline graphic though a simulated annealing optimization (SA) process (please see the pseudocode in the following). We used the SA structure in our other study (33). The algorithm starts with random initial parameters for expression of each gene regarding distinct cell types. To reduce the risk of falling in local optima, we used the mean of the bulk counts for each gene as the mean parameter of the prior distribution and a randomly generated number between 0 and 1 as the dispersion parameter. During each iteration of the algorithm, using the estimated parameters, sample cell-specific expression values are generated followed by obtaining an estimation of bulk expression levels as follows.

Inline graphic , in which Inline graphic denotes the estimated bulk expression of gene Inline graphic in individual Inline graphic and Inline graphic represents the estimated expression of gene Inline graphic in cell type Inline graphic for individual Inline graphic which has been sampled from the simulated prior distribution. Then, in each iteration, root mean square error (Inline graphic) is calculated for each gene Inline graphic as follows. Inline graphic where Inline graphic represents the total number of individual samples in the bulk data. The convergent set of parameters with the lowest RMSE will then be kept as the prior parameters of cell-specific gene expression levels in the bulk data. We should note that during each iteration of the algorithm for each gene Inline graphic, we have developed two perturbation mechanisms to generate new parameters where each one is randomly selected including: (i) current parameters ± rand(-0.5, .5) × current parameters; (ii) new mean parameter (Inline graphic): current mean parameters (Inline graphic) ± standard deviation of the gene Inline graphic across the bulk data, new dispersion parameter (Inline graphic): rand[0, 1]. Another major feature incorporated in this algorithm is its capability to escape from local optimum regions by enabling us to accept parameters (with a restricted probability) with a worse Inline graphic at another domain of the search space to ensure scanning the entire search space for potential global optima (see the pseudocode below). Simulated annealing has been proven to converge to near optimal solution (34).

In the following, we have represented the pseudocode of the developed algorithm:

  • Input: bulk libraries (b), reference scRNA-seq (r), cellular proportions (p), cell-type counts (f), cell-type of interest (c)

  • Output: gene expression profiles for the cell-type c within each bulk library.

  • Set: temperature (t), maximum temperature (t_max), minimum temperature (t_min), rate factor (α)

  • For every gene in b

  • Calculate mean expression of the gene as the search start point (μ)

  • Create negative binomial distribution NB(μ,θ) where θ is random real number from [0, 1]

  • Generate b random values from the created distribution

  • While t>t_min

  •   For 1 to 200

  •    Generate new parameters using two mechanisms each randomly chosen

  •    CalculateRMSE

  •    If RMSE gets lower, keep the new parameters

  •    Else keep the new parameters if 1-exp(t×RMSE)<0.01

  •    End If

  •   End For

  •   Set t = t_max×(1- α)iteration

  • End While

  • End For

Measuring similarity of estimated cell-specific expression profiles with the bulk data

In order to measure the similarity of each cell-specific estimation of expression profiles with the original bulk data, we used Cosine similarity measure in text2vec R package. Cosine similarity between two vectors Inline graphic and Inline graphic is defined as follows (35): Inline graphic where Inline graphic and Inline graphic denote the Frobenius norm of the two vectors Inline graphic and Inline graphic, respectively. We calculated the similarity of expression profiles of each gene in cell-specific estimated data versus its expression in the bulk data and averaged the similarity values for all of the genes in distinct cell-types.

Processing of scRNA-seq data

To pre-process the raw scRNA-seq count data, CellR internally employs Seurat software (32) (v. 2.3) in R. During the pre-processing stage first, the percentage of mitochondrial gene counts is detected. Then, to normalize the gene expression measurements for each cell, global-scaling normalization is applied followed by multiplying the counts by a scale factor of 10000 to keep the linear assumption made in this paper. Next, the data are scaled by regressing out the percentage of mitochondrial gene content. Using the pre-processed data, principal component analysis (PCA) is done. The number of principal components to be used in clustering for finding the cluster markers can be determined by a resampling test inspired by the jackStraw procedure (36).

RESULTS

Numerical experiments on simulation data

To test the efficiency of CellR, first, we created two artificial bulk RNA-seq data (see Methods and Materials section) using two sets of independent scRNA-seq data from Lake et al. (37) and Segerstolpe et al. (38) on cerebellum and pancreas, respectively. We used the procedure recommended by Wang et al. (23) to create the artificial bulk data. The main advantage of such an approach is that the correct proportion of available cell types are already known so that different computational approaches can be evaluated against the known truth. The data on cerebellum contain >5,600 cells including neuronal cell types such as granular cells (Gran, Percentage = 58.8%) and Purkinje cells (Purk, Percentage = 17.8%) as well as non-neuronal cells including endothelial cells (End, Percentage = 1.2%), astrocytes (Ast, Percentage = 9.9%), oligodendrocytes (Oli, Percentage = 3.4%), pericytes (Per, Percentage = 0.77%), oligodendrocyte precursor cells (OPCs, Percentage = 5.13%) and microglia (Mic, Percentage = 3%). The data from pancreas are a less heterogeneous set of endocrine cells comprising five cells types called Inline graphic (Percentage = 60.1%), Inline graphic (Percentage = 18.3%), Inline graphic (Percentage = 7.72%), Inline graphic (Percentage = 0.48%) and Inline graphic (Percentage = 13.4%). We ran CellR using two modes (lasso and ridge) and compared its accuracy with a few existing methods including CIBERSORT, CIBERSORTx, MuSiC, Deconf and lsFit. To test the accuracy, we split the counts of each gene equally to 10 subsets and adopted the cross validation strategy, such that we trained each method using 9 subsets and ran the model on the remaining subset. During each iteration, accuracy was measured using root mean square error (RMSE) and at the end, the average RMSE was reported (Figure 2A and B). We observed that CellR in lasso mode, CIBERSORT and CIBERSORTx outperform the other methods while CellR on Ridge mode does not yield the best performance as measured by average RMSE. MuSiC in both cases performs better than CIBERSORT and CIBERSORTx as well as CellR in ridge mode while showing slight increase in RMSE compared to the CellR in lasso mode. IsFit and Deconf underestimated the proportion of abundant cell types, such as Gran in cerebellum and Inline graphic cells in pancreas.

Figure 2.

Figure 2.

Comparative analysis of CellR and four other competing approaches. (A) Output of the compared methods using the artificial bulk RNA-seq data on cerebellum. (B) Output of the compared methods using the artificial bulk RNA-seq data on pancreas. (C) Average RMSE of re-sampling from the reference scRNA-seq data from cerebellum to compare stability of each method at 10% and 30% number of cells re-sampled. (D) Average RMSE of re-sampling from the reference scRNA-seq data from pancreas to compare stability of each method at 10% and 30% number of cells re-sampled. (E and F) Number of the identified cell-specific markers in brain and pancreas, respectively. (G and H) The effects of removing GTEx information from CellR on the accuracy of the results on cerebellum and pancreas data, respectively. (I) Average RMSE on the independent simulated data. (J) Comparison results of the competing methods compared to the ground truth data in human brain.

Additionally, it is known that computational methods for estimating cellular composition may be unstable when the number of cells is small. To compare the stability of the outputs of each method, we re-sampled the reference scRNA-seq data, including 30% of the entire cells in each iteration, performed the experiment 1,000 times and compared the average RMSEs (Figure 2C and D). CellR in the lasso mode yields more stable numbers with less variation compared to the competing methods. As depicted in Figure 2C and D, CellR leads to lower RMSEs. The bars representing the average RMSE values for each method includes an error bar. The error bars denote the stability of RMSEs in each iteration that demonstrates that CellR shows a reasonable degree of stability compared to the other models. We were interested in investigating how CellR and competing methods perform when decreasing the re-sampling rate to 10%. We re-iterated the re-sampling procedure explained above at a rate of 10% and calculated the RMSE on all of the benchmarked methods. As shown in Figure 2C and D, the RMSE in all methods has increased, which is a natural outcome of under-sampling of less-populated cell-types that ultimately leads to an increased RMSE. However, we did not see significant changes in the variations of RMSEs upon 1000 iterations and the order of RMSEs remain the same as the 30% re-sampling rate. To conduct an independent analysis using simulated bulk data from another study, we used the procedure introduced by Jaakkola and Elo (28) where artificial data were used encompassing five artificial cell-types A, B, C, D and E with different proportions across 40 samples. Then, we ran CellR on these data and calculated the proportions of these five cell types followed by computing the RMSEs over the generated artificial population (Figure 2I). We observed that CellR in lasso mode performs best among the benchmarked methods while CIBERSORTx yields the second lowest RMSE. CellR in ridge mode yields a relatively similar value to CIBERSORT. This additional analysis on an independent simulated data indicates the reliable performance of CellR.

An advantage of CellR lies in its ability to robustly characterize cellular composition without having a prior biological knowledge of the markers representing cell types (however, we acknowledge that a prior clustering analysis of the scRNA-Seq data need to be performed to define cellular clusters, which represent cell types). CIBERSORT, on the other hand, requires providing cellular markers that makes it difficult in scenarios where no sufficient information about the underlying molecular signatures of various cell types is known. However, CIBERSORTx provides an automated module to extract gene signatures to be used during the deconvolution process. Details on the identified markers on cerebellum and pancreas are reported in Supplementary Tables S1–S2, respectively. In cerebellum, CellR revealed 1,292 gene markers while 3,814 genes were identified in the pancreas data. We used the same markers in CIBERSORT. The number of markers per cell-type is provided in Figure 2E and F.

Another added value of CellR is to consider cross-individual gene expression variations during cell-type deconvolution. To demonstrate this, we repeated the re-sampling procedure described above on cerebellum and pancreas data when cross-individual gene expression variations from GTEx are not used in CellR (Figure 2G and H). We observed ∼9% increase in RMSE for both lasso and ridge modes when GTEx information is not included in the model compared to the cases where GTEx information is available. This stems from uncertainties induced in the linear programming model used by CellR that leads to destabilized outcomes in the optimization stage. Moreover, compared to the other benchmarked methods, it is clear that ignoring the GTEx information decreased the stability of CellR and resulted in a dramatic decrease in the overall accuracy of the method.

An important measure to check the accuracy of the proposed method is to evaluate its performance on sample data where ground truth single cell information is available on the same sample. To this end, we obtained a set of single nucleus RNA-seq data from human cerebral cortex (39) as well as bulk RNA-seq data from the same individual. The data contain 13 cell-types including 8 excitatory (Exc) and 5 inhibitory (In) neurons. We ran CellR and the other competing methods on the bulk data and compared the outcomes with the known number of available cell-types in the bulk library (Figure 2J). In the majority of the cell-types, CellR yields the most accurate proportions compared to the rest of the methods. In Exc-A set, CIBERSORT and CIBERSORTx perform better, while CIBERSORTx and MuSiC show a close performance in the other cell-types compared to the other methods. Deconf and lsFit demonstrate the poorest performance across the board. Overall, CellR was shown to have a high accuracy in the majority of the profiled cell-types.

Deconvolution of bulk RNA-Seq data in tissues that are relevant in several diseases

In real experimental situations, reference scRNA-seq and bulk RNA-seq data from the same individual may not always be available. Hence, cell type deconvolution methods should be able to accurately characterize the cellular composition of bulk data coming from different individuals than the source of the scRNA-seq data. To evaluate the performance of our method on real bulk tissue RNA-Seq data sets, using scRNA-Seq data generated on unrelated tissue samples, we obtained two sets of bulk data on postmortem human frontal cortex brain tissues. The first set, provided by Allen et al. (40), comprises 278 subjects with the following pathological diagnoses: Alzheimer’s disease (AD), N = 84; progressive supranuclear palsy (PSP), N = 84; pathologic aging (PA), N = 30; control, N = 80. The second data were obtained from a study by Labadorf et al. (41) on Huntington’s disease (HD) generated from human prefrontal cortex, including 20 HD subjects and 49 neuropathologically normal controls. We used the reference scRNA-seq data from reference (37) on human frontal cortex. We ran CellR as well as four other methods on the two aforementioned datasets. Cellular proportions are reported in Figure 3A to D. Eight cell-types have been enumerated including: excitatory and inhibitory neurons, endothelial and oligodendrocyte progenitor cells, microglia, oligodendrocytes, astrocytes, and pericytes. The enumerated proportions on AD and HD are represented in Figure 3A,B and Figure 3C,D, respectively. CellR in the ridge mode yields correlated proportions while dispersion of proportions in the lasso mode is relatively higher in astrocytes, inhibitory neurons and OPCs. CIBERSORT overestimates most of the analyzed cell types both in AD and HD samples. For instance, the proportion of astrocytes given by CellR in AD samples is ∼8–21% while the proportion is ∼0–58% by CIBERSORT. This is also the case for IsFit and Deconf whose output proportions are overestimated in pericytes. For example, both of these methods report a proportion of over 25% while the real proportion of pericytes in the reference data is <1%. Upon making comparisons, we observed that CIBERSORTx tends to yield less dispersed cellular proportions compared to CEIBERSORT. In addition, in most cases, the mean cellular proportions by CIBERSORTx are closer to CellR rather than CIBERSORT including endothelial cells, pericytes and excitatory neurons in AD as well as microglia, excitatory neurons, astrocytes and oligodendrocytes in HD. To gain a deeper insight into the number of the identified gene markers within the brain and pancreas scRNA-seq datasets, the number of cell-specific markers is shown in Figure 2E and F. We observed that the number of differentially expressed (DE) genes is larger in cell types which predominantly constitute the overall number of sequenced cells. Our tests indicate that on the simulation data, CellR in lasso mode yields better accuracy while outputting slightly higher dispersion in the proportion of cell types on real data (Figure 3). As a result, for less heterogeneous data, similar to the simulation data here, we recommend using lasso mode, whereas the ridge mode may have some advantages for more complex real data.

Figure 3.

Figure 3.

Cellular proportions of AD and HD cohorts. (A) Output of the compared methods using the bulk data from AD samples generated from human brain tissues for Exc, End, Mic and Oli cells. (B) Output of the compared methods using the bulk data from AD samples generated from human brain tissues for Ast, In, OPC and Per cells. (C) Output of the compared methods using the bulk data from HD and normal (N) samples generated from human brain tissues for Exc, End, Mic and Oli cells. (D) Output of the compared methods using the bulk data from HD and normal (N) samples generated from human brain tissues for Ast, In, OPC and Per cells. N: normal healthy controls; PSP: progressive supranuclear palsy, PA: pathologic aging, Exc: excitatory neurons, In: inhibitory neurons, Ast: astrocytes, OPC: oligodendrocyte progenitor cells, Per: pericytes, End: endothelial cells, Mic: microglia, Oli: oligodendrocytes.

In addition, we analyzed a bulk RNA-Seq data from Fadista et al. (42) on type 2 diabetes (T2D), due to the availability of a scRNA-Seq data on pancreas, which is the tissue that is directly relevant to T2D. We used CellR to analyze the putative associations between the proportion of beta cells and HbA1c level, a measure of long-term glycemia. HbA1c denotes normal glucose tolerance (HbA1c ≥ 6.5% in T2D, HbA1c ≤ 6% in healthy individuals). Only CellR successfully captured negative correlations between the beta cell proportion and HbA1c levels (correlation coefficient = -0.41, P-value = 0.003827). We also noticed that a recently published study (23) that re-analyzed the same data has come to a similar conclusion with correlation coefficient = ∼-0.31 and P-value = 0.00126 (see Supplementary Figure S1).

Estimating cell-specific gene expression profiles

A major application of CellR is to estimate cell type-specific gene expression profiles in distinct cellular populations within a heterogeneous bulk data. We have developed a meta-heuristic optimization-based search mechanism that enables estimating the distribution parameters for distinct cell populations and generates a transcriptomic profile of the cellular constituents of a bulk RNA-seq library (see Methods and Materials section). CellR receives bulk data, reference scRNA-seq data from the same tissue as well as estimated cellular proportions and counts (whether estimated by CellR or other methods) and generates a starting solution per gene for each cell type. Next, through several layers of search, it estimates near optimal distribution parameters for each gene and generates expression profiles for homogeneous cell populations separately. To evaluate the efficiency of the developed method, we conducted multiple experiments including simulation tests and real-world experiments on schizophrenia.

Initially, we created 50 pseudo-bulk RNA-seq samples by simulating scRNA-seq data on human cerebellum. For this, we used the data on cerebellum from Lake et al. (37) and simulated 50 scRNA-seq datasets (Figure 4A) upon it using Splatter (43). We turned each simulated data into a pseudo-bulk sample enabling us to have a ground truth for the transcriptome-wide distribution of genes in separate cell-types. Later, we ran CellR on each sample and compared the inferred average expression of the genes with the known expression levels. First, we calculated the similarity of the estimated gene profiles between pairs of cell-types (Figure 4B) using cosine measure library (see Methods and Materials section). We had profiled expression levels across nine cell-types including Gran, End, Ast, Oli, Per, OPC, Mic, as well as two Purkinje cells Purk1 and Purk2. We observed strong similarities between the average expressions of the inferred profiles between the same cell-types (Figure 4B and Supplementary Figure S2). For example, inferred profiles of Gran cells indicate a strong similarity with the Gran cells in the ground truth while showing elevated levels of dissimilarity with the other cell-types. Notably, we were able to show how subpopulations of Purk cells, e.g. Purk1 and Purk2, demonstrate similar patterns versus each other while indicating excessive differences with the other cell populations. We also made a second round of comparisons on the basis of our simulations. We calculated the Pearson correlations (Figure 4C) between the inferred and ground truth expression levels between pairs of cell-types and showed that there is strong correlations between the same cell-types, suggesting the reliability of the inferred expression profiles. We acknowledge that the developed method may not be error-free given limitations of scRNA-seq data, such as low library depth and dropout effects.

Figure 4.

Figure 4.

Cell-specific gene expression profiling by CellR. (A) the workflow of simulating RNA-seq libraries to test the efficiency of CellR; (B) similarity heatmap of the inferred gene expression profiles compared to the simulated data on human cerebellum; (C) correlation levels between the inferred gene expression profiles and the simulated data on human cerebellum; (D) number of DE genes within distinct cell populations in the CMC data on schizophrenia; (E) average cellular proportions in the CMC samples by CellR; (F) average of similarity values of the estimated expression of each gene in a cell-types compared with the bulk data in log10 scale; (G) enrichment degree of the TCF4 targets being disrupted in the cell-specific expression profiles estimated by CellR; (H) correlation of expression profiles of the real and inferred cell-specific expression in melanoma; (I) correlation of expression profiles of the real and inferred cell-specific expression in rheumatoid arthritis; Tcm: central memory T cells, Tem: effector memory T cells, Tn: naïve T cells.

We were interested to apply CellR on real transcriptome data on schizophrenia. We used CommonMind Consortium (CMC) study data for this analysis (44). CMC study is currently the largest repertoire of schizophrenia bulk RNA-seq data on human postmortem dorsolateral prefrontal cortex from a population of 258 schizophrenia individuals and 279 control subjects. To delineate how transcriptional patterns across distinct cellular populations differ among schizophrenia and normal individuals, we used CellR to create cell-specific expression profiles on the entire samples in the CMC data on eight cell-types including Ex, Ast, End, In, Mic, Oli, OPC and Per. We ran CellR and looked for DE genes within each cell-type (Supplementary Table S3). Overall, we observed 589 DE genes to be dysregulated in at least one cell-type while 693 genes are DE in the bulk data (Figure 4D). All of these DE genes were among the DE genes reported in the CMC study. Excitatory neurons were found to have the largest number of DE genes (∼71% of the total DE genes) while In, End, and Mic cell-types showed an almost identical number of DE genes (∼57% of the total DE genes, each). This is consistent with the observations made by Skene et al. (1) where Ast and Mic are found to be less relevant to the disease while Ex and In neurons share the highest number of susceptibility genes in schizophrenia.

In order to compare the performance of CellR in inferring cell-specific expression profiles, we ran CellR on two RNA-seq datasets on melanoma (45) and rheumatoid arthritis (46,47) and compared it with Rodeo (28). Rodeo is a novel method showing superior performance against some existing methods including cd-qprog (48), LRCDE (49), CDSeq (50) and Deblender (47). We calculated the cell-specific expression profiles on the constituent cell-types being indicated in (28) and obtained the correlation coefficient between the real and estimated expression profiles (Figure 4H and I). Our findings indicate that CellR predominantly leads to higher correlation values compared to Rodeo, suggesting its superior performance on inferring cell-specific expression profiles.

Particular cell-types are more relevant to schizophrenia

A study by Skene et al. (1) on how common genetic variants in schizophrenia can be mapped to brain cell types has demonstrated the importance of considering cell-types in studying genetic susceptibility to brain diseases. They had shown that schizophrenia common variants are predominantly enriched in pyramidal cells, medium spiny neurons (MSNs) and certain interneurons (1). They have concluded that schizophrenia variants are far less mapped to progenitor, embryonic and glial cells. A clear picture of susceptibility genes and their corresponding cell-types in schizophrenia can be achieved by CellR. Therefore, we were interested to evaluate if any of schizophrenia DE genes can be mapped to certain cell-types. To do so, we obtained the list of 693 DE genes in schizophrenia from the CMC study (44). Then we used the scRNA-seq reference data by Lake et al. (37) and obtained the gene markers by CellR. For each cell-type, we looked for the genes which were shared between their corresponding markers by CellR and the list of DE genes in the CMC data aimed at looking for potential enrichment of DE genes in any of the extracted cell types. We found two cell types of granular cells (P-value = 5 × 10–3, fold enrichment = 2) and Purkinje cells (P-value = 9 × 10–3, fold enrichment = 2.4) to enrich for schizophrenia DE genes. In addition to DE genes, we sought to evaluate whether schizophrenia common variants are enriched in any of cell-types within the brain. We collected the genome-wide association study (51) hits from the CLOZUK study (52) and the Psychiatric Genomic Consortium study (PGC2) (53) which correspond to 417 protein coding genes that are close to the risk loci (only a fraction of the 417 protein-coding genes may be associated with schizophrenia though, as GWAS only examine proxy markers of causal variants; genes have been used from CLOZUK and PGC2 datasets). We found the same cell-types to be enriched for schizophrenia GWAS hits including granular cells (P-value = 0.022, fold enrichment = 2.2) and Purkinje cells (P-value = 0.012, fold enrichment = 1.8). The rest of the cell types did not pass the significant threshold. Enrichment of schizophrenia risk factors in certain neuronal cells is in line with the findings of Skene et al. (1) where schizophrenia risk loci were mapped only to neuronal cells. To provide further evidence, we used CIBERSORT and CIBERSORTx signature creation modules and characterized the list of markers they use for deconvolution. Since both methods share the exact same approach for marker genes, we obtained the same set of marker genes. Similar to the analysis mentioned above, we computed the enrichment of schizophrenia DE genes in the cell-types annotated by Lake et al. (37). We found that granular cells (P-value = 3.4 × 10–4, fold enrichment = 1.9) and Purkinje cells (P-value = 4 × 10–3, fold enrichment = 2) share the highest enrichment scores similar to our observation using CellR. Moreover, we repeated the same analysis on GWAS hits and found relatively close significance scores on granular cells (P-value = 0.01) and Purkinje cells (P-value = 0.008). These observations suggest the accuracy of CellR in extracting marker genes from reference scRNA-seq data and demonstrate how genetic signals in schizophrenia originate from neuronal cells.

A critical application of CellR is to numerically estimate the proportions of cells-types in bulk samples without conducting costly scRNA-seq experiments. As a proof of concept, using the scRNA-seq reference data on the frontal cortex by Lake et al. (37), we obtained the cellular proportions of the samples in the CMC dataset. Average proportions across the entire cohort are represented in Figure 4E. We clearly see that neuronal cells including excitatory (Ex) and inhibitory (In) neurons, accounts for ∼70% of the cellular proportions within each sample. Therefore, we expect that transcriptional signals in these samples predominantly originate from these cell-types. This important observation motivated us to follow how network gene complexes being targeted by schizophrenia transcriptional master regulators (MRs) are expressed in distinct cell-types. In a recent study (54), we had identified TCF4 as a schizophrenia MR through re-analyzing the CMC bulk RNA-seq data and experimentally showed how disrupting the expression of this gene can control a large basket of target genes in human induced pluripotent stem cell (hiPSC)-derived neurons. For this, we re-generated cell-specific gene expression levels for each individual in the CMC data (see Methods and Materials section). To do this, we used CellR to estimate the prior distributions of cell-specific gene expression levels. These cell types include excitatory neurons (Ex), inhibitory neurons (In), astrocytes (Ast), oligodendrocyte progenitor cells (OPC), pericytes (Per), endothelial cells (End), microglia (Mic) and oligodendrocytes (Oli). Then for each distinct cell-type, we generated cell-specific expression levels across the entire individuals in the CMC data, which led to creating eight cell-specific gene expression datasets. Next, for each of these datasets, we created the regulatory networks using the same tools used in our study (54) and obtained the targets of TCF4. Finally, we looked for the overlapping targets of TCF4 generated from the bulk sample versus cell-specific expression data. Only for TCF4 targets in the data in Ex, we observed a significant overlap (P-val = 4.6 × 10–38, fold enrichment ratio = 119, Figure 4G). No significant overlap between the TCF4 targets in the original bulk data versus other cell-specific expression data was observed. We sought to analyze the similarities between cell-specific estimated gene expression profiles of TCF4 targets and the bulk expression levels. Upon obtaining cell-specific profiles, we calculated the similarities between the estimated expression of each gene in distinct cell-types compared to its expression in the bulk data and averaged the similarity values of the entire TCF4 targets in various cell-types (Figure 4F, see Methods and Materials section). Average similarity of TCF4 targets in Ex is almost 10-fold higher than other cell-types, signifying that the transcriptional signals captured in the bulk RNA-seq data predominantly originates from excitatory neurons. In addition, for each estimated cell-specific expression profiles, we obtained DE genes between schizophrenia cases versus normal controls and compared them with the list of DE genes in bulk CMC data. We observed significant overlap between the DE genes from Ex-specific expression profiles compared to the bulk data (P-val = 2.3 × 10–38, fold enrichment ratio = 24) while no significant overlap was observed for the rest of the cell-types. All these findings validated the accuracy of CellR in estimating the cellular proportions of bulk RNA-seq data. These observations indicate strong performance of CellR in estimating the cellular proportions and illustrate the importance of taking into account the cellular heterogeneity of bulk RNA-seq data to boost the signals and reduce biological noises.

Differentially expressed genes are highly enriched in granular and Purkinje cells in Alzheimer’s and Huntington’s diseases

We sought to evaluate if the DE genes in the bulk data can be traced back in the cell-specific molecular signatures, with the hypothesis that DE genes in specific cell types may be the major contributor to overall DE genes identified from bulk RNA-Seq data. To do this, we obtained the list of DE genes between HD samples and negative controls. About 5480 genes have been reported by Labadorf et al. (41) to be DE. We intersected the list of DE genes with the identified marker genes by CellR, using scRNA-seq data by Lake et al. (37), and found 316 genes shared by the two groups (Fisher Exact Test (FET) P-val = 0.007, Figure 5A). Next, we annotated the shared genes to their corresponding cell-types in the reference scRNA-seq data. About 50% of these genes were annotated to Purk and Gran cells which are classified as neuronal cells, whereas the rest of the genes were annotated to five other cell-types. Notably, Purk cells consisted ∼33% of the entire set of HD DE genes, e.g., the list of common marker genes by CellR and the DE genes reported by Labadorf et al. (41). These cells have been reported to be compromised in aggressive mouse models of HD and their dysfunction is shown to be correlated with HD’s pathology (55). Our observations indicate that a large fraction of DE genes in bulk tissue samples are in fact markers of specific cell-types. In other words, the statistical signals being picked up in bulk transcriptomic analysis originate from only a fraction of the cellular constituents of the samples, further highlighting specificity of cell-types in distinct diseases.

Figure 5.

Figure 5.

Cell-specific enrichment of dysregulated genes in Huntington’s disease and Alzheimer’s disease. (A) Number of shared dysregulated genes in Huntington’s disease and cell-specific gene markers in human brain; (B) Number of shared dysregulated genes in Alzheimer’s disease and cell-specific gene markers in human brain.

Next, we performed a similar analysis on AD where we obtained DE genes (40) comparing three different pairs including AD-control, PA-control and PSP-control. No DE genes were observed between PA and normal control samples. We observed 707 marker genes to be DE in AD-control pair while finding 17 marker genes to be DE in the PSP-control pair (Figure 5B). We observed ∼54% of the DE genes to be enriched in Gran and Purk cell-types (FET P-val = 6.26 × 10–32). These observations suggest a similar conclusion that much of the signal captured from bulk samples are largely attributed to a limited number of disease-relevant cell-types. Although single-cell sequencing is an effective means to investigate this issue and identify the disease-relevant cell types, it is not cost effective to be scaled to a very large number of samples; in comparison, CellR circumvented this problem and allowed the use of bulk RNA-Seq data to investigate cell type-specific contributions.

DISCUSSION

Heterogeneous cell populations in many of the genetically-driven diseases have different contributions to the disease onset and progression. Such differences cannot be captured by bulk RNA-seq. However, computational deconvolution of bulk mixtures can reveal the proportion of constituent cell-types within the samples. We introduced CellR, a data-driven approach that eliminates the need for having prior biological knowledge on representative gene markers of cell-types (though a prior clustering of the scRNA-Seq is required where each cluster represent a separate cell type), while correcting for potential rare and common genetic variations in the populations that may introduce confounding expression artifacts. As a proof of concept, we made exploratory tests on multiple complex diseases including schizophrenia, Alzheimer’s disease, Huntington’s disease and type 2 diabetes. We showed how CellR can be effectively employed to yield biological insights into the cellular mechanisms of complex diseases.

Compared to other computational approaches, we demonstrated several unique aspects of CellR in the study: First, CellR outputs more stable proportion values for different samples in the same study. Second, the improved accuracy and lower variation in the identified cell-proportions from CellR demonstrated that we can infer novel biological insights from bulk RNA-Seq samples, as demonstrated in several disease-relevant data sets in our study. Third, with the exception of MuSiC, which considers person-to-person gene expression variations at the single-cell reference level not at bulk resolution, existing methods ignore the variations of gene expressions that differ across individuals, which is a critical factor in elucidating true cell proportions in bulk transcriptomic data. In comparison, the CellR tool makes use of existing knowledge in GTEx knowledge portal to account for cross-individual genetic variations leading to fluctuations in gene expression. We believe that assigning the same weights to the gene signatures during the deconvolution process is the main reason leading to higher variations in the enumerated cell proportions by CIBERSORT and CIBERSORTx. This mainly stems from the gene signatures used in these methods, where classification accuracy (based on support vector machine) is the priority in deconvolution while population heterogeneity is not considered. In addition, using the identified cellular proportions on schizophrenia bulk RNA-seq data, we adjusted the gene expression values for distinct cell-types and showed how a significant portion of biological signals in bulk transcriptional signals originate from only excitatory neurons, signifying the importance of taking into account the heterogeneity of data when conducting transcriptome studies. Moreover, CellR is designed to estimate near optimal cell-specific gene expression profiles from RNA-seq libraries. Conducting rigorous numerical experiments, we showed how CellR can specify transcriptional dysregulations within distinct cell populations where conventional RNA-seq technologies are not able to distinguish. An important factor in deconvolution is to consider batch effects within the bulk and single cell data. While we acknowledge the importance of batch effects and its potential influences on deconvolution outcomes, our results indicate that such potential effects do not lead to radical negative implications. This is mainly because CellR uses reference single cell data solely for characterizing the markers of each cell-type and does not try to correlate the bulk data and the reference data. Therefore, even in case of existing batch effects, it will not lead to changes in the representative markers of cell-types. Therefore, CellR is unlikely to be severely affected by batch effects.

We recognize that there are several areas of future improvements that can be incorporated into CellR. First, although our method does not require prior information on specific gene markers, it is possible that well-validated and well-characterized prior information can improve performance. Therefore, we will explore different weighting schemes that allows CellR to take into account the contribution of user-defined gene markers in data analysis. Indeed, some software tools already compiled such a list of gene markers for specific cell types, and we may be able to directly use these as prior knowledge to improve CellR’s performance. Second, as noted by Kong et al. (56) that the addition of cell-type proportions as covariates can affect the number of DE genes in bulk data, we envision to take into account such latent knowledge to further reveal the role of cell-specific signals which contribute to the disease progression in silico. Moreover, we note that CellR is designed in a way that the bulk RNA-seq samples and the reference single cell data are generated from the same tissue and any inconsistency between these two may lead to incorrect outcomes.

In conclusion, we developed CellR, a novel computational method to enumerate bulk-tissue RNA-Seq data, infer the cellular compositions and estimate cell type-specific gene expression profiles. Through analysis on simulated data sets and several real data sets on various diseases, our observations corroborate how transcriptional signatures of complex diseases such as schizophrenia, Alzheimer’s disease, and Huntington’s disease and type 2 diabetes are enriched in specific cell-types identified by CellR. Comparative analysis demonstrated better performance of CellR against competing approaches that rely on a few known cell-specific gene markers. We acknowledge that CellR, given it clustering-based nature, can be influenced by the accuracy of clustering analysis and therefore is not guaranteed to yield the perfect partitioning specifically in highly complex datasets. We expect that CellR can be used to re-analyze many previously published bulk RNA-Seq data and infer more refined biological insights into the cell type-specific contribution of gene expression to disease phenotypes.

DATA AVAILABILITY

The bulk RNA-seq data on HD was generated by Labadorf et al. (41) and is available in GEO under accession number GSE64810. The scRNA-seq data on cerebellum were generated by Lake et al. (37) and were downloaded from gene expression omnibus (GEO) under accession number GSE97942. The scRNA-seq data on pancreas by Segerstolpe et al. (38) were downloaded from ArrayExpress (EBI, https://www.ebi.ac.uk/arrayexpress/) under accession number E-MTAB-5061. The RNA-seq data on AD and other neurological disorders were downloaded from AMP-AD knowledge portal under Synapse ID: syn3163039: Study data were provided by the following sources: The Mayo Clinic Alzheimer’s Disease Genetic Studies, led by Dr. Nilufer Taner and Dr. Steven G. Younkin, Mayo Clinic, Jacksonville, FL using samples from the Mayo Clinic Study of Aging, the Mayo Clinic Alzheimer’s Disease Research Center, and the Mayo Clinic Brain Bank. Data collection was supported through funding by NIA grants P50 AG016574, R01 AG032990, U01 AG046139, R01 AG018023, U01 AG006576, U01 AG006786, R01 AG025711, R01 AG017216, R01 AG003949, NINDS grant R01 NS080820, CurePSP Foundation, and support from Mayo Foundation. Study data include samples collected through the Sun Health Research Institute Brain and Body Donation Program of Sun City, Arizona. The Brain and Body Donation Program is supported by the National Institute of Neurological Disorders and Stroke (U24 NS072026 National Brain and Tissue Resource for Parkinson’s Disease and Related Disorders), the National Institute on Aging (P30 AG19610 Arizona Alzheimer’s Disease Core Center), the Arizona Department of Health Services (contract 211002, Arizona Alzheimer’s Research Center), the Arizona Biomedical Research Commission (contracts 4001, 0011, 05–901 and 1001 to the Arizona Parkinson’s Disease Consortium) and the Michael J. Fox Foundation for Parkinson’s Research.

DATA AVAILABILITY

A detailed description of the method along with a step-by-step execution procedure on an example data set is provided in https://github.com/adoostparast/CellR.

Supplementary Material

lqab056_Supplemental_Files

ACKNOWLEDGEMENTS

The authors would like to thank Drs. Kun Zhang and Blue B. Lake of the University of California, San Diego for generously sharing the RNA-seq data on human brain. We would also like to thank two anonymous reviewers for their insightful comments and suggestions for additional computational experiments.

Contributor Information

Abolfazl Doostparast Torshizi, Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA; Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.

Jubao Duan, Center for Psychiatric Genetics, NorthShore University HealthSystem, Evanston, IL 60201, USA; Department of Psychiatry and Behavioral Neurosciences, The University of Chicago, Chicago, IL 60015, USA.

Kai Wang, Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA; Department of Pathology and Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.

SUPPLEMENTARY DATA

Supplementary Data are available at NARGAB Online.

FUNDING

NIH [MH108728]; CHOP Research Institute (to K.W.); Alavi-Dabiri Postdoctoral Fellowship Award (to A.D.T.).

Conflict of interest statement. None declared.

REFERENCES

  • 1. Skene N.G., Bryois J., Bakken T.E., Breen G., Crowley J.J., Gaspar H.A., Giusti-Rodriguez P., Hodge R.D., Miller J.A., Munoz-Manchado A.B.et al.. Genetic identification of brain cell types underlying schizophrenia. Nat. Genet. 2018; 50:825–833. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Lu P., Nakorchevskiy A., Marcotte E.M.. Expression deconvolution: a reinterpretation of DNA microarray data reveals dynamic changes in cell populations. Proc. Natl. Acad. Sci. USA. 2003; 100:10370–10375. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Newman A.M., Liu C.L., Green M.R., Gentles A.J., Feng W., Xu Y., Hoang C.D., Diehn M., Alizadeh A.A.. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods. 2015; 12:453–457. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Abbas A.R., Wolslegel K., Seshasayee D., Modrusan Z., Clark H.F.. Deconvolution of blood microarray data identifies cellular activation patterns in systemic lupus erythematosus. PLoS One. 2009; 4:e6098. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Mohammadi S., Zuckerman N., Goldsmith A., Grama A.. A critical survey of deconvolution methods for separating cell types in complex tissues. Proc. IEEE. 2017; 105:340–366. [Google Scholar]
  • 6. Shen-Orr S.S., Tibshirani R., Khatri P., Bodian D.L., Staedtler F., Perry N.M., Hastie T., Sarwal M.M., Davis M.M., Butte A.J.. Cell type-specific gene expression differences in complex tissues. Nat. Methods. 2010; 7:287–289. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Marusyk A., Polyak K.. Tumor heterogeneity: causes and consequences. Biochim. Biophys. Acta. 2010; 1805:105–117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Velmeshev D., Schirmer L., Jung D., Haeussler M., Perez Y., Mayer S., Bhaduri A., Goyal N., Rowitch D.H., Kriegstein A.R.. Single-cell genomics identifies cell type-specific molecular changes in autism. Science. 2019; 364:685–689. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Doostparast Torshizi A., Ionita-Laza I., Wang K.. Cell Type-specific annotation and fine mapping of variants associated with brain disorders. Front Genet. 2020; 11:575928. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Doostparast Torshizi A., Duan J., Wang K.. Cell-type-specific proteogenomic signal diffusion for integrating multi-omics data predicts novel schizophrenia risk genes. Patterns. 2020; 1:100091. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Liang Q., Dharmat R., Owen L., Shakoor A., Li Y., Kim S., Vitale A., Kim I., Morgan D., Liang S.et al.. Single-nuclei RNA-seq on human retinal tissue provides improved transcriptome profiling. Nat. Commun. 2019; 10:5743. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Lake B.B., Chen S., Hoshi M., Plongthongkum N., Salamon D., Knoten A., Vijayan A., Venkatesh R., Kim E.H., Gao D.et al.. A single-nucleus RNA-sequencing pipeline to decipher the molecular anatomy and pathophysiology of human kidneys. Nat. Commun. 2019; 10:2832. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Gaujoux R., Seoighe C.. CellMix: a comprehensive toolbox for gene expression deconvolution. Bioinformatics. 2013; 29:2211–2212. [DOI] [PubMed] [Google Scholar]
  • 14. Zhong Y., Wan Y.W., Pang K., Chow L.M., Liu Z.. Digital sorting of complex tissues for cell type-specific gene expression profiles. BMC Bioinform. 2013; 14:89. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Gaujoux R., Seoighe C.. Semi-supervised nonnegative matrix factorization for gene expression deconvolution: a case study. Infect. Genet. Evol. 2012; 12:913–921. [DOI] [PubMed] [Google Scholar]
  • 16. Yadav V.K., De S.. An assessment of computational methods for estimating purity and clonality using genomic data derived from heterogeneous tumor tissue samples. Brief. Bioinform. 2015; 16:232–241. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Rao M.S., Van Vleet T.R., Ciurlionis R., Buck W.R., Mittelstadt S.W., Blomme E.A.G., Liguori M.J.. Comparison of RNA-Seq and microarray gene expression platforms for the toxicogenomic evaluation of liver from short-term rat toxicity studies. Front. Genet. 2018; 9:636. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Rai M.F., Tycksen E.D., Sandell L.J., Brophy R.H.. Advantages of RNA-seq compared to RNA microarrays for transcriptome profiling of anterior cruciate ligament tears. J. Orthop. Res. 2018; 36:484–497. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Liebner D.A., Huang K., Parvin J.D.. MMAD: microarray microdissection with analysis of differences is a computational tool for deconvoluting cell type-specific contributions from tissue samples. Bioinformatics. 2014; 30:682–689. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Avila Cobos F., Alquicira-Hernandez J., Powell J.E., Mestdagh P., De Preter K.. Benchmarking of cell type deconvolution pipelines for transcriptomics data. Nat. Commun. 2020; 11:5650. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Newman A.M., Steen C.B., Liu C.L., Gentles A.J., Chaudhuri A.A., Scherer F., Khodadoust M.S., Esfahani M.S., Luca B.A., Steiner D.et al.. Determining cell type abundance and expression from bulk tissues with digital cytometry. Nat. Biotechnol. 2019; 37:773–782. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Monaco G., Lee B., Xu W., Mustafah S., Hwang Y.Y., Carre C., Burdin N., Visan L., Ceccarelli M., Poidinger M.et al.. RNA-Seq signatures normalized by mrna abundance allow absolute deconvolution of human immune cell types. Cell Rep. 2019; 26:1627–1640. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Wang X., Park J., Susztak K., Zhang N.R., Li M.. Bulk tissue cell type deconvolution with multi-subject single-cell expression reference. Nat. Commun. 2019; 10:380. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Repsilber D., Kern S., Telaar A., Walzl G., Black G.F., Selbig J., Parida S.K., Kaufmann S.H., Jacobsen M.. Biomarker discovery in heterogeneous tissue samples -taking the in-silico deconfounding approach. BMC Bioinform. 2010; 11:27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Zeng W., Chen X., Duren Z., Wang Y., Jiang R., Wong W.H.. DC3 is a method for deconvolution and coupled clustering from bulk and single-cell genomics data. Nat. Commun. 2019; 10:4613. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Baron M., Veres A., Wolock S.L., Faust A.L., Gaujoux R., Vetere A., Ryu J.H., Wagner B.K., Shen-Orr S.S., Klein A.M.et al.. A Single-Cell Transcriptomic Map of the Human and Mouse Pancreas Reveals Inter- and Intra-cell Population Structure. Cell Syst. 2016; 3:346–360. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Sokolowski D.J., Faykoo-Martinez M., Erdman L., Hou H., Chan C., Zhu H., Holmes M.M., Goldenberg A., Wilson M.D.. Single-cell mapper (scMappR): using scRNA-seq to infer the cell-type specificities of differentially expressed genes. NAR Genom. Bioinform. 2021; 3:lqab011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Jaakkola M.K., Elo L.L.. Computational deconvolution to estimate cell type-specific gene expression from bulk data. NAR Genom. Bioinform. 2021; 3:lqaa110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Ardlie K.G., DeLuca D.S., Segre A.V., Sullivan T.J., Young T.R., Gelfand E.T., Trowbridge C.A., Maller J.B., Tukiainen T., Lek M.et al.. The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans. Science. 2015; 348:648–660. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Friedman J., Hastie T., Tibshirani R.. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010; 33:1–22. [PMC free article] [PubMed] [Google Scholar]
  • 31. Robinson M.D., McCarthy D.J., Smyth G.K.. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26:139–140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Butler A., Hoffman P., Smibert P., Papalexi E., Satija R.. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 2018; 36:411–420. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Doostparast Torshizi A., Fazel Zarandi M.H.. Alpha-plane based automatic general type-2 fuzzy clustering based on simulated annealing meta-heuristic algorithm for analyzing gene expression data. Comput. Biol. Med. 2015; 64:347–359. [DOI] [PubMed] [Google Scholar]
  • 34. Yang R.L. Convergence of the simulated annealing algorithm for continuous global optimization. J. Optim. Theory Appl. 2000; 104:691–716. [Google Scholar]
  • 35. Nguyen H.V., Bai L.. Proceedings of the 10th Asian conference on Computer vision - Volume Part II. 2011; Queenstown: Springer-Verlag; 709–720. [Google Scholar]
  • 36. Macosko E.Z., Basu A., Satija R., Nemesh J., Shekhar K., Goldman M., Tirosh I., Bialas A.R., Kamitaki N., Martersteck E.M.et al.. Highly parallel Genome-wide expression profiling of individual cells using nanoliter droplets. Cell. 2015; 161:1202–1214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Lake B.B., Chen S., Sos B.C., Fan J., Kaeser G.E., Yung Y.C., Duong T.E., Gao D., Chun J., Kharchenko P.V.et al.. Integrative single-cell analysis of transcriptional and epigenetic states in the human adult brain. Nat. Biotechnol. 2018; 36:70–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Segerstolpe A., Palasantza A., Eliasson P., Andersson E.M., Andreasson A.C., Sun X., Picelli S., Sabirsh A., Clausen M., Bjursell M.K.et al.. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 2016; 24:593–607. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Lake B.B., Ai R., Kaeser G.E., Salathia N.S., Yung Y.C., Liu R., Wildberg A., Gao D., Fung H.L., Chen S.et al.. Neuronal subtypes and diversity revealed by single-nucleus RNA sequencing of the human brain. Science. 2016; 352:1586–1590. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Allen M., Carrasquillo M.M., Funk C., Heavner B.D., Zou F., Younkin C.S., Burgess J.D., Chai H.S., Crook J., Eddy J.A.et al.. Human whole genome genotype and transcriptome data for Alzheimer's and other neurodegenerative diseases. Sci Data. 2016; 3:160089. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Labadorf A., Hoss A.G., Lagomarsino V., Latourelle J.C., Hadzi T.C., Bregu J., MacDonald M.E., Gusella J.F., Chen J.F., Akbarian S.et al.. RNA sequence analysis of human huntington disease brain reveals an extensive increase in inflammatory and developmental gene expression. PLoS One. 2015; 10:e0143563. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Fadista J., Vikman P., Laakso E.O., Mollet I.G., Esguerra J.L., Taneera J., Storm P., Osmark P., Ladenvall C., Prasad R.B.et al.. Global genomic and transcriptomic analysis of human pancreatic islets reveals novel genes influencing glucose metabolism. Proc. Natl. Acad. Sci. USA. 2014; 111:13924–13929. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Zappia L., Phipson B., Oshlack A.. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 2017; 18:174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Fromer M., Roussos P., Sieberts S.K., Johnson J.S., Kavanagh D.H., Perumal T.M., Ruderfer D.M., Oh E.C., Topol A., Shah H.R.et al.. Gene expression elucidates functional impact of polygenic risk for schizophrenia. Nat. Neurosci. 2016; 19:1442–1453. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Linsley P.S., Speake C., Whalen E., Chaussabel D. Copy number loss of the interferon gene cluster in melanomas is linked to reduced T cell infiltrate and poor patient prognosis. PLoS One. 2014; 9:e109760. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Takeshita M., Suzuki K., Kondo Y., Morita R., Okuzono Y., Koga K., Kassai Y., Gamo K., Takiguchi M., Kurisu R.et al.. Multi-dimensional analysis identified rheumatoid arthritis-driving pathway in human T cell. Ann. Rheum. Dis. 2019; 78:1346–1356. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Dimitrakopoulou K., Wik E., Akslen L.A., Jonassen I.. Deblender: a semi-/unsupervised multi-operational computational method for complete deconvolution of expression data from heterogeneous samples. BMC Bioinforma. 2018; 19:408. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Gong T., Hartmann N., Kohane I.S., Brinkmann V., Staedtler F., Letzkus M., Bongiovanni S., Szustakowski J.D.. Optimal deconvolution of transcriptional profiling data using quadratic programming with application to complex clinical blood samples. PLoS One. 2011; 6:e27156. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Glass E.R., Dozmorov M.G.. Improving sensitivity of linear regression-based cell type-specific differential expression deconvolution with per-gene vs. global significance threshold. BMC Bioinform. 2016; 17:334. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Kang K., Meng Q., Shats I., Umbach D.M., Li M., Li Y., Li X., Li L.. CDSeq: A novel complete deconvolution method for dissecting heterogeneous samples using gene expression data. PLoS Comput. Biol. 2019; 15:e1007510. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Levinson D.F., Shi J., Wang K., Oh S., Riley B., Pulver A.E., Wildenauer D.B., Laurent C., Mowry B.J., Gejman P.V.et al.. Genome-wide association study of multiplex schizophrenia pedigrees. Am. J. Psychiatr. 2012; 169:963–973. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Pardinas A.F., Holmans P., Pocklington A.J., Escott-Price V., Ripke S., Carrera N., Legge S.E., Bishop S., Cameron D., Hamshere M.L.et al.. Common schizophrenia alleles are enriched in mutation-intolerant genes and in regions under strong background selection. Nat. Genet. 2018; 50:381–389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Schizophrenia Working Group of the Psychiatric Genomics, C. Biological insights from 108 schizophrenia-associated genetic loci. Nature. 2014; 511:421–427. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Doostparast Torshizi A., Armoskus C., Zhang H., Forrest M.P., Zhang S., Souaiaia T., Evgrafov O.V., Knowles J.A., Duan J., Wang K.. Deconvolution of transcriptional networks identifies TCF4 as a master regulator in schizophrenia. Sci. Adv. 2019; 5:eaau4139. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Dougherty S.E., Reeves J.L., Lesort M., Detloff P.J., Cowell R.M.. Purkinje cell dysfunction and loss in a knock-in mouse model of Huntington disease. Exp. Neurol. 2013; 240:96–102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Kong Y., Rastogi D., Seoighe C., Greally J.M., Suzuki M.. Insights from deconvolution of cell subtype proportions enhance the interpretation of functional genomic data. Plos One. 2019; 14:e0215987. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

lqab056_Supplemental_Files

Data Availability Statement

The bulk RNA-seq data on HD was generated by Labadorf et al. (41) and is available in GEO under accession number GSE64810. The scRNA-seq data on cerebellum were generated by Lake et al. (37) and were downloaded from gene expression omnibus (GEO) under accession number GSE97942. The scRNA-seq data on pancreas by Segerstolpe et al. (38) were downloaded from ArrayExpress (EBI, https://www.ebi.ac.uk/arrayexpress/) under accession number E-MTAB-5061. The RNA-seq data on AD and other neurological disorders were downloaded from AMP-AD knowledge portal under Synapse ID: syn3163039: Study data were provided by the following sources: The Mayo Clinic Alzheimer’s Disease Genetic Studies, led by Dr. Nilufer Taner and Dr. Steven G. Younkin, Mayo Clinic, Jacksonville, FL using samples from the Mayo Clinic Study of Aging, the Mayo Clinic Alzheimer’s Disease Research Center, and the Mayo Clinic Brain Bank. Data collection was supported through funding by NIA grants P50 AG016574, R01 AG032990, U01 AG046139, R01 AG018023, U01 AG006576, U01 AG006786, R01 AG025711, R01 AG017216, R01 AG003949, NINDS grant R01 NS080820, CurePSP Foundation, and support from Mayo Foundation. Study data include samples collected through the Sun Health Research Institute Brain and Body Donation Program of Sun City, Arizona. The Brain and Body Donation Program is supported by the National Institute of Neurological Disorders and Stroke (U24 NS072026 National Brain and Tissue Resource for Parkinson’s Disease and Related Disorders), the National Institute on Aging (P30 AG19610 Arizona Alzheimer’s Disease Core Center), the Arizona Department of Health Services (contract 211002, Arizona Alzheimer’s Research Center), the Arizona Biomedical Research Commission (contracts 4001, 0011, 05–901 and 1001 to the Arizona Parkinson’s Disease Consortium) and the Michael J. Fox Foundation for Parkinson’s Research.

A detailed description of the method along with a step-by-step execution procedure on an example data set is provided in https://github.com/adoostparast/CellR.


Articles from NAR Genomics and Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES