Abstract
Motivation
The inference of cellular compositions from bulk and spatial transcriptomics data increasingly complements data analyses. Multiple computational approaches were suggested and recently, machine learning techniques were developed to systematically improve estimates. Such approaches allow to infer additional, less abundant cell types. However, they rely on training data which do not capture the full biological diversity encountered in transcriptomics analyses; data can contain cellular contributions not seen in the training data and as such, analyses can be biased or blurred. Thus, computational approaches have to deal with unknown, hidden contributions. Moreover, most methods are based on cellular archetypes which serve as a reference; e.g. a generic T-cell profile is used to infer the proportion of T-cells. It is well known that cells adapt their molecular phenotype to the environment and that pre-specified cell archetypes can distort the inference of cellular compositions.
Results
We propose Adaptive Digital Tissue Deconvolution (ADTD) to estimate cellular proportions of pre-selected cell types together with possibly unknown and hidden background contributions. Moreover, ADTD adapts prototypic reference profiles to the molecular environment of the cells, which further resolves cell-type specific gene regulation from bulk transcriptomics data. We verify this in simulation studies and demonstrate that ADTD improves existing approaches in estimating cellular compositions. In an application to bulk transcriptomics data from breast cancer patients, we demonstrate that ADTD provides insights into cell-type specific molecular differences between breast cancer subtypes.
Availability and implementation
A python implementation of ADTD and a tutorial are available at Gitlab and zenodo (doi:10.5281/zenodo.7548362).
1 Introduction
Bulk transcriptomics profiles are a complex linear combination of diverse molecular contributions of multiple individual cells. Many analyses, such as the screening for differentially expressed genes, can be confounded by the underlying cellular compositions. Consequently, if compositions are unknown, we cannot resolve the source of differential gene expression. Multiple approaches were suggested to estimate cellular proportions from bulk transcriptomics data, ranging from traditional, regression-based approaches to machine-learning techniques, including generalized regression (Altboum et al. 2014, Chen et al. 2018, Du et al. 2019) as well as deep learning approaches (Menden et al. 2020, Lin et al. 2022). In recent years, the advent of spatial transcriptomics has generated an additional need for reliable cell-type deconvolution (Dong and Yuan 2021, Ma and Zhou 2022).
Single-cell RNA sequencing data can substantially improve estimates of cellular proportions via improved reference matrices as well as collections thereof, and via optimized gene selection or weighting, as demonstrated by work of us and others (Tsoucas et al. 2019, Wang et al. 2019, Görtler et al. 2020, Jew et al. 2020, Dong et al. 2021). The latter can be facilitated via artificial mixtures of known cellular composition generated from single-cell data. For instance, MuSiC (Wang et al. 2019) and Digital Tissue Deconvolution (DTD) (Görtler et al. 2020) learn a gene weighting, and Bisque (Jew et al. 2020) learns gene-specific bulk expression transformations.
A recent comprehensive benchmarking of different methods for cell-type deconvolution resolved the marker gene selection as a major determinant of model performance and it was pointed out that if cell types are missing in the reference profiles results become substantially worse (Avila Cobos et al. 2020). The latter statement is in line with the observation that MuSiC, Non-Negative Least Squares (NNLS), and CIBERSORT (Chen et al. 2018) did not produce accurate results in a validation study using artificial bulk data of six pancreatic cell types if one cell type was removed from the reference single-cell expression dataset (Wang et al. 2019). Consequently, all cell types which are contained in the bulk mixtures should ideally also be included in the reference matrix (Avila Cobos et al. 2020). This, however, represents a major issue in practical applications, since bulk profiles likely contain signals of cell types not originally considered for model development.
Another, related issue is the origin of reference profiles. It is well known that the cell’s molecular phenotype depends on its environment. This also affects the results of digital tissue dissection, as shown by (Racle et al. 2017, Schelker et al. 2017), where it was pointed out that the references should be selected from niches similar to the investigated bulks.
Although many different approaches were meanwhile suggested for cell-type deconvolution, they rarely account for hidden cell type contributions and cellular environmental effects. Here, we systematically address both issues within a single approach named Adaptive Digital Tissue Deconvolution (ADTD). First, ADTD builds on our previous work on DTD and uses gene weights for optimized deconvolution. Second, it augments the deconvolution by estimates of a background profile together with corresponding cellular proportions across a set of investigated bulks. Third, ADTD adapts its reference profiles to the investigated bulks. Intriguingly, the latter also resolves cell-type specific regulation from bulk transcriptomics data. We verified ADTD in simulations based on single-cell RNA sequencing data, illustrating that ADTD out-competes state-of-the-art baselines. This particularly holds true if models are transferred to another domain, as shown for the transfer from healthy breast tissue to breast cancer. We finally illustrate how ADTD complements routine data analysis in an application to bulk transcriptomics data from The Cancer Genome Atlas (TCGA; https://www.cancer.gov/tcg) revealing patterns of cell-type specific gene regulation in breast cancer.
2 Methods
Adaptive Digital Tissue Deconvolution (ADTD) builds on loss-function learning for DTD introduced previously by Görtler et al. (2020) and extends it by both accounting for unknown background contributions and adaptive reference profiles. We will first recapitulate DTD and will then introduce ADTD.
2.1 Loss-function learning for digital tissue deconvolution
Bulk gene expression profiles can be modeled as a linear combination of cell-type-specific reference profiles,
(1) |
where is a matrix with n bulk gene expression profiles and a matrix with q cell-type specific reference profiles in their columns, both containing expression values of p genes in their rows. The columns of contain cellular weights corresponding to the columns (bulks) of Y. The values Cki can be interpreted as semi-quantitative measurements of the number of cells of type k in bulk profile i. The residuals between the observed bulks Y and the fitted bulks are given by .
One possible way to estimate C is by
(2) |
where is the Frobenius norm and the estimate of C for given X and Y. Equation (2) can be solved for each column (bulk) in Y, individually. In Görtler et al. (2020) and Schön et al. (2020), Equation (2) was augmented by gene weights , yielding
(3) |
where the weights gi can be chosen to improve estimates of C. They can be either specified by prior knowledge or can be learned from cellular mixtures with known cellular contributions, such as annotated bulk profiles or artificial mixtures generated from single-cell RNA sequencing data (Görtler et al. 2020; Schön et al. 2020). To optimize the gene weights, we defined the outer loss function
(4) |
where is the jth row (cell type) of C and its corresponding estimate. Thus, minimizing loss function Equation (4) with respect to the gis, maximizes the Pearson’s correlation between the ground truths and its estimates for all cell types of interest j, simultaneously. In contrast to the original implementation of loss-function learning (Görtler et al. 2020), we now additionally restrict the optimization problem to non-negative (physical) solutions . The corresponding algorithm uses the PyTorch machine learning library (Paszke et al. 2019).
2.2 Estimates of background contributions
One major drawback of reference-based cell-type deconvolution methods is that they do not necessarily capture all cellular contributions present in complex biospecimens. For instance, let X consist of a set of immune cell reference profiles and let Y be the bulk profile of a tumor specimen. Then, although immune cell contributions can be estimated, their values might be biased or blurred by the contributions from cancer cells to Y. In seminal work by Racle et al. (2017), this issue was addressed by a method called EPIC (Estimating the Proportion of Immune and Cancer cells), as outlined in the following. Let be an individual bulk profile corresponding to the ith column of Y. Then, we have
(5) |
where we made the contribution from a potential background explicit by adding the term . Here, represents the “hidden” background profile and ci its corresponding scalar weight. We further assume that the bulk profiles, the reference profiles, and the background profile are normalized by
(6) |
which gives with . This implies and motivates that ci can be directly estimated as
(7) |
Thus, EPIC intrinsically assumes that C can be estimated independently from the background profile x and that all ϵi are small.
2.3 Adaptive digital tissue deconvolution
2.3.1 Background estimation
ADTD relaxes EPIC’s assumptions and directly takes into account that the inference of C also depends on the background profile x. To achieve this, we replace ci in the residuals of Equation (5) by (assuming that the ϵis are small), and minimize their squared sum with respect to C and x. The corresponding optimization problem taking into account the gene weights determined via loss-function learning and non-negativity constraints becomes
(8) |
subject to and , where , C0 is the estimate of C using an ordinary NNLS estimate weighted by , and denotes a matrix of ones of dimension a × b. The regularization parameter λ1 calibrates between two solutions. For , the naive solution is retrieved, where the hidden cell proportions are given by Equation (7). For , x and C are estimated simultaneously. With growing λ1 values, the solutions are more strongly biased toward the naive solution . Note that the naive solution can be determined for each cellular mixture individually, while the latter solution requires a sufficient amount of bulk samples, however, providing the advantage that a consensus background is learned across a set of bulk specimens. We solved the optimization problem Equation (8) by iteratively optimizing with respect to C and x using quadratic programming, until convergence is reached.
2.3.2 Adaptive reference profiles
Reference profiles can be generated using external data from purified cellular mixtures or from single-cell data. However, cells adapt their molecular phenotype to their environment and as a consequence, global and static reference profiles might be inappropriate for most applications. ADTD dynamically adapts reference profiles to the specific application to improve inferred cellular contributions and to reveal potential regulatory effects on cell populations. This is achieved by replacing the reference matrix X by , where Δ is a p × q matrix of rescaling factors which are applied component-wise to X ( is the Hadamard product). The full ADTD loss function is defined as
(9) |
which is optimized with respect to C, x, and Δ, subject to the constraints
(10) |
In Equation (9), we replaced by as derived in Supplementary Section 1.1. The regularization term controls overfitting with respect to the parameters Δjk. For , we get . Thus, the limit corresponds to ADTD without cellular adaptation. Further C0 is determined via DTD with non-negativity constraint. Thus, the DTD limit is retrieved for .
2.3.3 Implementation
The ADTD optimization problem was solved by iteratively optimizing the loss with respect to C, x, and Δ until convergence is reached. The minimization with respect to C and x was addressed by quadratic programming (see Supplementary Subsections 1.2 and 1.3). The optimization with respect to Δ was performed iteratively for the rows (Supplementary Subsection 1.4). The full algorithm is summarized below (Algorithm 1).
Algorithm 1.
1: procedure ADTD()
2: Normalize X and Y such that
3: Initialize C = C0, , and x = x0, and
4: for do
5: update C (see Supplementary Subsection 1.2)
6: update x (see Supplementary Subsection 1.3)
7: for do
8: update (see Supplementary Subsection 1.4)
9: end for
10: Terminate when desired precision is achieved
11: end for
12: end procedure
2.4 Data processing, simulation studies, and analyses
2.4.1 Artificial bulks from single-cell data of healthy breast tissue—training data
We generated artificial mixtures of known cellular compositions for model training using single-cell data from healthy breast tissue (GSE164898) (Bhat-Nakshatri et al. 2021), representing women of different race, age, parity, menstrual phase, Tyrer-Cuzick score, and Body Mass Index (BMI). We denote these data as “training data” throughout the article. Cell types were labeled using Seurat (Hao et al. 2021), following the workflow of (Wu et al. 2021) to remain consistent with the test data described below. We averaged across all respective single-cell profiles of individual patients to derive archetypic reference profiles of B-cells, endothelial cells, myeloid cells, epithelial cells, perivascular-like cells (PVL), and T-cells, yielding a reference matrix for each patient. These were subsequently averaged to yield a global reference matrix X. Cells which were not labeled as one of these six cell types were removed from the training set. We further reduced the gene space to in total 1000, as follows. We first included the top 30 genes of every Seurat main cluster. Of the main clusters, we further investigated those containing predominantly endothelial cells and mesenchymal cells. These were re-clustered to yield 18 subclusters and of each of those, we further included the top 10 genes. This yielded in total 524 unique genes (genes representing multiple cluster were included only once). We further ranked all genes according to their variance calculated with respect to X and selected the 476 most variable genes to yield in total 1000 selected genes. We generated individual mixtures (n = 5000) by randomly drawing 100 single-cell profiles each and by averaging their expression.
2.4.2 Artificial bulks from single-cell data of healthy breast tissue—validation data
For a first validation of ADTD, we generated validation mixtures from the same single-cell profiles as previously, but now also using the hold-out single-cell profiles; the mixtures now include hidden background contributions from cells not assigned to one of the six major cell types ( of all cells). To also study ADTD’s capability to reconstruct cell-type specific regulation from bulk transcriptomics data, we further artificially modified gene-expression levels in a cell-type specific way. For this, we modified the expression of gene j in mixture i by (1) drawing a modification factor ai from the set , followed by (2) adding to the original expression Yij (values Yij < 0 were set to zero), where k corresponds to the cell profile. In each simulation run, we modified 60 genes; 30 genes randomly drawn from the 20 most specific genes for each cell type (in total 120 genes, selected via ) and 30 genes randomly drawn from the 120 most unspecific genes (selected according to minimal values of ). Thus, we manipulated both genes which are particularly relevant for a given cell type and genes which are rather unspecific for cell types. We refer to the so generated data as “validation data” throughout the article.
2.4.3 Artificial bulks from single-cell data of breast cancer tissue—test data
Artificial test mixtures of known cellular compositions were generated analogously to the training data, but now using single-cell data of breast cancer specimens (Wu et al. 2021). Thus, these data specifically allow us to study the domain transfer to completely unseen data, generated within a different experiment, and capturing a substantially different tissue as a consequence of cancer cell admixtures. In total, single-cell data of 26 different tumors were available with corresponding cell-type labels. We extracted B-cells, endothelial cells, myeloid cells, epithelial cells, PVL cells, and T-cells, as described above, and additionally cancer epithelial cells. These profiles were used to generate n = 5000 artificial bulk mixtures by randomly drawing 100 single-cell profiles each. In this process, we further added cell-type specific gene regulation, as previously done for the healthy validation data.
2.4.4. TCGA breast cancer analysis
We retrieved the BRCA bulk transcriptomics data generated by the TCGA Research Network (The Cancer Genome Atlas, https://www.cancer.gov/tcga), containing, in total, 1083 samples: 190 triple-negative breast cancers (TNBC; ER–, PR–, HER2–), 82 human epidermal growth factor receptor 2-positive breast cancer (HER2+), 562 luminal A (LumA; ER+, PR+, HER2–, and low levels of Ki-67), 209 luminal B (LumB; ER+, PR, and HER2+), and 40 breast cancer free tissue samples (normal). To compensate systematic differences between the measurement technologies, we further normalized the TCGA data gene-wise to the single-cell training data. Gene-wise re-scaling factors were calculated by dividing the respective mean expression in the single cell pseudo bulks of the training set by those derived for the TCGA controls.
2.4.5 Competing methods
We compared ADTD to several competing methods. First, we compared it to two different implementations of EPIC (Racle et al. 2017), one using our own reference matrix X generated from the training data (EPIC) and the originally suggested version using the EPIC built-in reference matrix (EPIC). EPIC returns only estimates for B-cells, endothelial cells, macrophages, PVLs, and T-cells. Therefore, we associated estimated macrophage contributions of EPIC with the ground truth of myeloid lineage cells. Second, we compared it to CIBERSORTx (Newman et al. 2019), where we used our own reference matrix, since the built-in matrix did not resemble the herein investigated cell types sufficiently. Third, we compared it to Scaden, which is a deep-learning based approach to estimate cellular proportions. Thus, similar as we trained the DTD gene weighting, we learned Scaden to disentangle the training data. For the latter, healthy single-cell profiles of our training data were provided for model development.
3 Results
3.1 Model training and validation
3.1.1 Model development and competing methods
It was repeatedly shown that gene selection strongly affects the performance of bulk deconvolution methods (Racle et al. 2017, Görtler et al. 2020), and the importance of a stringent marker gene selection was thoroughly demonstrated (Avila Cobos et al. 2020). ADTD involves a two step procedure. In a first step, it uses DTD (Görtler et al. 2020) on the training data to optimize cell-type deconvolution via gene weights gi in Equation (3). Thus, DTD does not select marker genes but weighs them. Here, DTD was trained on the training data to disentangle B-cells, endothelial cells, myeloid cells, normal epithelial cells, PVLs, and T-cells. Thus, to this point, the training process did not see any cellular background and all cells in the mixtures are also represented by the reference matrix X. Supplementary Table S1 summarizes the training performance for ADTD and the competing methods, where we evaluated the models by drawing n = 1000 samples. The results show that both ADTD and Scaden perform substantially better than the competitors. However, both were optimized to fit the data while the others were not.
3.1.1.1 ADTD hyper-parameters have little effect on training performance
ADTD requires specification of the hyper-parameters λ1 and λ2. While those do not affect the DTD prior learning, they affect predictions. We evaluated a comprehensive parameter grid consisting of all combinations of with , showing that ADTD is little sensitive to the explicit choice of the regularization terms, including the solutions with C = C0 for and for , which also performed remarkably well (Supplementary Fig. S1). The only exception is when both regularization parameters are extremely small , simultaneously, or when . The ADTD results in Supplementary Table S1 correspond to and . This explicit choice was motivated using the validation data and subsequently verified on the independent breast cancer test data, as outlined in the following.
3.1.2 Validation performance
In a next step, we verified ADTD’s and the competitors’ performance on the artificial validation data. Those were generated from the same measurement batch (healthy tissue) but additionally included contributions from single-cell profiles which could not be assigned to any of the major cell clusters and as such represent a cellular background (see methods). Deconvolution algorithms have to deal with such unknown cellular contributions and among the investigated methods, only ADTD and EPIC are able to estimate background contributions, while CIBERSORTx and Scaden are not. Nevertheless, the latter two were included as additional state-of-the-art baselines for comparison.
Table 1 gives the performance of ADTD and the competitors in estimating the ground truth cellular compositions for each of the cell types included in the reference matrix for n = 1000 mixtures. We observe that ADTD translates best to these first validation data with minimal performance losses compared to the training data (mean Pearson’s correlation over all cell types; ). ADTD is followed by Scaden (), and both perform substantially better than CIBERSORTx (), EPIC (), and EPIC (), respectively.
Table 1.
ADTD | EPIC1 | EPIC2 | CIBERSORTx | Scaden | |
---|---|---|---|---|---|
B-cells | 0.744 ± 0.015 | 0.094 ± 0.034 | 0.419 ± 0.028 | 0.126 ± 0.037 | 0.551 ± 0.020 |
Endo.b | 0.908 ± 0.003 | 0.839 ± 0.008 | 0.712 ± 0.017 | 0.824 ± 0.009 | 0.830 ± 0.009 |
Myel.c | 0.929 ± 0.004 | 0.825 ± 0.006 | 0.678 ± 0.018 | 0.835 ± 0.018 | 0.801 ± 0.013 |
Epith.d | 0.906 ± 0.007 | 0.647 ± 0.010 | - | 0.760 ± 0.019 | 0.781 ± 0.019 |
PVLe | 0.773 ± 0.016 | 0.530 ± 0.026 | - | 0.569 ± 0.013 | 0.693 ± 0.020 |
T-cells | 0.925 ± 0.003 | 0.782 ± 0.011 | 0.284 ± 0.032 | 0.768 ± 0.012 | 0.759 ± 0.015 |
Mean | 0.864 ± 0.004 | 0.620 ± 0.008 | 0.523 ± 0.014 | 0.647 ± 0.010 | 0.736 ± 0.006 |
Observed Pearson’s correlations obtained by comparing estimated with true cellular proportions for artificial mixtures generated from single-cell data of healthy breast tissue (see Methods). The error bars correspond to ±1 SD obtained over 10 simulation runs. Correlations > 0.8 are highlighted in bold.
Endo., endothelial cells.
Myel., myeloid cells.
Epith., epithelial cells;
PVL, perivascular-like cells.
We further evaluated ADTD and EPIC with respect to their performance in inferring the hidden cellular background (Table 2). Respective cellular proportions were estimated with a Pearson’s correlation of for ADTD and for EPIC. Here, EPIC was not evaluated since it captures a different set of cell types via X, which also affects the reconstructed backgrounds. As such, a fair comparison would not be possible. Additionally, ADTD provides estimates of the hidden-cell profile x, which could be also reliably reconstructed ().
Table 2.
ADTD | EPIC1 | |
---|---|---|
Hidden cell type | 0.840 ± 0.010 | 0.262 ± 0.014 |
Hidden profile | 0.754 ± 0.007 | - |
Performance in terms of Pearson’s correlation for the estimated hidden proportions (hidden cell type) and the hidden profile. Correlations > 0.8 are highlighted in bold.
3.1.2.1 ADTD hyper-parameter dependency
As for the training data, we systematically evaluated ADTD’s performance for different choices of λ1 and λ2 with results shown in Supplementary Fig. S2a and b, providing the average performance for estimating cellular proportions of the included cell types (Supplementary Fig. S2a) and the hidden cells (Supplementary Fig. S2b). We confirmed a negligible dependence of the ADTD’s prediction performance with respect to λ1 and λ2. As previously on the training data, the only exception is the case when both λ1 and λ2 are very small, simultaneously.
3.1.2.2 Reverse engineering of cell-type specific gene regulation and hyper-parameter dependency
In contrast to the competing methods, ADTD adapts the reference matrix via re-scaling factors encoded by Δ. Our artificial validation mixtures were generated to contain such effects, resembling potential cell-type specific gene-regulatory contributions to gene expression. We verified that ADTD is able to resolve cell-type specific gene regulation by evaluating areas under the Receiver Operating Characteristic (AUC) curves. As previously, we evaluated ADTD for different choices of λ1 and λ2 (Supplementary Fig. S2c). In contrast to the previous prediction tasks, an appropriate choice of the hyper-parameters turns out to be much more important in this context. We observed that for large values of λ2 the AUCs approach 0.5, corresponding to random predictions. This might be expected, since for . For smaller values of λ2, ADTD increasingly shows its capability to resolve cell-type specific gene expression, yielding AUCs up to approximately 0.9. Importantly, performance is little dependent on the explicit choice of λ2, providing reasonable performance in the full range from (AUC) to (AUC) given that . Again, if both are very small, performance is compromised, indicating potential issues with over-fitting. This is also supported by Supplementary Figs S3 and S4, where performance with respect to λ2 is studied in detail for the two explicit choices and .
3.1.2.3 Hyper-parameter selection
Given the outlined performance evaluations, we decided to fix and , since for this choice validation performance is good with respect to all investigated measures. The performance for alternative choices is given in Supplementary Figs S2a–c (validation data) and S2d–f (test data, and see below).
3.1.3 Domain transfer to breast cancer
Next, we systematically studied the domain transfer to breast cancer tissue. For this purpose, we generated artificial cellular mixtures using single-cell data from breast cancer specimens (Wu et al. 2021). Thus, the test mixtures are generated from a different single-cell experiment and address a different tissue context, namely breast tissue infiltrated by cancer cells. Thus, the artificial mixtures contain a substantial proportion of cells not seen in the training data.
3.1.3.1 Performance comparison to competing methods
Table 3 summarizes the results for ADTD and the competing methods. We observe a substantially better performance of ADTD compared to the competitors for each individual cell type, with the only exception of normal epithelial cells. The latter were slightly better estimated by Scaden ( for ADTD versus for Scaden). ADTD’s average performance was , followed by Scaden (), EPIC2 (), EPIC1 (), and CIBERSORTx (). Considering estimates of hidden cellular contributions (see Table 4), ADTD performs clearly better than EPIC ( for ADTD versus for EPIC). For the outlined reasons, a comparison to other approaches was not possible here. One exemplary scatter plot contrasting ADTD predictions with the corresponding ground truth is shown in Supplementary Fig. S5. The comparatively low performance for estimating epithelial and cancer epithelial contributions might be explained by the fact that both belong to a continuum of cells, consisting of cancer cells, epithelial cells, but also a number of boundary cases which resemble molecular characteristics of both. To determine the ground truth, we used the labelling provided by (Wu et al. 2021) based on a scoring system to annotate cancerous cells (the genomic instability score).
Table 3.
ADTD | EPIC1 | EPIC2 | CIBERSORTx | Scaden | |
---|---|---|---|---|---|
B-cells | 0.824 ± 0.012 | 0.035 ± 0.022 | 0.717 ± 0.013 | 0.315 ± 0.089 | 0.657 ± 0.018 |
Endo. | 0.832 ± 0.008 | 0.631 ± 0.011 | 0.783 ± 0.010 | 0.633 ± 0.060 | 0.859 ± 0.010 |
Myel.b | 0.928 ± 0.003 | 0.867 ± 0.005 | 0.748 ± 0.012 | 0.798 ± 0.044 | 0.794 ± 0.009 |
Epith.c | 0.349 ± 0.022 | 0.281 ± 0.023 | - | 0.201 ± 0.092 | 0.378 ± 0.030 |
PVLd | 0.907 ± 0.005 | 0.573 ± 0.014 | - | 0.670 ± 0.062 | 0.838 ± 0.007 |
T-cells | 0.882 ± 0.007 | 0.732 ± 0.018 | 0.419 ± 0.035 | 0.427 ± 0.096 | 0.732 ± 0.015 |
Mean | 0.787 ± 0.006 | 0.520 ± 0.003 | 0.666 ± 0.010 | 0.508 ± 0.035 | 0.710 ± 0.005 |
Observed Pearson’s correlations obtained by comparing the estimated cellular proportions of the cell types captured in the reference matrix X with the ground truth for artificial cellular mixtures of the breast cancer test dataset, where solely the cancer epithelial cells were hidden in the mixtures and not represented as reference profiles (see also Table 4). The error bars correspond to ±1 SD obtained over 10 simulation runs and correlations > 0.8 are highlighted in bold.
Myel., myeloid cells.
Epith., epithelial cells.
PVL, perivascular-like cells.
Table 4.
ADTD | EPIC1 | |
---|---|---|
Cancer Epith. | 0.449 ± 0.031 | 0.24 ± 0.017 |
Hidden profile | 0.712 ± 0.004 | - |
Performance in terms of Pearson’s correlation for the estimated hidden proportions (cancer Epith.) and the hidden profile.
3.1.3.2 Sample-size dependency
ADTD solves an optimization problem which simultaneously estimates the hidden reference profile x and the cellular proportions. Thus, an accurate estimate of x requires accurate estimates of C and vice versa. The hidden background profile x can be considered as an average background which is observed among different mixtures. Thus, to obtain reliable estimates for x and c, one might need to consider multiple mixtures. To study the impact of sample size n on ADTD test performance, we repeated the performance evaluation for different numbers of samples with results shown in Fig. 1, where the observed Pearson’s correlation is plotted versus sample size n for the cell types contained in the reference matrix (Fig. 1a–f), and for the estimated hidden proportions (Fig. 1g). Additionally, Fig. 1h shows the correlations comparing the estimated background profile x with the ground truth. Strikingly, ADTD achieves already remarkable stable performance for sample sizes as low as n = 50.
3.1.3.3 Reconstruction of cellular regulation
We further assessed ADTD’s performance to reverse engineer cell-type specific gene regulation in our breast cancer test mixtures. Figure 2a provides respective performance evaluation in terms of AUCs for different values of λ2 for a fixed value of , containing also the finally selected model (). ADTD remains highly predictive for the whole range of λ2 values between and , with best results observed for , closely followed by . Supplementary Fig. S6 shows a similar behaviour for . We further tested how performance depends on sample size with results shown in Supplementary Fig. S7, where we observed remarkably stable and still reasonably high performance for mixtures with sample sizes of only n = 125. This was further substantiated for one alternative model () in Supplementary Fig. S8.
Finally, we tested how the recovery of cell-type specific regulation depends on the abundance of cell types. To study this aspect, we repeated the generation of the breast cancer test mixtures but now with (1) modified T-cells only, and (2) modified PVLs only. T cells are highly abundant, contributing at average approximately 39.1% of cells to the mixtures, while PVLs are rare with an average contribution of 6.0%. Respective results are summarized in Supplementary Fig. S9 for T cells and in Supplementary Fig. S10 for PVLs, suggesting that ADTD infers cellular regulation irrespective of the cellular abundance.
3.1.3.4 Hyper-parameter dependency
While we did not utilize the test data to refine the hyper-parameters, examining their impact on the results is still insightful. We, therefore, evaluated the performance of ADTD for estimating the known cellular contributions, the hidden ones, and the cellular regulation for the full grid of hyper-parameters investigated previously (Supplementary Fig. S2d–f). As before, considering estimates of cellular contributions, the results were stable and are compromised only if both λ1 and λ2 are very small. Moreover, estimates of cellular regulation were reasonable for a broad range of λ2 values. Thus, our results suggest a highly competitive performance of ADTD largely independent of the explicit choice of hyper-parameters, given that λ2 is in a reasonable range and that both are not simultaneously very small.
It is also worth emphasizing an analogy to other regularized regression approaches such as ordinary Lasso or ridge regression (Hoerl and Kennard 1970, Tibshirani 1996). Here, larger regularization parameters correspond to models which only recover the most dominant effects. As such, the user might decide whether it is important to detect the strongest regulatory effects with high confidence or the majority of regulatory effects with low confidence. Here, the former would suggest comparatively high values of λ2 and the latter comparatively low ones, corresponding to a more conservative and a more liberal choice, respectively.
3.2 ADTD breast cancer subtype analysis using bulk transcriptomics data from TCGA
To illustrate how ADTD can complement routine data analysis, we performed an exemplary analysis considering breast cancer specimens from TCGA (see Methods). We evaluated ADTD separately for triple-negative breast cancer (TNBC), human epidermal growth factor receptor 2-positive breast cancer (HER2+), luminal A (LumA), and luminal B (LumB) breast cancer to study common and complementary cell-type specific gene regulation. Our hyper-parameters where selected as and , where ni equals the number of samples of breast-cancer subtype i. The additional subtype-specific rescaling factor is necessary to make the estimated cell-type specific regulation comparable. The overall scale was motivated by the fact that values between and still yielded reasonable performance on the validation data, while being considerably more conservative (more regularized). We extracted the matrices Δ and denote them as for breast cancer subtype in the following, where the entry contains the respective re-scaling factor for gene j in cell type k. For instance, a ground truth corresponds to a gene j being up-regulated by a factor of 2 in TNBC compared to healthy breast tissue. Possible cross-platform issues between the single-cell derived reference matrix X and the TCGA bulk profiles were addressed by gene-wise rescalings (see Methods).
First, we ranked the rows (genes) of Δ from highest to lowest deviations from one defined by . Thus, we identified the genes with strongest regulation according to ADTD for each of the four cancer subtypes across all cell types with included reference profiles. We then extracted the top ten genes for each subtype and explored the overlap (Fig. 3). We observed a remarkable overlap of five genes (of a total of p = 957 genes), which were selected in all four subtypes, namely B2M, TMSB4X, FTH1, LTB, and RPS27. We next verified if these five genes are consistently regulated among the four subtypes. B2M shows its highest regulation in myeloid cells across all subtypes, with the only exception of HER2+ breast cancers, where myeloid cells were also upregulated but slightly outcompeted by T-cells ( versus ). B2M encodes beta-2-microglobulin, a component of the class I major histocompatibility complex (MHC) which is present for all nucleated cells and which serves for antigen presentation. Myeloid cells comprise professional antigen presenting cells such as dendritic cells and tissue macrophages (Bassler et al. 2019). In line with our results, beta-2-microglobulin (β2-M) protein expression was significantly higher in breast cancer compared to benign breast tumors (Li et al. 2014). Moreover, β2-M protein expression was shown to be significantly different in the four breast cancer molecular subtypes (Li et al. 2014). ADTD adds to this observation the cell-type specific regulation of B2M, also indicating quantitative differences between TNBC, HER2+ and Luminal A/B (Supplementary Fig. S11a). TMSB4X encodes the protein Thymosin beta-4 and FTH1 encodes the heavy subunit of ferritin. In our analysis, both TMSB4X and FTH1 were consistently upregulated in myeloid cells across all subtypes (Supplementary Fig. S11e and f). Thymosin beta-4 was repeatedly studied in the context of breast cancer (Zhang et al. 2017), and in line with our findings, macrophages showed intense reactivity for Thymosin beta-4 antibodies in breast cancer, while cancer cells showed a much more variable reactivity (Larsson and Holck 2007). Also FTH1 was repeatedly discussed in the context of breast cancer and it was shown that ferritin stimulates breast cancer cells through an iron-independent mechanism and is localized within tumor-associated macrophages (Alkhateeb et al. 2013). LTB encodes lymphotoxin-beta (LT-beta), also known as tumor necrosis factor C (TNF-C). LTB was consistently upregulated in B-cells (Supplementary Fig. S11d). B cells can produce lymphotoxin, which induces angiogenesis and thus promotes tumor growth (Yuen et al. 2016). Finally, RPS27 encodes ribosomal protein S27. ADTD derived the strongest regulation in normal epithelial cells for all subtypes except HER2+. In the latter, RPS27 was slightly stronger regulated in myeloid cells ( versus ), see also Supplementary Fig. S11c. To identify genes with a diverse regulation pattern derived by ADTD, we globally searched the scaling matrices for largest deviations from 1. This yielded MFGE8 as top hit, observed in TNBC only. MFGE8 shows a substantial regulation of in PVLs, indicating an almost 3-fold increase while the other cell types are basically not affected (see also Supplementary Fig. S11b). MFGE8 encodes Milk fat globule-EGF factor 8 and its reduction was shown to inhibit triple-negative breast cancer cell viability and migration (Yang et al. 2019).
4 Conclusion and discussion
Multiple methods for digital tissue dissection were suggested in recent years, based on different losses such as squared residuals, l1 or the support vector regression (Altboum et al. 2014; Chen et al. 2018). Moreover, machine learning techniques such as deep learning were proposed as an alternative and promise improved performance, although adherence to strict Standard Operating Procedures (SOPs) might be necessary to achieve such a performance gain (Menden et al. 2020; Lin et al. 2022). Here, we approached computational tissue dissection from the perspective that strict SOPs are difficult (if not impossible) to guarantee and that methods should intrinsically account for between-dataset variability. As such, we proposed Adaptive Digital Tissue Deconvolution (ADTD) as an adaptive method for digital tissue dissection, which incorporates two sources of between-dataset variability. First, ADTD accounts for possibly hidden cellular background contributions, which are not represented by archetypical reference profiles. This allowed us to achieve improved out-of-sample performance as demonstrated by artificial test mixtures generated from single-cell RNA profiles of breast cancer specimens. In fact, both the proportions of background contributions and their representation as a reference profile could be estimated with confidence even for small sample sizes of . Second, ADTD accounts for environmental effects; depending on the tissue context, cellular phenotypes might be altered (Racle et al. 2017, Schelker et al. 2017), and therefore, ADTD adapts its reference matrix to the specific application. As demonstrated in comprehensive simulation studies, this allows to resolve cell-type specific regulation from bulk transcriptomics data.
It is worth emphasizing that approaching the former two issues requires appropriate strategies to deal with overfitting. Here, a common technique from high-dimensional statistics was used, namely l2 regularization (Hoerl and Kennard 1970). Here, regularization is not the only possible choice and was made for computational reasons, since ADTD Equation (9) can be minimized by iteratively solving subproblems using quadratic programming. In principle, regularization could be an interesting alternative to enforce sparseness in the cell-regulation parameters, potentially improving the interpretability of the results. However, to further explore this option would require appropriate implementation strategies. It is also important to point out that tuning the strength of the two l2-regularization terms is achieved by adjusting two hyper-parameters λ1 and λ2. Typically, hyper-parameters are selected by lowest generalization error. This strategy, however, fails for cell-type deconvolution; we are not interested in a low generalization error for the prediction of the bulk profiles Y, but in low errors of the underlying parameters C, c, and x. Therefore, we decided to perform hyper-parameter selection via a controlled validation study where the ground truth is available. Surprisingly, results were rather insensitive to the explicit choice of λ1 and λ2, under the assumption that both are not too small simultaneously. This means that a priori parameter choices might be appropriate for most applications. Alternatively, one might select hyper-paramters as follows: (A) start out with comparatively high regularization parameters. This typically yields reliable results for estimating cellular proportions, while the cell-type specific gene regulation cannot be resolved. Then, (B) systematically lower λ2 to resolve gene regulation. While doing so, respective estimates for cellular proportions should be controlled for deviations from the previous results. In our studies, the improvements from further tuning λ1 were negligible and thus it could remain fixed.
It is noteworthy that recent deep-learning based approaches can address more complex nonlinear relationships. An example is Scaden, which, although it performed substantially better than CIBERSORTx and EPIC, could not outperform ADTD. The underlying reason could be that bulk profiles are inherently linear, as they correspond to a linear combination of individual cells’ contributions. Thus, a bias towards linearity could be beneficial in this context. Finally, cell-type deconvolution is designed to disentangle molecularly diverse cells and performance is typically compromised in scenarios where many cell types are disentangled. This is particularly the case if related cell types are considered. We demonstrated that the adapted deconvolution weights g in DTD, which are also included in ADTD, improve on this issues (Görtler et al. 2020). However, still it is insufficiently addressed given the fact that molecularly similar cells can fulfill highly diverse tasks.
In summary, ADTD represents a substantial step in the development of digital tissue dissection methods, as it is to the best of our knowledge the first method which tackles two big intrinsic model uncertainties simultaneously, namely hidden background contributions and cellular environmental factors. These two aspects might not just be relevant for the analysis of bulk transcriptomics data, but also for new data data sources, such as spatial transcriptomics data. Also there, accounting for domain transfer and hidden variables, such as cellular background contributions, might be key for reliable inference.
Supplementary Material
Contributor Information
Franziska Görtler, Computational Biology Unit, Department of Biological Sciences, University of Bergen, N-5008 Bergen, Norway; Department of Oncology and Medical Physics, Haukeland University Hospital, 5021 Bergen, Norway.
Malte Mensching-Buhr, Department of Medical Bioinformatics, University Medical Center Göttingen, 37075 Göttingen, Germany.
Ørjan Skaar, Department of Informatics, Computational Biology Unit, University of Bergen, N-5008 Bergen, Norway.
Stefan Schrod, Department of Medical Bioinformatics, University Medical Center Göttingen, 37075 Göttingen, Germany.
Thomas Sterr, Institute of Theoretical Physics, University of Regensburg, 93053 Regensburg, Germany.
Andreas Schäfer, Institute of Theoretical Physics, University of Regensburg, 93053 Regensburg, Germany.
Tim Beißbarth, Department of Medical Bioinformatics, University Medical Center Göttingen, 37075 Göttingen, Germany.
Anagha Joshi, Department of Clinical Science, Computational Biology Unit, University of Bergen, N-5008 Bergen, Norway.
Helena U Zacharias, Peter L. Reichertz Institute for Medical Informatics of TU Braunschweig and Hannover Medical School, Hannover Medical School, 30625 Hannover, Germany.
Sushma Nagaraja Grellscheid, Computational Biology Unit, Department of Biological Sciences, University of Bergen, N-5008 Bergen, Norway.
Michael Altenbuchinger, Department of Medical Bioinformatics, University Medical Center Göttingen, 37075 Göttingen, Germany.
Supplementary data
Supplementary data are available at Bioinformatics online.
Conflict of interest
None declared.
Funding
This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) [AL 2355/1-1 “Digital Tissue Deconvolution—Aus Einzelzelldaten lernen”, project number 420069742]. S.N.G. and F.G. acknowledge European Union’s Horizon 2020 research and innovation programme (MESI-STRAT) [754688] and Trond Mohn Stiftelse [BFS2017TMT01]. H.U.Z. was supported by the German Federal Ministry of Education and Research (BMBF) within the framework of the e:Med research and funding concept [01ZX1912A]. T.B. was supported by the BMBF projects FAIrPaCT [01KD2208A], PerMiCCion [01KD2101C] and MATCH [01KU1910A] and DFG TRR274 [408885537] and KFO5002 [426671079].
Data availability
All data used in this article are publicly available. The Gene Expression Omnibus accession numbers to obtain the raw datasets are: scRNA-Seq of healthy breast tissues are available from Bhat-Nakshatri et al. (2021) (GSE164898) and breast cancer scRNA-Seq from Wu et al. (2021) (GSE176078). The BRCA bulk transcriptomics data were generated by the TCGA Research Network (The Cancer Genome Atlas, TCGA). A python implementation of ADTD and a tutorial are available at Gitlab and zenodo (doi:10.5281/zenodo.7548362).
References
- Alkhateeb AA, Han B, Connor JR. Ferritin stimulates breast cancer cells through an iron-independent mechanism and is localized within tumor-associated macrophages. Breast Cancer Res Treat 2013;137: 733–44. [DOI] [PubMed] [Google Scholar]
- Altboum Z, Steuerma Y, David E et al. Digital cell quantification identifies global immune cell dynamics during influenza infection. Mol Syst Biol 2014;10: 720. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Avila Cobos F, Alquicira-Hernandez J, Powel JE et al. Benchmarking of cell type deconvolution pipelines for transcriptomics data. Nature Commun 2020;11:1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bassler K, Schulte-Schrepping J, Warnat-Herresthal S et al. The myeloid cell compartment-cell by cell. Annu Rev Immunol 2019;37: 269–93. [DOI] [PubMed] [Google Scholar]
- Bhat-Nakshatri P, Gao H, Sheng L et al. A single-cell atlas of the healthy breast tissues reveals clinically relevant clusters of breast epithelial cells. Cell Rep Med 2021;2:100219. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen B, Khodadoust MS, Liu CL et al. Profiling tumor infiltrating immune cells with CIBERSORT. In: von Stechow, L. (eds) Cancer Systems Biology. Methods in Molecular Biology. Vol. 1711. New York, NY: Humana Press, 2018, 243–59. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dong M, Thennavan A, Urrutia E et al. SCDC: bulk gene expression deconvolution by multiple single-cell RNA sequencing references. Brief Bioinf 2021;22: 416–27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dong R, Yuan G.-C. SpatialDWLS: accurate deconvolution of spatial transcriptomic data. Genome Biol 2021;22:1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Du R, Carey V, Weiss ST.. deconvSeq: deconvolution of cell mixture distribution in sequencing data. Bioinformatics 2019;35: 5095–102. [DOI] [PubMed] [Google Scholar]
- Görtler F, Schön M, Simeth J et al. Loss-function learning for digital tissue deconvolution. J Comput Biol 2020;27: 342–55. [DOI] [PubMed] [Google Scholar]
- Hao Y, Hao S, Andersen-Nissen E et al. Integrated analysis of multimodal single-cell data. Cell 2021;184:3573–3587.e29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hoerl AE, Kennard RW. Ridge regression: applications to nonorthogonal problems. Technometrics 1970;12:69–82. [Google Scholar]
- Jew B, Alvarez M, Rahmani E et al. Accurate estimation of cell composition in bulk expression through robust integration of single-cell information. Nat Commun 2020;11:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Larsson L-I, Holck S. Localization of thymosin beta-4 in tumors. Ann New York Acad Sci 2007;1112:317–25. [DOI] [PubMed] [Google Scholar]
- Li K, Du H, Lian X et al. Characterization of β2-microglobulin expression in different types of breast cancer. BMC Cancer 2014;14:750. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin Y, Li H, Xiao X et al. DAISM-DNNXMBD: highly accurate cell type proportion estimation with in silico data augmentation and deep neural networks. Patterns 2022;3:100440. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ma Y, Zhou X. Spatially informed cell-type deconvolution for spatial transcriptomics. Nat Biotechnol 2022;40:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Menden K, Marouf M, Oller S et al. Deep learning–based cell composition analysis from tissue expression profiles. Sci Adv 2020;6:eaba2619. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Newman AM, Steen CB, Liu CL et al. Determining cell type abundance and expression from bulk tissues with digital cytometry. Nat Biotechn 2019;37:773–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Paszke A, Gross S, Massa F et al. PyTorch: an imperative style, high-performance deep learning library. Adv Neural Inform Process Syst 2019;32:8024–35. [Google Scholar]
- Racle J, de Jonge K, Baumgaertner P et al. Simultaneous enumeration of cancer and immune cell types from bulk tumor gene expression data. Elife 2017;6:e26476. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schelker M, Feau S, Du J et al. Estimation of immune cell content in tumour tissue using single-cell RNA-seq data. Nat Commun 2017;8:1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schön M, Simeth J, Heinrich P, et al. DTD: an R package for digital tissue deconvolution. J Comput Biol 2020;27:386–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc: Ser B (Methodol) 1996;58:267–88. [Google Scholar]
- Tsoucas D, Dong R, Chen H et al. Accurate estimation of cell-type composition from gene expression data. Nat Commun 2019;10:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang X, Park J, Susztak K et al. Bulk tissue cell type deconvolution with multi-subject single-cell expression reference. Nat Commun 2019;10:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu SZ, Al-Eryani G, Lee Roden D. et al. A single-cell and spatially resolved atlas of human breast cancers. Nat Genet 2021;53:1334–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang Y, Li J, Song O et al. Reduction in milk fat globule-EGF factor 8 inhibits triple-negative breast cancer cell viability and migration. Oncol Lett 2019;17:3457–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yuen GJ, Demissie E, Pillai S. B lymphocytes and cancer: a love-hate relationship. Trends Cancer 2016;2:747–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang X, Ren D, Guo L, et al. Thymosin beta 10 is a key regulator of tumorigenesis and metastasis and a novel serum marker in breast cancer. Breast Cancer Res 2017;19:1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data used in this article are publicly available. The Gene Expression Omnibus accession numbers to obtain the raw datasets are: scRNA-Seq of healthy breast tissues are available from Bhat-Nakshatri et al. (2021) (GSE164898) and breast cancer scRNA-Seq from Wu et al. (2021) (GSE176078). The BRCA bulk transcriptomics data were generated by the TCGA Research Network (The Cancer Genome Atlas, TCGA). A python implementation of ADTD and a tutorial are available at Gitlab and zenodo (doi:10.5281/zenodo.7548362).