Decomprolute is a benchmarking platform designed for multiomics-based tumor deconvolution

Song Feng; Anna Calinawan; Pietro Pugliese; Pei Wang; Michele Ceccarelli; Francesca Petralia; Sara JC Gosline

doi:10.1016/j.crmeth.2024.100708

. 2024 Feb 26;4(2):100708. doi: 10.1016/j.crmeth.2024.100708

Decomprolute is a benchmarking platform designed for multiomics-based tumor deconvolution

Song Feng ¹, Anna Calinawan ², Pietro Pugliese ³, Pei Wang ², Michele Ceccarelli ⁴, Francesca Petralia ², Sara JC Gosline ^1,^5,^∗

PMCID: PMC10921018 PMID: 38412834

Summary

Tumor deconvolution enables the identification of diverse cell types that comprise solid tumors. To date, however, both the algorithms developed to deconvolve tumor samples, and the gold-standard datasets used to assess the algorithms are geared toward the analysis of gene expression (e.g., RNA sequencing) rather than protein levels. Despite the popularity of gene expression datasets, protein levels often provide a more accurate view of rare cell types. To facilitate the use, development, and reproducibility of multiomic deconvolution algorithms, we introduce Decomprolute, a Common Workflow Language framework that leverages containerization to compare tumor deconvolution algorithms across multiomic datasets. Decomprolute incorporates the large-scale multiomic datasets produced by the Clinical Proteomic Tumor Analysis Consortium (CPTAC), which include matched mRNA expression and proteomic data from thousands of tumors across multiple cancer types to build a fully open-source, containerized proteogenomic tumor deconvolution benchmarking platform. http://pnnl-compbio.github.io/decomprolute

Keywords: cancer, deconvolution, proteomics, proteogenomics, CPTAC, CWL

Graphical abstract

Highlights

•
Decomprolute enables benchmarking of proteomic deconvolution algorithms
•
Framework incorporates proteogenomic tumor data from over 1,000 patient samples
•
Common Workflow Language (CWL) automates multiple tests across deconvolution algorithms
•
Extendable framework is designed to incorporate additional algorithm development

Motivation

Our goal is to provide a comprehensive platform for algorithm developers and researchers to benchmark and run tumor deconvolution algorithms on multiomic data. We designed Decomprolute to be a modular tool that can be used to evaluate a selection of existing deconvolution algorithms on a cancer dataset of interest and to assess the quality of any new methods that may be developed.

Feng et al. describe Decomprolute, a framework to benchmark algorithms that deconvolve bulk gene and protein expression measurements using cell-specific markers. Decomprolute uses the Common Workflow Language to automate a series of benchmarks that assess the performance of algorithms on proteomic data from over 1,000 cancer samples.

Introduction

Tumor growth and metastasis rely on the exchanges between the tumor cells and additional components constituting the tumor microenvironment.¹ Understanding the interactions between the tumor cells and surrounding non-malignant cells, including stromal, endothelial, and immune cells, is essential to model the mechanisms underlying tumor survival and spreading.² In particular, identifying the degree and nature of immune cell and other microenvironmental infiltration can assist in predicting a tumor responsiveness to specific immunotherapeutic regimens.³^,⁴ Hence, new technologies, such as mass cytometry⁵^,⁶^,⁷^,⁸ and single-cell RNA sequencing (RNA-seq),⁹^,¹⁰^,¹¹^,¹² have been applied also to study tumor microenvironment together with ad hoc computational algorithms to deconvolve cell types from bulk molecular measurements.¹³^,¹⁴^,¹⁵^,¹⁶^,¹⁷

Algorithmic tumor deconvolution is based upon the knowledge that specific genes are expressed at distinct levels within specific cell types.¹⁸ Given the prior knowledge of the specific combinations of gene expression levels to expect in a specific cell type, numerous existing computational algorithms can provide estimates of relative cell types present in the profiled tissue. One such algorithm, Microenvironment Cell Populations-counter (MCP-counter), provides tumor deconvolution predictions from bulk RNA-seq data using a gene signature matrix. This signature matrix is derived from previously published gene expression datasets, which are analyzed to provide an MCP-counter score, comprised of the logarithm of geometric mean of the marker genes for each cell type.¹⁷ CIBERSORT and CIBERSORTx employ a linear modeling approach to estimate cell-type composition from a signature matrix and bulk gene expression matrix.¹⁴^,¹⁹ EPIC (estimating the proportions of immune and cancer cells) is an algorithm that uses a similar approach to CIBERSORT but normalizes the gene expression values to account for proportions of healthy vs. malignant cells.¹⁶^,²⁰ xCell expands upon these existing numeric approaches by leveraging gene set enrichment statistics to characterize cell types.¹⁵ These methods, in addition to many others not explicitly mentioned here,²¹ showcase the great need for tumor deconvolution from bulk measurements.

While tumor deconvolution algorithms are highly effective at using gene expression data, their performances are unexplored on proteomic data. Given, however, the increasing number of cancer-specific proteomic datasets,²²^,²³^,²⁴^,²⁵^,²⁶ together with the established fact that protein levels do not always correlate with mRNA,²⁷^,²⁸^,²⁹^,³⁰ this suggests that algorithmic deconvolution could be more effective if a proteomic signature matrix is utilized. Recent work by Rieckmann et al.³¹ has created a dataset that enables the definition of immune cell types based on proteomic data. However, there is still no available gold-standard dataset to evaluate the ability of an algorithm using these proteomic-defined immune cell types to deconvolve tumor data. On the other side, for mRNA-based deconvolution, there are numerous single-cell datasets as well as sorted cell experiments that can be used for such benchmarks,³²^,³³^,³⁴ which are missing for proteomic data.

Here, we introduce Decomprolute, a containerized set of scientific workflows that enable the community to compare the performance of existing or novel deconvolution algorithms across various omics datasets. We demonstrate the utility of Decomprolute using a subset of published deconvolution algorithms on the CPTAC3 cancer datasets²⁶^,³⁵ for direct comparison with mRNA-based algorithms, simulated data, and pan-cancer immune subtypes. The framework comprises four existing deconvolution algorithms but can easily work with any new algorithm that is able to function in a Docker container. Our system is flexible enough to accept additional signature matrices, deconvolution algorithms, and datasets, both as input and for validation, as we hope that it will inspire future development in the tumor deconvolution space.

Results

Modular workflow framework enables flexible comparison of deconvolution results across signatures, cancer types, and algorithms

The goal of Decomprolute is to encourage rapid development and benchmarking of novel deconvolution algorithms and cell-type signature matrices. As such, the underlying architecture is structured around a modular framework that allows additional algorithms, datasets, or cell-type signatures to be easily added for comparison. The platform enables users to run and generate figures for experiments in a reproducible fashion, and its modularity allows it to be expanded to run additional statistical tests as needed.

The overall software architecture is described in Figure 1. The modules that comprise Decomprolute are (1) prot-data, which accesses data from published cancer proteomic resources³⁶; (2) mrna-data, which accesses matched gene expression data from the same patients; (3) signature-matrices, which returns specific signature matrices to evaluate with existing algorithms¹⁵^,¹⁶^,¹⁷^,¹⁹^,³⁷; (4) tumor-deconv-algs, which evaluates a combination of gene and/or protein expression data and signature matrix on an algorithm of interest; and (5) metrics, which compares the performance of the algorithm on a set of benchmarks we define. Each module takes a standard set of inputs and outputs and therefore can be interchanged and appended. A full list of parameters is described in Table 1. This modular design enables users to plug in their own data or algorithms (via Docker) or create a new evaluation metric by which they can compare data.

Overview of Decomprolute modular architecture describes four primary subdirectories of Decomprolute

The mRNA-data and prot-data modules both pull from the CPTAC pan-cancer resource. The signature matrix module contains genes that represent different cell types, and the tumor-deconv-algs module contains each of the algorithms we implemented, while the metrics module contains all the modules used to measure performance.

Table 1.

Overview of Decomprolute modular arguments and description

Module	Inputs	Outputs
prot-data	cancer type (e.g., LUAD, BRCA)	a single file of protein expression across patients
mRNA-data	cancer type (e.g., LUAD, BRCA)	a single file of gene expression across patients
signature-matrices	signature namesubsample	a single file of transcriptomic or proteomic profile chosen for cell types
tumor-deconv-alg	cancer typesignature namealgorithm name	a single matrix where rows are cell types, samples are columns, and each value is the estimated fraction of that cell type for that sample
metrics	specific parameters based on the type of metrics	figures and tables summarizing cross-algorithm analysis

Open in a new tab

The platform is designed to maximize flexibility and extensibility and therefore includes built-in data access scripts, signature matrices, and deconvolution algorithms. We also implemented four publicly available algorithms for deconvolution to provide examples of how these can be used and compared in practice. We include signature matrices that were published from mRNA immune cell expression profiles,¹⁴^,¹⁹ markers of stromal cells, and generated new ones from sorted proteomic data,³¹ as described in the STAR Methods. Lastly, we developed three metrics that enable users to compare and contrast various aspects of tumor deconvolution on proteomic data: (1) evaluation on simulated data to evaluate performance in the face of sparse data, (2) agreement between mRNA and protein, and (3) comparison with immune subtypes.³⁸ Each of these is demonstrated below.

Matched proteogenomic pan-cancer resource enables facile benchmarking of deconvolution across data modalities

We built the deconvolution framework leveraging the publicly available CPTAC data across 10 cancer cohorts, depicted in Figure 2, as these data are comprised of matched proteomic and transcriptomic measurements across ∼1,000 patient samples, making it one of the largest proteogenomic datasets available of human tissue. The data matrices procured from the CPTAC pan-cancer resource python package³⁵ have undergone processing through the CPTAC Common Data Analysis Pipeline. This comprehensive pipeline ensures uniformity in measurement techniques and standardization of normalization such that every matrix that is used as input into the tumor deconvolution module is log2 normalized with standard HGNC gene symbols as row names and patient sample identifiers as column names. The employment of this pipeline yields files that can be directly analyzed and cross-compared, specifically in relation to proteins and genes across various types of cancer tumors. By building a docker image for both mRNA and protein data, the Decomprolute framework enables robust comparability across cancer datasets in standardized file formats. This not only mitigates the potential for batch effects but also curbs other experimental inconsistencies that could arise during the benchmarking of different algorithms. This allows the user to evaluate performance across different tumor types. We focused on the samples for which there were matched proteomic and transcriptomic data (red bars, Figure 2) for our comparison of proteomic- and transcriptomic-based deconvolution but leveraged all data to evaluate missingness in the data (see below).

Summary of CPTAC3-combined dataset leveraged by Decomprolute

The x axis represents cancer type measured, and the y axis represents number of samples that have transcriptomics, proteomics, or both.

Expression of marker genes vary across omic datasets and signature matrices

Deconvolution frameworks are comprised of two distinct components: the algorithms that perform the deconvolution and the signature matrices that are used to identify specific cell-type populations. Therefore, in order to build a framework that compares algorithms, we also needed to identify specific cell signatures that can be leveraged and compared across mRNA and protein-derived datasets. Therefore, we collected five individual matrices: (1) the LM22 signature matrix, identifying 22 distinct cell types used in the first CIBERSORT publication¹⁴; (2) a peripheral blood mononuclear cell (PBMC)-derived signature from single-cell RNA-seq (as single-cell data are often used to deconvolve bulk data), called PBMC in our framework; (3) a signature of 9 cell types from a sorted proteomics experiment,³¹ called LM9; (4) a compressed version of the LM9 signature, called LM7c, that combines granulocytes into a single category; and (5) the protein reference list from the Matrisome,³⁷ a resource of extracellular matrix protein that can be used to identify non-immune cell types.

To set a baseline measurement for tumor deconvolution algorithms, we first assessed what fraction of the genes in each signature matrix was expressed in our benchmark proteogenomic cancer dataset. Here, we measured, for each predicted cell type in each signature matrix, what fraction of the marker genes was missing in each transcriptomic and proteomic dataset. The average fractions across each signature matrix are depicted in Figure 3. As would be expected, the individual transcriptomic datasets were generally missing fewer than 25% of marker genes, and this value was consistent across cell types. Proteomic datasets, however, exhibited a much higher and variable number of missing genes, ranging from a median around 40% to median values as high as ∼75%. Interestingly the distribution of missing values did not vary across immune cell markers of mRNA (LM22, PBMC; Figures 3A and 3B) and protein (LM9, LM7c; Figures 3C and 3D). However, the Matrisome markers (Figure 3E) were much more highly expressed in the proteomic data than other signatures.

Fraction of cell-type signatures that have missing data across the proteomic and transcriptomic datasets

Numbers represent per-sample averages for both transcriptomics and proteomics datasets. Boxplots represent the range between first and third quartiles, and whiskers extend beyond the box to the largest value no further than 1.5 times the inter-quartile range.

Sampling approach assesses algorithm performance with reduced coverage

Given the large amount of missingness in the signature matrices, as well as the general differences in coverage of mRNA and protein data, one of the first tests to evaluate tumor deconvolution algorithms and signature matrices on proteomic data is to evaluate how well they perform on missing data. Therefore, we generated two simulated datasets—one from single-cell RNA-seq data³⁹ and one from sorted proteomic data³¹ (as described in STAR Methods)—and evaluated the performance of the various tumor deconvolution algorithms in the presence of missing gene markers. We then used this dataset to evaluate the performance of each deconvolution algorithm in the presence of missing data.

The sampling workflow implemented in Decomprolute enables selecting a fraction of the genes of a signature matrix and evaluating how well the cell types predicted from the signature matrix correlate with those in the simulated datasets. We compared RNA-based signatures (LM22, PBMC) to data simulated from RNA-seq data and protein-based signatures (LM9, LM7c) to data simulated from the proteomic data. While the simulation framework can be customized, we initially simulated each signature matrix at 10%, 20%, 40%, 60%, 80%, and 100% five times and evaluated over 5 or 10 simulated datasets. The correlation values of each subsampled prediction and the simulated data are depicted in Figure 4.

Behavior of deconvolution algorithms across sparse datasets

Correlation of proteomic-derived matrices with data simulated from the (A) LM9 and (B) LM7c signature matrices. The x axis depicts predicted cell type, and the y axis is Spearman rank correlation across 10 simulated datasets between algorithm prediction (color indicated on right) and known fractions of cells.

The results of the subsampling experiment show that most combinations of algorithm and signature matrix fail to correlate with simulated gold-standard values when missing up to 80%–90% of the marker genes. For example, LM22 can identify numerous cell types with high accuracy using CIBERSORT, but this performance drops off when more than 50% of the marker genes are not expressed (top left panel, Figure 4), and this performance is worse for CD8⁺ T cells. The PBMC matrix seems to be most robust to missing data, particularly when used with the xCell algorithm. In summary, this experiment shows that, when running deconvolution algorithms on reduced coverage data, such as proteomics, it is important to factor in the signature matrix and how robust it is to missingness.

Assessing algorithmic agreement between protein- and mRNA-based deconvolution

As we described earlier, mRNA-based tumor deconvolution algorithms¹⁴^,¹⁵^,¹⁶^,¹⁷ have demonstrated success when compared to gold standards. These datasets contain known mixtures of individual cell types or paired single-cell measurements, alongside bulk RNA-seq data. Hence, it is possible to compare different algorithms to see which method deconvolves better having a ground truth as reference. However, because no such dataset exists for tumors measured via bulk proteomics, we use the bulk mRNA measurements matched to bulk proteomic measurements across 10 different cancer types from the CPTAC 3 pan-cancer efforts to identify which algorithm-signature matrix combination gives the best results on proteomic data when compared to mRNA-based predictions.

Like our other metrics, this test is also implemented as a single workflow that produces numerous tables and figures for further analysis. Figure 5 depicts a subset of the results of this analysis, measuring the average concordance between mRNA- and protein-based deconvolution using the Spearman rank correlation (see STAR Methods). Across the ten cancer types and four signature matrices, the xCell algorithm showed the highest amount of agreement for its predictions on mRNA and protein data, using both transcriptomic- and proteomic-derived signature matrices, as depicted by high average correlation in Figure 5. Furthermore, the xCell results were also highly correlated with MCP-counter, supporting the findings made by each. While the EPIC algorithm had generally poorer performance when measuring correlation between matched mRNA and proteomic datasets, it exhibited high degrees of correlation when using the Matrisome signature matrix.

Correlation of deconvolution algorithms between mRNA and protein datasets

Average Spearman rank correlation between deconvolution results run on mRNA data (x axis) and proteomic data (y axis)

Individual values are divided across signature matrices and cancer types.

There are two possible explanations for the differences between these four algorithms. One is that CIBERSORT and EPIC do not rely on gene signatures but on gene expression values, which can be less flexible across mRNA and protein datasets (given that protein levels and mRNA levels are not highly correlated). The second explanation is that CIBERSORT and EPIC yield absolute cell fractions relative to all cells in a given sample, which can make it difficult to compare between samples. In contrast, the output from MCP-counter and xCell typically are scores that are comparable between samples. We addressed the second issue by normalizing all deconvoluted results by the sum of fraction/score values so that they range between 0 and 1 and then using the Spearman rank correlation score to compare values. As such, the scoring metrics should be robust to whatever type of output of the deconvolution algorithms, regardless of the type of score produced.

Cell-type-specific variation in signature matrix performance

Due to biases in the algorithms and signature matrices, we also provided the ability to compare mRNA and protein deconvolution results across algorithms for specific cell types. Specifically, we measured, for each cell type, algorithm, and signature matrix, the correlation between mRNA and protein across patient samples. The results are depicted in Figure 6. Here, we see that the selection of signature matrix can determine how accurate proteomic-based cell-type predictions are. As we see in Figure 6A, correlation across mRNA and proteomic datasets, even for the same algorithm, is generally lower for CD4⁺ T cells, CD8⁺ T cells, natural killer (NK) cells, and NK T cells, suggesting that this signature matrix, derived from single-cell RNA-seq, might not be accurate for proteomic data regardless of algorithm. Using the LM7c matrix in Figure 6B, however, shows improved correlation for CD8⁺ T cells and CD4⁺ T cells for most algorithms. These visualizations could be helpful when selecting the deconvolution approach for a single cell type.

Correlation of deconvolution between mRNA and protein data by cell type

Correlation per disease type across predicted cell types across tumors by algorithm for the (A) PMBC signature matrix and (B) LM7c signature matrix. Columns represent algorithms run on mRNA data, and rows represent algorithms run on protein data. The correlations for each cancer type and cell type are shown as bars colored by cancer type.

xCell captures immune subtypes in proteomic-derived cell-type composition

For a third benchmarking metric, we again leveraged CPTAC 3 proteomic datasets. Here, we utilized the classification of each tumor sample into one of the six immune subtypes³⁸ predicted using the mRNA data published as part of this resource.⁴⁰ Specifically, we compared the deconvolution results of the four algorithms on tumor samples of all 10 tumors with respect to the immune subtype classifications. The values from each algorithm and for each cell type were transformed to Z scores. We used PBMC on RNA-seq data and LM7c on proteomic data and the median as summary statistics since we are interested in how many samples have a Z score above or below zero. We focused on immune subtypes with more than 100 samples assigned (Figure 7).

Predicted cell types across tumors by immune subtype and algorithm

(A) Depicts distribution of cell types (rows) as predicted by various algorithms. Color of density plots describes the algorithm used to score the subtypes.

(B) Depicts the same distribution of values, but the columns represent algorithms, and the color of the density plots represents the subtypes. Dashed bars represents a Z score of 0, while the colored bars represent the median of each histogram.

We then demonstrate how we use the cross-section of immune signatures with cell type to evaluate how accurately the algorithms can predict immune activity. For example, the C1 (wound healing) immune subtype, characterized by a high proliferation rate, shows no enrichment for any cell type for both types of data and for all algorithms (Figure 7A). The C2 (interferon [IFN]-γ dominant) subtype, defined by an abundance of CD8⁺ T cells and M1 macrophages, has the highest number of samples assigned (505). Interestingly, xCell predicts, for the samples assigned to this cluster, an enrichment for most immune cells regardless of the type of data used, while the other algorithms show an opposite result (Figure 7B). CIBERSORT found enrichment of CD8⁺ T cells in the C2 samples. These samples serve as a good benchmark of immune activity because they can be seen as “immune-hot”—the IFN-γ response that characterizes this immune subtype causes the activation of both innate and adaptive immune systems. The C3 subtype, defined as inflammatory, shows enrichment for lymphocytes for both types of data with xCell, whereas the C4 subtype, defined as lymphocyte depleted, shows a minimal enrichment for both types of data for CD4 T cells with CIBERSORT. Overall, xCell is more accurately able to capture the immune activation in this subtype compared to the other algorithms.

Discussion

Here, we introduced a benchmarking platform to assess the performance of tumor deconvolution algorithms on proteomic data. It is comprised of four modules, each of which can be altered to allow for additional (1) algorithms, (2) proteomic datasets, (3) signature matrices, and (4) evaluation metrics. We showcase each of the evaluation metrics to determine how well individual signature matrices perform across datasets and how well the algorithms leverage the signature matrices to assess proteomic measurements. First, we show how the algorithms can be run on simulated data using both proteomic and mRNA expression levels. We then compare mRNA-based deconvolution to proteomics to determine how well the algorithms agree. Lastly, we use mRNA-derived immune subtypes to evaluate how proteomics-based tumor deconvolution algorithms identify relative cell fractions within each subtype.

The need for such a system emerges from the absence a protein-native tumor deconvolution gold standard that can be used to evaluate the performance of existing tumor deconvolution algorithms on proteomic data. In the absence of such a dataset, we employ these three metrics—data simulation, correlation, and immune analysis—to enable the measurement of existing algorithms. Each metric provides the ability for users to make informed decisions about the methods and gene signatures they use to analyze bulk proteomic data. The simulated data experiments enable the identification of specific signature matrices and algorithms that are still efficacious when specific genes/proteins are missing. In cases where there are matched datasets with known cell-type numbers, the correlation experiments enable a head-to-head measurement of proteomic measurements with a gold standard. Lastly, the immune subtyping experiment shows how deconvolution results can be compared to orthogonal measurements of individual samples of expected cell-type activity. We hope that such a platform will be used by the community to further develop tumor deconvolution algorithms based on proteomic data so that we can get more insights from the inference of cell phenotypes using these data. While the development of new tools might not always be necessary, comparing existing methods will be an essential first step.

The development of deconvolution algorithms leverage existing gene and protein markers to understand cell-type populations in an evolving space due to increased coverage of single-cell omics,¹¹ spatially resolved transcriptomics,⁴¹ proteomics,⁴²^,⁴³ and other high-throughput assays. As such, it is becoming increasingly important to be able to identify cells in samples with fewer gene markers. Here, we found that xCell was the most robust in these scenarios, given its flexibility across signature matrices and behavior in our simulation experiments.

As we learn more about the value of proteomic measurements in cancer studies,²²^,²³^,²⁴^,²⁶^,⁴⁴ understanding the nuances of proteomic data in tumor deconvolution is highly valuable. This framework will facilitate the analysis and potential development of proteomic and proteogenomic tumor deconvolution algorithms by providing an easy way to compare newly developed approaches to those that already exist. We believe this platform is robust to additional datasets, algorithms, and signature matrices and will be broadly used by the tumor proteomics community.

Limitations and future directions

One primary limitation of Decomprolute and its ability to accurately assess tumor deconvolution algorithms is the absence of a gold-standard experimentally derived dataset. One such example of a dataset would be a collection of experiments in which cell types were mixed at known quantities before being profiled by global proteomic measurements, along with pure measurements of the cell types, or where cell types were measured via a different technology such as a Coulter counter. Equivalent datasets exist for RNA-based tumor deconvolution to some extent.⁴⁵^,⁴⁶ While we were able to recreate this experiment using simulated data, it should be noted that these are still only estimates, preventing the ability to ask more precise questions about how well each algorithm can detect less common cell types.

Moving forward, we hope to both extend our coverage of algorithms that can be shared via open-source methods and also carry out the experimental measurements required to evaluate the true efficacy of proteomic tumor deconvolution on a gold-standard datasets.

STAR★Methods

Key resources table

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Deposited data

CPTAC 3 Pancancer data	Li et al.²⁶	https://doi.org/10.1016/j.ccell.2023.06.009

Software and algorithms

Decomprolute code repository	This paper	https://doi.org/10.5281/zenodo.10515695

Open in a new tab

Resource availability

Lead contact

Requests for further information should be directed to and will be fulfilled by Sara Gosline, PhD sara.gosline@pnnl.gov.

Materials availability

This study analyzes publicly available data and did not generate unique reagents.

Data and code availability

•
All data used for the analyses in this paper are published and freely available as part of the CPTAC data resource paper. Links are included in the key resources table.
•
All source code is publicly available via Github at https://github.com/pnnl-compbio/decomprolute. An archival version is provided in the key resources table. In addition to the underlying software for executing and assessing the performance of the various algorithms, this repository includes signature matrix files, dummy test data, sample inputs, and the CI/CD configuration file. The CWL workflows to execute the pipeline are organized under a single directory. From this directory, users can execute deconvolution and performance comparison. If executing the code on a local machine, output is saved directly in this directory. Using a Docker image will save the output on the corresponding container directory, and the user can transfer the file to their local computer with a mounted directory or with the `docker scp` command.
•
Any additional information required to reanalyze the data in this paper are available from the lead contact upon request.

Methods details

Cancer transcriptomic and proteomic data

We provide a flexible framework that enable both the mRNA and proteomics data to be handled in individual modules to make it easier to upgrade and replace these modules with updated data as additional proteomics datasets are released. Specifically, we rely on the CPTAC Python package³⁵ in attempts to build a framework that would be flexible with respect to incoming data. Therefore, Decomprolute can be used with this package or replaced with other similar packages or data files. Figure 2 shows the sample numbers available at the time of publication.

Tumor deconvolution algorithm modules

Within the tumor-deconv-algs module we currently have implemented four distinct algorithms from the community: CIBERSORT,¹⁴ MCP-counter,¹⁷ xCell,¹⁵ and EPIC.¹⁶ Additional algorithms can be added provided they take a tab-delimited file as input (rows are gene names, columsn are sample identifiers) and produce a tab-delimited file as output.

Signature matrices

The signature-matrices module implements four signature matrices – 2 derived from transcriptomics and two derived from proteomics measurements.

The mRNA-derived matrices are called LM22 and PBMC. The LM22 matrix was originally published in the CIBERSORT manuscript¹⁴ and contains expression values derived from microarray data for a group of filtered genes across 22 immune cell types and subtypes for a total of 547 genes. The second published matrix PBMC (peripheral blood mononuclear cells) was derived from single-cell RNA sequencing (3′ sequencing) data in the CIBERSORTx manuscript¹⁹ and comprises 8 immune cell types represented by 1675 genes.

We also generated two additional signature matrices from a published proteomic dataset of flow cytometry-sorted PBMC.³¹ Briefly, 28 distinct human hematopoietic cell types and subtypes from peripheral blood of healthy donors were sorted by flow cytometry. Erythrocytes and platelets were excluded from subsequent analyses. Cellular proteomes were analyzed in single runs by high-resolution MS using a quadrupole Orbitrap instrument. Each cell phenotype proteome was measured from four donors. The proteomic dataset included 10,134 proteins and 104 steady state samples. For LM9 we grouped the 26 phenotypes into 9 cell types: B cells, basophils, CD4⁺ T cells, CD8⁺ T cells, dendritic cells, eosinophils, monocytes, natural killer cells (NKs), neutrophils. For LM7c, basophils, eosinophils and neutrophils were grouped together as granulocytes. We took imputed values from Table S3 of the Rieckmann et al. paper³¹ to generate the two signature matrices, with samples first scaled to have zero mean and unit variance for LM9, using CIBERSORTx¹⁹ with these parameters: kappa = 999; q-value = 0.01; number of barcode genes = 300 to 500; disable quantile normalization = TRUE; filter non-hematopoietic genes = TRUE. The resulting matrices were 814 and 377 genes, for LM7c and LM9 respectively.

Common Workflow Language deconvolution pipeline

We used the Common Workflow Language (CWL), following the syntax specified in CWL v1.2,⁴⁷ to link the individual docker images described above. Separate CWL script files were written for each step of data downloading, analyzing, and visualization. These individual script files have been integrated into ordered workflow steps in a single workflow file. Workflow has been primarily tested by the program cwltool, which is the reference implementation of programs that run CWL scripts, though can be employed using other CWL execution engines. The order of workflow steps was determined by using dependencies between the output of each step (e.g., data produced, file generated) and the input for the next step. The “scatter feature” was applied to facilitate parallel execution and accelerate the evaluations in each step. Essential results and log data were saved in order to retrieve or reanalyze the intermediate output files. The specification file for the workflow pipeline is written in YAML Ain’t Markup Language (YAML). The YAML files specify the input other parameters and/or arguments necessary for the pipeline.

Docker image building

Each CWL file leverages a local Docker runtime to execute the underlying algorithm scripts. All individual steps are built into separate Docker images, which makes the pipeline reproducible and resolves the complexity of package management or issues arising from differing operating systems. The Docker images required to run the pipeline are included in the public image repository Docker Hub, at https://hub.docker.com/u/tumordeconv. When executed, each CWL performs a "pull action" and automatically downloads or updates the specified image it requires to complete its task. Docker images were automatically built with each code commit and pushed to the Github repository, using continuous integration and continuous deployment practices (CI/CD), to avoid conflicts that can arise with manually built or outdated images. Each commit push triggered a series of end-to-end tests on the CI/CD platform CircleCI, where the entire workflow is executed on a virtual machine. If the tests were successful, indicating the pipeline integrity was maintained with each code change, any associated Docker images were rebuilt and published to the repository.

Data simulation

Pseudo-bulk data was simulated in a similar fashion as in Petralia et al. (2022).⁴⁸ Our simulation framework relied on two published datasets. First, we considered proteomic profiling from Rieckmann et al.³¹ This study includes proteomic profiling of 26 immune cell subtypes, and then collapsed to k = 9 different cell types: Neutrophils, Eosinophils, Basophils, B cells, CD4 T cells, CD8 t cells, Monocytes, Nature Killer (NK) cells, and Dendritic cells. For each cell type k, 4 different proteomic profiles were provided, i.e., $μ_{1, k}$ , $μ_{2, k}$ , $μ_{3, k}$ , $μ_{4, k}$ . For each sample $i$ , weights of different immune cells were randomly sampled from a dirichlet distribution with parameter 0.5 (i.e., $π_{i, 1}, π_{i, 2} . . π_{i, K})$ . Then, for each patient, mixed proteomic profiling was derived as the weighted average of proteomic profiling of different cell-types as follows:

y_{i} = π_{i, 1} β_{i, 1} + π_{i, 2} β_{i, 2} + . . π_{i, k} β_{i, K}

with $β_{i, k}$ being one of the proteomic profiles available for the k-th cell type which was randomly sampled from $μ_{1, k}$ , $μ_{2, k}$ , $μ_{3, k}$ , $μ_{4, k}$ . Next, we considered data from Linsley et al.,³⁹ which contains transcriptomic profiling of 6 immune cell types including B-cells, CD4 t-cells, CD8 t-cells, Monocytes, Neutrophils and Natural Killers. For each cell type, this data contained 20 different transcriptomic profiling. Mixed transcriptomic data was generated similarly to proteomic profiling.

Quantification and statistical analysis

Algorithm metrics

We use two types of metrics for comparing the deconvoluted results to either simulated data or mRNA data from the same sample: namely a correlation-based metric and a distance based metric. The deconvoluted results are in a matrix where columns are the samples and rows are the cell type proportion calculated from the deconvolution algorithms. To compare any two deconvoluted matrices, we can calculate either the correlation or the distance between the corresponding vectors of cell type proportions. Given any two matrices $A$ and $B$ we can get cell type proportions $a_{* j} = {a_{1 j}, . . ., a_{i j}, . . ., a_{N j}}$ and ${b_{* j} = {b}_{1 j}, . . ., b_{i j}, . . ., b_{N j}}$ for patient $j$ , where $N$ is the number of cell types, and distributions across all patients $a_{i *} = {a_{i 1}, . . ., a_{i j}, . . ., a_{i M}}$ and ${b_{i *} = {b}_{i 1}, . . ., b_{i j}, . . ., b_{i M}}$ for cell type $i$ , where $M$ is the number of patients. We can then calculate the correlation and distances in the following approach.

Correlation based comparison

In this comparison, each of the deconvoluted results are compared by calculating the Pearson correlation or Spearman correlation for each sample or for each cell type. The average correlation was simply calculated by averaging the correlation values across patients. The Pearson correlation for cell type proportions is calculated following:

r_{a b} = \frac{\sum_{i = 1}^{N} (a_{i} - \underline{a}) (b_{i} - \underline{b})}{\sqrt{\sum_{i = 1}^{N} {(a_{i} - \underline{a})}^{2}} \sqrt{\sum_{i = 1}^{N} {(b_{i} - \underline{b})}^{2}}}

and the Spearman correlation for cell type proportions is calculated following:

r_{R S} = \frac{\sum_{i = 1}^{N} (R_{i} - \underline{R}) (R_{i} - \underline{R})}{\sqrt{\sum_{i = 1}^{N} {(R_{i} - \underline{R})}^{2}} \sqrt{\sum_{i = 1}^{N} {(S_{i} - \underline{S})}^{2}}}

where, $R_{i}$ and $S_{i}$ are ranks of $a_{i}$ and $b_{i}$ . For correlations between patients distributions, we replace the $N$ with $M$ in the equations above.

Distance based comparison

In this comparison, we provide three different distance metrics, namely Euclidean, Jenson-Shannon divergence, Kolmogorov-Smirnov distance. For the distance metrics, we only calculate the distances between cell type proportions for each patient. An average distance was simply calculated by averaging the distance values across patients. The Euclidean distance is calculated following:

d_{E u} (A, B) = \frac{1}{M} \sum_{j = 1}^{M} \sqrt{\sum_{i = 1}^{N} {(a_{i j} - b_{i j})}^{2}}

The Jenson-Shannon distance is calculated with:

d_{J S} (A, B) = \frac{1}{M} \sum_{j = 1}^{M} \sqrt{\frac{D_{K L} (a_{* j} ‖ b_{* j}) + D_{K L} (b_{* j} ‖ a_{* j})}{2}}

where $D_{K L} (a_{* j} ‖ b_{* j})$ and $D_{K L} (b_{* j} ‖ a_{* j})$ are the Kullback-Leibler (KL) divergences calculated by:

D_{K L} (a_{* j} ‖ b_{* j}) = \sum_{i = 1}^{N} P (a_{i j}) \log \frac{P (a_{i j})}{P (b_{i j})}

and $P (a_{i j})$ is the proportion of cell type $i$ in patient sample $j$ in deconvoluted matrix $A$ and similarity for $P (b_{i j})$ in deconvoluted matrix $B$ . For the Kolmogorov-Simirnov (KS) distance, we calculated the KS distance with the following equation:

d_{K S} (A, B) = \frac{1}{M} \sum_{j = 1}^{M} \sup | F (a_{* j}) - F (b_{* j}) |

where $F (a_{* j})$ and $F (b_{* j})$ are the cumulative distribution function of $a_{* j}$ and $b_{* j}$ .

Acknowledgments

This work was done in collaboration with the US National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium (CPTAC). We would like to acknowledge the CPTAC 3 PanCancer Immune working group for their valuable feedback. PNNL is operated for the DOE by Battelle Memorial Institute under contract DE-AC05-76RL01830.

Author contributions

S.F. was the lead developer on Docker containers and workflows and wrote the manuscript. A.C. tested scripts, supported documentation, and wrote the manuscript. F.P. provided guidance and wrote the manuscript. P.W. provided guidance and wrote the manuscript. P.P. evaluated tumor immunity and wrote the manuscript. S.J.C.G. led project development and wrote the manuscript.

Declaration of interests

The authors declare no competing interests.

Published: February 26, 2024

References

1.Fridman W.H., Zitvogel L., Sautès-Fridman C., Kroemer G. The immune contexture in cancer prognosis and treatment. Nat. Rev. Clin. Oncol. 2017;14:717–734. doi: 10.1038/nrclinonc.2017.101. [DOI] [PubMed] [Google Scholar]
2.Hanahan D., Weinberg R.A. Hallmarks of cancer: the next generation. Cell. 2011;144:646–674. doi: 10.1016/j.cell.2011.02.013. [DOI] [PubMed] [Google Scholar]
3.Baghban R., Roshangar L., Jahanban-Esfahlan R., Seidi K., Ebrahimi-Kalan A., Jaymand M., Kolahian S., Javaheri T., Zare P. Tumor microenvironment complexity and therapeutic implications at a glance. Cell Commun. Signal. 2020;18:59. doi: 10.1186/s12964-020-0530-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Gun S.Y., Lee S.W.L., Sieow J.L., Wong S.C. Targeting immune cells for cancer therapy. Redox Biol. 2019;25 doi: 10.1016/j.redox.2019.101174. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Ali H.R., Jackson H.W., Zanotelli V.R.T., Danenberg E., Fischer J.R., Bardwell H., Provenzano E., CRUK IMAXT Grand Challenge Team. Rueda O.M., Chin S.F., et al. Imaging mass cytometry and multiplatform genomics define the phenogenomic landscape of breast cancer. Nat. Can. (Ott.) 2020;1:163–175. doi: 10.1038/s43018-020-0026-6. [DOI] [PubMed] [Google Scholar]
6.Bodenmiller B. Multiplexed Epitope-Based Tissue Imaging for Discovery and Healthcare Applications. Cell Syst. 2016;2:225–238. doi: 10.1016/j.cels.2016.03.008. [DOI] [PubMed] [Google Scholar]
7.Lun X.-K., Szklarczyk D., Gábor A., Dobberstein N., Zanotelli V.R.T., Saez-Rodriguez J., von Mering C., Bodenmiller B. Analysis of the Human Kinome and Phosphatome by Mass Cytometry Reveals Overexpression-Induced Effects on Cancer-Related Signaling. Mol. Cell. 2019;74:1086–1102.e5. doi: 10.1016/j.molcel.2019.04.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Simoni Y., Chng M.H.Y., Li S., Fehlings M., Newell E.W. Mass cytometry: a powerful tool for dissecting the immune landscape. Curr. Opin. Immunol. 2018;51:187–196. doi: 10.1016/j.coi.2018.03.023. [DOI] [PubMed] [Google Scholar]
9.Pliner H.A., Shendure J., Trapnell C. Supervised classification enables rapid annotation of cell atlases. Nat. Methods. 2019;16:983–986. doi: 10.1038/s41592-019-0535-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Liu X., Gosline S.J.C., Pflieger L.T., Wallet P., Iyer A., Guinney J., Bild A.H., Chang J.T. Knowledge-based classification of fine-grained immune cell types in single-cell RNA-Seq data. Briefings Bioinf. 2021;22:bbab039. doi: 10.1093/bib/bbab039. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Azizi E., Carr A.J., Plitas G., Cornish A.E., Konopacki C., Prabhakaran S., Nainys J., Wu K., Kiseliovas V., Setty M., et al. Single-Cell Map of Diverse Immune Phenotypes in the Breast Tumor Microenvironment. Cell. 2018;174:1293–1308.e36. doi: 10.1016/j.cell.2018.05.060. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Sokolowski D.J., Faykoo-Martinez M., Erdman L., Hou H., Chan C., Zhu H., Holmes M.M., Goldenberg A., Wilson M.D. Single-cell mapper (scMappR): using scRNA-seq to infer the cell-type specificities of differentially expressed genes. NAR Genom. Bioinform. 2021;3 doi: 10.1093/nargab/lqab011. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Finotello F., Trajanoski Z. Quantifying tumor-infiltrating immune cells from transcriptomics data. Cancer Immunol. Immunother. 2018;67:1031–1040. doi: 10.1007/s00262-018-2150-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Chen B., Khodadoust M.S., Liu C.L., Newman A.M., Alizadeh A.A. Profiling tumor infiltrating immune cells with CIBERSORT. Methods Mol. Biol. 2018;1711:243–259. doi: 10.1007/978-1-4939-7493-1_12. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Aran D., Hu Z., Butte A.J. xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome Biol. 2017;18:220. doi: 10.1186/s13059-017-1349-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Racle J., Gfeller D. In: Bioinformatics for Cancer Immunotherapy: Methods and Protocols Methods in Molecular Biology. Boegel S., editor. Springer US; 2020. EPIC: A Tool to Estimate the Proportions of Different Cell Types from Bulk Gene Expression Data; pp. 233–248. [DOI] [PubMed] [Google Scholar]
17.Becht E., Giraldo N.A., Lacroix L., Buttard B., Elarouci N., Petitprez F., Selves J., Laurent-Puig P., Sautès-Fridman C., Fridman W.H., de Reyniès A. Estimating the population abundance of tissue-infiltrating immune and stromal cell populations using gene expression. Genome Biol. 2016;17:218. doi: 10.1186/s13059-016-1070-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Liu C.C., Steen C.B., Newman A.M. Computational approaches for characterizing the tumor immune microenvironment. Immunology. 2019;158:70–84. doi: 10.1111/imm.13101. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Newman A.M., Steen C.B., Liu C.L., Gentles A.J., Chaudhuri A.A., Scherer F., Khodadoust M.S., Esfahani M.S., Luca B.A., Steiner D., et al. Determining cell type abundance and expression from bulk tissues with digital cytometry. Nat. Biotechnol. 2019;37:773–782. doi: 10.1038/s41587-019-0114-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Racle J., de Jonge K., Baumgaertner P., Speiser D.E., Gfeller D. Simultaneous enumeration of cancer and immune cell types from bulk tumor gene expression data. Elife. 2017;6 doi: 10.7554/eLife.26476. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Sturm G., Finotello F., List M. In: Bioinformatics for Cancer Immunotherapy: Methods and Protocols Methods in Molecular Biology. Boegel S., editor. Springer US; 2020. Immunedeconv: An R Package for Unified Access to Computational Methods for Estimating Immune Cell Fractions from Bulk RNA-Sequencing Data; pp. 223–232. [DOI] [PubMed] [Google Scholar]
22.Clark D.J., Dhanasekaran S.M., Petralia F., Pan J., Song X., Hu Y., da Veiga Leprevost F., Reva B., Lih T.-S.M., Chang H.-Y., et al. Integrated Proteogenomic Characterization of Clear Cell Renal Cell Carcinoma. Cell. 2019;179:964–983.e31. doi: 10.1016/j.cell.2019.10.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Dou Y., Kawaler E.A., Cui Zhou D., Gritsenko M.A., Huang C., Blumenberg L., Karpova A., Petyuk V.A., Savage S.R., Satpathy S., et al. Proteogenomic Characterization of Endometrial Carcinoma. Cell. 2020;180:729–748.e26. doi: 10.1016/j.cell.2020.01.026. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Gillette M.A., Satpathy S., Cao S., Dhanasekaran S.M., Vasaikar S.V., Krug K., Petralia F., Li Y., Liang W.-W., Reva B., et al. Proteogenomic Characterization Reveals Therapeutic Vulnerabilities in Lung Adenocarcinoma. Cell. 2020;182:200–225.e35. doi: 10.1016/j.cell.2020.06.013. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Zhang H., Liu T., Zhang Z., Payne S.H., Zhang B., McDermott J.E., Zhou J.-Y., Petyuk V.A., Chen L., Ray D., et al. Integrated proteogenomic characterization of human high grade serous ovarian cancer. Cell. 2016;166:755–765. doi: 10.1016/j.cell.2016.05.069. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Li Y., Dou Y., Da Veiga Leprevost F., Geffen Y., Calinawan A.P., Aguet F., Akiyama Y., Anand S., Birger C., Cao S., et al. Proteogenomic data and resources for pan-cancer analysis. Cancer Cell. 2023;41:1397–1406. doi: 10.1016/j.ccell.2023.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Fortelny N., Overall C.M., Pavlidis P., Freue G.V.C. Can we predict protein from mRNA levels? Nature. 2017;547:E19–E20. doi: 10.1038/nature22293. [DOI] [PubMed] [Google Scholar]
28.McManus J., Cheng Z., Vogel C. Next-generation analysis of gene expression regulation--comparing the roles of synthesis and degradation. Mol. Biosyst. 2015;11:2680–2689. doi: 10.1039/c5mb00310e. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Payne S.H. The utility of protein and mRNA correlation. Trends Biochem. Sci. 2015;40:1–3. doi: 10.1016/j.tibs.2014.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Nagaraj N., Wisniewski J.R., Geiger T., Cox J., Kircher M., Kelso J., Pääbo S., Mann M. Deep proteome and transcriptome mapping of a human cancer cell line. Mol. Syst. Biol. 2011;7:548. doi: 10.1038/msb.2011.81. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Rieckmann J.C., Geiger R., Hornburg D., Wolf T., Kveler K., Jarrossay D., Sallusto F., Shen-Orr S.S., Lanzavecchia A., Mann M., Meissner F. Social network architecture of human immune cells unveiled by quantitative proteomics. Nat. Immunol. 2017;18:583–593. doi: 10.1038/ni.3693. [DOI] [PubMed] [Google Scholar]
32.Avila Cobos F., Alquicira-Hernandez J., Powell J.E., Mestdagh P., De Preter K. Benchmarking of cell type deconvolution pipelines for transcriptomics data. Nat. Commun. 2020;11:5650. doi: 10.1038/s41467-020-19015-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Decamps C., Arnaud A., Petitprez F., Ayadi M., Baurès A., Armenoult L., HADACA consortium. Arnaud A., Guyon I., Nicolle R., Escalera S. DECONbench: a benchmarking platform dedicated to deconvolution methods for tumor heterogeneity quantification. BMC Bioinf. 2021;22:473. doi: 10.1186/s12859-021-04381-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Jin H., Liu Z. A benchmark for RNA-seq deconvolution analysis under dynamic testing environments. Genome Biol. 2021;22:102. doi: 10.1186/s13059-021-02290-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Lindgren C.M., Adams D.W., Kimball B., Boekweg H., Tayler S., Pugh S.L., Payne S.H. Simplified and Unified Access to Cancer Proteogenomic Data. J. Proteome Res. 2021;20:1902–1910. doi: 10.1021/acs.jproteome.0c00919. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Li, Y., Dou, Y., da Veiga Leprevost, F., Geffen, Y., Calinawan, A.P., Auget, F., Akiyama, Y., Ding, L., Nesvizhskii, A., Wang, P., et al. Proteogenomic Data and Resources for Pan-Cancer Analysis. Cancer Cell 41, 8;1397-1406. 10.1016/j.ccell.2023.06.009 [DOI] [PMC free article] [PubMed]
37.Naba A., Clauser K.R., Hoersch S., Liu H., Carr S.A., Hynes R.O. The matrisome: in silico definition and in vivo characterization by proteomics of normal and tumor extracellular matrices. Mol. Cell. Proteomics. 2012;11 doi: 10.1074/mcp.M111.014647. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Thorsson V., Gibbs D.L., Brown S.D., Wolf D., Bortone D.S., Ou Yang T.-H., Porta-Pardo E., Gao G.F., Plaisier C.L., Eddy J.A., et al. The Immune Landscape of Cancer. Immunity. 2018;48:812–830.e14. doi: 10.1016/j.immuni.2018.03.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Linsley P.S., Speake C., Whalen E., Chaussabel D. Copy number loss of the interferon gene cluster in melanomas is linked to reduced T cell infiltrate and poor patient prognosis. PLoS One. 2014;9 doi: 10.1371/journal.pone.0109760. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Gibbs D.L. Robust classification of Immune Subtypes in Cancer. bioRxiv. 2020 doi: 10.1101/2020.01.17.910950. Preprint at. [DOI] [Google Scholar]
41.Anderson A.C., Yanai I., Yates L.R., Wang L., Swarbrick A., Sorger P., Santagata S., Fridman W.H., Gao Q., Jerby L., et al. Spatial transcriptomics. Cancer Cell. 2022;40:895–900. doi: 10.1016/j.ccell.2022.08.021. [DOI] [PubMed] [Google Scholar]
42.Bhatia H.S., Brunner A.-D., Öztürk F., Kapoor S., Rong Z., Mai H., Thielert M., Ali M., Al-Maskari R., Paetzold J.C., et al. Spatial proteomics in three-dimensional intact specimens. Cell. 2022;185:5040–5058.e19. doi: 10.1016/j.cell.2022.11.021. [DOI] [PubMed] [Google Scholar]
43.Guilliams M., Bonnardel J., Haest B., Vanderborght B., Wagner C., Remmerie A., Bujko A., Martens L., Thoné T., Browaeys R., et al. Spatial proteogenomics reveals distinct and evolutionarily conserved hepatic macrophage niches. Cell. 2022;185:379–396.e38. doi: 10.1016/j.cell.2021.12.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Zhang H., Liu T., Zhang Z., Payne S.H., Zhang B., McDermott J.E., Zhou J.-Y., Petyuk V.A., Chen L., Ray D., et al. Integrated Proteogenomic Characterization of Human High-Grade Serous Ovarian Cancer. Cell. 2016;166:755–765. doi: 10.1016/j.cell.2016.05.069. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.Shen-Orr S.S., Tibshirani R., Khatri P., Bodian D.L., Staedtler F., Perry N.M., Hastie T., Sarwal M.M., Davis M.M., Butte A.J. Cell type–specific gene expression differences in complex tissues. Nat. Methods. 2010;7:287–289. doi: 10.1038/nmeth.1439. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Newman A.M., Liu C.L., Green M.R., Gentles A.J., Feng W., Xu Y., Hoang C.D., Diehn M., Alizadeh A.A. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods. 2015;12:453–457. doi: 10.1038/nmeth.3337. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Crusoe M.R., Abeln S., Iosup A., Amstutz P., Chilton J., Tijanić N., Ménager H., Soiland-Reyes S., Gavrilović B., Goble C., Community T.C. Methods included: standardizing computational reuse and portability with the Common Workflow Language. Commun. ACM. 2022;65:54–63. doi: 10.1145/3486897. [DOI] [Google Scholar]
48.Petralia F., Krek A., Calinawan A.P., Feng S., Gosline S., Pugliese P., Ceccarelli M., Wang P. BayesDeBulk: A Flexible Bayesian Algorithm for the Deconvolution of Bulk Tumor Data. bioRxiv. 2022 doi: 10.1101/2021.06.25.449763. Preprint at. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

•
All data used for the analyses in this paper are published and freely available as part of the CPTAC data resource paper. Links are included in the key resources table.
•
All source code is publicly available via Github at https://github.com/pnnl-compbio/decomprolute. An archival version is provided in the key resources table. In addition to the underlying software for executing and assessing the performance of the various algorithms, this repository includes signature matrix files, dummy test data, sample inputs, and the CI/CD configuration file. The CWL workflows to execute the pipeline are organized under a single directory. From this directory, users can execute deconvolution and performance comparison. If executing the code on a local machine, output is saved directly in this directory. Using a Docker image will save the output on the corresponding container directory, and the user can transfer the file to their local computer with a mounted directory or with the `docker scp` command.
•
Any additional information required to reanalyze the data in this paper are available from the lead contact upon request.

[bib1] 1.Fridman W.H., Zitvogel L., Sautès-Fridman C., Kroemer G. The immune contexture in cancer prognosis and treatment. Nat. Rev. Clin. Oncol. 2017;14:717–734. doi: 10.1038/nrclinonc.2017.101. [DOI] [PubMed] [Google Scholar]

[bib2] 2.Hanahan D., Weinberg R.A. Hallmarks of cancer: the next generation. Cell. 2011;144:646–674. doi: 10.1016/j.cell.2011.02.013. [DOI] [PubMed] [Google Scholar]

[bib3] 3.Baghban R., Roshangar L., Jahanban-Esfahlan R., Seidi K., Ebrahimi-Kalan A., Jaymand M., Kolahian S., Javaheri T., Zare P. Tumor microenvironment complexity and therapeutic implications at a glance. Cell Commun. Signal. 2020;18:59. doi: 10.1186/s12964-020-0530-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] 4.Gun S.Y., Lee S.W.L., Sieow J.L., Wong S.C. Targeting immune cells for cancer therapy. Redox Biol. 2019;25 doi: 10.1016/j.redox.2019.101174. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] 5.Ali H.R., Jackson H.W., Zanotelli V.R.T., Danenberg E., Fischer J.R., Bardwell H., Provenzano E., CRUK IMAXT Grand Challenge Team. Rueda O.M., Chin S.F., et al. Imaging mass cytometry and multiplatform genomics define the phenogenomic landscape of breast cancer. Nat. Can. (Ott.) 2020;1:163–175. doi: 10.1038/s43018-020-0026-6. [DOI] [PubMed] [Google Scholar]

[bib6] 6.Bodenmiller B. Multiplexed Epitope-Based Tissue Imaging for Discovery and Healthcare Applications. Cell Syst. 2016;2:225–238. doi: 10.1016/j.cels.2016.03.008. [DOI] [PubMed] [Google Scholar]

[bib7] 7.Lun X.-K., Szklarczyk D., Gábor A., Dobberstein N., Zanotelli V.R.T., Saez-Rodriguez J., von Mering C., Bodenmiller B. Analysis of the Human Kinome and Phosphatome by Mass Cytometry Reveals Overexpression-Induced Effects on Cancer-Related Signaling. Mol. Cell. 2019;74:1086–1102.e5. doi: 10.1016/j.molcel.2019.04.021. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib8] 8.Simoni Y., Chng M.H.Y., Li S., Fehlings M., Newell E.W. Mass cytometry: a powerful tool for dissecting the immune landscape. Curr. Opin. Immunol. 2018;51:187–196. doi: 10.1016/j.coi.2018.03.023. [DOI] [PubMed] [Google Scholar]

[bib9] 9.Pliner H.A., Shendure J., Trapnell C. Supervised classification enables rapid annotation of cell atlases. Nat. Methods. 2019;16:983–986. doi: 10.1038/s41592-019-0535-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib10] 10.Liu X., Gosline S.J.C., Pflieger L.T., Wallet P., Iyer A., Guinney J., Bild A.H., Chang J.T. Knowledge-based classification of fine-grained immune cell types in single-cell RNA-Seq data. Briefings Bioinf. 2021;22:bbab039. doi: 10.1093/bib/bbab039. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib11] 11.Azizi E., Carr A.J., Plitas G., Cornish A.E., Konopacki C., Prabhakaran S., Nainys J., Wu K., Kiseliovas V., Setty M., et al. Single-Cell Map of Diverse Immune Phenotypes in the Breast Tumor Microenvironment. Cell. 2018;174:1293–1308.e36. doi: 10.1016/j.cell.2018.05.060. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12.Sokolowski D.J., Faykoo-Martinez M., Erdman L., Hou H., Chan C., Zhu H., Holmes M.M., Goldenberg A., Wilson M.D. Single-cell mapper (scMappR): using scRNA-seq to infer the cell-type specificities of differentially expressed genes. NAR Genom. Bioinform. 2021;3 doi: 10.1093/nargab/lqab011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] 13.Finotello F., Trajanoski Z. Quantifying tumor-infiltrating immune cells from transcriptomics data. Cancer Immunol. Immunother. 2018;67:1031–1040. doi: 10.1007/s00262-018-2150-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib14] 14.Chen B., Khodadoust M.S., Liu C.L., Newman A.M., Alizadeh A.A. Profiling tumor infiltrating immune cells with CIBERSORT. Methods Mol. Biol. 2018;1711:243–259. doi: 10.1007/978-1-4939-7493-1_12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib15] 15.Aran D., Hu Z., Butte A.J. xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome Biol. 2017;18:220. doi: 10.1186/s13059-017-1349-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] 16.Racle J., Gfeller D. In: Bioinformatics for Cancer Immunotherapy: Methods and Protocols Methods in Molecular Biology. Boegel S., editor. Springer US; 2020. EPIC: A Tool to Estimate the Proportions of Different Cell Types from Bulk Gene Expression Data; pp. 233–248. [DOI] [PubMed] [Google Scholar]

[bib17] 17.Becht E., Giraldo N.A., Lacroix L., Buttard B., Elarouci N., Petitprez F., Selves J., Laurent-Puig P., Sautès-Fridman C., Fridman W.H., de Reyniès A. Estimating the population abundance of tissue-infiltrating immune and stromal cell populations using gene expression. Genome Biol. 2016;17:218. doi: 10.1186/s13059-016-1070-5. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib18] 18.Liu C.C., Steen C.B., Newman A.M. Computational approaches for characterizing the tumor immune microenvironment. Immunology. 2019;158:70–84. doi: 10.1111/imm.13101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Newman A.M., Steen C.B., Liu C.L., Gentles A.J., Chaudhuri A.A., Scherer F., Khodadoust M.S., Esfahani M.S., Luca B.A., Steiner D., et al. Determining cell type abundance and expression from bulk tissues with digital cytometry. Nat. Biotechnol. 2019;37:773–782. doi: 10.1038/s41587-019-0114-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] 20.Racle J., de Jonge K., Baumgaertner P., Speiser D.E., Gfeller D. Simultaneous enumeration of cancer and immune cell types from bulk tumor gene expression data. Elife. 2017;6 doi: 10.7554/eLife.26476. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] 21.Sturm G., Finotello F., List M. In: Bioinformatics for Cancer Immunotherapy: Methods and Protocols Methods in Molecular Biology. Boegel S., editor. Springer US; 2020. Immunedeconv: An R Package for Unified Access to Computational Methods for Estimating Immune Cell Fractions from Bulk RNA-Sequencing Data; pp. 223–232. [DOI] [PubMed] [Google Scholar]

[bib22] 22.Clark D.J., Dhanasekaran S.M., Petralia F., Pan J., Song X., Hu Y., da Veiga Leprevost F., Reva B., Lih T.-S.M., Chang H.-Y., et al. Integrated Proteogenomic Characterization of Clear Cell Renal Cell Carcinoma. Cell. 2019;179:964–983.e31. doi: 10.1016/j.cell.2019.10.007. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] 23.Dou Y., Kawaler E.A., Cui Zhou D., Gritsenko M.A., Huang C., Blumenberg L., Karpova A., Petyuk V.A., Savage S.R., Satpathy S., et al. Proteogenomic Characterization of Endometrial Carcinoma. Cell. 2020;180:729–748.e26. doi: 10.1016/j.cell.2020.01.026. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib24] 24.Gillette M.A., Satpathy S., Cao S., Dhanasekaran S.M., Vasaikar S.V., Krug K., Petralia F., Li Y., Liang W.-W., Reva B., et al. Proteogenomic Characterization Reveals Therapeutic Vulnerabilities in Lung Adenocarcinoma. Cell. 2020;182:200–225.e35. doi: 10.1016/j.cell.2020.06.013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] 25.Zhang H., Liu T., Zhang Z., Payne S.H., Zhang B., McDermott J.E., Zhou J.-Y., Petyuk V.A., Chen L., Ray D., et al. Integrated proteogenomic characterization of human high grade serous ovarian cancer. Cell. 2016;166:755–765. doi: 10.1016/j.cell.2016.05.069. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib26] 26.Li Y., Dou Y., Da Veiga Leprevost F., Geffen Y., Calinawan A.P., Aguet F., Akiyama Y., Anand S., Birger C., Cao S., et al. Proteogenomic data and resources for pan-cancer analysis. Cancer Cell. 2023;41:1397–1406. doi: 10.1016/j.ccell.2023.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] 27.Fortelny N., Overall C.M., Pavlidis P., Freue G.V.C. Can we predict protein from mRNA levels? Nature. 2017;547:E19–E20. doi: 10.1038/nature22293. [DOI] [PubMed] [Google Scholar]

[bib28] 28.McManus J., Cheng Z., Vogel C. Next-generation analysis of gene expression regulation--comparing the roles of synthesis and degradation. Mol. Biosyst. 2015;11:2680–2689. doi: 10.1039/c5mb00310e. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] 29.Payne S.H. The utility of protein and mRNA correlation. Trends Biochem. Sci. 2015;40:1–3. doi: 10.1016/j.tibs.2014.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib30] 30.Nagaraj N., Wisniewski J.R., Geiger T., Cox J., Kircher M., Kelso J., Pääbo S., Mann M. Deep proteome and transcriptome mapping of a human cancer cell line. Mol. Syst. Biol. 2011;7:548. doi: 10.1038/msb.2011.81. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib31] 31.Rieckmann J.C., Geiger R., Hornburg D., Wolf T., Kveler K., Jarrossay D., Sallusto F., Shen-Orr S.S., Lanzavecchia A., Mann M., Meissner F. Social network architecture of human immune cells unveiled by quantitative proteomics. Nat. Immunol. 2017;18:583–593. doi: 10.1038/ni.3693. [DOI] [PubMed] [Google Scholar]

[bib32] 32.Avila Cobos F., Alquicira-Hernandez J., Powell J.E., Mestdagh P., De Preter K. Benchmarking of cell type deconvolution pipelines for transcriptomics data. Nat. Commun. 2020;11:5650. doi: 10.1038/s41467-020-19015-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] 33.Decamps C., Arnaud A., Petitprez F., Ayadi M., Baurès A., Armenoult L., HADACA consortium. Arnaud A., Guyon I., Nicolle R., Escalera S. DECONbench: a benchmarking platform dedicated to deconvolution methods for tumor heterogeneity quantification. BMC Bioinf. 2021;22:473. doi: 10.1186/s12859-021-04381-4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] 34.Jin H., Liu Z. A benchmark for RNA-seq deconvolution analysis under dynamic testing environments. Genome Biol. 2021;22:102. doi: 10.1186/s13059-021-02290-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] 35.Lindgren C.M., Adams D.W., Kimball B., Boekweg H., Tayler S., Pugh S.L., Payne S.H. Simplified and Unified Access to Cancer Proteogenomic Data. J. Proteome Res. 2021;20:1902–1910. doi: 10.1021/acs.jproteome.0c00919. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] 36.Li, Y., Dou, Y., da Veiga Leprevost, F., Geffen, Y., Calinawan, A.P., Auget, F., Akiyama, Y., Ding, L., Nesvizhskii, A., Wang, P., et al. Proteogenomic Data and Resources for Pan-Cancer Analysis. Cancer Cell 41, 8;1397-1406. 10.1016/j.ccell.2023.06.009 [DOI] [PMC free article] [PubMed]

[bib37] 37.Naba A., Clauser K.R., Hoersch S., Liu H., Carr S.A., Hynes R.O. The matrisome: in silico definition and in vivo characterization by proteomics of normal and tumor extracellular matrices. Mol. Cell. Proteomics. 2012;11 doi: 10.1074/mcp.M111.014647. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib38] 38.Thorsson V., Gibbs D.L., Brown S.D., Wolf D., Bortone D.S., Ou Yang T.-H., Porta-Pardo E., Gao G.F., Plaisier C.L., Eddy J.A., et al. The Immune Landscape of Cancer. Immunity. 2018;48:812–830.e14. doi: 10.1016/j.immuni.2018.03.023. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib39] 39.Linsley P.S., Speake C., Whalen E., Chaussabel D. Copy number loss of the interferon gene cluster in melanomas is linked to reduced T cell infiltrate and poor patient prognosis. PLoS One. 2014;9 doi: 10.1371/journal.pone.0109760. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib40] 40.Gibbs D.L. Robust classification of Immune Subtypes in Cancer. bioRxiv. 2020 doi: 10.1101/2020.01.17.910950. Preprint at. [DOI] [Google Scholar]

[bib41] 41.Anderson A.C., Yanai I., Yates L.R., Wang L., Swarbrick A., Sorger P., Santagata S., Fridman W.H., Gao Q., Jerby L., et al. Spatial transcriptomics. Cancer Cell. 2022;40:895–900. doi: 10.1016/j.ccell.2022.08.021. [DOI] [PubMed] [Google Scholar]

[bib42] 42.Bhatia H.S., Brunner A.-D., Öztürk F., Kapoor S., Rong Z., Mai H., Thielert M., Ali M., Al-Maskari R., Paetzold J.C., et al. Spatial proteomics in three-dimensional intact specimens. Cell. 2022;185:5040–5058.e19. doi: 10.1016/j.cell.2022.11.021. [DOI] [PubMed] [Google Scholar]

[bib43] 43.Guilliams M., Bonnardel J., Haest B., Vanderborght B., Wagner C., Remmerie A., Bujko A., Martens L., Thoné T., Browaeys R., et al. Spatial proteogenomics reveals distinct and evolutionarily conserved hepatic macrophage niches. Cell. 2022;185:379–396.e38. doi: 10.1016/j.cell.2021.12.018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib44] 44.Zhang H., Liu T., Zhang Z., Payne S.H., Zhang B., McDermott J.E., Zhou J.-Y., Petyuk V.A., Chen L., Ray D., et al. Integrated Proteogenomic Characterization of Human High-Grade Serous Ovarian Cancer. Cell. 2016;166:755–765. doi: 10.1016/j.cell.2016.05.069. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib45] 45.Shen-Orr S.S., Tibshirani R., Khatri P., Bodian D.L., Staedtler F., Perry N.M., Hastie T., Sarwal M.M., Davis M.M., Butte A.J. Cell type–specific gene expression differences in complex tissues. Nat. Methods. 2010;7:287–289. doi: 10.1038/nmeth.1439. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib46] 46.Newman A.M., Liu C.L., Green M.R., Gentles A.J., Feng W., Xu Y., Hoang C.D., Diehn M., Alizadeh A.A. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods. 2015;12:453–457. doi: 10.1038/nmeth.3337. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib47] 47.Crusoe M.R., Abeln S., Iosup A., Amstutz P., Chilton J., Tijanić N., Ménager H., Soiland-Reyes S., Gavrilović B., Goble C., Community T.C. Methods included: standardizing computational reuse and portability with the Common Workflow Language. Commun. ACM. 2022;65:54–63. doi: 10.1145/3486897. [DOI] [Google Scholar]

[bib48] 48.Petralia F., Krek A., Calinawan A.P., Feng S., Gosline S., Pugliese P., Ceccarelli M., Wang P. BayesDeBulk: A Flexible Bayesian Algorithm for the Deconvolution of Bulk Tumor Data. bioRxiv. 2022 doi: 10.1101/2021.06.25.449763. Preprint at. [DOI] [Google Scholar]

PERMALINK

Decomprolute is a benchmarking platform designed for multiomics-based tumor deconvolution

Song Feng

Anna Calinawan

Pietro Pugliese

Pei Wang

Michele Ceccarelli

Francesca Petralia

Sara JC Gosline

Summary

Graphical abstract

Highlights

Motivation

Introduction

Results

Modular workflow framework enables flexible comparison of deconvolution results across signatures, cancer types, and algorithms

Figure 1.

Table 1.

Matched proteogenomic pan-cancer resource enables facile benchmarking of deconvolution across data modalities

Figure 2.

Expression of marker genes vary across omic datasets and signature matrices

Figure 3.

Sampling approach assesses algorithm performance with reduced coverage

Figure 4.

Assessing algorithmic agreement between protein- and mRNA-based deconvolution

Figure 5.

Cell-type-specific variation in signature matrix performance

Figure 6.

xCell captures immune subtypes in proteomic-derived cell-type composition

Figure 7.

Discussion

Limitations and future directions

STAR★Methods

Key resources table

Resource availability

Lead contact

Materials availability

Data and code availability

Methods details

Cancer transcriptomic and proteomic data

Tumor deconvolution algorithm modules

Signature matrices

Common Workflow Language deconvolution pipeline

Docker image building

Data simulation

Quantification and statistical analysis

Algorithm metrics

Correlation based comparison

Distance based comparison

Acknowledgments

Author contributions

Declaration of interests

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases