Abstract
Inferring the cell-type composition of bulk samples can provide biological insight. While bulk transcriptomics data has been extensively used for this purpose, the use of proteomics data has remained unexplored until recently. This study evaluates computational approaches for estimating immune cell composition using bulk sample proteomics data. Leveraging defined immune cell populations and simulated mixtures, we assess the impact of preprocessing methods and software tools on cell deconvolution outcomes. Our findings demonstrate the feasibility of using proteomics data for cell-type deconvolution, with Pearson correlations for estimated proportions in simulated sample mixtures above 0.9 when employing optimal missing value imputation and reference matrix generation parameters. We further provide an R package, proteoDeconv, to facilitate the preprocessing of proteomics data for deconvolution and parsing of results. This study highlights the feasibility of using proteomics for analyzing cell-type composition in biological samples.
Keywords: proteomics, cell-type deconvolution, immune cells, LC-MS/MS, immune infiltration


Introduction
The type, location and level of tumor-infiltrating immune cells have been shown to have prognostic value for many cancer types. − Aspects of immune infiltration in the tumor microenvironment may be utilized in clinical decision-making, for instance for stratification of patients into treatment groups. Several direct methods for measuring immune cell infiltration exist, for example flow cytometry and immunohistochemistry. Alternatively, immune cell infiltration has been successfully estimated using bulk transcriptomics coupled with deconvolution algorithms. An advantage of such a strategy is that the cell composition of a sample can be estimated from a transcriptomics sample, which is used also for measuring other markers in parallel. However, as cell types can be defined by the levels of many protein markers, deconvolution is complex and the absolute levels of proteins are difficult to estimate based on transcript levels alone. With proteomics measuring the actual end productproteinsrather than just the transcript, one could potentially get more accurate information about immune cell infiltration than with transcriptomics. However, the use of proteomics coupled with deconvolution algorithms has been largely unexplored until recently.
Available deconvolution tools have mainly been developed for transcriptomics data and can be broadly categorized into two main approaches: marker gene-based and reference-based methods. Marker gene-based approaches rely on specific genes associated with each cell type. Models quantify cell types independently by analyzing the expression of marker genes in heterogeneous samples. Reference-based methods, on the other hand, treat the problem as a system of equations. They describe gene expression in a sample as a weighted sum of expression profiles from different cell types. By solving this inverse problem, cell-type fractions can be inferred (Figure ). Deconvolution tools can also be classified as partial or complete, depending on if only the cell types that are included in the signature matrix are quantified (partial deconvolution) or if the algorithm attempts to quantify all cell types in the sample (complete deconvolution). Popular algorithms for immune deconvolution include CIBERSORTx and EPIC. They have been extensively evaluated with transcriptomics data. , The developments in recent years (so-called second-generation immune deconvolution) have focused on algorithms that generate signatures from single-cell RNA sequencing (scRNA-seq).
1.
Schematic figure showing the principle behind reference-based immune deconvolution.
Deconvolution tools have also been developed for other types of omics than transcriptomics, such as DNA methylation , and chromatin accessibility. , Proteomics for deconvolution has received some attention recently. − Notably, two different algorithms have been developed: BayesDeBulk and scpDeconv. Neither of these algorithms has been evaluated in independent benchmarks. The only proteome-based assessment of deconvolution algorithms so far is the recent Decomprolute benchmarking platform. The effect of various preprocessing methods has however not been investigated with Decomprolute.
In this study, we explore various approaches to immune deconvolution with proteomics data and the impact of different steps. Preprocessing steps, including normalization, imputation, and gene symbol handling, are discussed and recommendations are provided. We evaluate deconvolution performance on both pure immune cell samples and artificial simulated mixtures. Multiple deconvolution algorithms are benchmarked, including BayesDeBulk, CIBERSORTx, CIBERSORT, and EPIC. We also present an R package, proteoDeconv, which facilitates pre- and postprocessing for immune deconvolution with proteomics data.
Experimental Section
Defined Immune Cell Data and Mixtures
Immune Cell Preparation for Proteomic Analysis
PBMCs were isolated with Ficoll–Paque PLUS (Cytiva, Uppsala, Sweden, 17144003) density centrifugation and collection of the lymphocyte layer in PBS-EDTA (Invitrogen, Grand Island, NY, USA, 15575-038). Isolated PBMCs were rested in RPMI media (Cytiva, South Logan, UT, SH30096.01) supplemented with 10% FBS (Gibco, Paisley, UK, 10270-106) and 2 mM l-glutamine (Cytiva, SH30034.01). Isolated cells were counted with 1:9 vol/vol trypan blue (Gibco, Green Island, NY, USA, 15250-061) on a Luna Counter FL system (Logos Biosystems, Anyang-si, Gyeonggi-do, South Korea, L20001), ensuring a cell viability higher than 90%. The PBMC replicates for MS analysis were prepared by transferring 1 μL of the cell suspension (3.92 × 107 cells/ml) into 79 μL PBS and splitting into four 20 μL fractions.
Isolated PBMCs were stained with viability staining (BD Horizon Fixable Viability Stain 620, BD Biosciences, 564996), Fc receptors were blocked (ChromPure Mouse IgG, Jackson ImmunoResearch Laboratories INC., Ely, Cambridgeshire, UK 015-000-003) and then subsequently stained with our flow cytometry panel. The antibody panel consists of: CD3-APC (Life Technologies, MHDC0305), CD19-RPE (Dako, R0808), CD56-BV605 (BioLegend, 362537), CD8-PE/Cy7 (BD Pharmingen, 557746), CD14-FITC (Life Technologies, 2300712), CD27-BV510 (BD Horizon, 563092), CD20-PerCP/Cy5.5 (BD Pharmingen, 560736), CD16-BV786 (BD Horizon, 563690), HLA-DR-BV711 (BD Horizon, 563696). Acquisition and sorting of target cell populations were performed on BD FACSAria IIu (BD Biosciences, San Jose, CA) cell sorter. B-cells, naïve B-cells, T-cells, cytotoxic T-cells, Monocytes and activated NK cells were sorted approximately 10,000 cells/target population with a 100 μm nozzle in 4-way-purity mode. Mixes of 50% cytotoxic T-cells and 50% monocytes were also acquired. 100 μL 2% SDS (pH7) was added to each cell sample and the samples were heated to 95 °C for 7 min. 100 μL 2% SDS 20 mM CAA 10 mM TCEP 0.1 M Tris was then added and samples were left standing at room temperature before freezing at −80 °C followed by thawing and sonication.
Samples were prepared for mass spectrometry using S-Trap (Protifi, Fairpoint, NY, USA, C02-micro-80), according to the manufacturer’s instructions with overnight LysC and trypsin (1:9) digestion. Peptide eluates were dried using a SpeedVac Concentrator (Thermo Fisher Scientific, Asheville, NC, USA, Savant SPD131DDA).
Liquid Chromatography–Mass Spectrometry Data Acquisition
Evotips (Evosep) were loaded with peptides resuspended in 0.1% formic acid according to the manufacturer’s instructions and were sequentially loaded on an Evosep One liquid chromatography system with a Picofrit 15 cm column of 360 μm OD × 75 μm ID (CoAnn Technologies LLC, Richland, WA, USA, ICT36007515F-50-5) self-packed in-house with 1.9 μm C18 (Dr.Maisch GmbH, Ammerbuch Germany, r119.aq.0003) coupled to a Q Exactive HF-X (Thermo Fisher Scientific, Waltham, MA, USA) mass spectrometer. The peptides were separated using the standard 58 min Whisper method (Whisper 20 SPD method) with a column temperature at 40 °C and the spectra were acquired in data-independent acquisition (DIA) mode. The DIA method used a normalized collision energy of 27 with automatic injection time. Data acquisition was between 4 and 58 min in positive ion mode. MS1 spectra were collected at a target resolution of 60000 with automatic gain control (AGC) target value 3 × 106 and 55 ms maximum injection time in the scan range 395–1005 m/z in centroid mode. DIA MS2 spectra were acquired with 15,000 resolution and 1 × 106 AGC target value and automatic maximum injection time, with 50 loop count and 12 m/z isolation window with a normalized collision energy of 27. The DIA inclusion list contained 101 staggered windows between 400 and 1000 m/z according to the 12 m/z staggered window method suggested by Pino et al.
DIA Data Processing
The proteomics raw data underwent conversion to mzML format using vendor peak-picking and demultiplexing via MSconvert v.3.0.21266-1f16dae8 and was subsequently processed with DIA-NN version 1.8.1. In DIA-NN, library-free mode was employed, utilizing the UniProt human FASTA database from 2022–08–11 with common contaminants added as the input. Precursors with charge states ranging from 1 to 4, peptide lengths between 7 and 30, and peptide m/z values from 300 to 1800 were considered. Cysteine carbamidomethylation was set as a fixed modification, and no additional variable modifications were included. Quantification utilized “robust LC (high precision)” settings, and mass accuracy was automatically set. All proteomics data have been deposited but the NK cells were excluded from downstream analyses due to that only two replicates were available.
Immune Cell Reference Proteome
Data from Rieckmann et al. was used as additional reference data for the immune cell proteome. The protein group file was retrieved from ProteomeXchange repository PXD004352. Activated and steady-state cells were grouped together for all analyses, and erythrocytes and thrombocytes were excluded.
Simulated Mixtures
In addition to experimentally generated 50–50 mixtures of two immune cell types (CD8+ T cells and monocytes), hundreds of in silico mixtures were generated by combining randomly selected replicate samples of pure cell-type proteomes in varying fractions from 0 to 1. These mixtures were generated using the immune cell reference proteome samples from Rieckmann et al. Simulations were conducted using the SimBu package, originally developed for transcriptomics datawithout applying a scaling factor. Following simulation, the resulting expression matrices were subjected to deconvolution, and the estimated cell-type proportions were compared to the known simulated proportions. Pearson correlation coefficients and Root Mean Squared Errors (RMSE) were calculated for each tested cell type, and their mean values across all cell types are reported to assess deconvolution performance.
Signature Matrices
Custom proteomics-derived signature matrices were generated using immune cell data from Rieckmann et al. as the reference. The original 28 cell types were consolidated into seven broader groups for signature matrix generation: CD8+ T cells, CD4+ T cells, Dendritic cells, Monocytes, B cells, NK cells, and Granulocytes. These custom signature matrices were constructed using the CIBERSORTx Docker image, executed via the proteoDeconv package, with a set of optimized parametersdeviating settings from the defaults are specified in the text: the minimum and maximum numbers of genes per cell type (G.min and G.max) were set to 200 and 400, a stringent q-value threshold of 0.01 was applied for differential expression, while nonhematopoietic genes were retained (filter = FALSE). When comparing signature matrices from scRNA seq and proteomics, the matrix generation was reduced to the following five cell types: CD8+ T cells, CD4+ T cells, Monocytes, B cells and NK cells, with the scRNA data derived from the CIBERSORTx Web site (data set: NSCLC PBMCs Single Cell RNA-Seq). For BayesDeBulk, markers were identified using limma-based pairwise comparisons, selecting proteins with expression >1000 and at least 3-fold higher expression in the target cell type relative to others, starting from the same protein-based reference matrix as the other methods.
Data Analysis
Data processing and analysis were conducted in R version 4.4.2 using Posit’s Positron as IDE. When required, missing values were imputed using MsCoreUtils version 1.16.0. HGNC gene symbols were updated using HGNChelper version 0.8.14. Data were normalized with cyclic loess normalization or quantile normalization when applicable using limma version 3.60.0, and with vsn using the vsn package version 3.70.0. Following normalization, data were exponentiated to a linear scale, and each sample was scaled to a total intensity of 1 × 106, analogous to the Transcript Per Million (TPM) approach in RNA-seq pipelines. CIBERSORT, EPIC, BayesDeBulk, and CIBERSORTx were executed via proteoDeconv, which internally calls their respective R packages or the Docker image, as appropriate. All processing steps were implemented in targets for reproducibility, while renv ensured a consistent computational environment. The full pipeline is available on GitHub: https://github.com/ComputationalProteomics/proteoDeconv-manuscript.
proteoDeconv R Package
An R package was developed to facilitate immune deconvolution with proteomics data. The package includes functions for preprocessing data, updating HGNC symbols, imputing missing values, running deconvolution algorithms, simulating data, and generating signature matrices. The R package proteoDeconv is available on GitHub: https://github.com/ComputationalProteomics/proteoDeconv.
Results
Given the potential of bulk proteomics data for cell deconvolution, we aimed to evaluate the feasibility of applying deconvolution methods originally developed for transcriptomics and to investigate how different data processing strategies affect the outcome. Numerous deconvolution algorithms have been developed for transcriptomic data analysis; among the most commonly utilized are CIBERSORT, CIBERSORTx, and EPIC, all of which are tested in our framework. Other algorithms, such as ESTIMATE, ConsensusTME, quanTIseq, MCP-counter, and TIMER, were not tested due to their limited cell type resolution or incompatibility with custom signature matrices. For proteomics-based deconvolution, two algorithms exist: BayesDeBulk (included in our framework) and scpDeconv. However, we could not test scpDeconv due to the absence of suitable single-cell proteomics immune cell data sets.
To investigate the feasibility of using these algorithms on proteomics data, we first evaluated them on a data set of pure immune cells developed by Rieckmann et al., with a signature matrix generated from the same data set. As illustrated in Figure , CIBERSORT and CIBERSORTx yield identical, well-performing results. BayesDeBulk also performs well, while EPIC performs poorly. To further test the ability of the algorithms to deconvolute mixtures, we performed tests using simulated mixing of different pure immune cell samples at random proportions in 100 combinations and correlated the fractions estimated by the algorithms to the expected values. Both correlation and RMSE are important metrics in this context: correlation measures how well the estimated fractions follow the trend of the expected values, reflecting the algorithm’s ability to capture relative differences between samples, while RMSE quantifies the average magnitude of the errors, providing insight into the absolute accuracy of the estimates. The results of the simulations are provided in Table S1. Using the simulated data with the different algorithms, CIBERSORT and CIBERSORTx each had a Pearson correlation of about 0.91 and an RMSE of approximately 0.04. BayesDeBulk had a correlation of 0.76 and an RMSE of 0.07. EPIC performed worse in this case, with a correlation of 0.68 and an RMSE of 0.10. Based on these results, we continued to analyze the influence of different parameters on deconvolution results of different data sets using the CIBERSORT algorithm, as it performed well on these data and is well-regarded for transcriptomics.
2.

Comparison of BayesDeBulk, CIBERSORT, CIBERSORTx, and EPIC using the same DDA-based immune cell samples that served as reference for the signature matrix derived from the Rieckmann et al. data set. Each bar shows the estimated cell-type proportions for pure samples that contain only one expected cell type.
One reason why proteomics-based deconvolution algorithms have not been developed may be that more data are needed to evaluate them, as gold standard data sets of immune cell mixtures are missing. We therefore acquired new proteomics data for different immune cell types (B-cells, Naïve B-cells, T-cells, Cytotoxic T-cells, Monocytes and activated NK cells) derived from peripheral blood mononuclear cells (PBMCs). We also generated defined mixtures of cells using fluorescence-activated cell sorting that could be used to evaluate deconvolution algorithms. The data were acquired using data-independent acquisition (DIA) to reflect current state-of-the-art, to a depth of about 2000 proteins per cell type. These data were then used along with the Rieckmann data to evaluate the effects of different data processing parameters on deconvolution outcome, before testing the effects of different signature matrices.
Several challenges exist with inputting proteomics data into deconvolution tools originally developed for transcriptomics data. First, there is the problem of how to handle ambiguously identified proteins, that is protein groups, as the deconvolution algorithms do not accept protein groups. One solution is to simply select the first protein listed in each protein group. Another solution is to let the search engine (for example DIA-NN) produce a protein matrix that does not contain protein groups but single protein or gene identifiers for each entry. The difference between these two approaches on our reference data using CIBERSORT can be seen in Figure S1, visualizing that the protein grouping approach had small effects on the deconvolution outcome. Furthermore, when performing simulations using either approach, the difference is small, and it varies between data sets which method is best (Table S1).
The second problem is that there may be duplicate occurrences of proteins in proteome data, especially after reducing the protein groups to single proteins. To handle this, the protein with the highest median intensity may be chosen over the other(s) (denoted by the slice method). Another possible approach is to merge the intensities of all occurrences of a protein into one (using summarization, denoted the merge method). The deconvolution performance resulting from these two approaches is compared in Figure S1. Also in this case, the differences between the two methods were small, and it varies between data sets which approach is best.
Normalization and Imputation
It has previously been found that deconvolution algorithms perform better with data in linear space, that is not log-transformed, and that normalizations generally disrupt the deconvolution performance. We tested the effect of applying different normalizations (Figure ) and found that normalization indeed can have a detrimental effect on the deconvolution performance. With simulations of the Rieckmann data, the Pearson correlation for no normalization is 0.91 and the RMSE is 0.04. When normalizing with cyclic loess normalization for example, despite subsequently back-transforming the data to a linear scale, the correlation ends up lower at 0.80 and the RMSE at 0.08.
3.

Comparison of different normalization strategies in CIBERSORT-based deconvolution of DIA immune cell samples using the Rieckmann-derived signature matrix. The mix samples contained an equal number of CD8+ T cells and monocytes, and PBMC indicates crude PBMCs. For each comparison, the normalization method was applied to both the proteome data used for the signature matrix and to the samples being deconvoluted.
Another potential issue with proteomics data is missing values and how they should be handled. Missing values are common in proteomics data, and none of the deconvolution algorithms tested in this work tolerate any missing values. Thereby the missing data need to be imputed. Several approaches to imputation exist, including Random Forest imputation (RF), minimum value imputation and standard distribution imputation. Multiple imputation methods were compared, with results indicating that a conservative imputation method with minimum value imputation works better than for example RF imputation (Figure ). Furthermore, we also evaluated the effect of imputation on the proteomics immune cell data from Rieckmann et al. (Figure ). For all cell types, the inferred cell type corresponds better to the actual cell type when using minimum-value imputed data, further highlighting that this conservative strategy was beneficial for the deconvolution of this data set. This finding is also reiterated by simulations of the Rieckmann data: the Pearson correlation and RMSE indicates better performance with lowest-value imputation (0.91 and 0.04, respectively) than with either kNN (0.82 and 0.06) or RF imputation (0.85 and 0.06).
4.

Imputation strategies for missing values in immune cell data using the Rieckmann-derived signature matrix. (A) Comparison using this study’s collected DIA data as data source. The mix samples contained an equal number of CD8+ T cells and monocytes, and PBMC indicates crude PBMCs. (B) Comparison of imputation methods as in A but with the Rieckmann et al. DDA data and the cell types in the signature matrix.
Different Reference Matrices
Transcriptomics-derived signature matrices have been applied for immune deconvolution of proteomics data, but recent findings suggest that proteomics-derived signature matrices may be more appropriate for proteomics data. As has been shown in several studies, the correlation over samples between proteome and transcriptome data is moderate - with median Pearson/Spearman correlations around 0.4–0.5 for high-quality data. − We therefore hypothesized that a proteome-derived signature matrix would increase the deconvolution performance. Furthermore, with immune infiltration estimates typically being dependent on protein markers (for example in the case of immunohistochemistry or cell sorting), deconvolution based on proteins will be more directly comparable. As shown in Figure , the proteome-derived signature matrix outperforms the single-cell RNA-sequencing-derived signature matrix in deconvolution performance, despite both being constructed from the same five cell types. Also in simulations with the Rieckmann data, the proteome-derived signature performs better with a correlation of 0.96 and RMSE of 0.05, compared to the scRNA-seq signature which yields a correlation of 0.85 and RMSE of 0.12.
5.

Comparison of DIA-based immune cell samples deconvoluted with a proteome-derived signature matrix (derived from the Rieckmann et al. data set) versus a single-cell RNA-sequencing-derived signature matrix. To make a fair comparison, both signatures were made using the same five cell types, with the same signature generation parameters. The mix samples contained an equal number of CD8+ T cells and monocytes.
Parameters for Generating Reference Matrices
With transcriptomics data, the choice of the signature matrix has a substantial effect on deconvolution results, as it defines the cell-type-specific expression profiles used to estimate cellular proportions. In fact, it has been reported that the selection of the signature matrix often influences results more than the choice of the deconvolution algorithm itself. For CIBERSORTx, it is recommended that the signature matrix samples are preprocessed with the same steps as the samples to be deconvoluted. Various parameters in the signature matrix creation can have a considerable impact on the deconvolution performance. The minimum and maximum number of proteins (or genes) per cell type is one such parameter, which is investigated in Figure . Notably, setting this threshold too high results in considerably worse deconvolution performance.
6.

Influence of signature matrix creation parameters on deconvolution of DIA-based immune cell samples, using the Rieckmann reference proteome. Labels indicate the minimum and maximum number of proteins per cell type to consider. (A) Estimated cell-type proportions with different ranges of minimum and maximum number of proteins to consider per cell type. (B) The corresponding RMSE and Pearson correlation for simulated sample mixtures generated from the Rieckmann et al. data with the same signature matrices.
Validation of Simulation Methodology
As some of the benchmarks are based on the simulation of samples, we also performed a validation of the accuracy of this methodology. By comparing the deconvolution results from a mixed sample consisting of equal proportions of CD8+ T cells and monocytes (in terms of cell count), we found that our simulation approach results in comparable outcomes to experimentally mixing samples (Figure ). A shared limitation of both approaches is the presence of spillover between CD8+ and CD4+ T cells, suggesting that it remains difficult to distinguish between these subtypes with proteomics data. This effect has also been found to be widespread with transcriptomics data. The discrepancy between the deconvoluted proportions and the actual can probably be explained by the protein content bias. Monocytes are known to be larger than T cells and therefore contain more protein. For transcriptomics, the mRNA bias of different cell types has been taken into account with some algorithms such as EPIC. This represents one area of improvement for proteomics deconvolution algorithms, as no algorithm takes the protein content bias into account. This issue can potentially be partially mitigated by measuring and normalizing the injected sample amounts when performing the mass spectrometry analysis.
7.

Validation of the in silico simulation approach against real 50–50 mixtures of CD8+ T cells and monocytes in DIA-based proteomics data. Four replicates of the real mix are compared to 100 simulated mixtures, both deconvoluted using CIBERSORT with the Rieckmann-derived signature matrix. Error bars represent standard deviations across replicates or simulations.
proteoDeconv R Package
We developed the R package proteoDeconv to streamline the use of proteomics data for immune cell deconvolution. To the best of our knowledge, proteoDeconv is the only R package specifically designed for cell-type deconvolution using proteomics data. While there are related tools, such as the benchmarking platform Decomprolute, its focus is primarily on performance evaluation, offering metrics like correlation with transcriptomic data from CPTAC. However, Decomprolute does not assess the impact of different preprocessing strategies, which can significantly influence deconvolution outcomes in proteomics workflows. Other deconvolution frameworks include immunedeconv and omnideconv. However, these packages are not directly compatible with proteomics data, as it requires specific preprocessing steps. The ability to manage these preprocessing requirements is a distinctive feature of proteoDeconv.
Discussion
As the end products of the central dogma, proteins offer a direct representation of cellular function, potentially providing more accurate estimates of cell type composition than transcriptomics-based approaches. In this study, we explored the feasibility of proteomics-based immune cell deconvolution by generating mass spectrometry data from purified immune cell populations, alongside simulated mixtures designed to model complex samples. By comparing estimated cell-type proportions against known compositions, evaluated through metrics like RMSE and correlation coefficients, we aimed to uncover factors that influence deconvolution performance in a proteomics context.
When comparing our results to transcriptomics-based benchmarks, it is notable that the Pearson correlations observed for our simulated proteomics data fall within a similar range (∼0.7–0.9) as those reported for well-matched reference matrices in transcriptomics studies. This suggests that proteomics-based deconvolution can achieve comparable performance, supporting its feasibility as an alternative or complementary approach to transcriptomics for studying immune cell composition.
One of the observations from our analyses is the strong influence of data quality and proteomic depth on deconvolution performance. We found that when the signature matrix and the samples being deconvoluted originate from the same data seta controlled, albeit unrealistic scenariothe average Pearson correlation reached as high as 0.94. This reflects an ideal case where technical variability is minimized, but it also highlights the importance of using high-quality data with comprehensive proteomic coverage.
Another important aspect is the handling of missing values, which is a common challenge in proteomics. While imputation is often necessary, particularly in data-dependent acquisition (DDA) workflows, its impact on deconvolution performance can be complex. Our findings suggest that conservative imputation strategies, such as minimum-value imputation, tend to improve deconvolution performance more effectively than methods like Random Forest imputation or k-nearest neighbors (kNN) imputation. This pattern was observed across both our own data and the Rieckmann data set, indicating that conservative approaches may be generally preferable in the context of deconvolution.
The choice of signature matrix also plays an important role in deconvolution outcomes. When comparing proteome-derived and transcriptome-derived signature matrices, we observed that the proteome-derived matrices generally provided better performance when applied to proteomics data. This aligns with findings from previous studies and is consistent with the moderate correlation (∼0.4) typically observed between mRNA and protein abundance.
Regarding the choice of using a reference-based or marker-based deconvolution algorithm, it appears that the reference-based methods can achieve superior performance. As was found by Avila Cobos et al., reference-based methods generally perform better than marker-based methods with transcriptomics data. However, Feng et al. found that the reference-based methods benchmarked in their study resulted in worse deconvolution performance. Petralia et al. also argues that marker-gene-based methods are more suitable for proteomics data since reliable reference data are difficult to find. Theoretically, the choice of deconvolution methodwhether it is marker-based or reference-basedhas an impact on the deconvolution performance. For proteome data, the reduced depth in terms of number of measured proteins/genes compared to transcriptomics is a factor that may make marker-based methods less appropriate if the specific markers are not detected. A potential problem with reference-based methods is that changes in protein expressions can alter the conditions for deconvolution when less specific proteins are used in the signature.
While the motivation for this study is to eventually be able to reliably deconvolute immune cell composition of proteome samples, the samples that have been investigated are derived from PBMCs. The proteome of immune cells in PBMCs and tissue samples can be expected to vary, and including diseased samples in the signature matrix for transcriptome deconvolution has been shown to increase deconvolution accuracy and reduce biological bias. Still, many transcriptomics signatures are based on PBMCs and are very effective also for tumor sample deconvolution.
This work represents an early step toward establishing robust methods for cell-type deconvolution with proteomics data. It is clear that gold-standard data sets, where the immune cell proportions are known, would enable further validation and method development. Furthermore, single-cell proteomics holds potential for refining signature matrix construction, potentially offering increased granularity and enhancing the accuracy of proteomics-based cell-type deconvolution.
Conclusions
In conclusion, our study shows that proteomics offers a promising data source for estimating cell type compositions. By analyzing mass spectrometry proteome data and simulated immune cell mixtures, we recommend using high-quality, high-depth proteomic data for both sample and signature matrix construction. Additionally, employing conservative imputation methods, specifically minimum-value imputation, is important to improve deconvolution accuracy. A proteome-based reference matrix outperforms a transcriptome-based one, and algorithmically the reference-based methods appear to be most suitable for deconvolution. These insights highlight the feasibility of proteomics data in determining cellular compositions.
Supplementary Material
Acknowledgments
This work was supported by the Crafoord Foundation (20220707), the Knut and Alice Wallenberg Foundation (WASPDDLS22-020) and the EU Horizon 2020 Framework Programme for Research and Innovation (EU-H2020-MSCA-COFUND-754299-CanFaster). Aastha Sobti and David Gomez Jimenez are thanked for their input about cell sorting.
The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the data set identifier PXD056050.
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jproteome.4c00868.
The authors declare no competing financial interest.
References
- Fridman W. H., Zitvogel L., Sautès–Fridman C., Kroemer G.. The Immune Contexture in Cancer Prognosis and Treatment. Nat. Rev. Clin. Oncol. 2017;14:717–734. doi: 10.1038/nrclinonc.2017.101. [DOI] [PubMed] [Google Scholar]
- Jimenez D. G., Sobti A., Askmyr D., Sakellariou C., Santos S. C., Swoboda S., Forslund O., Greiff L., Lindstedt M.. Tonsillar Cancer with High CD8+ T-Cell Infiltration Features Increased Levels of Dendritic Cells and Transcriptional Regulation Associated with an Inflamed Tumor Microenvironment. Cancers. 2021;13:5341. doi: 10.3390/cancers13215341. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luca B. A., Steen C. B., Matusiak M., Azizi A., Varma S., Zhu C., Przybyl J., Espín-Pérez A., Diehn M., Alizadeh A. A., van de Rijn M., Gentles A. J., Newman A. M.. Atlas of Clinically Distinct Cell States and Ecosystems across Human Solid Tumors. Cell. 2021;184:5482–5496e28. doi: 10.1016/j.cell.2021.09.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lapuente-Santana O. ´., Van Genderen M., Hilbers P. A., Finotello F., Eduati F.. Interpretable Systems Biomarkers Predict Response to Immune-Checkpoint Inhibitors. Patterns. 2021;2:100293. doi: 10.1016/j.patter.2021.100293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mason M., Lapuente-Santana Ó., Halkola A. S., Wang W., Mall R., Xiao X., Kaufman J., Fu J., Pfeil J., Banerjee J.. et al. A Community Challenge to Predict Clinical Outcomes after Immune Checkpoint Blockade in Non-Small Cell Lung Cancer. J. Transl Med. 2024;22:190. doi: 10.1186/s12967-023-04705-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Finotello F., Mayer C., Plattner C., Laschober G., Rieder D., Hackl H., Krogsdam A., Loncova Z., Posch W., Wilflingseder D.. et al. Molecular and Pharmacological Modulators of the Tumor Immune Contexture Revealed by Deconvolution of RNA-seq Data. Genome Med. 2019;11:34. doi: 10.1186/s13073-019-0638-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zaitsev A., Chelushkin M., Dyikanov D., Cheremushkin I., Shpak B., Nomie K., Zyrin V., Nuzhdina E., Lozinsky Y., Zotova A., Degryse S., Kotlov N., Baisangurov A., Shatsky V., Afenteva D., Kuznetsov A., Paul S. R., Davies D. L., Reeves P. M., Lanuti M., Goldberg M. F., Tazearslan C., Chasse M., Wang I., Abdou M., Aslanian S. M., Andrewes S., Hsieh J. J., Ramachandran A., Lyu Y., Galkin I., Svekolkin V., Cerchietti L., Poznansky M. C., Ataullakhanov R., Fowler N., Bagaev A.. Precise Reconstruction of the TME Using Bulk RNA-seq and a Machine Learning Algorithm Trained on Artificial Transcriptomes. Cancer Cell. 2022;40:879–894e16. doi: 10.1016/j.ccell.2022.07.006. [DOI] [PubMed] [Google Scholar]
- Merotto, L. ; Zopoglou, M. ; Zackl, C. ; Finotello, F. . Chapter Two - Next-Generation Deconvolution of Transcriptomic Data to Investigate the Tumor Microenvironment. In International Review of Cell and Molecular Biology; Garg, A. D. , Galluzzi, L. , Eds.; Immune Checkpoint Biology in Health and Disease; Academic Press, 2024; Vol. 382; pp 103–143. [DOI] [PubMed] [Google Scholar]
- Newman A. M., Steen C. B., Liu C. L., Gentles A. J., Chaudhuri A. A., Scherer F., Khodadoust M. S., Esfahani M. S., Luca B. A., Steiner D., Diehn M., Alizadeh A. A.. Determining Cell Type Abundance and Expression from Bulk Tissues with Digital Cytometry. Nat. Biotechnol. 2019;37:773–782. doi: 10.1038/s41587-019-0114-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Racle J., de Jonge K., Baumgaertner P., Speiser D. E., Gfeller D.. Simultaneous Enumeration of Cancer and Immune Cell Types from Bulk Tumor Gene Expression Data. eLife. 2017;6:e26476. doi: 10.7554/elife.26476. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sturm G., Finotello F., Petitprez F., Zhang J. D., Baumbach J., Fridman W. H., List M., Aneichyk T.. Comprehensive Evaluation of Transcriptome-Based Cell-Type Quantification Methods for Immuno-Oncology. Bioinformatics. 2019;35:i436–i445. doi: 10.1093/bioinformatics/btz363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Avila Cobos F., Alquicira-Hernandez J., Powell J. E., Mestdagh P., De Preter K.. Benchmarking of Cell Type Deconvolution Pipelines for Transcriptomics Data. Nat. Commun. 2020;11:5650. doi: 10.1038/s41467-020-19015-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chakravarthy A., Furness A., Joshi K., Ghorani E., Ford K., Ward M. J., King E. V., Lechner M., Marafioti T., Quezada S. A., Thomas G. J., Feber A., Fenton T. R.. Pan-Cancer Deconvolution of Tumour Composition Using DNA Methylation. Nat. Commun. 2018;9:3220. doi: 10.1038/s41467-018-05570-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Teschendorff A. E., Zhu T., Breeze C. E., Beck S.. EPISCORE: Cell Type Deconvolution of Bulk Tissue DNA Methylomes from Single-Cell RNA-Seq Data. Genome Biol. 2020;21:221. doi: 10.1186/s13059-020-02126-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H., Sharma A., Luo K., Qin Z. S., Sun X., Liu H.. DeconPeaker, a Deconvolution Model to Identify Cell Types Based on Chromatin Accessibility in ATAC-Seq Data of Mixture Samples. Front. Genet. 2020;11:392. doi: 10.3389/fgene.2020.00392. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gabriel A. A., Racle J., Falquet M., Jandus C., Gfeller D.. Robust Estimation of Cancer and Immune Cell-Type Proportions from Bulk Tumor ATAC-Seq Data. eLife. 2024;13:RP94833. doi: 10.7554/elife.94833. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Busso-Lopes A. F., Neves L. X., Câmara G. A., Granato D. C., Pretti M. A. M., Heberle H., Patroni F. M. S., Sá J., Yokoo S., Rivera C.. et al. Connecting Multiple Microenvironment Proteomes Uncovers the Biology in Head and Neck Cancer. Nat. Commun. 2022;13:6725. doi: 10.1038/s41467-022-34407-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Petralia F., Krek A., Calinawan A. P., Charytonowicz D., Sebra R., Feng S., Gosline S., Pugliese P., Paulovich A. G., Kennedy J. J., Ceccarelli M., Wang P.. BayesDeBulk: A Flexible Bayesian Algorithm for the Deconvolution of Bulk Tumor Data. bioRxiv. 2023:2021.06.25.449763. doi: 10.1101/2021.06.25.449763. [DOI] [Google Scholar]
- Handin N., Yuan D., Ölander M., Wegler C., Karlsson C., Jansson-Löfmark R., Hjelmesæth J., Åsberg A., Lauschke V. M., Artursson P.. Proteome Deconvolution of Liver Biopsies Reveals Hepatic Cell Composition as an Important Marker of Fibrosis. Comput. Struct. Biotechnol. J. 2023;21:4361–4369. doi: 10.1016/j.csbj.2023.08.037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Feng S., Calinawan A., Pugliese P., Wang P., Ceccarelli M., Petralia F., Gosline S. J.. Decomprolute: A Benchmarking Platform Designed for Multiomics-Based Tumor Deconvolution. Cell Reports Methods. 2023;4:100708. doi: 10.1016/j.crmeth.2024.100708. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Teng P., Schaaf J. P., Abulez T., Hood B. L., Wilson K. N., Litzi T. J., Mitchell D., Conrads K. A., Hunt A. L., Olowu V., Oliver J., Park F. S., Edwards M., Chiang A., Wilkerson M. D., Raj-Kumar P.-K., Tarney C. M., Darcy K. M., Phippen N. T., Maxwell G. L., Conrads T. P., Bateman N. W.. ProteoMixture: A Cell Type Deconvolution Tool for Bulk Tissue Proteomic Data. iScience. 2024;27:109198. doi: 10.1016/j.isci.2024.109198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang F., Yang F., Huang L., Li W., Song J., Gasser R. B., Aebersold R., Wang G., Yao J.. Deep Domain Adversarial Neural Network for the Deconvolution of Cell Type Mixtures in Tissue Proteome Profiling. Nature Machine Intelligence. 2023;5:1236–1249. doi: 10.1038/s42256-023-00737-y. [DOI] [Google Scholar]
- Newman A. M., Liu C. L., Green M. R., Gentles A. J., Feng W., Xu Y., Hoang C. D., Diehn M., Alizadeh A. A.. Robust Enumeration of Cell Subsets from Tissue Expression Profiles. Nat. Methods. 2015;12:453–457. doi: 10.1038/nmeth.3337. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Racle, J. ; Gfeller, D. . EPIC: A Tool to Estimate the Proportions of Different Cell Types from Bulk Gene Expression Data. In Bioinformatics for Cancer Immunotherapy: Methods and Protocols; Boegel, S. , Ed.; Methods in Molecular Biology; Springer US: New York, NY, 2020; pp 233–248. [DOI] [PubMed] [Google Scholar]
- Pino L. K., Just S. C., MacCoss M. J., Searle B. C.. Acquiring and Analyzing Data Independent Acquisition Proteomics Experiments without Spectrum Libraries. Mol. Cell. Proteomics. 2020;19:1088–1103. doi: 10.1074/mcp.P119.001913. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chambers M. C, Maclean B., Burke R., Amodei D., Ruderman D. L, Neumann S., Gatto L., Fischer B., Pratt B., Egertson J., Hoff K., Kessner D., Tasman N., Shulman N., Frewen B., Baker T. A, Brusniak M.-Y., Paulse C., Creasy D., Flashner L., Kani K., Moulding C., Seymour S. L, Nuwaysir L. M, Lefebvre B., Kuhlmann F., Roark J., Rainer P., Detlev S., Hemenway T., Huhmer A., Langridge J., Connolly B., Chadick T., Holly K., Eckels J., Deutsch E. W, Moritz R. L, Katz J. E, Agus D. B, MacCoss M., Tabb D. L, Mallick P.. A Cross-Platform Toolkit for Mass Spectrometry and Proteomics. Nat. Biotechnol. 2012;30:918–920. doi: 10.1038/nbt.2377. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Demichev V., Messner C. B., Vernardis S. I., Lilley K. S., Ralser M.. DIA-NN: neural networks and interference correction enable deep proteome coverage in high throughput. Nat. Methods. 2020;17:41–44. doi: 10.1038/s41592-019-0638-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Frankenfield A. M., Ni J., Ahmed M., Hao L.. Protein Contaminants Matter: Building Universal Protein Contaminant Libraries for DDA and DIA Proteomics. J. Proteome Res. 2022;21:2104–2113. doi: 10.1021/acs.jproteome.2c00145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rieckmann J. C., Geiger R., Hornburg D., Wolf T., Kveler K., Jarrossay D., Sallusto F., Shen-Orr S. S., Lanzavecchia A., Mann M., Meissner F.. Social Network Architecture of Human Immune Cells Unveiled by Quantitative Proteomics. Nat. Immunol. 2017;18:583–593. doi: 10.1038/ni.3693. [DOI] [PubMed] [Google Scholar]
- Perez-Riverol Y., Bai J., Bandla C., García-Seisdedos D., Hewapathirana S., Kamatchinathan S., Kundu D. J., Prakash A., Frericks-Zipper A., Eisenacher M., Walzer M., Wang S., Brazma A., Vizcaíno J. A.. The PRIDE Database Resources in 2022: A Hub for Mass Spectrometry-Based Proteomics Evidences. Nucleic Acids Res. 2022;50:D543–D552. doi: 10.1093/nar/gkab1038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dietrich A., Sturm G., Merotto L., Marini F., Finotello F., List M.. SimBu: bias-aware simulation of bulk RNA-seq data with variable cell-type composition. Bioinformatics. 2022;38:ii141–ii147. doi: 10.1093/bioinformatics/btac499. [DOI] [PubMed] [Google Scholar]
- R Core Team R: A Language and Environment for Statistical Computing. 2024.
- Rainer J., Vicini A., Salzer L., Stanstrup J., Badia J. M., Neumann S., Stravs M. A., Verri Hernandes V., Gatto L., Gibb S., Witting M.. A Modular and Expandable Ecosystem for Metabolomics Data Annotation in R. Metabolites. 2022;12:173. doi: 10.3390/metabo12020173. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oh S., Abdelnabi J., Al-Dulaimi R., Aggarwal A., Ramos M., Davis S., Riester M., Waldron L.. HGNChelper: Identification and Correction of Invalid Gene Symbols for Human and Mouse. F1000 Res. 2022;9:1493. doi: 10.12688/f1000research.28033.2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ritchie M. E., Phipson B., Wu D., Hu Y., Law C. W., Shi W., Smyth G. K.. Limma Powers Differential Expression Analyses for RNA-sequencing and Microarray Studies. Nucleic Acids Res. 2015;43:e47. doi: 10.1093/nar/gkv007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huber W., von Heydebreck A., Sültmann H., Poustka A., Vingron M.. Variance Stabilization Applied to Microarray Data Calibration and to the Quantification of Differential Expression. Bioinformatics. 2002;18(suppl_1):S96–S104. doi: 10.1093/bioinformatics/18.suppl_1.s96. [DOI] [PubMed] [Google Scholar]
- Landau W. M.. The Targets R Package: A Dynamic Make-like Function-Oriented Pipeline Toolkit for Reproducibility and High-Performance Computing. J. Open Source Softw. 2021;6:2959. doi: 10.21105/joss.02959. [DOI] [Google Scholar]
- Yoshihara K., Shahmoradgoli M., Martínez E., Vegesna R., Kim H., Torres-Garcia W., Treviño V., Shen H., Laird P. W., Levine D. A., Carter S. L., Getz G., Stemke-Hale K., Mills G. B., Verhaak R. G. W.. Inferring Tumour Purity and Stromal and Immune Cell Admixture from Expression Data. Nat. Commun. 2013;4:2612. doi: 10.1038/ncomms3612. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jiménez-Sánchez A., Cast O., Miller M. L.. Comprehensive Benchmarking and Integration of Tumor Microenvironment Cell Estimation Methods. Cancer Res. 2019;79:6238–6246. doi: 10.1158/0008-5472.CAN-18-3560. [DOI] [PubMed] [Google Scholar]
- Becht E., Giraldo N. A., Lacroix L., Buttard B., Elarouci N., Petitprez F., Selves J., Laurent-Puig P., Sautès-Fridman C., Fridman W. H., de Reyniès A.. Estimating the Population Abundance of Tissue-Infiltrating Immune and Stromal Cell Populations Using Gene Expression. Genome Biol. 2016;17:218. doi: 10.1186/s13059-016-1070-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li T., Fan J., Wang B., Traugh N., Chen Q., Liu J. S., Li B., Liu X. S.. TIMER: A Web Server for Comprehensive Analysis of Tumor-Infiltrating Immune Cells. Cancer Res. 2017;77:e108–e110. doi: 10.1158/0008-5472.can-17-0307. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhong Y., Liu Z.. Gene Expression Deconvolution in Linear Space. Nat. Methods. 2012;9:8–9. doi: 10.1038/nmeth.1830. [DOI] [PubMed] [Google Scholar]
- Johansson H. J., Socciarelli F., Vacanti N. M., Haugen M. H., Zhu Y., Siavelis I., Fernandez-Woodbridge A., Aure M. R., Sennblad B., Vesterlund M.. et al. Breast Cancer Quantitative Proteome and Proteogenomic Landscape. Nat. Commun. 2019;10:1600. doi: 10.1038/s41467-019-09018-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- De Marchi T., Pyl P. T., Sjöström M., Klasson S., Sartor H., Tran L., Pekar G., Malmström J., Malmström L., Niméus E.. Proteogenomic Workflow Reveals Molecular Phenotypes Related to Breast Cancer Mammographic Appearance. J. Proteome Res. 2021;20:2983–3001. doi: 10.1021/acs.jproteome.1c00243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mosquim Junior S., Siino V., Rydén L., Vallon-Christersson J., Levander F.. Choice of High-Throughput Proteomics Method Affects Data Integration with Transcriptomics and the Potential Use in Biomarker Discovery. Cancers. 2022;14:5761. doi: 10.3390/cancers14235761. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vallania F., Tam A., Lofgren S., Schaffert S., Azad T. D., Bongen E., Haynes W., Alsup M., Alonso M., Davis M., Engleman E., Khatri P.. Leveraging Heterogeneity across Multiple Datasets Increases Cell-Mixture Deconvolution Accuracy and Reduces Biological and Technical Biases. Nat. Commun. 2018;9:4735. doi: 10.1038/s41467-018-07242-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dietrich A., Merotto L., Pelz K., Eder B., Zackl C., Reinisch K., Edenhofer F., Marini F., Sturm G., List M., Finotello F.. Benchmarking Second-Generation Methods for Cell-Type Deconvolution of Transcriptomic Data. biorxiv. 2024:2024.06.10.598226. doi: 10.1101/2024.06.10.598226. [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The mass spectrometry proteomics data have been deposited to the ProteomeXchange Consortium via the PRIDE partner repository with the data set identifier PXD056050.

