Skip to main content
ACS AuthorChoice logoLink to ACS AuthorChoice
. 2024 May 28;23(6):2169–2185. doi: 10.1021/acs.jproteome.4c00131

Statistical Testing for Protein Equivalence Identifies Core Functional Modules Conserved across 360 Cancer Cell Lines and Presents a General Approach to Investigating Biological Systems

Enes K Ergin †,, Junia JK Myung †,, Philipp F Lange †,‡,*
PMCID: PMC11166143  PMID: 38804581

Abstract

graphic file with name pr4c00131_0008.jpg

Quantitative proteomics has enhanced our capability to study protein dynamics and their involvement in disease using various techniques, including statistical testing, to discern the significant differences between conditions. While most focus is on what is different between conditions, exploring similarities can provide valuable insights. However, exploring similarities directly from the analyte level, such as proteins, genes, or metabolites, is not a standard practice and is not widely adopted. In this study, we propose a statistical framework called QuEStVar (Quantitative Exploration of Stability and Variability through statistical hypothesis testing), enabling the exploration of quantitative stability and variability of features with a combined statistical framework. QuEStVar utilizes differential and equivalence testing to expand statistical classifications of analytes when comparing conditions. We applied our method to an extensive data set of cancer cell lines and revealed a quantitatively stable core proteome across diverse tissues and cancer subtypes. The functional analysis of this set of proteins highlighted the molecular mechanism of cancer cells to maintain constant conditions of the tumorigenic environment via biological processes, including transcription, translation, and nucleocytoplasmic transport.

Keywords: bioinformatics, statistics, equivalence testing, proteomics, cancer cell lines

1. Introduction

Proteomics has significantly transformed biological analysis by providing quantitative insights into cellular processes.1 Quantitative proteomic techniques, in particular, have been instrumental in advancing our understanding of protein dynamics and their involvement in diseases.2 These techniques enable precise protein abundance measurements to unravel complex biological phenomena.3 Differential testing is commonly used to compare conditions and identify significant differences, providing valuable insights into the underlying biology of diseases such as cancer.4,5

While establishing differences is more traditionally used, exploring similarities can also provide valuable new perspectives on protein conservation, expression stability, and target reliability. The most common method for similarity assessment is typically done at the sample level, such as determining technical reproducibility between experimental repeats using correlation.6 Correlation is commonly used in proteomics to construct a correlation matrix of all samples to identify patterns within the data set, compare the relative similarity between groups of samples, and compare samples with nondirectly comparable quantitative values.7,8 Although correlation-based methods can provide general insights into sample similarities, they are susceptible to outliers and systematic shifts and are limited in providing statistically rigorous identification of individual stable analytes (i.e., proteins, genes, or transcripts).9,10

Equivalence testing presents an alternative approach for exploring stability at the analyte level. Widely used in various disciplines, from psychology11 to neuroscience,12 equivalence testing is a statistical method employed to compare two or more groups or conditions and determine if they are statistically equivalent or similar.13 This methodology, initially known as bioequivalence testing in pharmacokinetics, has proven valuable in assessing equivalence between groups.14 To our knowledge, equivalence testing had not been applied in quantitative proteomics or gene expression analysis until we recently introduced it to compare patient biopsies with patient-derived xenograft specimens15 and identify stable drug targets across cancer progression.16

To address the need for a more rigorous and specific framework for exploring similarities alongside differences at the analyte level, we propose QuEStVar (Quantitative Exploration of Stability and Variability through statistical hypothesis testing). QuEStVar leverages statistical testing to identify quantitatively and statistically stable and variable proteins in biological comparisons. This framework offers a familiar hypothesis-based approach to deepen our understanding of the roles of stable proteins in complex biological systems. Using QuEStVar, we identify core functional modules conserved across 360 cancer cell lines from 25 tissues, establishing a cancer cell line’s first stable core proteome.

2. Materials and Methods

2.1. Combined Testing Framework with QuEStVar

To facilitate and promote the use of equivalence testing, we developed the QuEStVar framework (Figure 1). QuEStVar is accessible as a set of Python scripts. Accompanying notebooks detail all the capabilities and analyses presented in this paper. One can find these resources in the Zenodo repository: https://zenodo.org/records/10694635. The core of QuEStVar consists of four distinct steps: creating comparisons to test, applying the missing value and optional coefficient of variation (CV)-based filtering, performing a t test to assess statistical differences, using a two-one-sided t test (TOST) to evaluate statistical equivalence, and summarizing data for further insights.

Figure 1.

Figure 1

Simplified operational flow of “QuEStVar”. (A) Three methods for creating input to specify comparisons for QuEStVar analysis are highlighted. (B) Before analysis, analytes (proteins) within each sample are filtered, addressing missing values and removing high intrasample CV features optionally. (C) Analysis pipeline for each sample pair begins by selecting proteins present in both samples, followed by executing t tests and TOSTs. The TOST uses an absolute log2 boundary of 0.5 in this diagram, while the t test uses a 0.75 boundary. Based on the conducted tests, the resulting statistical outcomes assign specific statuses to each analyte. Finally, (D) the results of the tests performed on cell line pairs are summarized into pair-specific and protein-specific reports, facilitating subsequent visualization and analysis. Notably, identified sets of proteins of interest can undergo functional enrichment analysis to understand biological insights further.

2.1.1. Creating Sample Pairs for Testing

The QuEStVar framework offers three distinct approaches to define sample pairs for testing (Figure 1A). The first method uses a structured metadata document containing sample details and labels to facilitate grouping. The second method allows users to input a list of samples formatted as tuples. This list should include the exact sample names required for iteration. The third method enables the generation of a comprehensive list that includes all possible combinations derived from the available samples. These approaches provide flexibility in selecting sample pairs for analysis and contribute to the versatility of the QuEStVar framework.

2.1.2. Missing Value and CV-Based Filtering

QuEStVar applies the missing value and CV-based filtration to each sample prior to testing (Figure 1B). QuEStVar offers two methods for filtering missing values. The first method uses a user-defined threshold, allowing a certain percentage of missing values from technical replicates per sample to be valid for testing, as long as it is not less than two. This option could lead to unequal sizes, with some proteins having more replicates than others. The second method we used for the analysis here excludes proteins if any replicate within a sample is missing. This ensures an equal sample size for all proteins throughout the comparisons.

After removing missing values, a filter based on the CV is applied. This user-defined threshold eliminates proteins with high intrasample variance, deemed as unreliably quantified. Typically, a CV threshold of 50 to 75% (calculated on a nonlog scale) is set in this step, which may exclude some proteins. To disable the CV-based filtering, users can set a high threshold to prevent protein removal. CV-based filtering aims to reduce the number of statistically insignificant proteins. Proteins with high intrasample variation tend to have decreased p-value confidence and, thus, are unlikely to be statistically significant.

Each sample undergoes a filtering process. When two samples are compared using the t test and TOST, only the proteins present in both samples after filtering are tested. The exclusion matrix visualization offers detailed insights into the reasons for excluding each sample pair.

2.1.3. Applying Two Tests for Difference and Equivalence

Every sample pair created to be tested undergoes t test and TOST procedures, where proteins still present in both samples are subjected to these tests in a vectorized fashion utilizing SciPy’s masked array implementations of “ttest_ind” and “ttest_rel” functions.17 After each test, the raw p-values are corrected using multiple test correction methods. QuEStVar provides Bonferroni,18 Holm,19 and false discovery rate (FDR) controlling methods Benjamini & Hochberg (BH or fdr)20 and Benjamini & Yekutieli (BY)21 alongside an omics-centered, q-value22 corrections (Figure 1C).

Upon conducting both tests for each sample pair and obtaining raw and corrected p-values, proteins are categorized into four protein statuses defined below

2.1.3.

Log2 fold-change (LFC) is calculated for the protein sample pair. Beq and Bdf are user-defined equivalence and difference boundaries, respectively. peq and pdf are the corrected p-values from the TOST and t test, respectively, for the protein, and a is the user-defined p-value threshold for significance.

The equivalence boundary represents the lower and upper boundaries used in two one-sided t tests. While the TOST relies on the LFC directly in the testing, the difference boundary is not used directly in testing but as a secondary protein status criterion. This allows us to use the LFC as the boundaries for determining the protein status.

2.1.4. Summarizing the Data for Further Insights

Finally, QuEStVar offers various ways to summarize and report the results. These include detailed testing summaries for individual comparisons, high-level summaries capturing pairwise comparison trends in terms of protein status percentages, and protein-specific highlights suitable for subsequent protein-set enrichment analyses. These diverse summarization options allow users to extract insights tailored to their analytical needs (Figure 1D).

2.2. Analysis of the E. coli Spike-In Data

We used a spike-in data set to demonstrate how QuEStVar enhances statistical explainability by introducing equivalence testing alongside differential testing. The data set used in the analysis is available in ″benchmark_diann.zip″ from https://zenodo.org/records/7859138, which was originally compiled by Fröhlich et al. in 2022,23 but the version that was analyzed with DiaNN was from Yu et al.24

The data set used in the analysis consists of 92 single-injection DIA runs across four conditions: “Lymphnode”, “1:25”, “1:12”, and “1:06”. There are 23 replicates in total for testing. The “Lymphnode” condition contains solely Homo sapiens (H. sapiens) peptides, while the other conditions include mixtures of Escherichia coli (E. coli) and H. sapiens peptides at various ratios. The ratio 1:25 indicates that for every 25 H. sapiens peptides, 1 E. coli peptide is added. Similarly, for the ratios 1:12 and 1:06, for every 12 and 6 H. sapiens peptides, 1 E. coli peptide is introduced, respectively. To simplify communication, we refer to these conditions as samples A, B, C, and D, respectively, in our presentation.

2.2.1. Data Preparation

The E. coli spike-in data was prepared for the QuEStVar in several steps. First, label-free quantitation values were obtained at the protein level from the “protein_maxlfq.tsv”. This data set’s first column provided protein accession IDs. To link these IDs with taxonomic context, organism information was extracted from the FASTA file “2022–02–18-reviewed-UP000005640-UP000000625”.

The data cleanup process included only limited imputation for this data set. If less than 25% of the 23 replicates were missing for each condition per protein, the missing values were replaced with the mean of the quantified replicates. This limited imputation aims to maintain more proteins with 23 complete replicates per condition during testing. Compared to full imputation, limited imputation has less impact while allowing more proteins to be fully retained. This increased the count of fully retained proteins from 2656 to 4252, reducing the number of proteins filtered out during testing in QuEStVar.

After the cleanup, each condition subset and replica-averaged and clean versions of the quantitative data sets were saved on the hard drive for running QuEStVar. Notebook S1 presents an in-depth commentary, step-by-step code, and extensive quality-check visualizations for the data preparation stage of the E. coli spike-in data set.

2.2.2. Running QuEStVar

For each subset of samples, proteins with missing values and proteins with an intrasample CV greater than 75% were filtered out. From these four samples, six sample pair combinations were formed. Additionally, a reference sample pair was created from sample A to represent the exact sample comparison as the reference. As a result of the filtrations, due to missing values and high intrasample CV, up to 21% of the proteins were excluded from sample pairs with sample A because sample A (Lymphnode) lacks E. coli proteins.

The testing and protein status classification was configured with the following parameters: a difference boundary of 0.75 and an equivalence boundary of 0.50, both measured in absolute LFCs. Statistical testing employed t test and TOST procedures, assuming equal variance and independent testing. A p-value threshold of 0.05 with FDR correction was applied for significance determination.

Additionally, we compared the similarities between the three sample pairs using standard correlation coefficient metrics and a new metric called sample equivalence index (SEI) (eq 1) from QuEStVar that can present the sample similarity between two compared samples. The SEI metric was calculated from the pair-specific summary that calculates the percentage of proteins classified as equivalent from a given comparison. The total number of proteins can be used as all available proteins in the data, all tested or significant proteins. We used the proteins identified as statistically different or equivalent for both the correlation coefficient and SEI to be used in the metrics calculation.

2.2.2. 1

This part of E. coli spike-in data analysis is detailed in Notebook S2.

2.3. Analysis of Simulated Scenarios

We conducted four simulated scenarios to test the behaviors of correlation coefficients and SEI. Notebook S3 details all the setup and execution of these scenarios. In the analysis, three correlation coefficients were used to measure the relationships between variables: Pearson correlation coefficient (r), Spearman correlation coefficient (ρ), and Kendall correlation coefficient (τ). These correlation coefficients were used to assess the similarity between samples and compared to the SEI, which indicated the similarity between samples.

The main rules to simulate the data were as follows: 20 replicates represent a sample. For some scenarios, these replicates were identical, while for others, they contained intrasample variation, mimicking the CV distribution of the real data. A total of 5289 proteins were used, as the number of proteins quantified in the 1–06 sample or sample D. The simulation involved introducing a difference at intervals of 250 proteins, starting from 0 to 5250 proteins. All metrics were calculated from log2-scaled data. The equivalence boundary was set at −0.5 to 0.5. FDR p-value correction was used to correct p-values, and only FDR < 0.05 was considered significant. The SEI was calculated using the number of tested proteins as the denominator and the number of equivalent proteins as the numerator.

The first simulated scenario involved introducing fixed-fold changes to the second sample in one or both directions. Initially, both samples were identical, with 20 replicates. Then, with 250 protein intervals, fixed fold changes of 0.25, 2, and 8 were added to the second sample.

The second simulated scenario involved introducing a range of fold-changes to the second sample in one or both directions. Initially, both samples were identical, with 20 replicates. Then, with 250 protein intervals, a range of fold-changes [(0.25, 0.75), (1, 3), (4, 12)] were added to the second sample.

The third simulated scenario introduced the change using a range of fold-changes (1,3). This time, an intrasample CV was added to the 20 replicates in both samples to simulate a more realistic scenario with added technical noise. The CVs were calculated using log2 values for simplicity in the function, resulting in smaller values that still reflect the CV distribution of real data.

The last simulation involved completely replacing the second sample with random values sampled from the first sample’s min–max value range using a uniform distribution. Three scalers were used to scale the min–max value range: 0.5 (halved the original range), 1 (kept the original range), and 2 (doubled the original range).

For each simulated scenario, we calculated Pearson, Spearman, Kendall, and SEI values at intervals of 250 proteins. These values were then plotted alongside a reference line representing the observable ground truth. The observable ground truth is the percentage of H. sapiens proteins tested in each comparison. Since H. sapiens are not changed between samples, the percentage of proteins that comes from H. sapiens gives us the percentage of proteins that are expected to be equivalent or similar. This reference line depicted a linear decrease in metric value from 1.0 (when 0% of proteins with a difference) to 0.0 (when 100% of proteins with a difference).

2.4. Analysis of Cancer Cell Line Data

To show an application of our equivalence-centered approach, we utilized the PXD030304 data set, a comprehensive collection of proteomic data derived from diverse cancer cell lines.25 It includes 949 distinct cell lines and quantifies 8498 proteins using the DIA-MS workflow. We selected this data set because it covers a wide range of tissue and cancer types, ensuring a representative sample for our study. Additionally, it provides at least 5 replicates for most cell lines, which increases the confidence in our testing compared to using only 2 replicates.

We expanded the metadata to include “Age”, “Tumor Status”, and “System Name” variables. These variables provide context for the analysis. The information was obtained through an automated mapping process using DepMap models26 and further refined through manual curation using the Cellosaurus database27 (Notebook S4).

2.4.1. Data Preparation

We went through several data cleaning and processing steps to prepare the data for QuEStVar. First, we implemented a cell-line selection process based on specific criteria. This process involved removing 135 cell lines that contained unknown values in “Sex”, “Age”, or “Tumor Status” and was the only subtype in a given cancer. Additionally, we eliminated 42 cell lines that had fewer than 6 replicates. However, no cell lines were removed based on having less than 20% of the total protein quantified (with at least 60% of replicates quantified). To prevent overrepresenting certain tissues, cancers, and subtypes in the comparison, we also selected a maximum of 3 cell lines from the same cancer subtype, age, and tumor status. The cell-line selection process ultimately resulted in 360 cell lines to be used in the analysis.

For data processing, we first implemented limited imputation. If less than 40% of the 6 replicates were missing for each cell line per protein, the missing values were replaced with the mean of the quantified replicates. Limited imputation increased the count of fully retained proteins from 400 to 725 across all cell lines, reducing the number of proteins filtered out during testing in QuEStVar. Then, we used the CV and missing value percentages to identify the best 5 replicates per cell line to keep for testing. We removed the proteins that are not fully quantified in at least two cell lines to ensure at least in one comparison that protein will be tested. At the end of the process, we ended with 7975 proteins from the initial 8498. Finally, we centered the distribution of cell lines with replicates on the median.

After preparing the data, each cell line subset, as well as the replica-averaged and clean versions of the quantitative data sets, were saved to the hard drive for QuEStVar execution. Notebook S5 provides an in-depth commentary, step-by-step code, and extensive quality-check visualizations for the data preparation stage of the cancer cell line data set.

2.4.2. Running QuEStVar

For each 360-cell line, proteins with missing values and those with an intrasample CV greater than 75% were filtered out. From the 360 cell lines, 64,620 cell line pairs were formed. The testing and protein status classification was configured with the following parameters: a difference boundary of 1 and an equivalence boundary of 0.90, both measured in absolute LFCs. Statistical testing employed t test and TOST procedures, assuming equal variance and independent testing. A p-value threshold of 0.05 with FDR correction was applied for significance determination.

After completing the testing and protein status classification part for all 64,620 cell line pairs, we created multiple summary tables to facilitate easier access to structured data. First, we created a grouping table that included sample 1 (S1) and sample 2 (S2) information alongside each sample’s metadata, which was copied to create detailed metadata for use. We created a secondary metadata table called combined labels to hold the combined names for the comparisons. For example, if S1’s tissue is from the lung and S2’s tissue is a lymphoid one, this data holds the “Lung vs Lymphoid” label for tissues.

One of the summary tables we make is called an expanded protein-status matrix. A vast data matrix contains all pairs in rows and proteins in columns. Using the protein-status matrix, we created the protein status summary table to store pair-specific summaries, where each row is a single cell line pair and summarized information about the counts of proteins with their corresponding status, test status, and percentages of different, equivalent, and unexplained. These two summary tables are created once all the pairs are run.

The last summary table, the shared protein summary table, was created to store protein-specific summaries and aggregate the protein’s status from all cell line data. This summary table is comparison-specific, and the records will be changed based on the subset of pairs included for a given comparison, such as same-tissue-only comparisons or all cell lines. The shared protein summary table records the number of pairs the protein has tested and has been found to be statistically different, equivalent, or unexplained. Using the number of tested as the denominator percentage of difference, equivalence is also calculated for the newly introduced “relative stability metric” (RSM), a value ranging from −100 to +100 to characterize each protein’s consistent behavior across many comparisons. This metric quantifies the protein’s tendency toward quantitative stability or variability. To augment the reliability of the RSM value, we used the % Tested variable, which presents the percentage of pairings in which a given protein is tested. This secondary metric safeguards against drawing conclusions based on insufficient data.

The running of all the cell lines and creating summary tables for the downstream analysis are detailed with in-depth commentary and step-by-step code within Notebook S6.

2.4.3. Downstream Analysis

The downstream analysis involves a question-oriented examination of the testing results in bulk to explore the stable core proteome (Notebook S7). Additionally, it involves looking at subsets of results in a grouped fashion to explore the stability within the same tissue (Notebook S8). For both analyses, apart from using custom visualizations tailored to specific questions about the results, we utilized the Python API of g:Profiler (v.1.0)28 to gather various biological information about the protein sets accessed in the analysis from Reactome29 and KEGG.30 While the enrichment analysis has focused on the pathways from Reactome and KEGG, the Gene Ontology (GO)31 results have been extracted from the g:Profiler and can be found in the Supporting Information tables.

2.5. - Technical Details

The QuEStVar method and its analysis were primarily conducted using Python (v.3.9.18).32 Using Numpy (v.1.24.3),33 Pandas (v.1.5.2),34 Polars (v.0.18.1), and Feather-format (v.0.4.1) facilitated data manipulation, analysis, and input/output operations. The SciPy (v.1.10.1)17 package served as the backbone of the testing framework. Matplotlib (v.3.7.2),35 Seaborn (v.0.11.2),36 Upsetplot (v.0.7.0), and PyComplexHeatmap (v.1.6.4)37 were used to generate visualizations. The versions of all the packages used in the analysis are listed in Notebook S9.

The analysis was performed on a PopOS 22.04 operating system, a Debian-based distribution, on a high-performance computing system with a Ryzen 9 3950X processor, 128 GB RAM, and a 1024 GB SSD. The analysis pipeline was completed in under 15 min, utilizing 30 threads for computational efficiency—the memory-efficient configuration used 20 GB of memory and 80 GB of hard drive space.

3. Results

3.1. Equivalence Testing Expands Statistical Insights into the Relationship between Biological Conditions

We hypothesize that equivalence testing improves the determination of the similarity of samples at the level of individual proteins rather than overall resemblance or correlation. To assess this, we utilized a two-species spike-in data set.24 This data set has four conditions (A, B, C, D) with increasing spiked-in E. coli proteome levels. Sample A contains only H. sapiens proteins and is the baseline. We have created four comparisons highlighted in Figure 2A. Since there is no change in H. sapiens proteins, all comparisons show near-zero LFC for these proteins. The D vs B comparison (Figure 2) exhibits the highest expected LFC (2.05) for E. coli, a reflection of the greatest E. coli spike-in difference. In contrast, D vs C (Figure S1) demonstrates an intermediate E. coli spike-in difference, resulting in an expected LFC of 1. D vs A (Figure S2) has no expected LFC for E. coli, as sample A lacks the spike-in. Finally, the A vs A comparison serves as a reference, with no expected LFC for E. coli.

Figure 2.

Figure 2

Application of QuEStVar in spike-in data with lower significance thresholds. (A) Four comparisons are run with QuEStVar, showing the expected LFC values for E. coli and H. sapiens. (B) Scatterplot with organism hue displays the LFC separation among E. coli proteins, with boxplots highlighting the differences in LFC between samples D and B. (C) Raw and adjusted p-value distributions from the testing are shown in a step histogram. (D) Antlers plot, a modified volcano plot, visually distinguishes proteins within equivalence boundaries (−0.5 to 0.5), offering insights into different and equivalent proteins. (E) Pie chart demonstrates the percentage of E. coli or H. sapiens proteins in each protein status category. (F) Protein exclusion matrix gives a detailed overview of the reasons for protein exclusion. Finally, in (G), the grouped bar plot highlights the percentage equivalence and correlation metrics for three data set comparisons, supplemented by ground truth calculations defining perfect cases of protein equivalence or dissimilarity in each comparison.

For these comparisons, QuEStVar has been run to examine if our new framework can clearly represent the expected E. coli and H. sapiens LFC between comparisons. The sample D vs B comparison shows that the LFC we observed from E. coli and H. sapiens proteins is closer to the expected LFC (Figure 2B). Due to the H. sapiens origin of most proteins and with LFC differences around 0, a majority were statistically equivalent. This made the p-value distribution difficult to interpret (Figure 2C). We modified a volcano plot into an “Antlers Plot” (Figure 2D) to visualize protein classification from both tests. P-values were selected to represent in the plot based on a protein’s LFC: TOST p-value, if within boundaries; otherwise, the t test p-value. Protein status counts supplement the Antlers plot to show the number of proteins per classification (Figure 2E). This comparison revealed 3919 statistically equivalent, 475 statistically different, and 341 unexplained proteins. Additional 654 proteins were excluded due to missing values or an intrasample CV above 75% (Figure 2F).

Since the degree of sample similarity is commonly evaluated by correlation analysis, we hypothesized that the ability to assess statistical equivalence for each protein would also allow robust assessment of global similarity between samples or conditions. To this end, we compared the SEI and correlation coefficient methods against the “observable ground truth” using three comparisons (D vs B, D vs C, and A vs A). The first comparison sample, D and B, where E. coli proteins have 2.05 LFC differences, had 0.89 observable ground truth due to 11% of proteins coming from E. coli. In contrast, even though the LFC difference is lower between D and C, 16% of proteins come from E. coli, where the observable ground truth is lower at 0.84. The last comparison uses the same sample, “A,” for both S1 and S2 to establish the reference comparison with 100% similarity. We have plotted these three comparisons and observed that SEI consistently aligns with the observable ground truth. However, the average correlations computed for each comparison tend to be lower than the observable ground truth. This discrepancy is more pronounced with increasing differences in magnitude, resulting in decreased correlation values (Figure 2G).

The integration of equivalence testing has enhanced our capacity to measure the similarity and broaden statistical insights between comparisons. Equivalence testing serves as a valuable complement to traditional statistical tests for differences to provide a more complete statistical picture.

3.2. SEI Presents a More Robust Way to Compare Similarity between Samples

In the E. coli spiked-in data set, we observed that the SEI as a sample-level similarity metric closely aligns with the observable ground truth. We then conducted data simulations to validate its accuracy in describing sample similarity beyond the species spike-in experiment. The simulations aimed to understand how commonly used correlation coefficient metrics (Pearson r, Spearman p, Kendall tau) and SEI respond to varying levels of similarity, magnitude and frequency of differences, and types of noise in the data. Each correlation coefficient was calculated for all pairwise comparisons, the average value was picked, and the SEI was calculated as the fraction of equivalence proteins in the tested protein. Statistically equivalent proteins were identified with equivalence boundaries of [−0.5, 0.5], FDR correction, and a p-value threshold of 0.05, using equal variance and unpaired testing parameters.

In our baseline simulation, we introduced fixed LFCs (0.25, 2, and 8) to an increasing number of proteins in one and both directions (Figure S3). This simplified scenario aimed to demonstrate the direct effect of fixed differences on similarity metrics. Adding the 0.25 LFC to the second sample had a minimal impact on all metrics; SEI remained at 1 because the LFC stayed within equivalence boundaries. When the LFC of the introduced offset was increased to 2.0 (dashed line) and 8.0 (dotted line), it created a linear drop in SEI as the percentage of affected proteins increased.

In the second scenario, we introduced varying LFC ranges (0.25–0.75, 1–3, and 4–12) in one or both directions (Figure S4). Unlike the fixed LFC approach, randomly selecting LFCs simulate nonuniform differences of varying magnitudes. This revealed a similar pattern to the first simulation scenario, but with lower overall similarity metrics. Notably, correlation coefficients failed to capture changes at high LFC ranges, exhibiting a sharp early decline. However, as more proteins changed, the values were either increased by adding the offset to one direction (Figure 3A) or plated and slowed their descent by adding an offset to both directions (Figure 3B). In both scenarios where we added an offset to one direction, the correlation coefficient increased after 50% of proteins have changed due to the overall grouping of most proteins having shifted with the offset. Additionally, we observed that the magnitude of the offset value affects correlation much more, while the SEI can capture similarity more precisely.

Figure 3.

Figure 3

Performance comparison of the SEI and correlation coefficient in simulated scenarios with increasing noise. Line plots demonstrate how the similarity metrics change as the percentage of proteins with introduced noise increases. Noise is simulated as follows: (A) three ranges of LFC added to sample two in one direction, (B) the same LFC ranges added to sample two in both directions, (C) a fixed 1–3 LFC range in sample two, but with three levels of intrasample CV introduced to replicates, and (D) replacement of sample two with random noise generated from the minimum and maximum values of sample one, scaled by factors of 0.5, 1, and 2.

Our third scenario aimed to simulate the effect of intrasample CV on similarity metrics. We introduced a 1–3 LFC range to the second sample while adding intrasample CV to the replicates, deviating from the previous identical replicates. This introduced a more realistic element, as high-dimensional data with technical and biological replicates often exhibit variability. We used three CV distributions with different means: 1, 4, and 16. The CV distribution was modeled after sample D from spiked-in data and CVs calculated from log2 values; they followed a right-skewed distribution (Figure S5). Figure 3C presents this scenario’s simulation results, where SEI starts around 0.4 with no protein change and gradually decreases to 0 at very high variation. At the same high variation setup, the correlation coefficient starts lower with both Pearson (r) and Spearman (p), starting around 0.77, and Kendall (tau), starting close to 0.6. The correlation coefficient starts higher with lower CV setups and decreases to the same levels in all variation setups.

As the last simulation, we introduced randomly sampled values from sample 1, scaled them, and added them to sample 2 in an increasing number of proteins (Figures 3D and S6). Unlike the other simulation, this does not depend on the initial state of the samples to change it; it simply replaces them with new values that can be widely different. The simulation resulted in correlation coefficients consistently dropping below the baseline and remaining low across. Moreover, when a hundred percent of proteins are replaced in sample two, they are more or less converged at the 0 correlation. However, the SEI displayed a closer to-baseline linear drop but did not hit the 0 mark due to some randomly replaced values being in [−0.5, 0.5] LFC boundaries and counted as statistically equivalent.

The comparative analysis underscored distinct behaviors between correlation metrics and the SEI.

  • Correlation: Pearson (r) exhibited notable sensitivity to noise frequency, particularly in scenarios involving higher fold changes. In contrast, Spearman (p) and Kendall (tau) displayed greater stability across a range of simulations, showing resilience to variations in noise levels and the directionality of modifications within the proteomic data set. All correlations were affected by the shifting cluster of proteins, especially in cases where the changes are introduced in one direction, and they start to increase after more than half of the proteins are changed.

  • SEI demonstrated a more consistent behavior throughout all test scenarios. It followed a linear decrease as the percentage of affected proteins increased, irrespective of the magnitude or the direction of the noise introduced. This behavior was observed consistently across all scenarios, showcasing the metric’s robustness in quantifying statistical equivalence based on predefined boundaries. SEI is, however, affected by very high variation displayed within samples, which resulted in lower confidence in the testing.

These diverse behaviors highlight the limitations and strengths of correlation metrics versus SEI in capturing changes induced by fold changes in the proteomic data set. While correlation metrics respond variably to noise magnitude and directionality, SEI demonstrates a more consistent adherence to defined statistical boundaries, especially when the precision and magnitude of differences support the robust classification of equivalent and different data points. SEI performance is independent of the fraction of deviating features showing sensitive linear response.

3.3. Quantitatively Stable Core Proteome across 360 Cancer Cell Lines

We then applied the QuEStVar to a large data set of cancer cell lines from 25 tissues25 to gain insights into quantitative protein stability and variability across diverse malignancies, subtypes, and tissues of origin. Out of over 900 cell lines, we focused on a subset of 360 selected based on strict criteria, including sample quality benchmarks, robust replication, select cancer subtypes, and a balanced distribution of cancers across different tissues. From these 360 cell lines, we generated 64,620 cell line pairs to be tested using QuEStVar (Figure 4A).

Figure 4.

Figure 4

Exploring stable and variable proteins quantitatively. (A) Overview of QuEStVar analysis on the cancer cell line data set summarizes the methodology. (B) Stable and variable protein identification using the RSM and percent tested metrics. Proteins are classified as stable, variable, or undetermined, with color-coding and cutoffs shown. (C) Heatmap visualizing variable protein expression across cell lines, with Ward clustering (Euclidean distance) applied to rows and columns. Three clusters are delineated based on the cell line systems of origin. (D) Donut chart showing cell line system composition within column clusters. Percentages exceeding 5% are labeled for clarity.

For this data set, we used the following identification criteria for quantitatively stable proteins: an RSM value greater than +35 and a%Tested value greater than 95%. Additionally, we considered an RSM value less than −35 for quantitatively variable proteins and a%Tested value greater than 95%. Applying these criteria, we identified 26 quantitatively variable and 171 quantitatively stable proteins across all cancer cell lines (Figure 4B).

We utilized hierarchical clustering to investigate the behavior of variable and stable proteins across all cancer cell lines based on z-scored protein abundances. Notably, when a heatmap of only variable proteins was built (Figure 4C), we observed the hematologic malignancies to drive the variable identification apparent in cell line clusters (Figure 4D). Protein cluster 1 demonstrated lower abundance in hematologic malignancies but higher levels in other types of cancer, while protein cluster 2 exhibited high abundance, specifically in hematologic cancers.

To explore the stable core proteome, we delved deeper into the biology of the 171 stable core proteins by exploring their functional details and protein interactions through KEGG pathways. 96 of the 171 quantitatively stable proteins were linked to genetic information processing, primarily transcription and translation. Within the KEGG database, the regulatory network of these proteins primarily involved spliceosomes, nucleocytoplasmic transport, ribosomes, mRNA surveillance pathway, coronavirus disease, and amyotrophic lateral sclerosis (Figure 5). Apart from disease pathways, they predominantly contributed to RNA metabolism. The list of enriched Reactome pathways on the same 171 proteins also highlighted the prevalence of RNA metabolism (Figure S8). Fifty-one of the stable 171 proteins were associated with the spliceosome (Figure 6), which is known as a general cell “housekeeping” machinery. Furthermore, 45 proteins were implicated in crucial processes such as translational processing, cell growth, and death. These functions include nucleocytoplasmic transport, ribosome, and mRNA surveillance pathways—the mRNA surveillance pathway, with stable proteins involved in the nonsense-mediated mRNA decay (NMD) part only. The GO enrichment yielded over 150 terms related to RNA metabolism. We present the top 30 terms from each category: Biological Process, Cellular Component, and Molecular Function (Figure S9).

Figure 5.

Figure 5

Predominant RNA metabolism-focused stable core proteome in cancer cell lines. Out of the 171 stable proteins, 96 contribute to enriched KEGG pathways. The upset plots illustrate the distribution of stable proteins across each pathway, showing the count of proteins associated with individual pathways within the identified stable core proteome. The bubble plot displays the enrichment levels of the identified stable proteins within pathways. The point size represents enrichment with “Enrichment” values, and the x-axis corresponds to the FDR values at −log10, providing insights into the significance of the enrichment.

Figure 6.

Figure 6

Detailed spliceosome pathway diagram showing the gene symbols mapped to the proteins in the test summaries of all cancer cell line comparisons. The size of the squares indicates the percentage of pairs the proteins have been tested. The color indicates if the protein is classified as stable, variable, or undetermined. If a gene/protein is indicated in the KEGG pathway but not found in the data, it is stated as a small dark-gray point.

We examined quantitative protein stability in 360 carefully chosen cancer cell lines, revealing 26 variable and 171 stable proteins. Those stable proteins have been found to dominate genetic information processing, particularly in RNA metabolism-related pathways such as spliceosome and mRNA surveillance.

3.4. Exploring Tissue-Specific Stable Core for Cancer Cell Lines

Next, we wanted to investigate the quantitative stability within cancer cell lines originating from the same tissues using the same data analyzed for cancer cell lines.25 To explore this, we created same-tissue groups by subsetting cancer cell line pairs from the same tissue of origin to represent 25 same-tissue groups with varying numbers of cell line pairs in each group. Additionally, we wanted to highlight the composition of the same-tissue group by finding what is the percentage of cell line pairs in each group that originate from the same cancer subtype or different subtypes, assuming that the similarity of cell lines from the same subtype would have a higher effect on the overall same-tissue group similarity (Figure 7A).

Figure 7.

Figure 7

High-level comparison of same-tissue-grouped pairs highlighting similarity. (A) Composition of same-tissue groupings categorized based on the origination from the same subtype, represented by yellow bars indicating percentages. The numbers at the end of each tissue group denote the total proteins per group, delineating biological similarity or dissimilarity. (B) Stacked bar plots display averaged percentages of protein status from pairs within the same tissue group, providing a comprehensive overview of groupwise similarity. The ordering of tissue groups by mean equivalence percent impacts A and C due to their shared y-axis. (C) Boxplots summarize and group the number of proteins tested within each pair, focusing on same-tissue grouping comparisons and offering insights into the distribution and variability of tested proteins. (D) Comparison between the stable core proteome (171 proteins) and their identification within same-tissue groupings. White denotes proteins that are not stable within a given same-tissue grouping, while dark blue signifies stability.

One approach to summarize many comparison results provided by QuEStVar is exploring test results at pair-specific summaries, which we used here to evaluate the overall similarity within same-tissue groups. Pair-specific summarizes the per same-tissue group, calculated by averaging the percentage of equivalent, different, and unexplained proteins in each pair that belongs to a given same-tissue group; the resulting averages were used to create a stacked bar plot of percentages per same-tissue group (Figure 7B). Notably, the proportion of unexplained proteins varied between 38 and 46% for all same-tissue groups. The groups were ordered by the average percentage of equivalent proteins, expecting those with fewer pairs and more of the same subtype to show greater similarity. However, this did not hold universally. Surprisingly, despite having more pairs and a lower same-subtype composition, lymphoid and myeloid tissues showed high similarity within the group. On the other hand, tissues such as testis, prostate, and liver tissues showed low similarity within the group, despite having a small number of pairs with high same-subtype composition. The number of proteins tested in each pair ranged from 2000 to 3750, with high variability within tissue groups (Figure 7C).

Following the comparison of pair-specific similarities among cancer cell lines within the same tissue, we moved on to a protein-specific analysis, mirroring our approach to explore the stable core proteome across all cancer cell lines to explore the tissue-specific stable core proteome this time. Identical criteria were used to identify proteins that are quantitatively stable (RSM: > 35 and % Tested: >95%) and variable (RSM: 35 and % Tested: >95%) within each specific same-tissue group. As a result, we observed groups of 20–375 proteins that were quantitatively stable within a single tissue and groups of 190–1100 proteins that were stable in more than one tissue but not universal. The proteins were identified to be quantitatively stable and variable in all same-tissue groups, and some of them were observed to be distinct from a single same-tissue group. While the initial assumption is that the high-similarity tissue group will produce more of the distinctly stable proteins, there are more tissue-specific stable proteins found in low-similarity tissues such as tissues of vagina, prostate, liver, and testis (Figure S10A). On the other hand, the distinctly variable proteins are found primarily on low-similarity tissues. In contrast, some of the high-similarity tissues show distinct proteins, such as the pancreas, cervix, and pleura (Figure S11B), where these are generally with few pairs and high same-subtype composition tissues (Figure 7A).

To better understand the relationship between the stable-core proteome identified in all cancer cell lines and the proteome identified in same-tissue groupings, we created a heatmap using binary values to indicate if a protein is quantitatively stable in a given same-tissue group. Proteins were grouped into 4 clusters and tissues into 3 clusters (Figure 7D). The first two protein clusters align with tissue-enriched stable proteins. For example, the first protein cluster contains stable proteins predominantly identified in tissue cluster 3, which are characterized by same-tissue groups with high similarity. Protein cluster 3 primarily consists of proteins found in the stable-core proteome in all cancer cell lines, with exceptions in the vagina, prostate, testis, and uterus, which do not share those proteins. Across tissues, 45 to all 171 stable core proteins are identified in the comprehensive analysis of all cancer cell lines. The uterus tissue displayed the lowest count, while the lymphoid, cervix, PNS, and soft tissues exhibited the highest. It is important to note that soft tissues are not in tissue cluster 3, while the others are.

We used the multiquery feature of g:Profiler to extract KEGG and Reactome pathways to highlight the biological functions of the quantitatively stable proteins identified in each group. We set a p-value of 1 to obtain all pathways and then applied a 0.05 FDR threshold to identify significantly enriched pathways. Among the tissues, we found 58 enriched KEGG pathways. Notably, the most enriched pathways, including spliceosome, ribosome, nucleocytoplasmic transport, and mRNA surveillance pathways, were consistent with those observed in the stable-core proteome (Figure S11). Reactome has provided a more detailed list of enriched pathways, comprising 570 unique terms enriched among the tissues (Figure S12). To analyze the prevalence of main pathways in the Reactome results, we have grouped enriched pathways by their main parent pathways. We found 23 parent pathways with at least one enriched member in a tissue group. We calculated the pathway coverage value by dividing the number of enriched child pathways by the total number of child pathways to determine the prevalence of a parent pathway in a specific tissue group. We used pathway coverage to create a heatmap of parent-pathway vs same-tissue groupings (Figure S12A). As a result, “metabolism of RNA” is the most prevalent parent pathway that has enriched all tissue groups to varying degrees, where the liver has the lowest coverage, of 23%. In contrast, PNS and CNS have 67% coverage. Within the metabolism of RNA, subpathways like ’Processing of Capped Intro-Containing Pre-mRNA’ are shown to be highly enriched across all, while some of the other pathways such as ’Regulation of mRNA stability by proteins that bind AU-rich elements’ show limited tissue-specific enrichment (Figure S12B). The figure shows pathways that have at least a protein identified in one tissue, but “mRNA Editing” lacked enrichment across tissues due to no representation in the data. Finally, we extracted the enriched GO terms from same-tissue groups and highlighted the top 30 terms from each category: biological process, cellular component, and molecular function (Figures S14 and S15).

Our analysis of same-tissue cancer cell line groups revealed surprising variability in quantitative similarity, defying expectations based on subtype composition and group size. Distinctly stable and variable proteins were found across all tissues, with low-similarity tissues showing a higher number of distinctly stable proteins. Despite tissue-specific differences, a shared pattern emerged: the quantitative stability of proteins involved in core cellular functions like RNA metabolism and related pathways appears broadly conserved across different cancer cell lines. The stable proteome from same-tissue groups is consistent with the stable core we identified across all cancer cell lines.

4. Discussion

Exploring quantitative similarities holds promise in investigating hypotheses, such as identifying stable targets for effective treatments in rapidly developing individuals and understanding core functionalities across different contexts. Nonsignificance in differential testing does not necessarily imply similarity and requires additional testing to determine if those nonsignificant analytes are statistically equivalent or if low precision precludes any conclusion. Our framework, QuEStVar, provides researchers with a complete picture of hypothesis testing by adding equivalence and differential testing to expand the statistical classification of analytes. The testing and summarization can be done in a few comparisons and large-scale pairwise comparisons to introduce ways to infer pair-specific similarity and protein-specific quantitative stability in groups. Equivalence testing enabled quantitative stability analysis at the analyte level within complex systems.

We showed that an analyte-level quantitative stability can be used to quantify the similarity between groups with the SEI. We have shown the SEI to be more robust and sensitive than traditional correlation metrics (Pearson r, Spearman p, Kendall tau). Unlike correlation coefficients, which are sensitive to the magnitude and directionality of noise, SEI consistently reflects the fraction of equivalent proteins within predefined boundaries. SEI’s linear response to the percentage of affected proteins offers precise quantification of similarity, even in substantial technical noise. Furthermore, SEI is agnostic to shifting protein clusters, a factor that can artificially inflate correlation coefficients in specific scenarios.

Furthermore, QuEStVar highlights the power of harnessing complicated variations to capture informative regulatory biological dynamics. Our analysis of quantitative stability and variability of diverse cancer cell lines demonstrated that hematologic malignancies drive the quantitative variability observed in the analysis. Additionally, same-tissue analyses revealed that malignancies originating from lymphoid and myeloid tissues exhibited a more pronounced similarity in protein profiles than their solid tumor counterparts despite greater subtype diversity and a larger pair pool. Interestingly, these hematologic cancers form distinct clusters, as shown in the original publication of the data set, indicate quantitative distinctness.25 Reactome and KEGG analyses also elucidated biological processes for tumor cells to maintain tumorigenic environments with common stable proteins over various cancer cells, including regulatory survival networks to escape cell regulatory and surveillance mechanisms. The most enriched pathways in Reactome and KEGG resulted from the cancer cell lines’ metabolism of RNA and related pathways. The spliceosome is the KEGG pathway, where more than half of the involved proteins are found to be quantitatively stable when looking at all cancer cell lines. Spliceosome, a general cell housekeeping machinery, possesses remarkable flexibility and adaptability in its conformation and composition. This dynamic nature enables the spliceosome to function with precision and versatility in normal cells but could also enhance the cancer cells’ ability to resist therapy.38 Apart from spliceosome nucleocytoplasmic transport, ribosome and mRNA surveillance pathways are stable in cancer cell lines. It has been shown that nucleocytoplasmic transport of macromolecules and ribosome biogenesis promise cancer treatment targets.39,40 The mRNA surveillance pathway is one of the stable pathways identified, which includes NMD, nonstop mRNA decay, and no-go decay. Interestingly, all stable proteins are involved in NMD, a quality control mechanism that eliminates a subset of mRNAs implicated in the pathophysiology of many human genetic diseases, including cancer. Stable proteins identified in cancer cells suggest promoting or suppressing NMD for differential exploitation to benefit survival. Although they play important roles in eliminating mRNAs encoding tumor suppressors, RNA binding proteins, splicing factors, and signaling proteins, as well as suppressing NMD in order to promote the expression of oncoproteins or other growth-promoting proteins,41 more functional investigation of these proteins is necessary to delineate the core activity of cancer cells. In addition, RNA metabolism and related pathways consistently emerge as predominant features in examining stability within same-tissue groupings, mirroring the stable proteome observed across various cancer cell lines. Notably, 23–67% of child pathways within the broader metabolism of RNA exhibit enrichment across diverse tissue groups in the Reactome database. The pathways not enriched in any tissue group are the modification and processing of mitochondrial pathways; proteins that are part of these pathways are mostly missing from the data. This is likely because to access mitochondrial proteomics, special enrichment needs to be done in sample preparation.42

We used t test-based hypothesis testing for difference and two one-sided t test (TOST) for the equivalence in our combined framework. While the t test provides a simple hypothesis-testing framework, it is worth noting that more sophisticated hypothesis-driven methods, such as ANOVA or Bayesian-based approaches for equivalence testing, are available and could be included within the QuEStVar in the future.43,44 Since the t test forms the basis of our statistical testing framework, its inherent assumptions and considerations extend to our combined testing approach, so ensuring that the data adheres to the assumptions of the t test is critical for the validity of the results.45 Additionally, it is crucial to evaluate the statistical power of the testing and design experiments accordingly.46

The LFC plays a crucial role in determining both statistical equivalence and difference. LFC defines boundaries within the TOST procedure for equivalence, while for difference, it acts as a secondary criterion following t tests. The appropriate LFC boundary selection depends on the sample size, data quality, and biological context or field standards.47,48 Ideally, these boundaries should be established during the experimental design stage.49,50 In this study, our use of public data required a posthoc boundary determination approach. We justified the specific boundaries chosen for both data sets using a data-driven method, selecting the smallest boundary that effectively sets realistic target for equivalence when looked at within the same biological samples (Note S1).

Since QuEStVar is based on a t test, a wide variety of data can be utilized, including data from genomics, transcriptomics, or metabolomics, provided that the data adheres to the assumptions of the t test. In this study, we presented the quantitative stability and variability using protein level data; however, this type of analysis can also be done at the peptide level, providing various advantages over the protein level, such as no skewing caused in protein inference, leading to more accurate quantification, access to proteoform information, and increased sensitivity to detect changes and infer similarities.5153 But it also presents significant challenges, especially by presenting more missing values than the protein level.54 There are proteomic-centric tools for differential expression analysis that enable testing at the peptide level or use the peptide level to strengthen the protein level results; they do apply missing value handling in their suite.5557 While the current implementation of QuEStVar is not flexible regarding missing values, we believe it as a valuable approach to highlight how we can examine both statistical differences and equivalence to have a more complete picture for comparisons and generate valuable insights. To this end, we are very confident that more proteomic-centric tools will adopt this strategy, and increased precision and completeness of data with new and improved methods will enable us to look at peptide-centric data more robustly and comprehensively.

One limitation of the current QuEStVar implementation is its inflexibility regarding missing values. Our approach excludes proteins missing in either test group, leading to significant data loss. We partially addressed this by using limited imputation for proteins quantified in most replicates. While full imputation is theoretically possible, we advise caution, as it requires understanding the root cause of missingness. Simply assuming a protein is present risks mislabeling statistically different or equivalent proteins based on imputed data. Imputing with small values drawn from downshifted distribution58 can be used in full imputation, but this would assume all missingness caused by protein not in the actual sample, which is hard to determine. An alternative approach, inspired by Limma’s weighting method, could mitigate the impact of imputed values.59,60 However, since the nature of missingness in our data was unclear, we opted for limited imputation to recover some proteins while minimizing the risk of misinterpretation.

Another limitation is the quality of the data set we used to study pan-cancer cell line stability. Increased variance within replicates poses challenges, which can be addressed to a degree by CV-based filtering for reliably quantified proteins. However, any pre- or postfiltering reduces the pool of tested proteins, resulting in diminished statistical coverage and ability to determine proteome characteristics. High missing values further curtail coverage, potentially leading to the omission of crucial proteins and an incomplete analytical picture. Recognizing these constraints is pivotal, especially considering the field’s progression toward higher-quality data, promising more reliable and confident analytical conclusions. There is also a lack of balance in the cancer representation, with some cancers being overrepresented; this caused an issue when comparing custom groups, such as same-tissue groupings, especially in further analysis. This issue can be avoided by further subsetting to balanced groups or creating a balanced representation of groups early in the design process.

One of the most promising application of equivalence testing lies in exploring the stable core proteome. This study explored the stable core proteome in cancer cell lines. At the same time, this could be expanded into more subsets of data sets that lead to specialized stable core proteomes that can be used in a wide variety of research. Establishing the stable core proteome of a biological subset can be considered a stricter definition of housekeeping proteins due to employing statistical equivalence-based conditions to call the protein stable, and this protein should be found equivalent in most comparisons. On the other hand, housekeeping proteins are initially defined as most abundant in most cells,61 then a large set of proteins are found to be expressed in all cells captured by the next-gen sequence at the transcript level.62 Until recently, this definition has not been expanded to looking at a set of proteins found in all but at similar expression levels.63 The housekeeping definition relied primarily on mRNA expression levels,64,65 which has been shown not to reflect the protein-level abundance well.7,66,67 As proteomic data continues to improve in depth and quality, there will be future opportunities to create more comprehensive housekeeping proteomes. Equivalence testing and our proposed method can be invaluable tools in achieving this goal.

In conclusion, we demonstrated how equivalence testing, combined with traditional difference testing, expands protein classification in diverse comparisons within a familiar statistical framework. Measuring the fraction of equivalent proteins in the form of the SEI provides a more robust and precise similarity metric than the correlation. Applying QuEStVar, we explored the stability and variability in a prominent cancer cell line data set. We found RNA metabolism-related pathways, such as the spliceosome and mRNA surveillance, dominate the stable proteome, both across cell lines and within same-tissue groups. Investigating these groups exposed substantial subtype diversity and varying similarity levels within tissues. Equivalence testing has broad applicability, offering a powerful tool to deepen our understanding of quantitative stability and its critical role in cancer biology.

Acknowledgments

We thank all Lange Lab members for their valuable feedback upon testing the framework in various projects and experimental settings.

Glossary

Abbreviations

BH

Benjamini & Hochberg

BY

Benjamini & Yekutieli

CV

coefficient of variation

DIA

data-independent acquisition

DIA–MS

data-independent acquisition mass spectrometry

FDR

false discovery rate

GO

gene ontology

KEGG

Kyoto Encyclopedia of Genes and Genomes

LFC

log2-fold change

LFQ

label-free quantification

MS

mass spectrometry

QuEStVar

quantitative exploration of stability and variability

TOST

two one-sided t test

RSM

relative stability metric

SEI

sample equivalence index

Data Availability Statement

The data and scripts related to this study are available at 10.5281/zenodo.10694635 and https://github.com/LangeLab/Analysis_of_QuEStVar_Manuscript under an MIT license.

Supporting Information Available

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jproteome.4c00131.

  • D vs C and D vs A comparisons for the spike-in data; results of four different simulations; heatmap of the z-scored abundances of stable core proteins across 360 cell lines; enrichment of Reactome pathways derived from the stable core proteome; top 30 enriched GO terms from the stable core proteome; stable and variable proteins in same-tissue groups, followed by KEGG and Reactome pathway analyses of these proteins; grouped overview of main-parent pathways, focusing on the “Metabolism of RNA” pathways; and top 30 enriched GO terms from same-tissue groups in boxplot and heatmap form (PDF)

  • Justification of how equivalence boundary is selected for both data sets posthoc (PDF)

  • Notebooks in HTML format, which cover data preparation for spike-in data, apply QuEStVar on the spike-in data, compare similarity metrics in simulated scenarios, expand the metadata for cancer cell line data, prepare the cancer cell line data for analysis, apply QuEStVar on the cancer cell line data, explore the stable core proteome in cancer cell line data, investigate stability in same-tissue groupings, provide a detailed summary of the packages used in the analysis, and show a comparative analysis of how the boundaries for equivalence and difference can be selected (ZIP)

  • Data relevant for spiked-in analysis (XLSX)

  • Simulation data (XLSX)

  • Data relevant for cancer cell line analysis (XLSX)

  • Stable core proteome in all cancer cell lines (XLSX)

  • Stability explored in same-tissue groups (XLSX)

This work was partially supported by grants from the Natural Sciences and Engineering Research Council of Canada (NSERC) (#RGPIN-2018–05645), the Michael Cuccione Foundation, and the BC Children’s Hospital Foundation (to P.F.L.). P.F.L. was supported by the Canada Research Chairs program (CRC-RS 950–230867, P.F.L.), the Michael Smith Foundation for Health Research Scholar program (16442, P.F.L.), and the University of British Columbia (E.K.E.).

The authors declare no competing financial interest.

Special Issue

Published as part of Journal of Proteome Researchvirtual special issue “Canadian Proteomics”.

Supplementary Material

pr4c00131_si_001.pdf (11.1MB, pdf)
pr4c00131_si_002.pdf (1.9MB, pdf)
pr4c00131_si_003.zip (29MB, zip)
pr4c00131_si_004.xlsx (1.1MB, xlsx)
pr4c00131_si_005.xlsx (38.6KB, xlsx)
pr4c00131_si_006.xlsx (1.1MB, xlsx)
pr4c00131_si_007.xlsx (3.6MB, xlsx)
pr4c00131_si_008.xlsx (26.5MB, xlsx)

References

  1. Macklin A.; Khan S.; Kislinger T. Recent advances in mass spectrometry based clinical proteomics: applications to cancer research. Clin. Proteomics 2020, 17, 17. 10.1186/s12014-020-09283-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Miles H. N.; Delafield D. G.; Li L. Recent Developments and Applications of Quantitative Proteomics Strategies for High-Throughput Biomolecular Analyses in Cancer Research. RSC Chem. Biol. 2021, 2, 1050–1072. 10.1039/d1cb00039j. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Ankney J. A.; Muneer A.; Chen X. Relative and Absolute Quantitation in Mass Spectrometry-Based Proteomics. Annu. Rev. Anal. Chem. 2018, 11, 49–77. 10.1146/annurev-anchem-061516-045357. [DOI] [PubMed] [Google Scholar]
  4. Yang X.-L.; Shi Y.; Zhang D.-D.; Xin R.; Deng J.; Wu T.-M.; Wang H.-M.; Wang P.-Y.; Liu J.-B.; Li W.; et al. Quantitative proteomics characterization of cancer biomarkers and treatment. Mol. Ther.--Oncolytics 2021, 21, 255–263. 10.1016/j.omto.2021.04.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Wolski W. E.; Nanni P.; Grossmann J.; d’Errico M.; Schlapbach R.; Panse C. prolfqua: A Comprehensive R-Package for Proteomics Differential Expression Analysis. J. Proteome Res. 2023, 22, 1092–1104. 10.1021/acs.jproteome.2c00441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Schessner J. P.; Voytik E.; Bludau I. A practical guide to interpreting and generating bottom-up proteomics data visualizations. Proteomics 2022, 22, e2100103 10.1002/pmic.202100103. [DOI] [PubMed] [Google Scholar]
  7. Liu Y.; Beyer A.; Aebersold R. On the Dependency of Cellular Protein Levels on mRNA Abundance. Cell 2016, 165, 535–550. 10.1016/j.cell.2016.03.014. [DOI] [PubMed] [Google Scholar]
  8. Wang D.; Eraslan B.; Wieland T.; Hallström B.; Hopf T.; Zolg D. P.; Zecha J.; Asplund A.; Li L.-H.; Meng C.; et al. A deep proteome and transcriptome abundance atlas of 29 healthy human tissues. Mol. Syst. Biol. 2019, 15, e8503 10.15252/msb.20188503. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Janse R. J.; Hoekstra T.; Jager K. J.; Zoccali C.; Tripepi G.; Dekker F. W.; van Diepen M. Conducting correlation analysis: important limitations and pitfalls. Clin. Kidney J. 2021, 14, 2332–2337. 10.1093/ckj/sfab085. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Saccenti E. What can go wrong when observations are not independently and identically distributed: A cautionary note on calculating correlations on combined data sets from different experiments or conditions. Front. Syst. Biol. 2023, 3, 1042156. 10.3389/fsysb.2023.1042156. [DOI] [Google Scholar]
  11. Lakens D.; Scheel A. M.; Isager P. M. Equivalence testing for psychological research: A tutorial. Adv. Methods Pract. Psychol. Sci. 2018, 1, 259–269. 10.1177/2515245918770963. [DOI] [Google Scholar]
  12. Gerchen M. F.; Kirsch P.; Feld G. B. Brain-wide inferiority and equivalence tests in fMRI group analyses: Selected applications. Hum. Brain Mapp. 2021, 42, 5803–5813. 10.1002/hbm.25664. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Wellek S.Testing Statistical Hypotheses of Equivalence and Noninferiority; Chapman and Hall/CRC, 2010. [Google Scholar]
  14. Endrenyi L.; Tothfalusi L. Bioequivalence for highly variable drugs: regulatory agreements, disagreements, and harmonization. J. Pharmacokinet. Pharmacodyn. 2019, 46, 117–126. 10.1007/s10928-019-09623-w. [DOI] [PubMed] [Google Scholar]
  15. Uzozie A. C.; Ergin E. K.; Rolf N.; Tsui J.; Lorentzian A.; Weng S. S. H.; Nierves L.; Smith T. G.; Lim C. J.; Maxwell C. A.; et al. PDX models reflect the proteome landscape of pediatric acute lymphoblastic leukemia but divert in select pathways. J. Exp. Clin. Cancer Res. 2021, 40, 96. 10.1186/s13046-021-01835-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Lorentzian A. C.; Rever J.; Ergin E. K.; Guo M.; Akella N. M.; Rolf N.; Lim C. J.; Reid G. S. D.; Maxwell C. A.; Lange P. F. Targetable lesions and proteomes predict therapy sensitivity through disease evolution in pediatric acute lymphoblastic leukemia. Nat. Commun. 2023, 14, 7161. 10.1038/s41467-023-42701-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Virtanen P.; Gommers R.; Oliphant T. E.; Haberland M.; Reddy T.; Cournapeau D.; Burovski E.; Peterson P.; Weckesser W.; Bright J.; et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 2020, 17, 261–272. 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Bonferroni C. E.Teoria Statistica Delle Classi e Calcolo Ddelle Probabilità; Seeber, 1936. [Google Scholar]
  19. Holm S. A Simple Sequentially Rejective Multiple Test Procedure. Scand. J. Stat. 1979, 6, 65–70. [Google Scholar]
  20. Benjamini Y.; Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B 1995, 57, 289–300. 10.1111/j.2517-6161.1995.tb02031.x. [DOI] [Google Scholar]
  21. Benjamini Y.; Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann. Statist. 2001, 29, 1165–1188. 10.1214/aos/1013699998. [DOI] [Google Scholar]
  22. Storey J. D.; Tibshirani R. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. U.S.A. 2003, 100, 9440–9445. 10.1073/pnas.1530509100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Fröhlich K.; Brombacher E.; Fahrner M.; Vogele D.; Kook L.; Pinter N.; Bronsert P.; Timme-Bronsert S.; Schmidt A.; Bärenfaller K.; et al. Benchmarking of analysis strategies for data-independent acquisition proteomics using a large-scale dataset comprising inter-patient heterogeneity. Nat. Commun. 2022, 13, 2622. 10.1038/s41467-022-30094-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Yu F.; Teo G. C.; Kong A. T.; Fröhlich K.; Li G. X.; Demichev V.; Nesvizhskii A. I. Analysis of DIA proteomics data using MSFragger-DIA and FragPipe computational platform. Nat. Commun. 2023, 14, 4154. 10.1038/s41467-023-39869-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Gonçalves E.; Poulos R. C.; Cai Z.; Barthorpe S.; Manda S. S.; Lucas N.; Beck A.; Bucio-Noble D.; Dausmann M.; Hall C.; et al. Pan-cancer proteomic map of 949 human cell lines. Cancer Cell 2022, 40, 835–849.e8. 10.1016/j.ccell.2022.06.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Dharia N. V.; Kugener G.; Guenther L. M.; Malone C. F.; Durbin A. D.; Hong A. L.; Howard T. P.; Bandopadhayay P.; Wechsler C. S.; Fung I.; et al. A first-generation pediatric cancer dependency map. Nat. Genet. 2021, 53, 529–538. 10.1038/s41588-021-00819-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Bairoch A. The Cellosaurus, a Cell-Line Knowledge Resource. J. Biomol. Technol. 2018, 29, 25–38. 10.7171/jbt.18-2902-002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Raudvere U.; Kolberg L.; Kuzmin I.; Arak T.; Adler P.; Peterson H.; Vilo J. g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Res. 2019, 47, W191–W198. 10.1093/nar/gkz369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Fabregat A.; Jupe S.; Matthews L.; Sidiropoulos K.; Gillespie M.; Garapati P.; Haw R.; Jassal B.; Korninger F.; May B.; et al. The Reactome Pathway Knowledgebase. Nucleic Acids Res. 2018, 46, D649–D655. 10.1093/nar/gkx1132. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Kanehisa M.; Sato Y.; Kawashima M. KEGG mapping tools for uncovering hidden features in biological data. Protein Sci. 2022, 31, 47–53. 10.1002/pro.4172. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Aleksander S. A.; Balhoff J.; Carbon S.; Cherry J. M.; Drabkin H. J.; Ebert D.; Feuermann M.; Gaudet P.; Harris N. L.; Hill D. P.; et al. The Gene Ontology knowledgebase in 2023. Genetics 2023, 224. 10.1093/genetics/iyad031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Van Rossum G.; Drake F. L.. Introduction to Python 3: (Python Documentation Manual Part 1); CreateSpace Independent Publishing Platform, 2009. [Google Scholar]
  33. Harris C. R.; Millman K. J.; van der Walt S. J.; Gommers R.; Virtanen P.; Cournapeau D.; Wieser E.; Taylor J.; Berg S.; Smith N. J.; et al. Array programming with NumPy. Nature 2020, 585, 357–362. 10.1038/s41586-020-2649-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. McKinney W.Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference; SciPy, 2010; pp 56–61.
  35. Hunter J. D. Matplotlib: A 2D Graphics Environment. Comput. Sci. Eng. 2007, 9, 90–95. 10.1109/MCSE.2007.55. [DOI] [Google Scholar]
  36. Waskom M. seaborn: statistical data visualization. JOSS 2021, 6, 3021. 10.21105/joss.03021. [DOI] [Google Scholar]
  37. Ding W.; Goldberg D.; Zhou W. PyComplexHeatmap: A Python package to visualize multimodal genomics data. iMeta 2023, 2, e115 10.1002/imt2.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Will C. L.; Luhrmann R. Spliceosome structure and function. Cold Spring Harb. Perspect. Biol. 2011, 3, a003707. 10.1101/cshperspect.a003707. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Hill R.; Cautain B.; de Pedro N.; Link W. Targeting nucleocytoplasmic transport in cancer therapy. Oncotarget 2014, 5, 11–28. 10.18632/oncotarget.1457. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Zisi A.; Bartek J.; Lindström M. S. Targeting ribosome biogenesis in cancer: lessons learned and way forward. Cancers 2022, 14, 2126. 10.3390/cancers14092126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Nagar P.; Islam M. R.; Rahman M. A. Nonsense-Mediated mRNA Decay as a Mediator of Tumorigenesis. Genes 2023, 14, 357. 10.3390/genes14020357. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Alberio T.; Pieroni L.; Ronci M.; Banfi C.; Bongarzone I.; Bottoni P.; Brioschi M.; Caterino M.; Chinello C.; Cormio A.; et al. Toward the standardization of mitochondrial proteomics: the italian mitochondrial human proteome project initiative. J. Proteome Res. 2017, 16, 4319–4329. 10.1021/acs.jproteome.7b00350. [DOI] [PubMed] [Google Scholar]
  43. Campbell H.; Lakens D. Can we disregard the whole model? Omnibus non-inferiority testing for R2 in multi-variable linear regression and η2 in ANOVA. Br. J. Math. Stat. Psychol. 2021, 74, 64–89. 10.1111/bmsp.12201. [DOI] [PubMed] [Google Scholar]
  44. Kelter R. Bayesian Hodges-Lehmann tests for statistical equivalence in the two-sample setting: Power analysis, type I error rates and equivalence boundary selection in biomedical research. BMC Med. Res. Methodol. 2021, 21, 171. 10.1186/s12874-021-01341-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Schork K.; Podwojski K.; Turewicz M.; Stephan C.; Eisenacher M. Important issues in planning a proteomics experiment: statistical considerations of quantitative proteomic data. Methods Mol. Biol. 2021, 2228, 1–20. 10.1007/978-1-0716-1024-4_1. [DOI] [PubMed] [Google Scholar]
  46. Lever J.; Krzywinski M.; Altman N. Points of significance: Principal component analysis. Nat. Methods 2017, 14, 641–642. 10.1038/nmeth.4346. [DOI] [Google Scholar]
  47. Kammers K.; Cole R. N.; Tiengwe C.; Ruczinski I. Detecting significant changes in protein abundance. EuPa Open Proteomics 2015, 7, 11–19. 10.1016/j.euprot.2015.02.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Gutierrez N. M.; Cribbie R. Effect sizes for equivalence testing: Incorporating the equivalence interval. Methods Psychol. 2023, 9, 100127. 10.1016/j.metip.2023.100127. [DOI] [Google Scholar]
  49. Lakens D. Equivalence Tests: A Practical Primer for t Tests, Correlations, and Meta-Analyses. Soc. Psychol. Personal. Sci. 2017, 8, 355–362. 10.1177/1948550617697177. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Campbell H.; Gustafson P.. What to make of equivalence testing with a post-specified margin?. Meta-Psychol. 2021, 5. 10.15626/mp.2020.2506. [DOI] [Google Scholar]
  51. Goeminne L. J. E.; Gevaert K.; Clement L. Peptide-level Robust Ridge Regression Improves Estimation, Sensitivity, and Specificity in Data-dependent Quantitative Label-free Shotgun Proteomics. Mol. Cell. Proteomics 2016, 15, 657–668. 10.1074/mcp.M115.055897. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Bludau I.; Frank M.; Dörig C.; Cai Y.; Heusel M.; Rosenberger G.; Picotti P.; Collins B. C.; Röst H.; Aebersold R. Systematic detection of functional proteoform groups from bottom-up proteomic datasets. Nat. Commun. 2021, 12, 3810. 10.1038/s41467-021-24030-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Bai M.; Deng J.; Dai C.; Pfeuffer J.; Sachsenberg T.; Perez-Riverol Y. LFQ-Based Peptide and Protein Intensity Differential Expression Analysis. J. Proteome Res. 2023, 22, 2114–2123. 10.1021/acs.jproteome.2c00812. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Schwämmle V.; Hagensen C. E.; Rogowska-Wrzesinska A.; Jensen O. N. PolySTest: Robust Statistical Testing of Proteomics Data with Missing Values Improves Detection of Biologically Relevant Features. Mol. Cell. Proteomics 2020, 19, 1396–1408. 10.1074/mcp.RA119.001777. [DOI] [PMC free article] [PubMed] [Google Scholar]
  55. Suomi T.; Elo L. L. Enhanced differential expression statistics for data-independent acquisition proteomics. Sci. Rep. 2017, 7, 5869. 10.1038/s41598-017-05949-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Zhang Y. pepDESC: A Method for the Detection of Differentially Expressed Proteins for Mass Spectrometry-Based Single-Cell Proteomics Using Peptide-level Information. Mol. Cell. Proteomics 2023, 22, 100583. 10.1016/j.mcpro.2023.100583. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Kohler D.; Staniak M.; Tsai T.-H.; Huang T.; Shulman N.; Bernhardt O. M.; MacLean B. X.; Nesvizhskii A. I.; Reiter L.; Sabido E.; et al. MSstats Version 4.0: Statistical Analyses of Quantitative Mass Spectrometry-Based Proteomic Experiments with Chromatography-Based Quantification at Scale. J. Proteome Res. 2023, 22, 1466–1482. 10.1021/acs.jproteome.2c00834. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Tyanova S.; Temu T.; Sinitcyn P.; Carlson A.; Hein M. Y.; Geiger T.; Mann M.; Cox J. The Perseus computational platform for comprehensive analysis of (prote)omics data. Nat. Methods 2016, 13, 731–740. 10.1038/nmeth.3901. [DOI] [PubMed] [Google Scholar]
  59. Ritchie M. E.; Phipson B.; Wu D.; Hu Y.; Law C. W.; Shi W.; Smyth G. K. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015, 43, e47 10.1093/nar/gkv007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Ergin E. K.; Uzozie A. C.; Chen S.; Su Y.; Lange P. F. SQuAPP—simple quantitative analysis of proteins and PTMs. Bioinformatics 2022, 38, 4956–4958. 10.1093/bioinformatics/btac628. [DOI] [PubMed] [Google Scholar]
  61. Zhang L.; Li W.-H. Mammalian housekeeping genes evolve more slowly than tissue-specific genes. Mol. Biol. Evol. 2004, 21, 236–239. 10.1093/molbev/msh010. [DOI] [PubMed] [Google Scholar]
  62. Uhlén M.; Fagerberg L.; Hallström B. M.; Lindskog C.; Oksvold P.; Mardinoglu A.; Sivertsson Å.; Kampf C.; Sjöstedt E.; Asplund A.; et al. Tissue-based map of the human proteome. Science 2015, 347, 1260419. 10.1126/science.1260419. [DOI] [PubMed] [Google Scholar]
  63. Joshi C. J.; Ke W.; Drangowska-Way A.; O’Rourke E. J.; Lewis N. E. What are housekeeping genes?. PLoS Comput. Biol. 2022, 18, e1010295 10.1371/journal.pcbi.1010295. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Hounkpe B. W.; Chenou F.; de Lima F.; De Paula E. HRT Atlas v1.0 database: redefining human and mouse housekeeping genes and candidate reference transcripts by mining massive RNA-seq datasets. Nucleic Acids Res. 2021, 49, D947–D955. 10.1093/nar/gkaa609. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Jiang L.; Wang M.; Lin S.; Jian R.; Li X.; Chan J.; Dong G.; Fang H.; Robinson A. E.; Snyder M. P.; et al. A quantitative proteome map of the human body. Cell 2020, 183, 269–283.e19. 10.1016/j.cell.2020.08.036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Payne S. H. The utility of protein and mRNA correlation. Trends Biochem. Sci. 2015, 40, 1–3. 10.1016/j.tibs.2014.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Takemon Y.; Chick J. M.; Gyuricza I. G.; Skelly D. A.; Devuyst O.; Gygi S. P.; Churchill G. A.; Korstanje R. Proteomic and transcriptomic profiling reveal different aspects of aging in the kidney. Elife 2021, 10, e62585 10.7554/elife.62585. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

pr4c00131_si_001.pdf (11.1MB, pdf)
pr4c00131_si_002.pdf (1.9MB, pdf)
pr4c00131_si_003.zip (29MB, zip)
pr4c00131_si_004.xlsx (1.1MB, xlsx)
pr4c00131_si_005.xlsx (38.6KB, xlsx)
pr4c00131_si_006.xlsx (1.1MB, xlsx)
pr4c00131_si_007.xlsx (3.6MB, xlsx)
pr4c00131_si_008.xlsx (26.5MB, xlsx)

Data Availability Statement

The data and scripts related to this study are available at 10.5281/zenodo.10694635 and https://github.com/LangeLab/Analysis_of_QuEStVar_Manuscript under an MIT license.


Articles from Journal of Proteome Research are provided here courtesy of American Chemical Society

RESOURCES