Abstract
Summary
We present promor, a comprehensive, user-friendly R package that streamlines label-free quantification proteomics data analysis and building machine learning-based predictive models with top protein candidates.
Availability and implementation
promor is freely available as an open source R package on the Comprehensive R Archive Network (CRAN) (https://CRAN.R-project.org/package=promor) and distributed under the Lesser General Public License (version 2.1 or later). Development version of promor is maintained on GitHub (https://github.com/caranathunge/promor) and additional documentation and tutorials are provided on the package website (https://caranathunge.github.io/promor/).
Supplementary information
Supplementary data are available at Bioinformatics Advances online.
1 Introduction
Label-free quantification (LFQ) approaches are commonly used in mass spectrometry-based proteomics. One of the most widely used software tools for protein identification and quantification is MaxQuant (Tyanova et al., 2016a). The downstream analysis of MaxQuant output files can be complex and often challenging to those inexperienced in proteomics data analysis. Some tools available for this purpose are implemented as graphical user interface (GUI) applications [e.g. LFQ-Analyst (Shah et al., 2019), ProVision (Gallant et al., 2020), ProteoSign (Efstathiou et al., 2017)], among which, one of the most popular is the MaxQuant-associated tool, Perseus (Tyanova et al., 2016b). Perseus is an extensive software suite that offers a range of features to analyze several different types of proteomics data. While Perseus is fairly easy to use, the user interface with its wide range of options can be overwhelming at times to new users. Furthermore, the inability to save previously used analytical settings in GUI applications such as Perseus may present challenges to researchers looking to standardize data analysis. Other tools, such as MSstats (Choi et al., 2014), protti (Quast et al., 2022), pmartR (Stratton et al., 2019) and DEP (Zhang et al., 2018) are primarily implemented as R packages and provide greater analytical flexibility and reproducibility to proteomics data analysis workflows. While these available software all offer analytical capability to perform the steps in typical proteomics data analysis workflows, users may need additional software to perform tasks specific to their research domain (e.g. clinical applications, biomarker discovery).
In recent years, machine learning (ML) has made its presence felt in the field of proteomics. Particularly in biomarker research, ML is becoming a popular tool to derive candidate biomarker panels from proteomics data (Bader et al., 2020; Virreira Winter et al., 2021). ML algorithms are now being widely employed to build proteomics-based predictive models of disease prognosis and diagnosis (Desaire et al., 2022; Mann et al., 2021). When building a proteomics-based predictive model, choosing a robust panel of protein candidates can greatly improve the accuracy of the model. In this regard, ML-based predictive models could benefit from narrowing down protein features to those that show significant differences in abundance between groups of interest. In the current landscape of proteomics data analytical tools, the capability to seamlessly transition from differential expression analysis to predictive modeling is limited. Realizing this need, we developed promor, a comprehensive, user-friendly, R package that streamlines differential expression analysis and predictive modeling of label-free proteomics data. promor provides an all-in-one reproducible workflow that integrates tools to perform quality control, visualization and differential expression analysis of label-free proteomics data. Furthermore, promor integrates tools to build ML-based predictive models using top protein candidates identified through differential expression analysis, assess model performance, determine feature importance and estimate the predictive power of the models.
2 Overview
2.1 Implementation
promor is implemented in R () and relies on packages such as imputeLCMD (Lazar et al., 2016), limma (Ritchie et al., 2015) and caret (Kuhn, 2008) for back-end pre-processing, differential expression analysis and ML-based modeling, respectively. As input, promor requires a user-generated tab-delimited text file containing the experimental design and a MaxQuant-produced ‘proteinGroups.txt’ file or a standard quantitative table of protein intensities, which could be produced by any proteomic data analysis software. For visualization, promor employs the popular ggplot2 (Wickham et al., 2016) architecture and produces ggplot objects, which allows for further customization (Fig. 1).
Fig. 1.
An overview of suggested promor workflows. (A) Proteomics data analysis workflow includes analytical functions for pre-processing, quality control, missing data imputation, data normalization and differential expression analysis. (B) Modeling workflow includes analytical functions for pre-processing the output of differential expression analysis, model building and model evaluation. (C) Several plotting functions are provided to visualize data and produce publication-ready figures using color blind-friendly palettes
2.2 Proteomics data analysis
promor can be used to analyze any bottom-up label-free proteomics data (e.g. raw, LFQ or iBAQ). Multiple functions are provided for quality control, visualization, missing data imputation, normalization and differential expression analysis (Table 1 and Fig. 1A).
Table 1.
Analytical and visualization functions in promora
Function name | Input | Tasks | Output |
---|---|---|---|
create_df |
|
Creates a data frame of LFQ protein intensities. Removes contaminant proteins, proteins identified only by site, reverse sequence proteins and proteins identified by two or fewer unique peptides. Converts zeros to missing values. Log2 transforms the values. | raw_df |
aver_techreps | raw_df | If technical replicates are present in the data, computes average intensity across technical replicates for each sample. | raw_df |
filterbygroup_na | raw_df | Filters out proteins with >34% missing values (<66% valid values) in at least one of the groups. | raw_df |
impute_na | raw_df/norm_df | Imputes missing values using the ‘minProb’ method. | imp_df |
normalize_data | raw_df/imp_df | Normalizes the data using the ‘quantile’ method. | norm_df |
find_dep | norm_df/imp_df | Identifies differentially expressed proteins with an absolute log2 fold change >1 at an adjusted P-value <0.05. | fit_df |
pre_process |
|
Extracts protein intensity data for the top 20 differentially expressed proteins, removes proteins that show high pairwise correlation (>0.90) and converts the data into a format suitable for modeling. | model_df |
split_data | model_df | Splits data into training (70%) and test (30%) data sets while preserving the overall class distribution of the data. | split_df |
train_models | split_df | Trains ML models on the training data set using the default list of ML algorithms (‘svmRadial’, ‘glm’, ‘rf’, ‘xgbLinear’, ‘naive_bayes’), performs 10-fold cross validation three times, calculates re-sampling-based performance measures for the models and outputs the best model for each algorithm. | model_list |
test_models |
|
Uses the models built using the training data to predict the test data. | probability_list |
corr_plot | raw_df | Generates scatter plots showing the correlation between pairs of technical replicates. | ggplot |
heatmap_na | raw_df | Generates a heatmap to show the missing data distribution in the matrix. | ggplot |
impute_plot |
|
Generates a global density plot showing the data distribution before and after missing data imputation. | ggplot |
norm_plot |
|
Generates box plots showing the sample data distributions before and after data normalization. | ggplot |
heatmap_de |
|
Generates a heatmap of protein intensities for the top 20 differentially expressed proteins. | ggplot |
volcano_plot | fit_df | Generates a volcano plot highlighting significantly differentially expressed proteins (absolute log2 fold change >1 at an adjusted P-value <0.05). | ggplot |
feature_plot | model_df | Generates box plots showing protein intensity differences among groups (classes). | ggplot |
varimp_plot | model_list | Generates lollipop plots showing the importance of different proteins (features) in the models built. | ggplot |
performance_plot | model_list | Generates boxplots showing the performance (accuracy and kappa) of models built using different ML algorithms. | ggplot |
roc_plot |
|
Generates receiver operating characteristic (ROC) curves showing the predictive power of the models built using different ML algorithms. | ggplot |
This table describes the tasks and the output produced by the functions under default settings.
To demonstrate the utility of promor for analyzing label-free proteomics data that do not contain technical replicates, we analyzed a previously published proteome benchmark data set by Cox et al. (2014) (PRIDE ID: PXD000279). The data set consists of LFQ protein intensity data for 6694 proteins quantified from HeLa (H) and Escherichia coli (L) lysates that were mixed at defined ratios. There were six samples in total. Three biological replicates represented each of the two groups. The results from the analysis were visualized at multiple stages (Supplementary Figs S1–S5). First, we pre-processed the data using the create_df function with default settings. create_df function removed contaminant proteins, proteins identified ‘only-by-site’, reverse sequence proteins and proteins identified by two or fewer unique peptides. To remove proteins with a high proportion of missing values, we used the filterbygroup_na function, setting the highest allowed missing data percentage in either group at 40%. Next, we imputed the missing data in the data frame using the impute_na function with the default ‘minProb’ method assuming that the missing values are left-censored. Since the data have already been normalized with the MaxLFQ algorithm (Cox et al., 2014) in MaxQuant, we did not further normalize the data in promor. The output of imputation (imp_df object) was used in the differential expression analysis, performed using the default settings in the find_dep function. We identified 1294 significantly differentially expressed proteins between the ‘H’ and ‘L’ groups in the data (Supplementary Table S1 and Supplementary Figs S4 and S5).
Furthermore, to test the utility of promor for analyzing label-free proteomics data that contain technical replicates, we analyzed previously published data by Ramond et al. (2015) (PRIDE ID: PXD001584). This data set consists of LFQ protein intensity data obtained from two strains (WT—wild type and D8—argP mutant) of Francisella tularensis, a pathogenic bacterium responsible for the zoonotic disease tularemia. The proteinGroups.txt file contained LFQ data for 1265 proteins across 18 samples representing the two conditions (WT and D8) with three biological replicates in each condition and three technical replicates for each biological replicate. A step-by-step tutorial providing a detailed description of the workflow and the implementation choices are provided here: https://caranathunge.github.io/promor/articles/promor_with_techreps.html
2.3 Building predictive models
In promor, multiple functions are provided to build predictive models with differentially expressed proteins and assess model performance (Table 1 and Fig. 1B). Over 200 ML algorithms are made accessible through the caret package (Kuhn, 2008) for building predictive models. For users inexperienced in complex ML algorithms, promor provides a default list of five widely used classification-based algorithms, chosen to represent a variety of ML model types (e.g. random forest, support vector machines, generalized linear models, naive bayes and gradient boosting). However, while many different algorithms can be applied to proteomics data, it is important to note that not all of them are well-suited to address the problem at hand. The choice of machine algorithms should be carefully decided according to the prediction task, data type, sample size and the number of features (proteins) in the data set.
We tested the use of promor for building predictive models by analyzing a previously published data set by Suvarna et al. (2021) (PRIDE ID: PXD022296). In the original study, the authors built proteomics-based classification models to predict COVID severity in patients. To avoid class imbalance in the data, only a subset of the samples were used from the original proteinGroups.txt file. The steps leading up to differential expression analysis are described in detail here: https://caranathunge.github.io/promor/articles/promor_for_modeling.html. The results from differential expression analysis (fit_df object) and the normalized data frame (norm_df object) were used in the modeling workflow. The fit_df and norm_df objects were pre-processed with the pre_process function to convert the data into a model_df object. Next, we split the data into training and test data sets using the split_data function. The training data set contained 70% of the data (29 samples), while the test data set contained the remaining 30% (6 samples). The train_models function was run on the training data set in the split_df object with four selected ML algorithms: random forest (rf), support vector machine with linear kernel (svmLinear), naive bayes (naive_bayes) and K-nearest neighbor (knn). The four algorithms were chosen based on their suitability for building models using few features (8 proteins) and samples (35 samples). Furthermore, a k-fold cross-validation (k = 10, repeats = 3) was employed to evaluate model performance. The output was used to test the models on the test data set included in the split_df object. The results from the analysis were visualized at multiple levels during the modeling workflow (Supplementary Figs. S6–S9). The model built with the ‘naive_bayes’ algorithm performed best in terms of accuracy (85.5) and Area Under the Curve (AUC = 88.9%) (Supplementary Fig. S9).
2.4 Benchmarking
We compared the performance of promor against Perseus using the previously mentioned Cox et al. (2014) (PRIDE ID: PXD000279) data set. An identical workflow and parameters to those mentioned in Section 2.2 were used in Perseus. In Perseus, we used the imputeLCMD plugin to implement the ‘minProb’ imputation method, and the limma plugin to implement the moderated t-test. We observed a significant overlap in the differentially expressed proteins identified by both programs (98.85%) (Supplementary Tables S1 and S2 and Fig. 2A). The number of proteins that were only identified by a single program could be attributed to the random sampling during missing value imputation. Furthermore, the calculated log-fold changes and P-values were strongly correlated between the two programs (Fig. 2B and C). R code for benchmarking analysis is provided on github at https://github.com/caranathunge/promor_bioRxiv_preprint
Fig. 2.
A comparison between promor and Perseus using the proteome benchmark data set, Cox et al. (2014). (A) A Venn diagram showing the overlap of the significantly differentially expressed proteins identified by promor and Perseus. Scatterplots of the resulting protein log2 fold changes (B) and log10P-values (C) of differentially expressed proteins as calculated by promor and Perseus
3 Conclusions
We present promor, a user-friendly, comprehensive R package that facilitates seamless transition from differential expression analysis of label-free proteomics data to building predictive models with top protein candidates; a feature that could be particularly useful in clinical and biomarker research.
Supplementary Material
Acknowledgments
We wish to thank Asitha I. Senanayake for his helpful comments and discussions on software development.
Contributor Information
Chathurani Ranathunge, Eastern Virginia Medical School, School of Health Professions, Norfolk, VA 23501, USA.
Sagar S Patel, Eastern Virginia Medical School, School of Health Professions, Norfolk, VA 23501, USA.
Lubna Pinky, Eastern Virginia Medical School, School of Health Professions, Norfolk, VA 23501, USA.
Vanessa L Correll, The Leroy T. Canoles Jr. Cancer Research Center, Eastern Virginia Medical School, Norfolk, VA 23501, USA.
Shimin Chen, The Leroy T. Canoles Jr. Cancer Research Center, Eastern Virginia Medical School, Norfolk, VA 23501, USA.
O John Semmes, The Leroy T. Canoles Jr. Cancer Research Center, Eastern Virginia Medical School, Norfolk, VA 23501, USA.
Robert K Armstrong, Eastern Virginia Medical School, School of Health Professions, Norfolk, VA 23501, USA; Sentara Center for Simulation and Immersive Learning, Eastern Virginia Medical School, Norfolk, VA 23501, USA.
C Donald Combs, Eastern Virginia Medical School, School of Health Professions, Norfolk, VA 23501, USA.
Julius O Nyalwidhe, The Leroy T. Canoles Jr. Cancer Research Center, Eastern Virginia Medical School, Norfolk, VA 23501, USA.
Author contributions
Chathurani Ranathunge (Conceptualization, Data curation, Formal analysis, Methodology, Software, Validation, Visualization, Writing—original draft, Writing—review & editing [lead]), Sagar S. Patel (Formal analysis, Investigation, Validation, Writing—review & editing [supporting]), Lubna Pinky (Formal analysis, Investigation, Validation, Writing—review & editing [supporting]), Vanessa L. Correll (Formal analysis, Investigation, Validation, Writing—review & editing [supporting]), Shimin Chen (Formal analysis, Investigation, Validation, Writing—review & editing [supporting]), John Semmes (Funding acquisition, Supervision [supporting]), Robert K. Armstrong (Funding acquisition [supporting], Project administration [supporting], Resources [lead], Writing—review & editing [supporting]), C. Donald Combs (Funding acquisition [lead], Project administration [lead], Resources [lead], Writing—review & editing [supporting]), and Julius O. Nyalwidhe (Methodology [supporting], Supervision [lead], Writing—review & editing [supporting])
Funding
This work was supported by The Hampton Roads Biomedical Research Consortium (Digital Patient Project).
Conflict of Interest: none declared.
References
- Bader J.M. et al. (2020) Proteome profiling in cerebrospinal fluid reveals novel biomarkers of alzheimer’s disease. Mol. Syst. Biol., 16, e9356. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Choi M. et al. (2014) Msstats: an r package for statistical analysis of quantitative mass spectrometry-based proteomic experiments. Bioinformatics, 30, 2524–2526. [DOI] [PubMed] [Google Scholar]
- Cox J. et al. (2014) Accurate proteome-wide label-free quantification by delayed normalization and maximal peptide ratio extraction, termed maxlfq. Mol. Cell. Proteomics, 13, 2513–2526. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Desaire H. et al. (2022) Advances, obstacles, and opportunities for machine learning in proteomics. Cell Rep. Phys. Sci., 3, 101069. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Efstathiou G. et al. (2017) Proteosign: an end-user online differential proteomics statistical analysis platform. Nucleic Acids Res., 45, W300–W306. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gallant J.L. et al. (2020) ProVision: a web-based platform for rapid analysis of proteomics data processed by MaxQuant. Bioinformatics, 36, 4965–4967. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kuhn M. (2008) Building predictive models in r using the caret package. J. Stat. Software, 28, 1–26. [Google Scholar]
- Lazar C. et al. (2016) Accounting for the multiple natures of missing values in label-free quantitative proteomics data sets to compare imputation strategies. J. Proteome Res., 15, 1116–1125. [DOI] [PubMed] [Google Scholar]
- Mann M. et al. (2021) Artificial intelligence for proteomics and biomarker discovery. Cell Syst., 12, 759–770. [DOI] [PubMed] [Google Scholar]
- Quast J.P. et al. (2022) protti: an R package for comprehensive data analysis of peptide- and protein-centric bottom-up proteomics data. Bioinform. Adv., 2, vbab041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ramond E. et al. (2015) Importance of host cell arginine uptake in francisella phagosomal escape and ribosomal protein amounts. Mol. Cell. Proteomics, 14, 870–881. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ritchie M.E. et al. (2015) limma powers differential expression analyses for rna-sequencing and microarray studies. Nucleic Acids Res., 43, e47–e47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shah A.D. et al. (2019) Lfq-analyst: an easy-to-use interactive web platform to analyze and visualize label-free proteomics data preprocessed with maxquant. J. Proteome Res., 19, 204–211. [DOI] [PubMed] [Google Scholar]
- Stratton K.G. et al. (2019) pmartr: quality control and statistics for mass spectrometry-based biological data. J. Proteome Res., 18, 1418–1425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Suvarna K. et al. (2021) Proteomics and machine learning approaches reveal a set of prognostic markers for covid-19 severity with drug repurposing potential. Front. Physiol., 12, 432. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tyanova S. et al. (2016a) The maxquant computational platform for mass spectrometry-based shotgun proteomics. Nat. Protoc., 11, 2301–2319. [DOI] [PubMed] [Google Scholar]
- Tyanova S. et al. (2016b) The perseus computational platform for comprehensive analysis of (prote) omics data. Nat. Methods, 13, 731–740. [DOI] [PubMed] [Google Scholar]
- Virreira Winter S. et al. (2021) Urinary proteome profiling for stratifying patients with familial parkinson’s disease. EMBO Mol. Med., 13, e13257. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wickham H. et al. (2016) Package ‘ggplot2’. Create elegant data visualisations using the grammar of graphics. Version 2, 1–189. [Google Scholar]
- Zhang X. et al. (2018) Proteome-wide identification of ubiquitin interactions using ubia-ms. Nat. Protoc., 13, 530–550. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.