Abstract
Proteogenomics involves the integrative analysis of genomic, transcriptomic, proteomic and post-translational modification data produced by next-generation sequencing and mass spectrometry-based proteomics. Several publications by the Clinical Proteomic Tumor Analysis Consortium (CPTAC) and others have highlighted the impact of proteogenomics in enabling deeper insight into the biology of cancer and identification of potential drug targets. In order to encapsulate the complex data processing required for proteogenomics, and provide a simple interface to deploy a range of algorithms developed for data analysis, we have developed PANOPLY—a cloud-based platform for automated and reproducible proteogenomic data analysis. A wide array of algorithms have been implemented, and we highlight the application of PANOPLY to the analysis of cancer proteogenomic data.
Proteogenomics involves the integrative analysis of genomic, transcriptomic, proteomic and post-translational modification (PTM) data produced by next-generation sequencing and mass spectrometry-based proteomics. Several publications by the Clinical Proteomic Tumor Analysis Consortium (CPTAC) and others have highlighted the impact of proteogenomics in enabling deeper insight into the biology of cancer and identification of potential drug targets1–4. In order to encapsulate the complex data processing required for proteogenomics, and provide a simple interface to deploy a range of algorithms developed for data analysis, we have developed PANOPLY—a cloud-based platform for automated and reproducible proteogenomic data analysis. A wide array of algorithms have been implemented, and we highlight the application of PANOPLY to the analysis of cancer proteogenomic data.
PANOPLY uses state-of-the-art statistical and machine learning algorithms to transform multi-omic data from cancer samples into biologically meaningful and interpretable results. PANOPLY provides a comprehensive collection of proteogenomic data analysis methods including sample QC (sample quality evaluation using profile plots and tumor purity scores1, identify sample swaps, etc.), association analysis, RNA and copy number correlation (to proteome), connectivity map (CMAP) analysis5, outlier analysis using BlackSheep6, PTM Signature Enrichment Analysis (PTM-SEA)7, Gene Set Enrichment Analysis (GSEA)8 and single-sample GSEA7, consensus clustering, and multi-omic clustering using non-negative matrix factorization (NMF) (Figure 1A). Most analysis modules include a report generation task that outputs an interactive HTML report summarizing results from the analysis (Figure 1B). A complete list of PANOPLY task modules with documentation can be found at the PANOPLY wiki (https://github.com/broadinstitute/PANOPLY/wiki). The types of liquid chromatography-tandem mass spectrometry-based (LC-MS/MS) proteomics data that are amenable to analysis by PANOPLY includes label-free and isobaric mass tag label-based LC-MS/MS approaches like iTRAQ and TMT profiling of the proteome and multiple PTM-omes including phospho-, acetyl- and ubiquitylomes. PANOPLY can analyze LC-MS/MS data input as (log-transformed) intensities or ratios.
PANOPLY leverages Terra (http://app.terra.bio )—a Google Cloud-based cloud-native platform for extreme-scale data analysis, sharing, and collaboration—to host proteogenomic workflows, and is designed to be flexible, automated, reproducible, scalable, and secure. Using the underlying architecture of Terra, PANOPLY analysis algorithms and methods are implemented as tasks, interchangeably referred to as modules. A series of tasks constitutes a workflow, which represents a pipeline with outputs of one or more tasks feeding into inputs of downstream tasks. Tasks and workflows are defined using the Workflow Description Language (WDL), allowing users to customize pipelines and quickly add their own tasks to the PANOPLY workflow. The central element of PANOPLY is the Terra workspace—a shareable collection of everything needed for a data analysis project, including data located on cloud storage linked with the workspace, workflows encapsulating algorithms and pipelines, analysis parameters, results, and a data model that organizes sample meta-data into collections of participants, samples, and sample sets. With versioned workflows, workspaces enable reproducible research by completely encapsulating input data, analysis methods, settings, and results in a shareable cloud-based space.
PANOPLY v1.0 consists of the following components:
A workspace (https://app.terra.bio/#workspaces/broad-firecloud-cptac/PANOPLY_Production_Pipelines_v1_0) with a preconfigured unified workflow to automatically run all analysis tasks on proteomics (global proteome, phosphoproteome, acetylome, ubiquitylome), transcriptome and copy number data; and an additional workspace (https://app.terra.bio/#workspaces/broad-firecloud-cptac/PANOPLY_Production_Modules_v1_0) that includes separate methods for each analysis component.
An interactive Jupyter notebook that provides step-by-step instructions for uploading data, identifying data types, specifying parameters, and setting up the PANOPLY workspace for analyzing a new dataset.
A GitHub repository (https://github.com/broadinstitute/PANOPLY) with code, documentation and description of algorithms.
A tutorial (https://github.com/broadinstitute/PANOPLY/wiki/PANOPLY-Tutorial) illustrating the application of PANOPLY (https://app.terra.bio/#workspaces/broad-firecloud-cptac/PANOPLY_Tutorial) to a published breast cancer dataset and demonstrating the practical relevance of PANOPLY by regenerating many of the results described in (Mertins et al)1 with minimal effort.
PANOPLY has been used for analyzing data from many CPTAC and other cancer proteogenomic landscape studies, and workspaces with results for published data sets, including CPTAC BRCA1,4 (https://app.terra.bio/#workspaces/broad-firecloud-cptac/PANOPLY_CPTAC_LUAD) and LUAD3 (https://app.terra.bio/#workspaces/broad-firecloud-cptac/PANOPLY_CPTAC_LUAD) are available. While PANOPLY has been developed and presented in the context of cancer proteogenomic studies, it can be applied to any proteogenomics study with appropriate data, vastly simplifying the analysis and integration process. While the current release of PANOPLY contains algorithms for integrative proteogenomic data analysis, future versions will include workflows to perform raw data characterization of genomics and proteomics data. New analysis methods from recent CPTAC proteogenomic studies will also be added in new releases. We envision that use of PANOPLY will provide a quick and comprehensive baseline analysis for proteogenomic studies, leading to many disease specific hypotheses that can be explored further using additional computational and wet-lab experiments.
Supplementary Material
Acknowledgements
This work was supported by grants from the National Cancer Institute (NCI) Clinical Proteomic Tumor Analysis Consortium grants NIH/NCI U24CA210979 (to DRM and GG) and NIH/NCI U24-CA210986 and NIH/NCI U01 CA214125 (to SAC).
Footnotes
Competing Interests
The authors declare no competing interests.
Data and Code Availability
All code and data are accessible online at GitHub (https://github.com/broadinstitute/PANOPLY) and Terra (https://app.terra.bio/). More detailed information is provided in Supplementary Table 1.
References
- 1.Mertins P et al. Proteogenomics connects somatic mutations to signalling in breast cancer. Nature 534, 55–62 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Zhang B et al. Proteogenomic characterization of human colon and rectal cancer. Nature 513, 382–387 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Gillette MA et al. Proteogenomic Characterization Reveals Therapeutic Vulnerabilities in Lung Adenocarcinoma. Cell 182, 200–225.e35 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Krug K et al. Proteogenomic Landscape of Breast Cancer Tumorigenesis and Targeted Therapy. Cell 183, 1–21 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Subramanian A et al. A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles. Cell 171, 1437–1452.e17 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Blumenberg L et al. BlackSheep: A Bioconductor and Bioconda package for differential extreme value analysis. doi: 10.1101/825067. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Krug K et al. A Curated Resource for Phosphosite-specific Signature Analysis. Mol. Cell. Proteomics 18, 576–593 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Subramanian A et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U. S. A. 102, 15545–15550 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Yoshihara K et al. Inferring tumour purity and stromal and immune cell admixture from expression data. Nat. Commun. 4, 1–11 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Aran D, Hu Z & Butte AJ xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome Biol. 18, 220 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All code and data are accessible online at GitHub (https://github.com/broadinstitute/PANOPLY) and Terra (https://app.terra.bio/). More detailed information is provided in Supplementary Table 1.