1. Introduction
Proteogenomics integrates two different research fields, mass spectrometry (MS) based proteomics and next-generation sequencing (NGS) based genomics, transcriptomics or translatomics. At the outset, proteomics data was used to aid genome annotation by gene model refinement based on protein-level validation1. Therefore fragmentation spectra were identified using databases compiled of six reading frame translations of the complete genome2–5, gene predictions6, expressed sequence tags (ESTs)5, 7–9, homologous sequences10–12, or exon graph models13, 14, instead of reference protein databases as UniProtKB15, RefSeq16 or Ensembl17. Currently, NGS techniques are becoming more widespread and inexpensive and are regularly carried out in parallel with matching MS-based proteomics experiments. Adding extra information gathered from genomics, transcriptomics or translatomics data, respectively based on genome sequencing, RNAseq and ribosome profiling (RIBOseq), results in a more comprehensive search space for MS/MS identification. Different types of information can be obtained from these different NGS-based technologies to aid and refine the MS-based peptide and protein identification process: transcript abundance, translation efficiency, translation initiation site location, somatic versus germ line mutations, splice variation and delineation of novel coding regions 18–21.
To help unravel the proteome complexity, search engines scan these custom protein databases, trying to identify alternative proteoforms22. Also, these custom searches can result in the identification of tumor-specific peptides23, 24 in onco-proteogenomics studies. The field of proteogenomics is rapidly expanding25, mainly because of the advent of new sequencing techniques and the increased sensitivity and throughput of recent MS-based proteomics. An excellent review on proteogenomics concepts and applications thereof is available26. In this review, we focused on the expanding toolset that is being made available to successfully analyze the merged NGS and MS datasets in proteogenomics experiments.
2. Proteogenomics goals
2.1. Aid Genome Annotation
Nowadays, due to the exponential growth of sequencing technologies, it is becoming straightforward to draft complete genomes of non-model organism helping to interpret MS/MS data27 in a cost-effective way. Previously, sequence information (genome and cDNA) was unavailable for these non-model organisms, resulting in incomplete or missing protein sequences. Only homology-based or de novo algorithms could be employed to identify peptides from fragmentation spectra. These so-called homology-based algorithms allow for sequence-similarity searches28–30 against homologues protein sequences. A plethora of proteogenomics studies were successful in plants2, 10–12, 31–33, within blue technology (review by Hartmann et. al.34), in prokaryotes35–37, viruses38 and within environmental microbiology (review by Armengaud et. al.39) using one or both of the following two rationales: (1) homology searches using the evolutionary closest sequence (genome, cDNA, EST) template or (2) searches against an NGS-generated species-specific sequence templates.
2.2. Unravel Proteome Complexity (Identify Proteoforms)
Extra information gained from sequencing based technologies can be used to build custom, comprehensive protein databases for MS/MS spectra identification. Several types of peptides can thus be additionally identified. Figure 1 gives an overview of the different classes of peptides that can be identified in proteogenomics studies. The greater part of enzymatically cleaved peptides (e.g. using trypsin) will map to annotated protein-coding regions. The majority thereof will reside within one exonic region, but a minority can also overlap annotated transcript splice junction sites. In silico translated sequences of alternative splice isoforms, identified based on transcriptomics data (RNAseq), can be complemented to the protein sequence search DB, enabling the identification of novel splice proteoforms40 and cross-junction peptides (covering an annotated exon-intron boundary) using MS-based techniques. Also, peptides can map to untranslated (5’UTR and 3’UTR) or intronic regions, or can point to out-of-frame translation products. Peptides starting in a 5’UTR region can give rise to upstream open reading frame (uORF) translation products or N-terminal extended proteoforms. Peptides in the 3’UTR could on the other hand point to read-through events. In some rare cases, reverse-strand peptides can also be identified2.
Figure 1.
Classes of peptides identified in proteogenomics. A. A division op proteo-genomics peptide types can be made based on the genomics region where these map. The majority of enzymatically cleaved peptides map to coding genic locations (intragenic), whereas a small amount also maps to non-coding RNA and pseudogenes or intergenic regions. Exceptionally, peptides can point to chimeric proteins (fusion products in for example oncoproteogenomics studies) or could lead to gene fusion in the case of identification of gene-fusion peptides. Of the intragenic subclass, the majority will map to one exon and a minority can overspan exon splice sites (possibly leading to alternative splice isoform identification). Proteogenomics can lead to the identification of novel peptides located in untranslated regions (5’ and 3’UTR) or intronic regions, internal out-of-frame peptides, peptides that resided at the reverse strand or single amino acid variant (SAV) peptides (introduced through genetic variation or RNA editing). Other novel findings can point to exon-intron junction (cross-junction peptides). B. Another application of proteogenomics is the study of antibody or nanobody peptides in the highly variable regions. Here a combination is made of sequencing of B-cells and mass spectrometry of the blood anti/nano-bodies after affinity selection. C. Venomics is another research field wherein proteogenomics can be extremely useful. Here a combination of RNAseq of the venom gland (of for example cone snails, spiders, snails) and matching mass spectrometry boosts the identification rate of the impressively divers arsenal of toxin peptides that mostly carry multiple post-translational modifications.
Next to the aforementioned peptide classes that map to known protein-coding gene models, another type of intragenic peptides can be discovered that map to non-coding regions as (long intergenic) non-coding RNA (lincRNA) genes41 or pseudogenes6, 42, 43. Moreover, peptides can map to intergenic chromosome positions, located between known genic regions. These peptides can point to translation of small open reading frames (sORFs, see below) or unannotated regular coding sequences. In some rare cases, identified peptides can result in gene joining, by identifying longer full-length open reading frames. These are located within one exon, but can also overspan annotated splice junction sites44. Additional single nucleotide variation (SNV), insertion or deletion information, obtained from sequencing data or publicly available databases or RNA-editing sites45, 46 can also result in the identification of single amino acid variants (SAVs)47. In onco-proteogenomics studies21, 23, 24, usage of tumor-specific protein sequences including SNV data obtained from genomics and/or transcriptomics data regularly results in the identification of aberrant SAVs21. Peptides that point to fusion proteins (chimeric peptides)48 are also regularly investigated in cancer proteogeomics studies.
Recently, a novel translatomics technique was devised, called ribosome profiling49–52. By sequencing ribosome-protected mRNA fragments, this method charts a genome-wide protein synthesis profile. Furthermore, (alternative) translation initiation sites (TIS) can be accurately predicted by exploiting the abilities of antibiotics, such as harringtonine53 or lactimidomycin54 that stall ribosomes at sites of translation initiation. Incorporation of this information into the protein sequence database enables (alternative) TIS validation by means of matching MS-based proteomics studies19, 55–57. Especially in combination with N-terminal COFRADIC (COmbined FRActional Diagonal Chromatography)58, 59, that isolates amino terminal peptides, this proved to be very successful. RIBOseq is also very useful in pinpointing translation synthesis of small open reading frames60–62. As already mentioned, these can be located in either intergenic or non-coding genic regions such as (li)ncRNA or in intronic and exonic regions of annotated genes (out-of-frame translation). Notable is that previous efforts63, 64 combining peptidomics and massive parallel RNA sequencing have also resulted in the identification of short open reading frame (sORF)-encoded polypeptides (SEPs) in human cells located both in coding genes (5’UTR, out-of-frame CDS, 3’UTR and antisense) and non-coding RNA.
2.3. Antibody Sequencing
The sequences of antibodies are produced by seemingly random recombination of variable (V), diversity (D) and joining (J) genes followed by somatic hypermutation. Therefore, space of potential amino acid sequences of antibodies that can be generated from a given genome is prohibitively large. Because no good sequence database was available, traditionally the only viable approach for sequencing antibodies with mass spectrometry was de novo sequencing which often involve manual labor, is time consuming, and require high-quality spectra with no missing information. However, targeted sequencing of the variable part of the antibody transcripts can provide a database that can be searched automatically and increases the success rate of identification. The combination of next generation sequencing of the transcripts that attempts at providing the sequences for the entire antibody repertoire of the individual, and affinity mass spectrometry of the circulating antibodies that bind a specific antigen is a powerful proteogenomic method to sequence antibodies. This method has been applied to characterizing circulating human antibodies in response to infection 65, 66, and to production of nanobodies as research reagents based on llama single chain antibodies 67.
2.4. Immunogenic Peptides
The application of proteogenomics approaches in the field of cancer immunology is becoming more widespread68, 69. Several studies describe the application of sequencing (whole exome and/or RNAseq) in combination with mass spectrometry to identify mutated major-histocompatibility complex class 1 (MHC1) presented peptides70, 71. These somatic mutations are contained in immunogenic peptides that can serve as T-cell epitopes if presented on the MHC1 molecules as they are recognized by the adaptive immune system. As more MS-based efforts are focusing on the identification and quantification of human leukocyte antigen (HLA) associated peptides72, 73, the integration with tumor-specific sequencing information becomes achievable.
2.5. Venomics
Another goal of proteogenomics is the identification of venom peptides from glands of different species (e.g. spiders74, cone snails75–79, snakes80, 81). These venoms are mostly studied because of their putative therapeutic application. Characteristic to these peptides is that they often carry disulfide-bridges that are highly stable and resistant to proteolytic degradation, ensuring that the peptide toxins remain active after administration. Another characteristic, next to the high level of post-translational modifications (PTMs), is that the venom peptide (e.g. conopeptides from cone snails) arsenal demonstrates an impressive diversity, due to diversification mechanisms at work in the venom gland. To uncover the mechanism(s) that generate this heterogeneous pool of conotoxins with alternative cleavage sites, heterogeneous post-translational modifications and highly variable amino- and carboxy-terminal truncations, an integrated approach of NGS transcriptomics with high sensitive MS-based proteomics is necessary75. Formerly, de novo identification strategies (see below in the peptide identification section) were mainly applied to identify these venom peptides, but nowadays, with the advent of NGS techniques it became possible to construct NGS-derived venomics search databases enabling the usage of DB-search algorithms to identify these peptides.
3. Tools, databases and pipelines for proteogenomics studies
Figure 2 sketches a proteogenomic flowchart. Useful tools, databases and pipelines that can be applied throughout the different analysis steps are mentioned. The tools are subdivided based on the role they play within the proteogenomics process. Below we describe the available tools, databases and pipelines in more detail. Moreover, Table 1 gives an overview of the publicly available databases that can be used, and Table 2 lists all available tools and pipelines together with references to related literature and their website.
Figure 2.
Comprehensive overview of proteogenomics workflow. A typical proteogenomics strategy consists of different steps. Novel peptide identifications are mostly obtained by scanning comprehensive, custom-build protein sequence databases using database search engines. The search database creation step (1. DB Creation) is very important and can hold sequences from annotated protein repositories, translated genomic and/or RNA sequences or processed sequences from sequence read archives. Furthermore specific protein-related information can be incorporated in the search space. Routinely, the (very large) custom database is filtered to help manage its size. This filtering can be pursued based on different criteria. Fragmentation spectra can be experimentally obtained or downloaded from public repositories. The upper green box, connected to the arrow, gives an overview of the different proteogenomics tools reported to create such custom databases. (2. MS/MS data): PRIDE, PeptideAtlas, Massive, ProteomicsDB, Chorus, CPTAC all hold MS/MS data of MS-based proteomics experiments on different cell/tissue types and species. (3. Peptide identification): As mentioned, the identification is mostly obtained using database searching (multiple search algorithms are available). Other so-called hybrid tag-based or de novo methods have also been successfully applied in many proteogenomics experiments. (4. Validation & interpretation): After the identification step, validation and interpretation of the PSM/peptide/proteins remains indispensable, using appropriate statistical significance estimation (FDR/PEP calculation). Further global annotation analysis based on gene ontology or protein interaction is also routinely performed. (5. Mapping & visualization): The vast amount of multi-omics data can be further mapped and visualized in a genome-centric way, many tools are available for both the mapping and visualization step. Also, further integrative visualization of protein interaction networks based on this cross-omics data is also possible using several tools as Circos and Cytoscape. The lower green box, connected to the arrow, gives an overview of reported pipelines or solutions that combine the different necessary steps in a complete proteogenomics workflow.
Table 1.
Data Resources for Proteogenomics
| Sequence resources - General | |||
|---|---|---|---|
| UniProtKB | www.uniprot.org | REF | Comprehensive, high-quality and freely accessible resource of protein sequence and functional information |
| RefSeq | www.ncbi.nlm.nih.gov/refseq | Comprehensive, integrated, non-redundant, well-annotated set of reference sequences including genomic, transcript, and protein. | |
| Ensembl | www.ensembl.org | Genome databases for vertebrates and other eukaryotic species | |
| Sequence resources - Specific | |||
|
| |||
| TISdb | tisdb.human.cornell.edu | Source of information for mRNA alternative translation | |
| ChimerDB | ercsb.ewha.ac.kr/fusiongene | Knowledgebase for fusion sequences | |
| dbCRID | dbcrid.biolead.org | Database of chromosomal rearrangements in human diseases | |
| ChiTaRS | chitars.bioinfo.cnio.es | Database of the chimeric transcripts and RNA-seq data | |
| CanProVar | bioinfo.vanderbilt.edu/canprovar | PMID: 21389108 | Human Cancer Proteome Variation Database |
| dbSNP | www.ncbi.nlm.nih.gov/SNP | Database of single nucleotide polymorphisms | |
| COSMIC | cancer.sanger.ac.uk | Catalogue Of Somatic Mutations In Cancer | |
| HAltORF | www.roucoulab.com/haltorf | Resource for the investigation of alternative Open Reading Frames (ORFs) in human mRNA sequences | |
| UTRdb | utrdb.ba.itb.cnr.it | PMID: 19880380 | Curated database of 5' and 3' untranslated sequences of eukaryotic mRNAs, derived from several sources of primary data |
| uORFdb | cbdm.mdc-berlin.de/tools/uorfdb | PMID: 24163100 | Comprehensive literature database on eukaryotic uORF biology |
| sORFdb | www.sorfs.org | Comprehensive repository of sORFs based on ribosome profiling | |
| dbRES, DARNED | bioinfo.au.tsinghua.edu.cn/dbRES, darned.ucc.ie | PMID: 17088288, PMID: 23074185 | Databases for RNA Editing sites |
| Animal toxin annotation | www.uniprot.org/program/Toxins | A systematical annotation of proteins secreted in animal venom | |
| LNCiPedia | www.lncipedia.org | PMID: 25378313 | |
| OMIM | www.omim.org | Catalog for Human Genes and Disorders | |
| Public repos for proteomics data | |||
|
| |||
| PRIDE | www.ebi.ac.uk/pride | Centralized, standards compliant, public data repository for proteomics data (member of ProteomeXChange, www.proteomeXChange.org) | |
| PeptideAtlas | www.peptideatlas.org | Multi-organism, publicly accessible compendium of peptides identified in a large set of tandem MS proteomics experiments (ProteomeXChange) | |
| MassIVE | massive.ucsd.edu | Community resource to promote the global, free exchange of mass spectrometry data (ProteomeXChange) | |
| ProteomicsDB | www.proteomicsdb.org | Public in-memory database of the provisional human proteome map | |
| Chorus | chorusproject.org | Web application for storing, sharing, visualizing, and analyzing spectrometry files | |
| CPTAC | https://cptac-data-portal.georgetown.edu | Clinical Proteomic Tumor Analysis Consortium data portal | |
| Public repos for sequencing data | |||
| SRA | http://www.ncbi.nlm.nih.gov/sra | PMID: 22009675 | Raw sequencing and alignment information from high-throughput sequencing platforms |
Table 2.
Tools and algorithms for proteogenomics
| NGS Tools - Mappers | |||
|---|---|---|---|
| Bowtie | bowtie-bio.sourceforge.net/bowtie2 | PMID: 22388286 | ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences |
| TopHat | ccb.jhu.edu/software/tophat/index.shtml | PMID: 23618408 | fast splice junction mapper for RNA-Seq reads |
| STAR | code.google.com/p/rna-star | PMID: 23104886 | ultrafast universal RNA-seq aligner |
| NGS Tools - Other | |||
|
| |||
| GATK | https://www.broadinstitute.org/gatk | PMID: 21478889 | a framework for variation discovery |
| samtools mpileup | samtools.sourceforge.net/mpileup.shtml | Calling SNPs/INDELs with SAMtools | |
| Transcript assembly tools | PMID: 24185837 | Assessment of transcript reconstruction methods for RNA-seq. | |
| Proteogenomics Pipelines and Tools | |||
|
| |||
| Custom database creation | |||
| PROTEOFORMER | www.biobix.be/proteoformer | PMID: 25510491 | Stand-alone or Galaxy version to convert RIBOseq data to an proteinn product database for MS-based proteomics identification. |
| SpliceDB | proteomics.ucsd.edu/software-tools/splicedb-splice- graph-proteomics-tools/ | PMID: 25263569, PMID: 23802565 | construction of a compact database that contains all useful information expressed in RNA-seq reads |
| CustomProDB | www.bioconductor.org/packages/release/bioc/html/customProDB.html | PMID: 24058055 | Generate customized protein database from NGS data, with a focus on RNA-Seq data, for proteomics search. |
| Quilts | openslice.fenyolab.org/cgi-bin/quilts_cgi_v2.0.pl | Creates sample specific protein sequence databased using genomic and transcriptomic information. | |
| Sap-DB/Splice-DB/Reduce-DB | toolshed.g2.bx.psu.edu (“Proteomics” link) | PMID: 25149441 | customized proteomic databases suitable for MS searching (SAP, Splice and Reduced) |
| MSMSpdbb | org.uib.no/prokaryotedb | PMID: 20080508 | Multi-Strain Mass Spectrometry Prokaryotic DataBase Builder |
| Proteomics Informed by Transcriptomics | bessantlab.org/expertise/proteomics-informed-by-transcriptomics | PMID: 23142869 | Use of de novo transcript assemblies from RNA-seq data for the routine creation of sample-specific proteomes (host-pathogen or metaproteomics studies) |
| Complete proteogenomics pipelines | |||
| Galaxy implementations | http://toolshed.g2.bx.psu.edu, http://galaxy.nbic.nl, https://usegalaxyp.org, http://galaxy.wur.nl | PMID: 25658277 | Large toolbar for MS-based proteomics and NGS integration |
| GenoSuite | https://sourceforge.net/projects/proteogenomic | PMID: 23882027 | Framework developed for proteogenomic analysis (searching, scoring, visualization) |
| Peppy | geneffects.com/peppy | PMID: 23614390 | Software tool designed to perform every necessary task of proteogenomic searches (create genomic DB, track loci, match MS/ MS and assign confidence) |
| Enosi | proteomics.ucsd.edu/software-tools/enosi | PMID: 24142994 | Enosi pipeline analyzes identified peptides. Enosi is developed to automize the spectrum-peptide match to be recognized more intuitive sense. |
| bacterial-proteogenomic-pipeline | https://github.com/mpc-bioinformatics/bacterial-proteogenomic-pipeline | 25521444 | The Bacterial Proteogenomic Pipeline consists of several modules, which assist in a proteogenomics analysis. |
| Others | |||
| sapFinder | bioconductor.org/packages/devel/bioc/html/sapFinder.html | PMID: 25053745 | detection of the variant peptides based on tandem mass spectrometry (MS/MS)-based proteomics data |
| PGP | https://bitbucket.org/andreyto/proteogenomics | PMID: 24470574 | parallel prokaryotic proteogenomics annotation pipeline for MPI clusters based on MS proteomics data |
| GFS | PMID: 18428684, PMID: 12518051 | peptide-mass fingerprint against the theoretical translation and proteolytic digest of an entire genome (genome fingerprint scanning). | |
| Llama Magic | gpmdb.rockefeller.edu/llama-magic-cgi/llama_magic.pl | PMID: 25362362 | Tool for identification of single chain Llama antibodies by combining next generation sequencing data with mass spectrometry based affinity proteomics data |
| Proteogenomics Mapping and Visualization | |||
|
| |||
| Mapping | |||
| PMT | www.agbase.msstate.edu/tools/pgm | PMID: 21513508 | Proteogenomic mapping tool: maps peptides back to their source genome |
| PepLine | www.grenoble.prabi.fr/protehome/software/pepline | PMID: 18348511 | maps MS/MS fragmentation spectra of trypsic peptides to genomic DNA sequences (Plants) |
| IggyPep | www.iggypep.org | PMID: 20000637 | Maps PST to the genome |
| PGx | pgx.fenyolab.org | Takes peptide identities and quantities as input and and maps them onto the human genome | |
| iPiG | https://sourceforge.net/projects/ipig | PMID: 23226516 | Tool for the integration of peptide identifications from mass spectrometry experiments into existing genome browser visualizations |
| Visualization | |||
| peptide_to_gff | toolshed.g2.bx.psu.edu/view/galaxyp/peptide_to_gff | Tool to convert peptide to gff | |
| PG Nexus using IGV | https://github.com/IntersectAustralia/ap11_Samifier | PMID: 24152167 | covisualize peptides in the context of genomes or genomic contigs, along with RNA-seq reads (Prokaryotes - Eukaryotes) |
| VESPA | cbb.pnnl.gov/portal/software/vespa.html | PMID: 22480257 | integrates high-throughput proteomics data (peptide-centric) and transcriptomics (probe or RNA-Seq) data into a genomic context (Bacteria) |
| MIMOMICs | bamics2.cmbi.ru.nl/websoftware/minomics/minomics_intro.php | PMID: 19008250 | allows facile and in-depth visualization of prokaryotic transcriptomic and proteomic data in conjunction with genomics data (Prokaryotes) |
| Protter | wlab.ethz.ch/protter | PMID: 24162465 | interactive protein feature visualization and integration with experimental proteomic data |
| Tools for peptide identification from MS/MS | |||
|
| |||
| Database searching | |||
| SearchGUI | https://code.google.com/p/searchgui | PMID: 21337703 | SearchGUI is a user-friendly open-source graphical user interface for configuring and running proteomics identification search engines, currently supporting X!Tandem, MS-GF+, MS Amanda, MyriMatch, Comet, Tide and OMSSA. |
| PeptideShaker | https://code.google.com/p/peptide-shaker | PMID: 25574629 | PeptideShaker is a search engine independent platform for interpretation of proteomics identification results from multiple search engines, currently supporting X!Tandem, MS-GF+, MS Amanda, OMSSA, MyriMatch, Comet, Tide, Mascot and mzIdentML. |
| Tag-based, hybrid filtering/searching | |||
| GenoMS | proteomics.ucsd.edu/software-tools/genoms | PMID: 20164058 | Template Proteogenomics: Sequencing Whole Proteins Using an Imperfect Database (uses Inspect (PMID: 16013882)) and PepNovo). |
| Spider,TagRecon | www.bioinfor.com/peaks/features/spider.html, enchurch.mc.vanderbilt.edu/bumbershoot/tagrecon | PMID: 16108090, PMID: 20131910 | Mutation-tolerant search engine. |
| De novo sequencing, homology searching | |||
| DenovoGUI, PepNovo+, DirecTAG | denovogui.googlecode.com | PMID: 24295440, PMID: 15858974, PMID: 18630943 | User-friendly and lightweight graphical user interface called DeNovoGUI for running parallelized versions of the freely available de novo sequencing software PepNovo+ and DirecTag. |
| UniNovo | proteomics.ucsd.edu/Software/UniNovo.html | PMID: 23766417 | Universal tool for de novo peptide sequencing |
| MS-blast/MS-Homology | genetics.bwh.harvard.edu/msblast, prospector.ucsf.edu/prospector/html/instruct/homologyman.htm | REFS | Tools for homology searching |
3.1. Custom Database Creation
Matching fragmentation spectra to peptides contained in custom databases enabling extra identifications is an important aspect of proteogenomics. As shown in Figure 2, different types of information can be used to build an increasingly comprehensive protein sequence search database. A standard database search is performed against well-annotated resources of protein sequences, mostly obtained from UniProtKB15, RefSeq16 or Ensembl17. Within traditional proteogenomics studies, databases containing the six reading frame translation of the complete genome2–5 or translation products from EST libraries5, 7–9, 82 were successfully applied. Later, with the advent of next-generation sequencing technologies, this sequencing information was incorporated in the custom protein sequence database. Next to performing matching sequencing experiments (DNAseq, exomeSeq, RNAseq or RIBOseq), many publicly available datasets are also available from sequence read archives available at the National Center for Biotechnology Information (NCBI-SRA)83 and the European Bioinformatics Institute (EBI-ENA)84. Before this sequencing information can be used, the reads that result from such an NGS experiment first need to be mapped to a reference genome. Dependent on the type of sequencing a (non) splice-aware aligner is used. Bowtie85 or BWA86 is the most commonly applied mapper for aligning short reads to a long reference genome or transcript sequence base. For RNAseq data though splice aware mappers (TopHat87, STAR88, BWA) can be used to map the reads directly to a reference genome, reporting the splice junctions with above-threshold read coverage. This threshold can be set more stringent for discovery of novel splice junctions or more tolerant for annotated splice sites. Custom databases compiled of (1) peptides that overlap (novel) exon-exon boundaries or (2) (novel) complete translation sequences using the transcript construction (e.g. based on tools as Cufflinks89 or others described in ref. 90) have already proven to be successful in different (onco-)proteogenomics studies23, 24. Sequence variation or indels, extractable information from NGS datasets, can give rise to identification of SAVs. Most commonly, tools as GATK91 and the combination of samtools92 and mpileup are applied to extract this extra layer of information.
The knowledge on specific sequence features is growing and is commonly deposited in specialized databases and can thus be included in the assembly of proteins. Information on (alternative) translation initiation sites (TISdb93, based on ribosome profiling) or (alternative) ORFs (HaltORF94), untranslated regions (UTRdb95) or evidence of translated upstream ORFs (uORFdb96) or small ORFs (www.sORFs.org) can help to include extra layers of complexity of the proteome into the protein search space and thus help the identification thereof by means of MS-based proteomics or peptidomics experiments. Cancer fusion products (available from ChimerDB97, dbCRID98, ChiTaRS99) can also be incorporated into the protein sequence database in order to detect chimeric peptides in onco-proteogenomic studies23, 48. Germline or somatic nucleotide variants, deposited in the NCBI dbSNP database100, supplemented with cancer mutations from CanProVar101 or COSMIC102 can give rise to identification of SAVs or peptides with amino acid deletion or insertions. RNA-editing events (derived from databases as dbRES46 or DARNED45) or calculated based on DNAseq and/or RNAseq using recently developed tools such as REDItools103 and GIREMI104 can introduce another type of genomic variation into the protein sequence database.
Information on particular classes of sequences can also be obtained for focused searches: (long intergenic) non-coding RNA can be extracted from LNCiPedia105 or NONCODE106 to assess whether coding sORFs can be identified, catalogs of genes related to genetic disorders can be distilled from OMIM107, and protein sequences secreted in animal venoms can be obtained from the Uniprot systematical toxin annotation program (www.uniprot.org/program/Toxins).
As we pointed out, multiple layers of information can be incorporated into the protein database. This sometimes results in an explosion of the database size introducing difficulties during the peptide-to-spectrum matching (PSM). With an increasing database size, the chance that the best scoring match is incorrect also increases. In other words, distinguish the good from the bad positive matches becomes more difficult. Tools as Percolator108, 109 or MS2PIP110 are available to help boost the rate of confident peptide identifications from a collection of tandem mass spectra, using semi-supervised machine learning. Another strategy to reduce this phenomenon of increased false positive identifications is keeping the database size within limits. Several ways of filtering or redundancy removal, based on different metrics have been described. Whereas traditional proteogenomics used overlap with ESTs or gene expression as filters, RNAseq based proteogenomics studies use read coverage as a metric for expression as a cutoff (RIBOseq19, 55–57 or RNAseq18, 21, 111, 112). To manage the size of the search database, Woo et al. 21 proposed a splice graph approach to compile size-limited multiple RNAseq alignment databases. Other studies use annotation information from large-scale studies to filter (e.g. based on GENCODE113) or use a chromosome-centric approach (C-HPP: Chromosome-Centric Human Proteome project 114–116). Sequence homology11 or sequence tag overlap2, 14, 30, 117 are also often used to help reduce the database size. A recent method (HiRIEF LC-MS) describes the use of the high-resolution isoelectric focusing43 to help reduce the protein sequence DB, enabling an unbiased proteogenomics search.
Several automated tools have been proposed in the last decade to compile custom protein sequence databases. Previously, construction of a custom translated genome sequence database, using the MSMSpdbb tool118, was reported to study prokaryote species. The last few years, automated pipelines to construct custom protein sequence databases based on NGS data are becoming available (See Figure 2, upper green box). The majority uses RNAseq alignments to construct translated amino acid sequences, taking advantage of the multiple types of information that is available from this source of NGS data: mRNA abundance, SNVs, alternative splicing, and novel coding regions. CustomProDB112 and the implementations of Sheynkeman et al. (SAP-db, SPLICE-db, REDUCED-db111) build separated databases based on respectively variant, (alternative) splicing and mRNA expression information. Other tools try to combine all gathered knowledge within one search database using one (Quilts) or multiple (SpliceDB21) RNAseq alignments. PROTEOFORMER19, on the other hand uses RIBOseq data as input. Next to the aforementioned info (abundance, SNVs, splicing, novel transcripts) derived from RNAseq based experiments; RIBOseq data also enables the identification of translation initiation sites. Needless to say that this is extremely useful in delineating the true coding sequence and managing the search database size within proteogenomics studies, since only a one-frame translation is necessary from the RIBOseq-covered transcript.
3.2. Peptide Identification
Routinely, a database search engine is used to match fragmentation spectra to peptides in the database resulting in peptide identifications. Similar to regular proteomics studies, this identification strategy is also mostly applied in the proteogenomics research field. Multiple database search tools are available and nowadays results from multiple DB searches are combined to boost the identification rate. SearchGUI119 is a user-friendly open-source graphical user interface for configuring and running proteomics identification search engines, currently supporting X!Tandem120, MS-GF+121, MS Amanda122, MyriMatch123, Comet124, Tide125 and OMSSA126. Next to DB-searching using different search algorithms, other hybrid approaches (e.g. InsPecT117, GenoMS30) have been devised, filtering the search database based on peptide sequence tags (PST) derived from the actual fragmentation spectrum. This tag-based filtering allows reducing the search space, which can be very efficient and useful in proteogenomics studies. Next to this PST-based filtering, other tools (e.g. Spider28 and TagRecon29) allow to reconcile these tags against the protein database and moreover enable mutation-tolerant searching to identify unanticipated variations or posttranslational modifications. Finally, the true de novo sequencing algorithms are methods that do not make advantage of a reference protein database and only use spectral information for the peptide identification. These have been successfully applied in many proteogenomics studies, especially for non-model species where a comprehensive genome annotation is still unavailable or incomplete. Multiple tools to derive these partial sequence stretches or complete de novo sequences are available: e.g. PepNovo+127, DirecTag128 (both bundled in a graphical user interface called DenovoGui129) and UniNovo130. Special about the latter is that it can cope with all types of fragmentation (CID, ETD and/or HCD). Subsequently, this peptide de novo sequencing information can be used to perform database searches, looking for proteins containing peptides identical or homologues to the sequence (e.g. MS-BLAST131, MS-Homology132).
3.3. Peptide validation and Interpretation
The increased size of a proteogenomics sequence database in comparison to a regular protein database can make the peptide validation very challenging. Either one assures that the protein database is similar-sized as routinely used reference protein database such as SwissProt, or a multistage search and validation strategy has to be worked out21, 26. Building custom databases from RIBOseq data (where translation is only necessary from one reading frame) proved to be favorable as can be seen from the presented Posterior Error Probability (PEP) distribution plots of the identified tryptic peptides. The PEP distribution plot (presenting the cumulative valid peptide identifications for increasing PEP values) of the RIBOseq derived database behaves similar as the SwissProt one19 or a search database combining RIBOseq derived and annotated SwissProt protein sequences (after redundancy removal at the peptide level). Several other ways to manage the size of a proteogenomics database have been reviewed in the custom database creation section and are presented in Figure 2 – Customized DB.
Often, stringent validation cut-offs are used to cope with this increased DB size. The flip side of the coin is that the false negative rate will increase by minimizing the false positive rate in this way. Performing a multi-stage strategy21, 26, 133 has been proposed to tackle this issue. Jagtap et al.133 proposed a two-step method wherein matches derived from a primary search against the large database are used to create a smaller subset database. The second search was performed against a target-decoy version of this subset database merged with a host database. A two-stage false discovery rate (FDR) strategy21, 26, as opposed to a combined FDR calculation, was benchmarked by Woo et al.21 and proved to be advantageous using an RNA-seq derived search space built using the graph-based approach. The difference in database size (between known and NGS-derived sequences) makes that FDR cut-offs for the different PSM groups (known versus novel) need to be analyzed separately. If not, the FDR threshold for the known peptides will increase, resulting in a reduced number of known identifications, whereas the FDR cut-off for novel peptides will decrease, resulting in a higher number of false positive novel peptides.
PeptideShaker134 is a relatively new tool to interpret MS-based proteomics identification results from multiple search engines. The PeptideShaker platform allows the combined FDR/PEP estimation approach and also enables the multi-pass strategy. The tool has a built-in function enabling the export of non-validated PSMs from a first search against for example a reference protein sequence database (SwissProt). In a follow-up analysis, the fragmentation spectra can be matched against a database of novel NGS-derived sequences. In fact, a target-decoy strategy for FDR calculation could be performed for each class of novel peptides as proposed by Nesvizhskii26. Moreover, PeptideShaker allows for further separation of the PSMs based on charge or modification status if statistical significance is ensured, thus allowing a more sophisticated FDR calculation and increasing the sensitivity of the process. Further top-down validation by means of protein interaction networks (STRING135) or gene ontology (DAVID136, QuickGO137) is also possible within the tool. The described multi-pass PSM confidence calculation for the different classes of novel peptides can also be incorporated in other tools such as PeptideProphet138 within the Trans-Proteomics Pipeline139. Next to the commonly used target-decoy strategy described above, novel ways to validate peptides have been described. The NOKOI strategy140 no longer requires a decoy database and claims to be very suitable for proteogenomics approaches where very large sequence databases need to be scanned.
We already mentioned that expression information (derived from RNAseq data) could be applied to set a lower detection limit for matching follow-up MS-based proteomics measurements111, thus incorporating transcript abundance measures within the MS database search process. These transcript abundances values have furthermore been used to also improve the sensitivity of protein identification. A study by Shanmugam et al. presented a method that uses the Global Proteome Machine Database (GPMD141) identification frequencies in combination with RNAseq transcript abundances to adjust the confidence score of protein identifications20.
3.4. Peptide mapping and visualization
After the identification and validation of the peptides, the peptides need to be mapped to their genomic location allowing detailed inspection aiding gene or genome (re)annotation and/or covisualization with aligned RNAseq or RIBOseq data. A handful of tools are available that allow this peptide mapping in a genome-centric fashion. Some of these (e.g. Proteogenomic Mapping Tool142, PepLine143 or IggyPep144) use string-matching algorithms to map the peptide sequence to the 6 reading frame translation of the complete genome, while others (e.g. PGx or the iPiG mapper145) use extra gene annotation information to speed up this mapping process. Other bioinformatics tools enable the visualization of the peptide information in the context of genomes. The Visual Exploration and Statistics to Promote Annotation (VESPA146) and MINOMICS147 tools are stand-alone visualization solutions that focus on the annotation of prokaryotic genomes through the integration of MS-based proteomics and NGS data in a genome-centric way. Other tools, available from the Galaxy web interface148, allow the visualization of both prokaryotic and eukaryotic peptide data and facilitate the integration with other proteogenomics applications: PG Nexus using the Integrative Genome Viewer (IGV149) or the peptide_to_gff tool that converts peptide data into generic feature format that can be uploaded and visualized in a genome browser environment. Yet another tool, called Protter150, allows the visualization of proteoforms and interactive integration of annotated and predicted sequence features together with experimental proteomic evidence.
A standard file format holding base level PSM information and genome mapping coordinates could be very useful for future proteogenomics studies. Previously, several initiatives based on the UCSC BED format were taken to map specific proteogenomics results (e.g. from ENCODE or CPTAC project) or PeptideAtlas peptide information. Recently, the proBAM format was introduced and described, holding this base-level PSM info. A bioconductor R-package (proBAMr) is also made available converting MS-based proteomics results (in pepXML format) to this proBAM format, enabling genome-centric visualization. Higher level (e.g. peptide or ‘novel’ event) information can be extracted from this proBAM file and converted to a file format of choice (e.g. BED, BedGraph, WIG).
3.5. Integrative proteogenomics pipelines
A handful of integrative proteogenomics pipelines are available that can execute complete proteogenomics analyses. PEPPY151 for example starts off by generating a peptide database using the 6 reading frame translation of a known (eukaryotic) genome. Next, the peptide locations are tracked and visualized and PSM are obtained using a fixed confidence threshold. All steps (decoy DB generation, DB search and PSM validation) are automatically executed ensuring a list of identifications at the desired FDR cut-off. Other complete pipelines, applying a similar methodology, but focusing on the analysis of prokaryotic genomes, are available (GenoSuite152 and bacterial-proteogenomic-pipeline153). ENOSI, another complete proteogenomics package initially used the same strategy in creating a genome-derived search database to help genome (re)annotation. Later, a second custom splice-graph database, compactly representing many putative splice junctions, was added to allow the identification of splice peptides. This strategy was successfully applied in a Zea mays proteogenomics study2. Now, ENOSI also enables the construction of a protein database from mapped RNAseq reads, again helping to identify splice peptides. Furthermore, the pipeline still allows for the 6RF genomic translation. Peptides are identified using MS-GF+121, matching fragmentation spectra to these custom databases. Clusters of co-located peptides (i.e. events) can give rise to certain discoveries including 'novel genes, 'novel exons', novel splice junctions', or 'gene extension'. The newest kid on the block is called PGTools154: a software suite for (onco)proteogenomics data analysis and visualization. This open-source software suite also completes a full proteogenomics circle and contains the several different steps: generation of RNAseq derived search databases (next to other available databases derived from Ensembl), DB searching (currently MS-GF+, X! Tandem and Comet are supported), FDR calculation (based on Percolator), grouping and annotation of proteins, identification of cancer associated proteins and visualization based on Circos155, the UCSC genome browser and/or the Integrative Genomics Viewer (IGV156).
ProteoAnnotator157 is another open source proteogenomics annotation software designed to aid genome (re-)annotation, enabling the integration of MS-based proteomic evidence into genome databases. The goal is to use MS data for confirming that predicted gene models are translated into protein, and to verify the correctness of existing gene models. It is also usable to examine whether supporting evidence for alternative gene prediction at particular loci is present. The pipeline implements the Proteomics Standards Initiative mzIdentML standard for each analysis stage and includes steps to combine multiple search engine databases, to perform peptide-level statistics and to score grouped protein identifications matched to a given genomic locus to validate that updates to the official gene models are statistically sound.
Many tools in the genomics, transcriptomics or proteomics research field are script-based or have a command line interface. This enables researchers to create custom pipelines and workflows to perform integrative analysis of raw multi-omics data. Several frameworks (also called workflow management systems) are suitable to perform these tasks and moreover meet several necessary requirements as the need to be flexible, scalable, shareable, sustainable and complete158: Taverna159, Galaxy148, KNIME160 or Yabi161.
The Galaxy web platform shows to be the most practical choice for building proteogenomics workflows. Galaxy has been used within the NGS research community for almost a decade and has become the most established workflow solution. A plethora of tools for genomics and transcriptomics data manipulation and analysis has been set available over the last 10 years: e.g. for the processing and alignment of high-throughput sequencing data. Moreover, many proteomics tools have been released in the Galaxy workflow environment (a listing is provided at http://toolshed.g2.bx.psu.edu/ within the 'Proteomics' category). Complete proteogenomics pipelines can thus be built: (i) creation of custom protein databases based on RNAseq data (using the SAP-, SPLICE-, REDUCED-db tools111 or sapFinder162) or RIBOseq data (based on the PROTEOFORMER tool19), followed by (ii) peptide identification and validation using the SearchGUI/PeptideShaker command line interface (CLI) implementation or using the separate search algorithm implementations available. The ability to construct multi-omics workflows in the Galaxy environment is unlimited and has already been successfully applied and reported on both for proteogenomics43, 163 and metaproteogenomic133, 164 studies.
3.6. Antibody sequencing tools
Llama Magic67 was developed as tool for identification of single chain Llama antibodies by combining next generation sequencing data with mass spectrometry-based affinity proteomics data, but it can also be used for antibody sequencing in general. It is based on these steps: (i) creation of a database of unique tryptic peptide sequences from the next generation sequencing data; (ii) searching the database using X! Tandem; and (iii) mapping the identified peptides back to the antibody sequences and scoring the sequences. The reason for creating a unique peptide sequence database instead of just translating antibody sequences and searching those is that some parts of these antibodies are identical and therefore after digestion would be highly redundant (with some peptides represented >105 times). By digesting the sequences and only searching the unique peptides, the search is much faster and most search engines could not handle such a peptide redundancy. The scoring of the sequences is based on peptide coverage of the Complementarity Determining Regions (CDR’s), the overall coverage and sequence counts.
4. Conclusions and future perspectives
An elaborate list of tools is available that is applicable in proteogenomics studies, as can be seen from this review. In the current data-driven era, bioinformatics solutions for multi-omics studies are becoming more mainstream and important. The danger lies in that current worked-out integrative solutions will no longer be available in the future or that novel algorithms and improvements in both the MS-based proteomics and NGS-based sequencing research field will not find their way into the existing pipelines, if they cease to be further maintained. Although elegant integrative solutions are available (PEPPY, ENOSI, PGTools), we think that existing workflow solutions and especially Galaxy will play an important role. This very well established workflow management solution within the bioinformatics community is ideal to serve as the (default) repository for multi-omics tools and pipelines. As already mentioned, Galaxy is supported for almost a decade by a large community of researchers in the NGS domain. Over the last few years a growing interest also goes to implementing MS-based proteomics tools into Galaxy, facilitating the many proteogenomics applications listed in this review. As a large community drives this effort, the accessibility, maintenance and future update of the implemented tools in a version-based manner can also be better guaranteed. Notable is that a lot of integration is still deemed necessary to address multi-omics studies combining proteomics, transcriptomics and possibly also metabolomics and cheminformatics information.
Visualization of cross-omics data remains very important and challenging. Bundling proteomics and genomics data in a genome-centric way using different flavors of genome browsers (UCSC, Ensembl, IGV) is possible, allowing gene-level inspection of the different layers of information. This (multiple) gene-level inspection can also be done using Circos155, another elegant visualization tool that allows researchers to present their multilayer data in a circular fashion. However, the main advantage of looking into multi-omics data is that it allows the inspection of interactions within complex networks over multiple layers of information. Combinations of perturbation or regulation of different nodes in an interaction network affecting the same pathway over the several layers of integrated data (e.g. transcriptomics and proteomics) need to be visualized. Cytoscape165 could help in this effort and basic integration with e.g. Galaxy is available through the GenomeSpace (www.genomespace.org) implementation (a tool that brings together diverse computational tools in one place).
A point of much debate in proteogenomics studies is the FDR estimation and the confidence of the (novel) peptide identifications. In an ideal situation the scoring method can always discriminate the good from the bad positive matches even if the search space reaches the theoretical upper level of all peptide sequences possible. In a realistic situation however, this becomes increasingly difficult if more and more sequence material (based on e.g. variation) is included in the search space. A general strategy for FDR analysis has been outlined in by Nesvizhskii 26 that proposes controlling for the FDR separately for peptides matching the reference database and novel proteogenomics findings. Several methods based on machine learning techniques (e.g. Percolator, MS2PIP) have been devised to perform a rescoring to increase the number of confident identifications. These methods can already aid peptide identification in proteogenomics but all the same, the ideal situation is far from reached. Until then, a multistage search strategy on successively the reference protein database and a custom genomics derived protein database can be practiced to improve the sensitivity of the peptide identification process in a proteogenomics experiment. Furthermore, this strategy imposes a separate FDR estimation for the group of known and novel peptide identifications. As mentioned before, the PeptideShaker platform can elegantly deal with this multistage strategy, even allowing further separation of the PSMs based on charge or modification status if statistical significance is ensured.
Novel proteogenomics findings, so-called ‘events’ (e.g. alternative TISs, novel proteoforms, insertions/deletions/mutations, translated UTRs, frame-shift or reverse strand translations within a known protein coding region or actual novel protein-coding loci) need to be reported and deposited in existing public protein repositories such as UniProtKB or RefSeq and genome repositories such as Ensembl or UCSC. Off course, this can only be the case for ‘events’ that have been validated using robust statistical significance estimation. Integration and visualization of novel ‘events’ from proteogenomics in genome-centric interfaces can be pursued using the track hubs system, which has also been implemented to integrate the enormous amount of information from the ENCODE project166. Surely, it would be very useful and even necessary to draft community standards for this NGS-MS integration, describing what the minimum amount of information needs to be to comprehensively annotate this specific proteomics information flow. Integration with existing protein and genome resources need to be further established, on the other hand, specific, C-HPP project based dedicated web browsers are already made available to view and analyze human proteogenomics data (The Proteome Browser167,CAPER168, GenomewidePDB169), allowing the detection of novel peptides or identification of SAVs and exon-skipping events amongst others.
In summary, proteogenomics is still a relatively new and rapidly evolving field with diverse new methods and tools being proposed, and there is also a lively discussion in the field about best practices. In several research areas a proteogenomics approach is clearly useful, for example in its utility for antibody sequencing and genome (re-)annotation. In cancer immunotherapy (i.e. immunopeptide identification) the use of proteogenomics is becoming more critical, but its usefulness for tumor analysis is still an open question. In the next few years we should see proteogenomics maturing and its application and data analysis will become more standardized.
Abbreviations
- CDR
Complementarity Determining Region
- CDS
Coding Sequence
- CID
Collision Induced Dissociation
- CLI
Command Line Interface
- COFRADIC
COmbined FRActional Diagonal Chromatography
- DB
Database
- EBI-ENA
European Bioinformatics Institute – European Nucleotide Archive
- EST
Expressed Sequence Tag
- ETD
Electron Transfer Dissociation
- FDR
False Discovery Rate
- HARR
Harringtonine
- HCD
Higher-energy Collision Dissociation
- HLA
Human Leukocyte 1ntigen
- lincRNA
long intergenic non-coding RNA
- LTM
Lactimidomycin
- MS
Mass Spectrometry
- MHC1
Major-Histocompatibility Complex class 1
- NCBI-SRA
National Center for Biotechnology Information – Sequence Read Archive
- NGS
Next-Generation Sequencing
- PEP
Posterior Error Probability
- PSM
Peptide-to-Spectrum Match
- PST
Peptide Sequence Tag
- PTM
Post-Translational Modification
- RF
Reading Frame
- RIBOseq
Ribosome profiling
- SAV
Single Amino acid Variant
- SEP
sORF-encoded polypeptide
- SNV
Single Nucleotide Variant
- sORF
small Open Reading Frame
- TIS
Translation Initiation Site
- uORF
upstream Open Reading Frame
- UTR
UnTranslated Region
Footnotes
Authors Contributions: G.M. and D.F. wrote the manuscript. Postdoctoral Fellow of the Research Foundation – Flanders (FWO-Vlaanderen) to G.M. Funding for open access charge: Research Foundation - Flanders (FWO-Vlaanderen) and Ghent University. This work was supported by National Cancer Institute (NCI) CPTAC award U24CA160035, and by CPTAC contract 13XS068 from Leidos Biomedical Research, Inc.
References
- 1.Jaffe JD, Berg HC, Church GM. Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics. 2004;4:59–77. doi: 10.1002/pmic.200300511. [DOI] [PubMed] [Google Scholar]
- 2.Castellana NE, et al. An automated proteogenomic method uses mass spectrometry to reveal novel genes in Zea mays. Molecular & cellular proteomics : MCP. 2014;13:157–167. doi: 10.1074/mcp.M113.031260. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Baerenfaller K, et al. Genome-scale proteomics reveals Arabidopsis thaliana gene models and proteome dynamics. Science. 2008;320:938–941. doi: 10.1126/science.1157956. [DOI] [PubMed] [Google Scholar]
- 4.Brunner E, et al. A high-quality catalog of the Drosophila melanogaster proteome. Nature biotechnology. 2007;25:576–583. doi: 10.1038/nbt1300. [DOI] [PubMed] [Google Scholar]
- 5.Fermin D, et al. Novel gene and gene model detection using a whole genome open reading frame analysis in proteomics. Genome biology. 2006;7:R35. doi: 10.1186/gb-2006-7-4-r35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Brosch M, et al. Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and "resurrected" pseudogenes in the mouse genome. Genome research. 2011;21:756–767. doi: 10.1101/gr.114272.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Nesvizhskii AI, et al. Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data: toward more efficient identification of post-translational modifications, sequence polymorphisms, and novel peptides. Molecular & cellular proteomics : MCP. 2006;5:652–670. doi: 10.1074/mcp.M500319-MCP200. [DOI] [PubMed] [Google Scholar]
- 8.Choudhary JS, Blackstock WP, Creasy DM, Cottrell JS. Interrogating the human genome using uninterpreted mass spectrometry data. Proteomics. 2001;1:651–667. doi: 10.1002/1615-9861(200104)1:5<651::AID-PROT651>3.0.CO;2-N. [DOI] [PubMed] [Google Scholar]
- 9.Liska AJ, Sunyaev S, Shilov IN, Schaeffer DA, Shevchenko A. Error-tolerant EST database searches by tandem mass spectrometry and multiTag software. Proteomics. 2005;5:4118–4122. doi: 10.1002/pmic.200401262. [DOI] [PubMed] [Google Scholar]
- 10.Capriotti AL, et al. Proteome investigation of the non-model plant pomegranate (Punica granatum L.) Analytical and bioanalytical chemistry. 2013;405:9301–9309. doi: 10.1007/s00216-013-7382-3. [DOI] [PubMed] [Google Scholar]
- 11.Junqueira M, et al. Protein identification pipeline for the homology-driven proteomics. Journal of proteomics. 2008;71:346–356. doi: 10.1016/j.jprot.2008.07.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Immel F, Renaut J, Masfaraud JF. Physiological response and differential leaf proteome pattern in the European invasive Asteraceae Solidago canadensis colonizing a former cokery soil. Journal of proteomics. 2012;75:1129–1143. doi: 10.1016/j.jprot.2011.10.026. [DOI] [PubMed] [Google Scholar]
- 13.Tanner S, et al. Improving gene annotation using peptide mass spectrometry. Genome research. 2007;17:231–239. doi: 10.1101/gr.5646507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Castellana NE, et al. Discovery and revision of Arabidopsis genes by proteogenomics. Proceedings of the National Academy of Sciences of the United States of America. 2008;105:21034–21038. doi: 10.1073/pnas.0811066106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.UniProt C. Activities at the Universal Protein Resource (UniProt) Nucleic acids research. 2014;42:D191–198. doi: 10.1093/nar/gkt1140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Pruitt KD, et al. RefSeq: an update on mammalian reference sequences. Nucleic acids research. 2014;42:D756–763. doi: 10.1093/nar/gkt1114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Flicek P, et al. Ensembl 2014. Nucleic acids research. 2014;42:D749–755. doi: 10.1093/nar/gkt1196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Wang X, Liu Q, Zhang B. Leveraging the complementary nature of RNA-Seq and shotgun proteomics data. Proteomics. 2014;14:2676–2687. doi: 10.1002/pmic.201400184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Crappe J, et al. PROTEOFORMER: deep proteome coverage through ribosome profiling and MS integration. Nucleic acids research. 2015;43:e29. doi: 10.1093/nar/gku1283. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Shanmugam AK, Yocum AK, Nesvizhskii AI. Utility of RNA-seq and GPMDB protein observation frequency for improving the sensitivity of protein identification by tandem MS. Journal of proteome research. 2014;13:4113–4119. doi: 10.1021/pr500496p. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Woo S, et al. Proteogenomic strategies for identification of aberrant cancer peptides using large-scale next-generation sequencing data. Proteomics. 2014;14:2719–2730. doi: 10.1002/pmic.201400206. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Gawron D, Gevaert K, Van Damme P. The proteome under translational control. Proteomics. 2014;14:2647–2662. doi: 10.1002/pmic.201400165. [DOI] [PubMed] [Google Scholar]
- 23.Alfaro JA, Sinha A, Kislinger T, Boutros PC. Onco-proteogenomics: cancer proteomics joins forces with genomics. Nature methods. 2014;11:1107–1113. doi: 10.1038/nmeth.3138. [DOI] [PubMed] [Google Scholar]
- 24.Zhang B, et al. Proteogenomic characterization of human colon and rectal cancer. Nature. 2014;513:382–387. doi: 10.1038/nature13438. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Pandey A, Pevzner PA. Proteogenomics. Proteomics. 2014;14:2631–2632. doi: 10.1002/pmic.201470173. [DOI] [PubMed] [Google Scholar]
- 26.Nesvizhskii AI. Proteogenomics: concepts, applications and computational strategies. Nature methods. 2014;11:1114–1125. doi: 10.1038/nmeth.3144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Armengaud J, et al. Non-model organisms, a species endangered by proteogenomics. Journal of proteomics. 2014;105:5–18. doi: 10.1016/j.jprot.2014.01.007. [DOI] [PubMed] [Google Scholar]
- 28.Han Y, Ma B, Zhang K. SPIDER: software for protein identification from sequence tags with de novo sequencing error. Journal of bioinformatics and computational biology. 2005;3:697–716. doi: 10.1142/s0219720005001247. [DOI] [PubMed] [Google Scholar]
- 29.Dasari S, et al. TagRecon: high-throughput mutation identification through sequence tagging. Journal of proteome research. 2010;9:1716–1726. doi: 10.1021/pr900850m. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Castellana NE, Pham V, Arnott D, Lill JR, Bafna V. Template proteogenomics: sequencing whole proteins using an imperfect database. Molecular & cellular proteomics : MCP. 2010;9:1260–1270. doi: 10.1074/mcp.M900504-MCP200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Luge T, et al. Transcriptomics assisted proteomic analysis of Nicotiana occidentalis infected by Candidatus Phytoplasma mali strain AT. Proteomics. 2014;14:1882–1889. doi: 10.1002/pmic.201300551. [DOI] [PubMed] [Google Scholar]
- 32.Mason ME, Koch JL, Krasowski M, Loo J. Comparisons of protein profiles of beech bark disease resistant and susceptible American beech (Fagus grandifolia) Proteome science. 2013;11:2. doi: 10.1186/1477-5956-11-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Volkening JD, et al. A proteogenomic survey of the Medicago truncatula genome. Molecular & cellular proteomics : MCP. 2012;11:933–944. doi: 10.1074/mcp.M112.019471. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Hartmann EM, et al. Proteomics meets blue biotechnology: a wealth of novelties and opportunities. Marine genomics. 2014;17:35–42. doi: 10.1016/j.margen.2014.04.003. [DOI] [PubMed] [Google Scholar]
- 35.Zhang C, Xu P, Zhu Y. Progress in proteogenomics of prokaryotes. Sheng wu gong cheng xue bao = Chinese journal of biotechnology. 2014;30:1026–1035. [PubMed] [Google Scholar]
- 36.de Souza GA, et al. Proteogenomic analysis of polymorphisms and gene annotation divergences in prokaryotes using a clustered mass spectrometry-friendly database. Molecular & cellular proteomics : MCP. 2011;10:M110002527. doi: 10.1074/mcp.M110.002527. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Tovchigrechko A, Venepally P, Payne SH. PGP: parallel prokaryotic proteogenomics pipeline for MPI clusters, high-throughput batch clusters and multicore workstations. Bioinformatics. 2014;30:1469–1470. doi: 10.1093/bioinformatics/btu051. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Evans VC, et al. De novo derivation of proteomes from transcriptomes for transcript and protein identification. Nature methods. 2012;9:1207–1211. doi: 10.1038/nmeth.2227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Armengaud J, Hartmann EM, Bland C. Proteogenomics for environmental microbiology. Proteomics. 2013;13:2731–2742. doi: 10.1002/pmic.201200576. [DOI] [PubMed] [Google Scholar]
- 40.Li HD, Menon R, Omenn GS, Guan Y. Revisiting the identification of canonical splice isoforms through integration of functional genomics and proteomics evidence. Proteomics. 2014;14:2709–2718. doi: 10.1002/pmic.201400170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Sun H, et al. Integration of mass spectrometry and RNA-Seq data to confirm human ab initio predicted genes and lncRNAs. Proteomics. 2014;14:2760–2768. doi: 10.1002/pmic.201400174. [DOI] [PubMed] [Google Scholar]
- 42.Ucciferri N, Rocchiccioli S. Proteomics techniques for the detection of translated pseudogenes. Methods in molecular biology. 2014;1167:187–195. doi: 10.1007/978-1-4939-0835-6_12. [DOI] [PubMed] [Google Scholar]
- 43.Branca RM, et al. HiRIEF LC-MS enables deep proteome coverage and unbiased proteogenomics. Nature methods. 2014;11:59–62. doi: 10.1038/nmeth.2732. [DOI] [PubMed] [Google Scholar]
- 44.Pawar H, Kulkarni A, Dixit T, Chaphekar D, Patole MS. A bioinformatics approach to reanalyze the genome annotation of kinetoplastid protozoan parasite Leishmania donovani. Genomics. 2014;104:554–561. doi: 10.1016/j.ygeno.2014.09.008. [DOI] [PubMed] [Google Scholar]
- 45.Kiran AM, O'Mahony JJ, Sanjeev K, Baranov PV. Darned in 2013: inclusion of model organisms and linking with Wikipedia. Nucleic acids research. 2013;41:D258–261. doi: 10.1093/nar/gks961. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.He T, Du P, Li Y. dbRES: a web-oriented database for annotated RNA editing sites. Nucleic acids research. 2007;35:D141–144. doi: 10.1093/nar/gkl815. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Park H, et al. Compact variant-rich customized sequence database and a fast and sensitive database search for efficient proteogenomic analyses. Proteomics. 2014;14:2742–2749. doi: 10.1002/pmic.201400225. [DOI] [PubMed] [Google Scholar]
- 48.Conlon KP, et al. Fusion peptides from oncogenic chimeric proteins as putative specific biomarkers of cancer. Molecular & cellular proteomics : MCP. 2013;12:2714–2723. doi: 10.1074/mcp.M113.029926. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Gao X, et al. Quantitative profiling of initiating ribosomes in vivo. Nature methods. 2015;12:147–153. doi: 10.1038/nmeth.3208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Lee S, et al. Global mapping of translation initiation sites in mammalian cells at single–nucleotide resolution. Proceedings of the National Academy of Sciences of the United States of America. 2012;109:E2424–2432. doi: 10.1073/pnas.1207846109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Ingolia NT, et al. Ribosome profiling reveals pervasive translation outside of annotated protein-coding genes. Cell reports. 2014;8:1365–1379. doi: 10.1016/j.celrep.2014.07.045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Ingolia NT, Lareau LF, Weissman JS. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell. 2011;147:789–802. doi: 10.1016/j.cell.2011.10.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Fresno M, Jimenez A, Vazquez D. Inhibition of translation in eukaryotic systems by harringtonine. Eur J Biochem. 1977;72:323–330. doi: 10.1111/j.1432-1033.1977.tb11256.x. [DOI] [PubMed] [Google Scholar]
- 54.Schneider-Poetsch T, et al. Inhibition of eukaryotic translation elongation by cycloheximide and lactimidomycin. Nature chemical biology. 2010;6:209–217. doi: 10.1038/nchembio.304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Koch A, et al. A proteogenomics approach integrating proteomics and ribosome profiling increases the efficiency of protein identification and enables the discovery of alternative translation start sites. Proteomics. 2014;14:2688–2698. doi: 10.1002/pmic.201400180. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Van Damme P, Gawron D, Van Criekinge W, Menschaert G. N-terminal proteomics and ribosome profiling provide a comprehensive view of the alternative translation initiation landscape in mice and men. Molecular & cellular proteomics : MCP. 2014;13:1245–1261. doi: 10.1074/mcp.M113.036442. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Menschaert G, et al. Deep proteome coverage based on ribosome profiling aids mass spectrometry-based protein and peptide discovery and provides evidence of alternative translation products and near-cognate translation initiation events. Molecular & cellular proteomics : MCP. 2013;12:1780–1790. doi: 10.1074/mcp.M113.027540. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Staes A, et al. Selecting protein N-terminal peptides by combined fractional diagonal chromatography. Nature protocols. 2011;6:1130–1141. doi: 10.1038/nprot.2011.355. [DOI] [PubMed] [Google Scholar]
- 59.Hartmann EM, Armengaud J. N-terminomics and proteogenomics, getting off to a good start. Proteomics. 2014;14:2637–2646. doi: 10.1002/pmic.201400157. [DOI] [PubMed] [Google Scholar]
- 60.Crappe J, et al. Combining in silico prediction and ribosome profiling in a genome-wide search for novel putatively coding sORFs. BMC genomics. 2013;14:648. doi: 10.1186/1471-2164-14-648. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Bazzini AA, et al. Identification of small ORFs in vertebrates using ribosome footprinting and evolutionary conservation. The EMBO journal. 2014;33:981–993. doi: 10.1002/embj.201488411. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Aspden JL, et al. Extensive translation of small Open Reading Frames revealed by Poly-Ribo-Seq. eLife. 2014;3:e03528. doi: 10.7554/eLife.03528. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Slavoff SA, et al. Peptidomic discovery of short open reading frame-encoded peptides in human cells. Nature chemical biology. 2013;9:59–64. doi: 10.1038/nchembio.1120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Ma J, et al. Discovery of human sORF-encoded polypeptides (SEPs) in cell lines and tissue. Journal of proteome research. 2014;13:1757–1765. doi: 10.1021/pr401280w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Scheid JF, et al. Sequence and structural convergence of broad and potent HIV antibodies that mimic CD4 binding. Science. 2011;333:1633–1637. doi: 10.1126/science.1207227. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Muellenbeck MF, et al. Atypical and classical memory B cells produce Plasmodium falciparum neutralizing antibodies. The Journal of experimental medicine. 2013;210:389–399. doi: 10.1084/jem.20121970. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Fridy PC, et al. A robust pipeline for rapid production of versatile nanobody repertoires. Nature methods. 2014;11:1253–1260. doi: 10.1038/nmeth.3170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Schumacher TN, Schreiber RD. Neoantigens in cancer immunotherapy. Science. 2015;348:69–74. doi: 10.1126/science.aaa4971. [DOI] [PubMed] [Google Scholar]
- 69.Gubin MM, Artyomov MN, Mardis ER, Schreiber RD. Tumor neoantigens: building a framework for personalized cancer immunotherapy. J Clin Invest. 2015;125:3413–3421. doi: 10.1172/JCI80008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Granados DP, et al. Impact of genomic polymorphisms on the repertoire of human MHC class I-associated peptides. Nature communications. 2014;5:3600. doi: 10.1038/ncomms4600. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Yadav M, et al. Predicting immunogenic tumour mutations by combining mass spectrometry and exome sequencing. Nature. 2014;515:572–576. doi: 10.1038/nature14001. [DOI] [PubMed] [Google Scholar]
- 72.Bassani-Sternberg M, Pletscher-Frankild S, Jensen LJ, Mann M. Mass spectrometry of human leukocyte antigen class I peptidomes reveals strong effects of protein abundance and turnover on antigen presentation. Molecular & cellular proteomics : MCP. 2015;14:658–673. doi: 10.1074/mcp.M114.042812. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Caron E, et al. An open-source computational and data resource to analyze digital maps of immunopeptidomes. eLife. 2015;4 doi: 10.7554/eLife.07661. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Pineda SS, Undheim EA, Rupasinghe DB, Ikonomopoulou MP, King GF. Spider venomics: implications for drug discovery. Future medicinal chemistry. 2014;6:1699–1714. doi: 10.4155/fmc.14.103. [DOI] [PubMed] [Google Scholar]
- 75.Biass D, et al. Uncovering intense protein diversification in a cone snail venom gland using an integrative venomics approach. Journal of proteome research. 2015;14:628–638. doi: 10.1021/pr500583u. [DOI] [PubMed] [Google Scholar]
- 76.Dutertre S, et al. Deep venomics reveals the mechanism for expanded peptide diversity in cone snail venom. Molecular & cellular proteomics : MCP. 2013;12:312–329. doi: 10.1074/mcp.M112.021469. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Ueberheide BM, Fenyo D, Alewood PF, Chait BT. Rapid sensitive analysis of cysteine rich peptide venom components. Proceedings of the National Academy of Sciences of the United States of America. 2009;106:6910–6915. doi: 10.1073/pnas.0900745106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Aman JW, et al. Insights into the origins of fish hunting in venomous cone snails from studies of Conus tessulatus. Proceedings of the National Academy of Sciences of the United States of America. 2015 doi: 10.1073/pnas.1424435112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Safavi-Hemami H, et al. Specialized insulin is used for chemical warfare by fish-hunting cone snails. Proceedings of the National Academy of Sciences of the United States of America. 2015;112:1743–1748. doi: 10.1073/pnas.1423857112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Brahma RK, McCleary RJ, Kini RM, Doley R. Venom gland transcriptomics for identifying, cataloging, and characterizing venom proteins in snakes. Toxicon : official journal of the International Society on Toxinology. 2015;93:1–10. doi: 10.1016/j.toxicon.2014.10.022. [DOI] [PubMed] [Google Scholar]
- 81.Margres MJ, et al. Linking the transcriptome and proteome to characterize the venom of the eastern diamondback rattlesnake (Crotalus adamanteus) Journal of proteomics. 2014;96:145–158. doi: 10.1016/j.jprot.2013.11.001. [DOI] [PubMed] [Google Scholar]
- 82.Edwards NJ. Novel peptide identification from tandem mass spectra using ESTs and sequence database compression. Molecular systems biology. 2007;3:102. doi: 10.1038/msb4100142. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Silvester N, et al. Content discovery and retrieval services at the European Nucleotide Archive. Nucleic acids research. 2015;43:D23–29. doi: 10.1093/nar/gku1129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Kodama Y, Shumway M, Leinonen R International Nucleotide Sequence Database C. The Sequence Read Archive: explosive growth of sequencing data. Nucleic acids research. 2012;40:D54–56. doi: 10.1093/nar/gkr854. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nature methods. 2012;9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 2009;25:1105–1111. doi: 10.1093/bioinformatics/btp120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Dobin A, et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013;29:15–21. doi: 10.1093/bioinformatics/bts635. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Trapnell C, et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature protocols. 2012;7:562–578. doi: 10.1038/nprot.2012.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Steijger T, et al. Assessment of transcript reconstruction methods for RNA-seq. Nature methods. 2013;10:1177–1184. doi: 10.1038/nmeth.2714. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.DePristo MA, et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nature genetics. 2011;43:491–498. doi: 10.1038/ng.806. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Li H, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Wan J, Qian SB. TISdb: a database for alternative translation initiation in mammalian cells. Nucleic acids research. 2014;42:D845–850. doi: 10.1093/nar/gkt1085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Vanderperre B, Lucier JF, Roucou X. HAltORF: a database of predicted out-of-frame alternative open reading frames in human. Database : the journal of biological databases and curation 2012, bas025. 2012 doi: 10.1093/database/bas025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95.Grillo G, et al. UTRdb and UTRsite (RELEASE 2010): a collection of sequences and regulatory motifs of the untranslated regions of eukaryotic mRNAs. Nucleic acids research. 2010;38:D75–80. doi: 10.1093/nar/gkp902. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Wethmar K, Barbosa-Silva A, Andrade-Navarro MA, Leutz A. uORFdb--a comprehensive literature database on eukaryotic uORF biology. Nucleic acids research. 2014;42:D60–67. doi: 10.1093/nar/gkt952. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97.Kim P, et al. ChimerDB 2.0--a knowledgebase for fusion genes updated. Nucleic acids research. 2010;38:D81–85. doi: 10.1093/nar/gkp982. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.Kong F, et al. dbCRID: a database of chromosomal rearrangements in human diseases. Nucleic acids research. 2011;39:D895–900. doi: 10.1093/nar/gkq1038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99.Frenkel-Morgenstern M, Gorohovski A, Vucenovic D, Maestre L, Valencia A. ChiTaRS 2.1-an improved database of the chimeric transcripts and RNA-seq data with novel sense-antisense chimeric RNA transcripts. Nucleic acids research. 2015;43:D68–75. doi: 10.1093/nar/gku1199. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Sherry ST, et al. dbSNP: the NCBI database of genetic variation. Nucleic acids research. 2001;29:308–311. doi: 10.1093/nar/29.1.308. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101.Li J, Duncan DT, Zhang B. CanProVar: a human cancer proteome variation database. Human mutation. 2010;31:219–228. doi: 10.1002/humu.21176. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Forbes SA, et al. COSMIC: exploring the world's knowledge of somatic mutations in human cancer. Nucleic acids research. 2015;43:D805–811. doi: 10.1093/nar/gku1075. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 103.Picardi E, D'Erchia AM, Montalvo A, Pesole G. Using REDItools to Detect RNA Editing Events in NGS Datasets. Current protocols in bioinformatics /editoral board, Andreas D. Baxevanis ... [et al.] 2015;49:12 12 11–12 12 15. doi: 10.1002/0471250953.bi1212s49. [DOI] [PubMed] [Google Scholar]
- 104.Zhang Q, Xiao X. Genome sequence-independent identification of RNA editing sites. Nature methods. 2015 doi: 10.1038/nmeth.3314. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 105.Volders PJ, et al. An update on LNCipedia: a database for annotated human lncRNA sequences. Nucleic acids research. 2015;43:D174–180. doi: 10.1093/nar/gku1060. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 106.Xie C, et al. NONCODEv4: exploring the world of long non-coding RNA genes. Nucleic acids research. 2014;42:D98–103. doi: 10.1093/nar/gkt1222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 107.Amberger J, Bocchini CA, Scott AF, Hamosh A. McKusick's Online Mendelian Inheritance in Man (OMIM) Nucleic acids research. 2009;37:D793–796. doi: 10.1093/nar/gkn665. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 108.Spivak M, Weston J, Bottou L, Kall L, Noble WS. Improvements to the percolator algorithm for Peptide identification from shotgun proteomics data sets. Journal of proteome research. 2009;8:3737–3745. doi: 10.1021/pr801109k. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 109.Kall L, Canterbury JD, Weston J, Noble WS, MacCoss MJ. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nature methods. 2007;4:923–925. doi: 10.1038/nmeth1113. [DOI] [PubMed] [Google Scholar]
- 110.Degroeve S, Martens L. MS2PIP: a tool for MS/MS peak intensity prediction. Bioinformatics. 2013;29:3199–3203. doi: 10.1093/bioinformatics/btt544. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 111.Sheynkman GM, et al. Using Galaxy-P to leverage RNA-Seq for the discovery of novel protein variations. BMC genomics. 2014;15:703. doi: 10.1186/1471-2164-15-703. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 112.Wang X, Zhang B. customProDB: an R package to generate customized protein databases from RNA-Seq data for proteomics search. Bioinformatics. 2013;29:3235–3237. doi: 10.1093/bioinformatics/btt543. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 113.Khatun J, et al. Whole human genome proteogenomic mapping for ENCODE cell line data: identifying protein-coding regions. BMC genomics. 2013;14:141. doi: 10.1186/1471-2164-14-141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 114.Chaiyarit S, et al. Chromosome-centric Human Proteome Project (C-HPP): Chromosome 12. Journal of proteome research. 2014;13:3160–3165. doi: 10.1021/pr500009j. [DOI] [PubMed] [Google Scholar]
- 115.Pinto SM, et al. Functional annotation of proteome encoded by human chromosome 22. Journal of proteome research. 2014;13:2749–2760. doi: 10.1021/pr401169d. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 116.Liu S, et al. A chromosome-centric human proteome project (C-HPP) to characterize the sets of proteins encoded in chromosome 17. Journal of proteome research. 2013;12:45–57. doi: 10.1021/pr300985j. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 117.Tanner S, et al. InsPecT: identification of posttranslationally modified peptides from tandem mass spectra. Analytical chemistry. 2005;77:4626–4639. doi: 10.1021/ac050102d. [DOI] [PubMed] [Google Scholar]
- 118.de Souza GA, Arntzen MO, Wiker HG. MSMSpdbb: providing protein databases of closely related organisms to improve proteomic characterization of prokaryotic microbes. Bioinformatics. 2010;26:698–699. doi: 10.1093/bioinformatics/btq004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 119.Vaudel M, Barsnes H, Berven FS, Sickmann A, Martens L. SearchGUI: An open-source graphical user interface for simultaneous OMSSA and X!Tandem searches. Proteomics. 2011;11:996–999. doi: 10.1002/pmic.201000595. [DOI] [PubMed] [Google Scholar]
- 120.Craig R, Beavis RC. TANDEM: matching proteins with tandem mass spectra. Bioinformatics. 2004;20:1466–1467. doi: 10.1093/bioinformatics/bth092. [DOI] [PubMed] [Google Scholar]
- 121.Kim S, Pevzner PA. MS-GF+ makes progress towards a universal database search tool for proteomics. Nature communications. 2014;5:5277. doi: 10.1038/ncomms6277. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 122.Dorfer V, et al. MS Amanda, a universal identification algorithm optimized for high accuracy tandem mass spectra. Journal of proteome research. 2014;13:3679–3684. doi: 10.1021/pr500202e. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 123.Tabb DL, Fernando CG, Chambers MC. MyriMatch: highly accurate tandem mass spectral peptide identification by multivariate hypergeometric analysis. Journal of proteome research. 2007;6:654–661. doi: 10.1021/pr0604054. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 124.Eng JK, Jahan TA, Hoopmann MR. Comet: an open-source MS/MS sequence database search tool. Proteomics. 2013;13:22–24. doi: 10.1002/pmic.201200439. [DOI] [PubMed] [Google Scholar]
- 125.Diament BJ, Noble WS. Faster SEQUEST searching for peptide identification from tandem mass spectra. Journal of proteome research. 2011;10:3871–3879. doi: 10.1021/pr101196n. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 126.Geer LY, et al. Open mass spectrometry search algorithm. Journal of proteome research. 2004;3:958–964. doi: 10.1021/pr0499491. [DOI] [PubMed] [Google Scholar]
- 127.Frank A, Pevzner P. PepNovo: de novo peptide sequencing via probabilistic network modeling. Analytical chemistry. 2005;77:964–973. doi: 10.1021/ac048788h. [DOI] [PubMed] [Google Scholar]
- 128.Tabb DL, Ma ZQ, Martin DB, Ham AJ, Chambers MC. DirecTag: accurate sequence tags from peptide MS/MS through statistical scoring. Journal of proteome research. 2008;7:3838–3846. doi: 10.1021/pr800154p. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 129.Muth T, et al. DeNovoGUI: an open source graphical user interface for de novo sequencing of tandem mass spectra. Journal of proteome research. 2014;13:1143–1146. doi: 10.1021/pr4008078. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 130.Jeong K, Kim S, Pevzner PA. UniNovo: a universal tool for de novo peptide sequencing. Bioinformatics. 2013;29:1953–1962. doi: 10.1093/bioinformatics/btt338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 131.Shevchenko A, Sunyaev S, Liska A, Bork P, Shevchenko A. Nanoelectrospray tandem mass spectrometry and sequence similarity searching for identification of proteins from organisms with unknown genomes. Methods in molecular biology. 2003;211:221–234. doi: 10.1385/1-59259-342-9:221. [DOI] [PubMed] [Google Scholar]
- 132.Kayser JP, Vallet JL, Cerny RL. Defining parameters for homology-tolerant database searching. Journal of biomolecular techniques : JBT. 2004;15:285–295. [PMC free article] [PubMed] [Google Scholar]
- 133.Jagtap P, et al. A two-step database search method improves sensitivity in peptide sequence matches for metaproteomics and proteogenomics studies. Proteomics. 2013;13:1352–1357. doi: 10.1002/pmic.201200352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 134.Vaudel M, et al. PeptideShaker enables reanalysis of MS-derived proteomics data sets. Nature biotechnology. 2015;33:22–24. doi: 10.1038/nbt.3109. [DOI] [PubMed] [Google Scholar]
- 135.Franceschini A, et al. STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic acids research. 2013;41:D808–815. doi: 10.1093/nar/gks1094. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 136.Huang da W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature protocols. 2009;4:44–57. doi: 10.1038/nprot.2008.211. [DOI] [PubMed] [Google Scholar]
- 137.Binns D, et al. QuickGO: a web-based tool for Gene Ontology searching. Bioinformatics. 2009;25:3045–3046. doi: 10.1093/bioinformatics/btp536. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 138.Ma K, Vitek O, Nesvizhskii AI. A statistical model-building perspective to identification of MS/MS spectra with PeptideProphet. BMC bioinformatics. 2012;13(Suppl 16):S1. doi: 10.1186/1471-2105-13-S16-S1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 139.Deutsch EW, et al. A guided tour of the Trans-Proteomic Pipeline. Proteomics. 2010;10:1150–1159. doi: 10.1002/pmic.200900375. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 140.Gonnelli G, et al. A decoy-free approach to the identification of peptides. Journal of proteome research. 2015;14:1792–1798. doi: 10.1021/pr501164r. [DOI] [PubMed] [Google Scholar]
- 141.Craig R, Cortens JP, Beavis RC. Open source system for analyzing, validating, and storing protein identification data. Journal of proteome research. 2004;3:1234–1242. doi: 10.1021/pr049882h. [DOI] [PubMed] [Google Scholar]
- 142.Sanders WS, et al. The proteogenomic mapping tool. BMC bioinformatics. 2011;12:115. doi: 10.1186/1471-2105-12-115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 143.Ferro M, et al. PepLine: a software pipeline for high-throughput direct mapping of tandem mass spectrometry data on genomic sequences. Journal of proteome research. 2008;7:1873–1883. doi: 10.1021/pr070415k. [DOI] [PubMed] [Google Scholar]
- 144.Menschaert G, et al. A hybrid, de novo based, genome-wide database search approach applied to the sea urchin neuropeptidome. Journal of proteome research. 2010;9:990–996. doi: 10.1021/pr900885k. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 145.Kuhring M, Renard BY. iPiG: integrating peptide spectrum matches into genome browser visualizations. PloS one. 2012;7:e50246. doi: 10.1371/journal.pone.0050246. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 146.Peterson ES, et al. VESPA: software to facilitate genomic annotation of prokaryotic organisms through integration of proteomic and transcriptomic data. BMC genomics. 2012;13:131. doi: 10.1186/1471-2164-13-131. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 147.Brouwer RW, van Hijum SA, Kuipers OP. MINOMICS: visualizing prokaryote transcriptomics and proteomics data in a genomic context. Bioinformatics. 2009;25:139–140. doi: 10.1093/bioinformatics/btn588. [DOI] [PubMed] [Google Scholar]
- 148.Goecks J, Nekrutenko A, Taylor J, Galaxy T. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome biology. 2010;11:R86. doi: 10.1186/gb-2010-11-8-r86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 149.Robinson JT, et al. Integrative genomics viewer. Nature biotechnology. 2011;29:24–26. doi: 10.1038/nbt.1754. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 150.Omasits U, Ahrens CH, Muller S, Wollscheid B. Protter: interactive protein feature visualization and integration with experimental proteomic data. Bioinformatics. 2014;30:884–886. doi: 10.1093/bioinformatics/btt607. [DOI] [PubMed] [Google Scholar]
- 151.Risk BA, Spitzer WJ, Giddings MC. Peppy: proteogenomic search software. Journal of proteome research. 2013;12:3019–3025. doi: 10.1021/pr400208w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 152.Kumar D, et al. Proteogenomic analysis of Bradyrhizobium japonicum USDA110 using GenoSuite, an automated multi-algorithmic pipeline. Molecular & cellular proteomics : MCP. 2013;12:3388–3397. doi: 10.1074/mcp.M112.027169. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 153.Uszkoreit J, Plohnke N, Rexroth S, Marcus K, Eisenacher M. The bacterial proteogenomic pipeline. BMC genomics. 2014;15(Suppl 9):S19. doi: 10.1186/1471-2164-15-S9-S19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 154.Nagaraj SH, et al. PGTools: a software suite for proteogenomics data analysis and visualization. Journal of proteome research. 2015 doi: 10.1021/acs.jproteome.5b00029. [DOI] [PubMed] [Google Scholar]
- 155.Krzywinski M, et al. Circos: an information aesthetic for comparative genomics. Genome research. 2009;19:1639–1645. doi: 10.1101/gr.092759.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 156.Thorvaldsdottir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Briefings in bioinformatics. 2013;14:178–192. doi: 10.1093/bib/bbs017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 157.Ghali F, et al. ProteoAnnotator--open source proteogenomics annotation software supporting PSI standards. Proteomics. 2014;14:2731–2741. doi: 10.1002/pmic.201400265. [DOI] [PubMed] [Google Scholar]
- 158.Boekel J, et al. Multi-omic data analysis using Galaxy. Nature biotechnology. 2015;33:137–139. doi: 10.1038/nbt.3134. [DOI] [PubMed] [Google Scholar]
- 159.Wolstencroft K, et al. The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud. Nucleic acids research. 2013;41:W557–561. doi: 10.1093/nar/gkt328. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 160.Lindenbaum P, Le Scouarnec S, Portero V, Redon R. Knime4Bio: a set of custom nodes for the interpretation of next-generation sequencing data with KNIME. Bioinformatics. 2011;27:3200–3201. doi: 10.1093/bioinformatics/btr554. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 161.Hunter AA, Macgregor AB, Szabo TO, Wellington CA, Bellgard MI. Yabi: An online research environment for grid, high performance and cloud computing. Source code for biology and medicine. 2012;7:1. doi: 10.1186/1751-0473-7-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 162.Wen B, et al. sapFinder: an R/Bioconductor package for detection of variant peptides in shotgun proteomics experiments. Bioinformatics. 2014;30:3136–3138. doi: 10.1093/bioinformatics/btu397. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 163.Jagtap PD, et al. Flexible and accessible workflows for improved proteogenomic analysis using the Galaxy framework. Journal of proteome research. 2014;13:5898–5908. doi: 10.1021/pr500812t. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 164.Jagtap P, et al. Workflow for analysis of high mass accuracy salivary data set using MaxQuant and ProteinPilot search algorithm. Proteomics. 2012;12:1726–1730. doi: 10.1002/pmic.201100097. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 165.Saito R, et al. A travel guide to Cytoscape plugins. Nature methods. 2012;9:1069–1076. doi: 10.1038/nmeth.2212. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 166.Consortium EP. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 167.Goode RJ, et al. The proteome browser web portal. Journal of proteome research. 2013;12:172–178. doi: 10.1021/pr3010056. [DOI] [PubMed] [Google Scholar]
- 168.Yang S, et al. CAPER 3.0: a scalable cloud-based system for data-intensive analysis of Chromosome-centric Human Proteome Project datasets. Journal of proteome research. 2015 doi: 10.1021/pr501335w. [DOI] [PubMed] [Google Scholar]
- 169.Jeong SK, et al. GenomewidePDB, a proteomic database exploring the comprehensive protein parts list and transcriptome landscape in human chromosomes. Journal of proteome research. 2013;12:106–111. doi: 10.1021/pr3009447. [DOI] [PubMed] [Google Scholar]


