Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2016 Feb 24.
Published in final edited form as: J Proteome Res. 2015 Oct 7;14(12):4945–4948. doi: 10.1021/acs.jproteome.5b00688

Devising a Consensus Framework for Validation of Novel Human Coding Loci

Elspeth A Bruford , Lydie Lane , Jennifer Harrow §,*
PMCID: PMC4765950  NIHMSID: NIHMS759884  PMID: 26367542

Abstract

A report on the Wellcome Trust retreat on devising a consensus framework for the validation of novel human protein coding loci, held in Hinxton, U.K., May 11–13, 2015.

Keywords: proteogenomics, transcriptomics, ribosome profiling, small ORF, long noncoding RNA


graphic file with name nihms-759884-f0001.jpg

INTRODUCTION

At the HUPO 2011 Congress the Chromosome-Centric Human Proteome Project (c-HPP) was officially launched as a large multidisciplinary international effort to perform the cataloguing and characterization of human gene products with a focus on loci that have little or no evidence at the protein level, termed “missing proteins”.1 At this time there was no satisfactory evidence at the protein level for 33% of the 20 059 protein coding genes. Since this initial assessment the number of missing proteins has been reduced considerably, but as of October 2014, ~18% of the human proteome was still lacking evidence at the protein level.2 Different reasons can explain why some proteins are difficult to identify by mass spectrometry (MS) or antibody-based strategies. Some of them may be expressed in rarely studied tissues or cell types or are only expressed/induced in very specific conditions. Others are expressed at levels that are below the detection limits of the current MS instruments. In addition, some protein sequences do not contain tryptic peptides or epitopes that can unambiguously identify the protein. Finally, a number of protein-coding loci that were predicted from genomic analysis simply do not exist, encode pseudogenes, or give rise to nonprotein coding RNA. With technological advances in the proteogenomics field and better access to rare samples, solutions to overcome the issues of rare or low abundance proteins are emerging; however, due to increased data set sizes and lower detection limits, the fraction of incorrect protein identifications tends to increase. Furthermore, a number of recent publications have claimed to have found “new” proteins overlooked by the main reference databases, but careful reanalysis of the evidence provided indicates that many of these claims are invalid.

The “Devising a Consensus Framework for Validation of Novel Human Coding Loci” workshop was held between 11th and 13th May, 2015, at the Wellcome Genome Campus at Hinxton, bringing together invited experts from the fields of genomics, transcriptomics, MS-based proteomics, and representatives from the major human genome and protein databases and proteomic repositories. A test set of putative protein coding loci from recent publications was selected and distributed to the participating annotation resources (HAVA-NA,3 neXtProt,4 RefSeq,5 and Swiss-Prot6) prior to the workshop by the organizing committee (Jen Harrow (GENCODE7), Elspeth Bruford (HGNC8), and Lydie Lane and Amos Bairoch (neXtProt4)). The central focus of the meeting was discussing these edge cases and consolidating a list of putative human protein products, with the eventual aim of formulating guidelines for publication that the biocuration and scientific research community can use to determine the evidence needed to support a novel locus being included or excluded from the reference human protein-coding gene set.

IDENTIFYING/CONFIRMING NOVEL PROTEINS USING MASS SPEC AND RIBOSOME PROFILING

This session began with presentations from Alexey Nesvizhskii from University of Michigan (Ann Arbor, MI), Michael Tress from the Spanish National Cancer Research Centre (CNIO; Madrid, Spain), and Eric Deutsch from the Institute for Systems Biology (Seattle, WA). Nesvizhskii discussed false discovery rates (FDRs) in MS-based data sets, which are statistical measures of the accuracy of the protein identification. He stressed how the FDR cut-offs must be adjusted depending on the locus types being studied. He also introduced his Web site CRAPome,9 the Contaminant Repository of Affinity Purification Data, which aggregates background contaminants from negative controls of affinity purification-MS studies. Tress, who works with the GENCODE consortium, also stressed the importance of FDRs and suggested that the participants of the workshop should devise some rules for filtering proteomic data to ensure that only high confidence data is used for validation. Two recent large-scale proteomics studies reported a large number of “novel protein coding genes”;10,11 however, the analysis of the GENCODE consortium12 showed that these two studies reported dubious peptide evidence for at least 200 olfactory receptor genes (which encode transmembrane proteins with highly restricted expression), even though the studies had not included data from olfactory tissue. Tress cited three possible reasons for the high number of erroneous predictions: Some of the identified peptides match to more than one gene; some peptides were wrongly identified as containing a glutamine to pyroglutamic acid modification in non N-terminal positions; and both studies included many low quality spectra. Deutsch, who is in charge of the PeptideAtlas,13 warned the participants of the danger of combining data sets as you can end up with a very high FDR rate; he said that the thresholds used in PeptideAtlas are very stringent to avoid random noise. For example, PeptideAtlas identifies only two discriminating olfactory receptor peptides, and even these two have now been reassessed as having more likely matches to peptide variants of abundant proteins. Evidence is regarded as insufficient if the peptides are fewer than seven residues or weak if fewer than two different (but potentially overlapping) peptides support one protein. In their 2105-03 release, PeptideAtlas has now raised the bar to 9 peptides of 9 or more amino acids.13 Gil Omenn of the University of Michigan (Ann Arbor, MI), chair of the HPP2, commented that structural proteomics can also be useful in validation by predicting if the protein, including splice isoforms, may be functional. This was followed by discussion of some of the test set of loci that had been distributed prior to the meeting. The key point arising from this discussion was that the different annotation resources (HAVANA, neXtProt, RefSeq, and Swiss-Prot) were generally in agreement about their criteria for deciding coding status (e.g., evolutionary conservation, transcript evidence, ribosome profiling, published experimental characterization) but required help to decide whether MS data for a potential new ORF were of high enough quality to trigger further analysis. To this end, Deutsch suggested that candidate new proteins be sent to PeptideAtlas and that twice a year they would be processed against the combined set of human MS spectra and valid peptides reported back. Nesvizhskii proposed giving access to a web-based tool that will summarize the reliability of a peptide identification.

IDENTIFYING SMALL ORFs AND uORFs

Jon Mudge, from the HAVANA group (Hinxton, U.K.), started this session by describing their annotation of small coding DNA sequences (CDSs). This led to a discussion of the term “uORF” (upstream open reading frame), as some felt this may ascribe a function to the locus of regulating the translation of the CDS. Mudge stated that most 5' untranslated regions (UTRs) contain some kind of uORF, so should they all be annotated, or only ones where some functional data has been found? Juan Pablo Couso from the University of Sussex highlighted that there is a continuum between clearly defined protein coding loci and characterized nonprotein coding loci. Translation can be inferred from the results of ribosomal profiling that assays the regions of mRNA protected by ribosomes while undergoing translation. A combination of ribosomal profiling and MS data has been used to identify small ORFs (smORFs) that are common in Drosophila14 and could potentially be emergent protein coding genes. Conservation across species can also argue in favor of the existence of these small proteins; however, Couso emphasized the difficulty in using BLAST for identifying smORFs under 70 amino acids (aa). It was proposed that smORFs of >70 aa be classed as protein coding if there were at least two lines of evidence (e.g., conservation and ribosome profiling), but smORFs of <70 aa would only be regarded as protein coding if experimental evidence at the protein level was available; however, it should be borne in mind that detection of such small proteins by MS can be challenging, as they are usually low abundance, have only a few tryptic sites, and are spatially/temporally restricted. From the few examples discussed at the end of the session, it appeared that it was sometimes difficult for curators to decide whether the data should be represented as a one-gene or two-gene model.

PSEUDOGENES CONFUSING THE PROTEOME LANDSCAPE

The next workshop session focused on pseudogenes and began with Cristina Sisu from Yale University (New Haven, CT), who presented her analysis of the human, worm, and fly pseudogenes,15 highlighting their finding of great variation in pseudogene composition between the species. Of particular relevance to the meeting were their criteria for predicting transcribed and translated pseudogenes in human. Sisu highlighted that there is currently little overlap between the 1098 transcribed pseudogenes predicted by the HAVANA group and the 1441 predicted automatically using RNA-seq data mapping to GENCODE 19. Harsha Gowda from the Institute of Bioinformatics in Bangalore (India) discussed transcribed and translated human pseudogenes from the Pandey lab,11 although it was evident that their analysis was less stringent. Jyoti Choudhary from the WTSI (Hinxton, U.K.) highlighted their validation of translated pseudogenes from analysis in mouse;16 however, the number of identified and validated loci is extremely low (around nine). Finally, Kim Pruitt from NCBI highlighted issues of annotating pseudogenes and correctly mapping expression data between pseudogenes and the parent gene from which they are derived. RefSeq has currently predicted over 1032 transcribed pseudogenes. If a pseudogene has coding potential, then RefSeq curators will change it to a protein coding gene and it will no longer be tracked as pseudogene, unlike HAVANA, who annotate a pseudogene with coding potential or an MS hit as a translated pseudogene. At the end of the session, the annotation groups discussed difficult annotation examples, where assignment as a pseudogene or protein coding gene was ambiguous and involved investigating variation of haplotypes, underlying genome assembly errors and using extremely stringent mapping options to confidentially change predictions to coding genes.

LONG NONCODING RNAS

The workshop focus then switched to long noncoding RNAs (lncRNAs) with remote presentations from Chris Ponting from the University of Oxford (Oxford, U.K.), Mitch Guttman from Stanford University (Stanford, CA), and Tim Mercer from the Garvan Institute of Medical Research (Sydney, Australia) as well as Peter Jan Volders and Kenneth Verheggen from Ghent University (Gent, Belgium). Ponting discussed the low degree of conservation of most lncRNAs,17 including the example of the RNA gene H19, which contains ORFs that are not conserved and hence unlikely to encode proteins. He also mentioned that conversely there are lineage-specific examples of protein coding genes that are not well-conserved, such as the 2 Mb EYS gene that is mutated in some cases of retinitis pigmentosa in humans but has been pseudogenised in rodents. Guttman likewise mentioned the importance of evolutionary conservation; for example, as measured by a PhyloCSF score, which utilizes a multispecies genome alignment to identify conserved regions in determining coding status. He also talked about the ribosome release score (RRS),18 a ratio of the ribosome profiling reads in an ORF compared with the reads in the 3′ UTR of a transcript, which can be used to differentiate translated and nontranslated RNAs. There are very few lncRNAs that have both a high RRS and a high PhyloCSF score. Mercer talked about using CaptureSeq to identify novel lncRNA transcripts.19 While RNA-seq is good at detecting highly expressed genes, it is not so good at identifying transcripts that are more weakly expressed. In contrast, CaptureSeq is selective RNA sequencing that uses short nucleotide probes to target specific regions of interest and hence can provide improved detection and assembly of lncRNA gene models. Volders and Verheggen both represented the LNCipedia resource,20 and Volders discussed the need for a gold standard for ribosome profiling data to enable comparisons between results, while Verheggen introduced their ReSpin pipeline to reprocess data from the PRIDE protein interactions database and told us how subtracting these data from transcripts with a low PhyloCSF score had enabled them to identify a high confidence set of lncRNA transcripts.

HUMAN VARIATION

This session was devoted to human variation and its impact on genome annotation, with presentations from Marie-Paule Lefranc from the Institut de Génétique Humaine (Montpelier, France), Jens Mayer from the University of Saarland (Saarbrücken, Germany), and a remote presentation from Tsviya Olender at the Weizmann Institute (Rehovot, Israel). Lefranc introduced IMGT,21 the international ImMunoGeneTics Information System, and highlighted the structural conservation of immunoglobulin variable domains combined with their very high amino acid diversity; therefore, there can be huge variability in the resulting peptides, and some peptides may be expressed very specifically or at extremely low levels, making their detection very difficult. Mayer discussed transposable elements, in particular, endogenous retroviruses (ERVs) and long interspersed nuclear elements (LINEs), in the human genome. ERVs and related sequences comprise ~8% of the human genome, but only the ERV-K family has continued to replicate in the human genome and the loci are highly polymorphic but also highly similar,22 making it difficult to distinguish which loci are actually protein coding. Olender, from the HORDE database of olfactory receptors,23 talked about their recent analysis of the human olfactory receptor repertoire (the largest gene family in the human genome) using Next Generation Sequencing (NGS) data from 4 samples of human olfactory epithelium versus 16 control tissues. Clonal expression of OR transcripts in individual olfactory epithelial cells, might make detection of transcripts, let alone proteins, more problematic. As well as being highly restricted in their expression, some olfactory receptor loci can vary in coding status between individual genomes (called polymorphic or segregating pseudogenes), so they have also devised a metric called the “CORP” (Classifier for Olfactory Receptor Pseudogenes) score to estimate pseudogene probability of a given allele.

CONCLUSIONS AND NEXT STEPS

It was clear from the workshop that there is a lot of ambiguity surrounding how the different annotation groups interpret proteomics data when using it as evidence for a novel protein coding gene. The groups were reticent to use proteomic data alone for validation, unless there were multiple high confidence hits to support it. The curation teams wanted to improve communication between the different groups (GENCODE, HGNC, neXtProt, RefSeq, Swiss-Prot) when trying to evaluate these edge cases as putative novel protein coding genes and proposed using the existing consensus CDS (CCDS) system24 provided by the NCBI to record the exchanges between curators. We hope that further examination of edge cases will be used in the development of guidelines from the curation groups that will aid the correct interpretation of MS evidence to validate novel protein coding genes. Members of the proteomics community attending the workshop are continuing discussions concerning the development of tools that the annotation community could use to distinguish between highquality MS data and low-quality data that should not be used for evidence. PeptideAtlas (Deutsch) agreed to consider incorporating RefSeq as an additional reference proteome, probably in the next annual update. The annotation groups also proposed providing novel RNA-seq models for the proteogenomics community to increase their search space and hence the opportunities to find novel proteins. Conversely, it was recommended that claimed peptide matches to novel proteins from pseudogenes or lncRNAs should be searched for highquality matches to known sequences with single amino acid variants or isobaric post-translational modifications. Novel annotations from chromosomes 14 and 21 will be used for a pilot analysis to see if MS data can be used in combination with other data to identify novel protein coding genes. Finally, the participants felt it would be useful to rerun the workshop at a suitable frequency (1 to 2 years) to discuss updates in technology and how the different annotation groups can incorporate these changes into their analyses.

Acknowledgments

ACKNOWLEDGMENTS

We are grateful to Gil Omenn (University of Michigan, Ann Arbor, MI) and Amos Bairoch (SIB Swiss Institute of Bioinformatics) for their input and critical reading of the manuscript.

Funding

Funding for the retreat was provided by the Wellcome Trust Scientific Conferences program. E.A.B. is funded by National Human Genome Research Institute (NHGRI) grant U41HG003345 and Wellcome Trust grant 099129/Z/12/Z. J.H. is funded by National Institutes of Health grant U41HG007234 and Wellcome Trust grant WT098051. L.L. is funded by the Swiss Federation Commission for Technology and Innovation grant CTI 10214.

Footnotes

Notes

The authors declare no competing financial interest.

ADDITIONAL NOTE

Intended as part of the The Chromosome-Centric Human Proteome Project 2015 special issue.

REFERENCES

  • (1).Paik Y-K, Jeong S-K, Omenn GS, Uhlen M, Hanash S, Cho SY, Lee H-J, Na K, Choi E-Y, Yan F, et al. The Chromosome-Centric Human Proteome Project for cataloging proteins encoded in the genome. Nat. Biotechnol. 2012;30:221–223. doi: 10.1038/nbt.2152. [DOI] [PubMed] [Google Scholar]
  • (2).Omenn GS, Lane L, Lundberg EK, Beavis RC, Nesvizhskii AI, Deutsch EW. Metrics for the Human Proteome Project 2015: Progress on the Human Proteome and Guidelines for High-Confidence Protein Identification. J. Proteome Res. 2015;14:3452. doi: 10.1021/acs.jproteome.5b00499. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (3).Harrow JL, Steward CA, Frankish A, Gilbert JG, Gonzalez JM, Loveland JE, Mudge J, Sheppard D, Thomas M, Trevanion S, et al. The Vertebrate Genome Annotation browser 10 years on. Nucleic Acids Res. 2014;42:D771–D779. doi: 10.1093/nar/gkt1241. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (4).Gaudet P, Michel P-A, Zahn-Zabal M, Cusin I, Duek PD, Evalet O, Gateau A, Gleizes A, Pereira M, Teixeira D, et al. The neXtProt knowledgebase on human proteins: current status. Nucleic Acids Res. 2015;43:D764–D770. doi: 10.1093/nar/gku1178. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (5).Pruitt KD, Brown GR, Hiatt SM, Thibaud-Nissen F, Astashyn A, Ermolaeva O, Farrell CM, Hart J, Landrum MJ, McGarvey KM, et al. RefSeq: an update on mammalian reference sequences. Nucleic Acids Res. 2014;42:D756–D763. doi: 10.1093/nar/gkt1114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (6).The UniProt Consortium UniProt: a hub for protein information. Nucleic Acids Res. 2015;43:D204–D212. doi: 10.1093/nar/gku989. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (7).Harrow J, Frankish A, Gonzalez JM, Tapanari E, Diekhans M, Kokocinski F, Aken BL, Barrell D, Zadissa A, Searle S, et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 2012;22:1760–1774. doi: 10.1101/gr.135350.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (8).Gray KA, Yates B, Seal RL, Wright MW, Bruford EA. Genenames.org: the HGNC resources in 2015. Nucleic Acids Res. 2015;43:D1079–D1085. doi: 10.1093/nar/gku1071. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (9).Mellacheruvu D, Wright Z, Couzens AL, Lambert J-P, St-Denis NA, Li T, Miteva YV, Hauri S, Sardiu ME, Low TY, et al. The CRAPome: a contaminant repository for affinity purification-mass spectrometry data. Nat. Methods. 2013;10:730–736. doi: 10.1038/nmeth.2557. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (10).Wilhelm M, Schlegl J, Hahne H, Gholami AM, Lieberenz M, Savitski MM, Ziegler E, Butzmann L, Gessulat S, Marx H, et al. Mass-spectrometry-based draft of the human proteome. Nature. 2014;509:582–587. doi: 10.1038/nature13319. [DOI] [PubMed] [Google Scholar]
  • (11).Kim M-S, Pinto SM, Getnet D, Nirujogi RS, Manda SS, Chaerkady R, Madugundu AK, Kelkar DS, Isserlin R, Jain S, et al. A draft map of the human proteome. Nature. 2014;509:575–581. doi: 10.1038/nature13302. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (12).Ezkurdia I, Vazquez J, Valencia A, Tress M. Analyzing the first drafts of the human proteome. J. Proteome Res. 2014;13:3854–3855. doi: 10.1021/pr500572z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (13).Deutsch EW, Sun Z, Campbell D, Kusebauch U, Chu CS, Mendoza L, Shteynberg D, Omenn GS, Moritz RL. The State of the Human Proteome in 2014/2015 as viewed through PeptideAtlas: enhancing accuracy and coverage through the Atlas-Prophet. J. Proteome Res. 2015;14:3461. doi: 10.1021/acs.jproteome.5b00500. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (14).Aspden JL, Eyre-Walker YC, Phillips RJ, Amin U, Mumtaz MAS, Brocard M, Couso J-P. Extensive translation of small Open Reading Frames revealed by Poly-Ribo-Seq. eLife. 2014;3:e03528. doi: 10.7554/eLife.03528. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (15).Sisu C, Pei B, Leng J, Frankish A, Zhang Y, Balasubramanian S, Harte R, Wang D, Rutenberg-Schoenberg M, Clark W, et al. Comparative analysis of pseudogenes across three phyla. Proc. Natl. Acad. Sci. U. S. A. 2014;111:13361–13366. doi: 10.1073/pnas.1407293111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (16).Brosch M, Saunders GI, Frankish A, Collins MO, Yu L, Wright J, Verstraten R, Adams DJ, Harrow J, Choudhary JS, et al. Shotgun proteomics aids discovery of novel protein-coding genes, alternative splicing, and “resurrected” pseudogenes in the mouse genome. Genome Res. 2011;21:756–767. doi: 10.1101/gr.114272.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (17).Haerty W, Ponting CP. Unexpected selection to retain high GC content and splicing enhancers within exons of multiexonic lncRNA loci. RNA. 2015;21:320–332. doi: 10.1261/rna.047324.114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (18).Guttman M, Russell P, Ingolia NT, Weissman JS, Lander ES. Ribosome profiling provides evidence that large noncoding RNAs do not encode proteins. Cell. 2013;154:240–251. doi: 10.1016/j.cell.2013.06.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (19).Clark MB, Mercer TR, Bussotti G, Leonardi T, Haynes KR, Crawford J, Brunck ME, Cao K-AL, Thomas GP, Chen WY, et al. Quantitative gene profiling of long noncoding RNAs with targeted RNA sequencing. Nat. Methods. 2015;12:339–342. doi: 10.1038/nmeth.3321. [DOI] [PubMed] [Google Scholar]
  • (20).Volders P-J, Verheggen K, Menschaert G, Vandepoele K, Martens L, Vandesompele J, Mestdagh P. An update on LNCipedia: a database for annotated human lncRNA sequences. Nucleic Acids Res. 2015;43:D174–D180. doi: 10.1093/nar/gku1060. Database issue. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (21).Lefranc M-P, Giudicelli V, Duroux P, Jabado-Michaloud J, Folch G, Aouinti S, Carillon E, Duvergey H, Houles A, Paysan-Lafosse T, et al. IMGT®, the international ImMunoGeneTics information system® 25 years on. Nucleic Acids Res. 2015;43:D413–D422. doi: 10.1093/nar/gku1056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (22).Schmitt K, Heyne K, Roemer K, Meese E, Mayer J. HERV-K(HML-2) rec and np9 transcripts not restricted to disease but present in many normal human tissues. Mobile DNA. 2015;6:4. doi: 10.1186/s13100-015-0035-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (23).Olender T, Nativ N, Lancet D. HORDE: comprehensive resource for olfactory receptor genomics. Methods Mol. Biol. 2013;1003:23–38. doi: 10.1007/978-1-62703-377-0_2. [DOI] [PubMed] [Google Scholar]
  • (24).Farrell CM, O’Leary NA, Harte RA, Loveland JE, Wilming LG, Wallin C, Diekhans M, Barrell D, Searle SMJ, Aken B, et al. Current status and new features of the Consensus Coding Sequence database. Nucleic Acids Res. 2014;42:D865–D872. doi: 10.1093/nar/gkt1059. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES