Hidden in plain sight: what remains to be discovered in the eukaryotic proteome?

Valerie Wood; Antonia Lock; Midori A Harris; Kim Rutherford; Jürg Bähler; Stephen G Oliver

doi:10.1098/rsob.180241

. 2019 Feb 20;9(2):180241. doi: 10.1098/rsob.180241

Hidden in plain sight: what remains to be discovered in the eukaryotic proteome?

Valerie Wood ^1,^2,^✉, Antonia Lock ³, Midori A Harris ^1,², Kim Rutherford ^1,², Jürg Bähler ³, Stephen G Oliver ^1,²

PMCID: PMC6395881 PMID: 30938578

Abstract

The first decade of genome sequencing stimulated an explosion in the characterization of unknown proteins. More recently, the pace of functional discovery has slowed, leaving around 20% of the proteins even in well-studied model organisms without informative descriptions of their biological roles. Remarkably, many uncharacterized proteins are conserved from yeasts to human, suggesting that they contribute to fundamental biological processes (BP). To fully understand biological systems in health and disease, we need to account for every part of the system. Unstudied proteins thus represent a collective blind spot that limits the progress of both basic and applied biosciences. We use a simple yet powerful metric based on Gene Ontology BP terms to define characterized and uncharacterized proteins for human, budding yeast and fission yeast. We then identify a set of conserved but unstudied proteins in S. pombe, and classify them based on a combination of orthogonal attributes determined by large-scale experimental and comparative methods. Finally, we explore possible reasons why these proteins remain neglected, and propose courses of action to raise their profile and thereby reap the benefits of completing the catalogue of proteins’ biological roles.

Keywords: fission yeast, budding yeast, human, unknown proteins, gene ontology, biocuration

1. Slow progress in characterizing unknowns

When the first eukaryotic chromosome (chromosome III of Saccharomyces cerevisiae) was sequenced in 1992, the most surprising discovery was that previously undetected protein-coding genes outnumbered mapped genes by a factor of five [1,2]. Researchers had generally assumed that few proteins remained to be discovered, especially in an organism as intensively studied as yeast. The completion of the S. cerevisiae genome sequence in 1996 confirmed that more than half the genes lacked any indication of their biochemical activity or broader biological role [2]. Over the ensuing two decades, complete genome sequences have become available for over 6500 eukaryotic species [3]. At first, characterization of newly discovered genes progressed rapidly in model species, as researchers supplemented classical biochemistry and forward genetics with reverse genetics, homology modelling and large-scale systematic techniques to study novel genes. Complete genome sequences also allowed the deployment of large-scale systematic techniques [4]. More recently, however, progress has slowed, even in well-studied species such as the budding yeast S. cerevisiae and the fission yeast Schizosaccharomyces pombe.

Figure 1 shows protein characterization over the past 28 years for these two yeasts. Notably, the proportions characterized in fission yeast (84%) and budding yeast (82%) have only slightly increased in the past decade (from 80%, as noted by Peña-Castillo & Hughes [11], and 77%, respectively). Across all studied eukaryotic species, the proportion of characterized proteins has reached a plateau around 80%, and exhibits a long-tailed distribution, with the biological roles of the remaining 20% still elusive.

Here, we use a simple yet powerful metric based on Gene Ontology (GO) biological process (BP) terms to define characterized and uncharacterized proteins for human and the two model yeasts. We then combine our GO-based classification with information about taxonomic conservation using fission yeast to identify a set of broadly conserved, but unstudied, proteins. We classify the fission yeast conserved but unstudied protein set based on a combination of orthogonal attributes (e.g. taxonomic conservation, mutant viability, protein sequence features, localization). Finally, we explore possible reasons why these proteins remain neglected, propose courses of action to raise their profile among bench researchers and bioinformaticians, and posit the benefits of completing catalogue of proteins’ biological roles.

2. Defining unknown metrics: what counts as ‘known’?

To estimate more precisely the proportion of a proteome that is characterized, and to provide inventories of uncharacterized gene products, the ‘known’ category must be rigorously defined. However, the gradual accumulation of data of many different types, from diverse experimental and computational methods and multiple sources, makes it challenging to draw a clear line between ‘known’ and ‘unknown’. For example, in 2004 Hughes et al. [12] observed that the then-current Yeast Proteome Database (YPD) listed 80% of S. cerevisiae genes as ‘known’, but also noted that by more stringent criteria based on GO annotation then in the Saccharomyces Genome Database (SGD), 30–40% remained unknown and others only poorly understood. Knowledge acquisition is necessarily a continuum—different experiments are performed at different scales (e.g. high- versus low-throughput) and yield results at different levels of biological detail (e.g. detecting DNA repair versus distinguishing mismatch repair from base excision repair) and confidence (stemming from variation in the quality of assays and the number of replicates performed). For these reasons, the characterization status of gene products does not fall on a simple linear scale. Biologists often make qualitative judgement calls to designate individual gene products as ‘novel’, ‘barely characterized’ or ‘relatively well characterized’. While this serves the purposes of individual researchers working on a gene-by-gene basis, a more quantitative and objective approach is required to summarize the status of functional characterization for an entire proteome, and to facilitate cross-species comparisons.

2.1. Metrics to describe functional characterization levels

To develop workable metrics for the status of the functional annotation of a given proteome, we have exploited GO annotation [13], and illustrate this scheme using the proteomes of S. cerevisiae, S. pombe and human as examples. Since the functional attributes of gene products of diverse species are routinely described using GO, these metrics are widely applicable.

The GO molecular function (MF) ontology describes molecular-level activities of gene products (such as catalytic, transporter and receptor activities). GO BP refers to ordered assemblies of MFs representing physiological roles of gene products (e.g. involvement in cytokinesis or DNA replication). GO cellular component (CC) provides the locations of gene products (organelles, complexes, etc.). Determining each annotation type relies on different experimental techniques, yielding complementary results and insights. We might know a gene product’s MF or its localization (CC), but not the physiological context in which that product acts (BP).

2.2. Annotation coverage for proteins by Gene Ontology aspect

One simple way to quantify the degree to which the function of a given gene product has been characterized is to report annotation to one, two or all three GO aspects (MF, BP, CC). GO aspect coverage provides a simple metric that is accessible for any species by counting gene products with (or without) annotation to each aspect. Figure 2 shows coverage by ontology aspect for human, fission yeast and budding yeast proteins. From this viewpoint, we can quickly assess the number of gene products annotated to all three aspects of GO (usually well characterized), or none at all (uncharacterized), and assess what types are absent for a species.

Figure 2. — GO aspect coverage of budding yeast, fission yeast and human proteins. Venn diagrams indicate the number of protein coding gene products annotated to each Gene Ontology aspect (biological process, molecular function, cellular component). Data sources: *S. pombe*, PomBase 25 September 2018; *S. cerevisiae*, YeastMine [14] 25 September 2018; human, HumanMine [15] and GO repository [16], both 26 September 2018.

2.3. Known physiological function: Gene Ontology slim process coverage

Although informative, the activities captured by MF and the localizations described by CC make only a limited contribution to knowledge about a gene product’s characterization if taken in isolation. Although some MF terms (such as those describing transcription factor activities, substrate-specific transmembrane transporter activities, and some specific catalytic activities) implicitly refer to physiological roles, biologically informative gene characterization usually requires additional data to place the activity into a broader physiological context. By contrast, BP annotation provides this context, and thereby reveals more about the role of a gene product in an organism’s biology, and provides a useful benchmark for preliminary characterization. As an example, knowing that a protein is a kinase (MF) is not very informative until we find that a deletion mutant of the gene encoding that protein is defective in meiotic nuclear division (BP) and the protein itself localizes to chromatin (CC).

To use this information as a measure of the progress in a protein’s functional characterization, we use the annotation overviews provided by tailored GO term subsets known as ‘GO slims’ [17]. We created a BP slim set covering as many annotated gene products as possible, while remaining informative about the physiological context in which they operate. As our starting point, we took the fission yeast GO BP slim developed at PomBase over 8 years [18], which provides excellent coverage of informative cellular processes for fission yeast (99.4% of annotated proteins with a known process), and minimizes overlap between terms. The PomBase slim aims to demonstrate the distribution of processes within distinct ‘modules’ of biology (cytokinesis, tRNA metabolism, DNA replication etc.) [19,20], and therefore excludes overly general BP terms, such as ‘metabolism’ or ‘cellular component organization’, that would increase coverage at the expense of specific context. Terms that recapitulate activities in the MF ontology (e.g. ‘protein phosphorylation’) or describe phenotypic observations but do not correspond to a specific physiological role for a gene product (e.g. ‘response to chemical’) are also excluded. Following the same principles, we extended the 53-term PomBase slim into a generalized process slim of 117 terms to use in cross-species analysis, as summarized in figure 3a.

Figure 3. — GO slim analysis of budding yeast, fission yeast and human proteins. (a) Generic GO biological process slim set creation flowchart. The fission yeast GO biological process slim [18] was applied to human and *S. cerevisiae* protein sets, and then iteratively extended to improve coverage by adding terms. All evidence codes were included except ‘reviewed computational analysis’ (RCA), which yields a higher rate of false positives than the others. Some processes were swapped (e.g. ‘cytoplasmic translation’ in the fission yeast slim for more general ‘translation’) to accommodate the less specific annotation available in other species. The fission yeast slim also omits overly broad terms (e.g. ‘metabolism’) and terms representing activities (molecular functions) in the biological process ontology (e.g. ‘phosphorylation’) because they do not add information about physiological roles; these terms were also excluded from our generic slim set even if inclusion would have increased coverage. (Terms specifically considered but omitted from the generic slim are listed in electronic supplementary material, table S11). At convergence (the point where no additional informative terms could be identified for gene products with biological process annotations), proteins annotated to slim terms were classified as ‘known’ (4393 *S. pombe*; 4936 *S. cerevisiae*; 16354 human). The remaining proteins with uninformative processes were classified as unknown, along with those already identified as unknown by annotation to the root node with evidence code ND (no data). Manual assessment of the remaining human proteins with no GO biological process annotation added 266 proteins, bringing the ‘known’ total to 16620. Final ‘unknown’ protein totals are 676 in *S. pombe*, 978 in *S. cerevisiae* and 3117 in human. The set of GO slim terms is available in electronic supplementary material, table S1. (b) Proportions of proteins with known GO slim biological role. For all three species, ‘known’ proteins have annotation to at least one term from the GO slim set (see A), and ‘unknown’ proteins do not. Because the human proteome includes some proteins that lack annotation in the GO database, the proportions of unannotated proteins that we found to be known (i.e. annotatable) and unknown are indicated separately. All protein datasets exclude dubious proteins and transposons. Analysis was performed using GOTermFinder [21], with GO data from 25 September 2018 and the GO slim created as described in A. Input protein lists are available in electronic supplementary materials, tables S2 (*S. pombe*), S3 (*S. cerevisiae*) and S4 (human). GOTermFinder output is available in electronic supplementary material, tables S5 (*S. pombe*), S6 (*S. cerevisiae*) and S7 (human).

For any annotation-based metric, it is important to distinguish unknown (or unstudied) from unannotated gene products. Here, unknown gene products are defined as those that have been evaluated by curators and have no annotation to any BP slim terms (these gene products are annotated to the root term ‘BP’ with the evidence code ‘no data (ND)’ [22,23]). Unannotated are those not explicitly indicated as unknown but which, nevertheless, have no annotation from experiment or inference. Because all fission yeast and budding yeast genes have been systematically assessed using all available data, any gene products lacking specific GO annotations can confidently be deemed to have unknown biological roles. For the human proteome, manual inspection of the unannotated proteins revealed that many can actually be annotated to a BP based on experimental data in the literature or by homology-based inference, and thus classified as characterized. To make this knowledge available, we manually curated 931 GO annotations for 502 human proteins from 310 publications, including BP assignments for 238 previously unannotated proteins. These annotations will be submitted to the Gene Ontology Consortium for inclusion in the human GO annotation dataset. Figure 3b shows the proportions of the S. pombe, S. cerevisiae and human proteomes that are known (i.e. annotated to informative BP terms), unannotated, annotable, or unknown. See electronic supplementary material data for GO slim term IDs (electronic supplementary material, table S1), input protein lists (electronic supplementary material, tables S2–S4), GO slim outputs (electronic supplementary material, tables S5–S7), unknown gene lists (electronic supplementary material, tables S8–S10), GO terms considered for the slim but not included (electronic supplementary material, table S11) and human GO annotations generated by this study (electronic supplementary material, table S12).

3. Why do these proteins remain unstudied?

Our GO slim-based characterization metric confirms the impression from simpler metrics that, for the two model yeasts and human, about 20% of proteins lack physiologically informative descriptions. Why do so many proteins, many of them conserved, remain unstudied? Below, we consider biological and sociological/cultural factors that contribute to the apparent lack of interest in these unknown proteins.

3.1. Biological bias

One factor influencing gene characterization is simply how easily one gene’s contribution to an organism can be detected. Deletion mutants of essential genes have a clear phenotype that indicates an important function—for yeasts, the failure to grow on rich media. As a consequence, these genes, and the core processes in which they participate, are well characterized. For example, only 24 of the genes in the fission yeast unknown set are essential in rich media. Changes in cell morphology are also readily identifiable phenotypes. Visual screening and analysis of the fission yeast genome-wide deletion collection for morphology phenotypes under standard laboratory conditions found obvious abnormalities for only 10% of 643 genes of unknown function [24]. The most commonly used experimental conditions, designed as they are to maximize cell growth, can hide environment-dependent roles. Many more of the 676 currently uncharacterized (per figure 3b) fission yeast genes are associated with growth or viability phenotypes upon specific chemical challenges (26.1%) than under standard laboratory conditions (3.6%) (PomBase [19,20], queries run 11 November 2018). In budding yeast, only 34% of all deletion mutants display a growth phenotype under standard laboratory conditions, whereas 97% of all genes are essential for optimal growth in at least one condition when assayed under multiple chemical or environmental perturbations [25].

Protein characterization has traditionally emphasized core BP over those that reflect interactions with the environment. However, analysis of the proteins that have recently (2016–2018) been removed from the fission yeast ‘conserved unknown’ set because their functions have been determined reveals that they most often participate in environment-responsive processes such as signalling, detoxification, proteostasis, lipostasis and mitochondrial organization (table 1). Many of these functions are associated with the age-related accumulation of damaged or misfolded proteins, which become debilitating over time. In humans, such functions are implicated in neurodegenerative diseases, such as Alzheimer’s and motor neuron diseases [26], which underscores the importance of making strenuous efforts to elucidate the functions of the remaining ‘conserved unknowns’. It should also be pointed out that the ecology of yeasts is poorly understood, and we postulate that many unknown proteins function in aspects of life that are not normally probed in the laboratory (e.g. interactions with pathogens, or survival within insect vectors). We anticipate that a greater variety of experimental conditions will supply more information about the unknown gene products catalogued here to reveal the roles of many gene products in processes fundamental for human health.

Table 1.

GO slim classification of conserved S. pombe proteins characterized between 2016 and 2018.

process	proteins
membrane biology
lipid metabolism	15
transmembrane transport	9
vesicle-mediated transport	9
organelle localization by membrane tethering	4
other membrane organization	3
other ER processes	2
	42
communication
signaling	9
transcription	6
chromatin organization	1
	16
catabolism and detoxification
detoxification	25
protein catabolism	5
apoptotic process	4
DNA repair	3
nucleobase-containing compound catabolism	5
autophagy	1
mannose catabolism	1
	44
mitochondrial processes and energy
mitochondrial gene expression	4
mitochondrial organization	3
energy generation	4
	11
other processes
tRNA metabolism (cytosolic)	3
ribosome biogenesis (cytosolic)	5
mRNA metabolism	2
cytoplasmic translation	2
cytoskeleton organization	3
protein folding (cytosolic)	1
protein complex assembly	1
nucleocytoplasmic transport	1
chromosome segregation	1
amino acid metabolism	2
cofactor metabolism	1
other	5
	27
total	140

Open in a new tab

3.2. Research bias

In fission yeast, the majority of new knowledge over the past decade provides increasing detail for previously described proteins. This bias towards studying already-known proteins is not peculiar to yeast research. In their essay ‘Too many roads not taken’ [27], Edwards et al. observe that 70% of human protein research still focuses on the 10% of proteins known before the human genome was sequenced. Although few studies have explored the causes of the observed emphasis on known proteins, we can identify a number of plausible contributing factors, which are largely borne out by a recent large-scale analysis of publications on human genes [28]. First, the complexity of biology demands that investigators narrow their study targets to a manageable range. Researchers with established interests in specific topics thus naturally focus their work where they have deep knowledge, and extend their studies to novel genes only if a strong lead emerges, for instance, from work in another species or from a data-mining approach that implicates them in BP already under investigation. Indeed, Stoeger et al. [28] find that research in model organisms strongly influences the initial study of individual human genes. Both papers also highlight pragmatic considerations, notably the availability of research tools, and socio-political factors including career timelines, funding priorities, and peer review, all of which exacerbate the tendency to avoid the wholly unknown. Risk-averse funders and reviewers tend not to favour long-range strategies aimed at genes without an existing functional context for fear of diverting resources towards targets whose significance is not guaranteed. Without a shift in perspective, proteins without any existing functional annotation will continue to be neglected, to the detriment of basic and applied biomedical research. Stoeger et al. [28] note that current research is not only slow to cover novel genes but also ‘can significantly deviate from the actual biological importance of individual genes’.

4. Classifying the conserved unknown proteins, or: what lies undiscovered?

The stubborn core of remaining proteins of unknown function are often dismissed as species-specific, but we have often found otherwise, and we can no longer afford to sweep these proteins under the carpet. Therefore, to provide further insight into why some gene products elude physiological characterization, we present a case study using the set of 410 fission yeast proteins of unknown physiological role that have orthologues outside the Schizosaccharomyces clade (the ‘conserved unknowns’). We classify these 410 proteins according to a range of orthogonal biological attributes (including taxonomic conservation, identification of a catalytic fold or domain, cellular localization, viability). Figure 4 presents the subset of 200 proteins in this group which are conserved outside fungi (in vertebrates, archaea or bacteria; details for the full set of 410 proteins are provided in electronic supplementary material, figure S1 and table S13). The number of conserved unknown proteins that play an essential role under permissive growth conditions is disproportionally low (4.4%; 18/410, versus 1278/4376 (29.2%) known, inferred or published), and almost half of these 18 essential proteins are localized to the mitochondria or the endoplasmic reticulum. Only 8 essential genes are conserved in vertebrates (all are organellar), and only a single protein out of 53 conserved between yeast and prokaryotes (but not vertebrates) is essential. A substantial proportion of the 200 proteins conserved outside the fungi (76) are absent from Saccharomyces due to the well-documented lineage-specific gene losses in its evolutionary history [34]; characterized gene products with this taxonomic distribution are most highly enriched for chromatin organization and mRNA metabolism. Unknown mitochondrial and endomembrane system proteins are enriched for proteins with transmembrane domains (59/114). Unknown nuclear proteins are predicted to include more transferases than other enzymatic activities; the set of nuclear proteins shared only with eukaryotes includes 19 domains of unknown function (DUFs) with no currently identifiable catalytic domain, and 12 protein–protein interaction domains (e.g. WD, ankyrin or TPR) that frequently function as scaffolds for protein complexes [30]. This multi-factorial classification can support the prediction of likely physiological roles that can be experimentally tested.

Figure 4. — Taxonomic conservation and features of unknown proteins. Classification of 210 conserved unknown fission yeast proteins along various axes. PomBase curators manually assign protein-coding genes to one of a set of broad taxonomic classifiers [20,29]. PomBase also maintains manually curated lists of orthologues between *S. pombe* and *S. cerevisiae*, and between *S. pombe* and human, three eukaryotic species separated by approximately 500–1000 million years of evolution. In combination, these inventories can be used to identify conservation across taxonomic space at different levels of specificity. Of the fission yeast ‘unknown’ protein-coding genes, 410 are conserved outside the *Schizosaccharomyces* clade. Of these, 210 are present either in fungi and vertebrates, or in fungi and prokaryotes (data from PomBase manual assignments, queried on 31 July 2018). Proteins were classed as catalytic (i.e. having an identifiable catalytic fold) or non-catalytic (no currently identifiable catalytic fold) based on protein domain, fold, clan or superfamily membership, using InterPro [30] and GO [13] assignments. Cellular locations using GO annotation are available for most of the unknown proteome based on a genome-wide localization study and inference from other models [31]. Viability data come from large-scale screens reported by Kim *et al.* [32] and Chen *et al.* [33]. The fission yeast ‘conserved unknown’ protein set [18] is reviewed continually for new functional data.

5. Strategies to link unknown genes to broad cellular roles

Despite almost a century of gene product-specific biochemical and genetic interrogation, and two decades of post-genomic research, a large number of proteins conserved from yeast to human still have no known biological role. Broadly conserved unknown eukaryotic proteins can be assumed to have important cellular roles conserved over 500 million years of evolution. It is, therefore, remarkable that this gene set has hardly reduced over the past decade. In this period, familiar genes have been studied in ever-greater depth, presumably at the expense of the characterization of genes of hitherto unknown function (e.g. over 33 000 papers with ‘p53’ in the title have been published since 2007). Assuming a diminishing return for studies on highly characterized proteins, investigations on unstudied proteins will have a relatively higher impact that is likely to outweigh the considerable initial efforts required to place them within the context of current knowledge.

To jump-start renewed progress in unknown gene characterization, two major stumbling blocks must be overcome. One is to identify the cohort of genes of unknown function, and the other is to develop mechanisms to bring the proteins that they encode to the attention of interested researchers. Here, we provide a framework that uses a generally accessible set of criteria based on manually curated data to identify and classify unstudied proteins, which could easily be extended with additional criteria for further annotation specificity in future iterations. The construction of inventories of unknown proteins will ultimately depend on accurate and complete functional annotation of all genes of the major model species.

Commentators on genome-scale research have long recognized that, in order to fully describe an organism’s protein complement, it will be necessary to deploy parallel experimental and computational methods, at both large and small scales [4,12]. Understanding how investigatory biases and the characteristics of particular gene sets have converged to prevent characterization will help us to identify the most promising routes to uncovering unknown functions. These, and other factors that contribute to the neglect of the characterization of conserved gene products that are likely to have novel biological roles, deserve further in-depth consideration. It is likely that, to fill the persistent knowledge gaps represented by the roughly 20% of proteins that remain uncharacterized, a creative combination of existing and emerging experimental and in silico methods will be required, as well as an increased awareness among the scientific community of the value of a full proteome description. Because basic knowledge at the cellular level provides the building blocks of translational research, drug discovery, personalized medicine, metabolomics and systems biology, comprehensive proteome characterization underpins the success of numerous and diverse endeavours across all of the biological and medical sciences.

Supplementary Material

Figure S1

rsob180241supp1.pdf^{(1.6MB, pdf)}

Supplementary Material

Table S1

rsob180241supp2.tsv^{(1.8KB, tsv)}

Supplementary Material

Table S2

rsob180241supp3.tsv^{(60.3KB, tsv)}

Supplementary Material

Table S3

rsob180241supp4.tsv^{(46.8KB, tsv)}

Supplementary Material

Table S4

rsob180241supp5.tsv^{(136.4KB, tsv)}

Supplementary Material

Table S5

rsob180241supp6.tsv^{(127.9KB, tsv)}

Supplementary Material

Table S6

rsob180241supp7.tsv^{(139.3KB, tsv)}

Supplementary Material

Table S7

rsob180241supp8.tsv^{(465.2KB, tsv)}

Supplementary Material

Table S8

rsob180241supp9.tsv^{(8KB, tsv)}

Supplementary Material

Table S9

rsob180241supp10.tsv^{(8.1KB, tsv)}

Supplementary Material

Table S10

rsob180241supp11.tsv^{(23.7KB, tsv)}

Supplementary Material

Table S11

rsob180241supp12.tsv^{(453B, tsv)}

Supplementary Material

Table S12

rsob180241supp13.tsv^{(93.5KB, tsv)}

Supplementary Material

Table S13

rsob180241supp14.tsv^{(28.8KB, tsv)}

Data accessibility

This article has no additional data.

Authors' contributions

V.W. conceived the project and wrote the initial draft. V.W., A.L. and M.A.H. prepared figures. All authors contributed to the discussion of ideas and manuscript revisions, and read and approved the final manuscript.

Competing interests

The authors declare no competing interests.

Funding

This work was supported by the Wellcome Trust (grant no. 104967/Z/14/Z to S.G.O.).

References

1.Oliver SG. et al. 1992. The complete DNA sequence of yeast chromosome III. Nature 357, 38–46. ( 10.1038/357038a0) [DOI] [PubMed] [Google Scholar]
2.Goffeau A. et al. 1996. Life with 6000 genes. Science 274, 546, 563–567 ( 10.1126/science.274.5287.546) [DOI] [PubMed] [Google Scholar]
3.NIH. 2018 Genome information by organism. See https://www.ncbi.nlm.nih.gov/genome/browse/ (accessed: 1 October 2018).
4.Oliver SG. 1996. From DNA sequence to biological function. Nature 379, 597–600. ( 10.1038/379597a0) [DOI] [PubMed] [Google Scholar]
5.Hodges PE, McKee AH, Davis BP, Payne WE, Garrels JI. 1999. The Yeast Proteome Database (YPD): a model for the organization and presentation of genome-wide functional data. Nucleic Acids Res. 27, 69–73. ( 10.1093/nar/27.1.69) [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Wu LF, Hughes TR, Davierwala AP, Robinson MD, Stoughton R, Altschuler SJ. 2002. Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat. Genet. 31, 255–265. ( 10.1038/ng906) [DOI] [PubMed] [Google Scholar]
7.MIPS. 2018 CYGD: the Comprehensive Yeast Genome Database. See http://mips.helmholtz-muenchen.de/proj/yeast/tables/inventy.html. [DOI] [PMC free article] [PubMed]
8.Christie KR, Hong EL, Cherry JM. 2009. Functional annotations for the Saccharomyces cerevisiae genome: the knowns and the known unknowns. Trends Microbiol. 17, 286–294. ( 10.1016/j.tim.2009.04.005) [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Gibney PA, Hickman MJ, Bradley PH, Matese JC, Botstein D. 2013. Phylogenetic portrait of the Saccharomyces cerevisiae functional genome. G3 (Bethesda) 3, 1335–1340. ( 10.1534/g3.113.006585) [DOI] [PMC free article] [PubMed] [Google Scholar]
10.PomBase. 2018 Protein-coding gene characterisation status. See https://www.pombase.org/status/genecharacterisation-statistics-history.
11.Peña-Castillo L, Hughes TR. 2007. Why are there still over 1000 uncharacterized yeast genes? Genetics 176, 7–14. ( 10.1534/genetics.107.074468) [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Hughes TR, Robinson MD, Mitsakakis N, Johnston M. 2004. The promise of functional genomics: completing the encyclopedia of a cell. Curr. Opin. Microbiol. 7, 546–554. ( 10.1016/j.mib.2004.08.015) [DOI] [PubMed] [Google Scholar]
13.2017. Expansion of the Gene Ontology knowledgebase and resources. Nucleic Acids Res. 45, D331–D338. ( 10.1093/nar/gkw1108) [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Balakrishnan R, Park J, Karra K, Hitz BC, Binkley G, Hong EL, Sullivan J, Micklem G, Cherry JM. 2012. YeastMine—an integrated data warehouse for Saccharomyces cerevisiae data as a multipurpose tool-kit. Database 2012, bar062 ( 10.1093/database/bar062) [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Smith RN, 2012. InterMine: a flexible data warehouse system for the integration and analysis of heterogeneous biological data. Bioinformatics 28, 3163–3165. ( 10.1093/bioinformatics/bts577) [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Gene Ontology. 2018 Download annotations. See http://current.geneontology.org/products/pages/downloads.html (accessed 26 September 2018).
17.Rhee SY, Wood V, Dolinski K, Draghici S. 2008. Use and misuse of the gene ontology annotations. Nat. Rev. Genet. 9, 509–515. ( 10.1038/nrg2363) [DOI] [PubMed] [Google Scholar]
18.PomBase. 2018 Fission yeast GO slim. See https://www.pombase.org/browsecuration/fission-yeast-go-slim-terms.
19.Lock A, Rutherford K, Harris MA, Wood V. 2018. PomBase: the scientific resource for fission yeast, pp. 49–68. New York, NY: Humana Press, [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Lock A. 2019. PomBase 2018: user-driven reimplementation of the fission yeast database provides rapid and intuitive access to diverse, interconnected information. Nucleic Acids Res. 47, D821–D827. ( 10.1093/nar/gky961 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G. 2004. GO::TermFinder—open source software for accessing gene ontology information and finding significantly enriched gene ontology terms associated with a list of genes. Bioinformatics 20, 3710–3715. ( 10.1093/bioinformatics/bth456) [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Gaudet P, Škunca N, Hu J, Dessimoz C. 2017. Primer on the gene ontology. Methods in Molecular Biology, vol. 1446 New York, NY: Humana Press. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Gene Ontology. 2018 No biological data available (ND) evidence code. See http://wiki.geneontology.org/index.php/No_biological_Data_available_(ND)_evidence_code.
24.Hayles J, Wood V, Jeffery L, Hoe K-L, Kim D-U, Park H-O, Salas-Pino S, Heichinger C, Nurse P. 2013. A genome-wide resource of cell cycle and cell shape genes of fission yeast. Open. Biol. 3, 130053 ( 10.1098/rsob.130053) [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Hillenmeyer ME. et al. 2008. The chemical genomic portrait of yeast: uncovering a phenotype for all genes. Science 320, 362–365. ( 10.1126/science.1150021) [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Castrillo JI, Oliver SG (eds). 2015. Systems biology of Alzheimer’s disease. Methods in Molecular Biology New York, NY: Humana Press. [PubMed] [Google Scholar]
27.Edwards AM, Isserlin R, Bader GD, Frye SV, Willson TM, Yu FH. 2011. Too many roads not taken. Nature 470, 163–165. ( 10.1038/470163a) [DOI] [PubMed] [Google Scholar]
28.Stoeger T, Gerlach M, Morimoto RI, Nunes Amaral LA. 2018. Large-scale investigation of the reasons why potentially important genes are ignored. PLoS Biol. 16, 1–25. ( 10.1371/journal.pbio.2006643) [DOI] [PMC free article] [PubMed] [Google Scholar]
29.PomBase. 2018 Taxonomic conservation. See https://www.pombase.org/documentation/taxonomic-conservation.
30.Finn RD. 2017. InterPro in 2017-beyond protein family and domain annotations. Nucleic Acids Res. 45, D190–D199. ( 10.1093/nar/gkw1107) [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Matsuyama A. et al. 2006. ORFeome cloning and global analysis of protein localization in the fission yeast Schizosaccharomyces pombe. Nat. Biotechnol. 24, 841–847. ( 10.1038/nbt1222) [DOI] [PubMed] [Google Scholar]
32.Kim D-U. et al. 2010. Analysis of a genome-wide set of gene deletions in the fission yeast Schizosaccharomyces pombe. Nat. Biotechnol. 28, 617–623. ( 10.1038/nbt.1628) [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Chen J-S. et al. 2014. Identification of new players in cell division, DNA damage response, and morphogenesis through construction of Schizosaccharomyces pombe deletion strains. G3 (Bethesda) 5, 361–370. ( 10.1534/g3.114.015701) [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Wolfe KH, Shields DC. 1997. Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387, 708–713. ( 10.1038/42711) [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figure S1

rsob180241supp1.pdf^{(1.6MB, pdf)}

Table S1

rsob180241supp2.tsv^{(1.8KB, tsv)}

Table S2

rsob180241supp3.tsv^{(60.3KB, tsv)}

Table S3

rsob180241supp4.tsv^{(46.8KB, tsv)}

Table S4

rsob180241supp5.tsv^{(136.4KB, tsv)}

Table S5

rsob180241supp6.tsv^{(127.9KB, tsv)}

Table S6

rsob180241supp7.tsv^{(139.3KB, tsv)}

Table S7

rsob180241supp8.tsv^{(465.2KB, tsv)}

Table S8

rsob180241supp9.tsv^{(8KB, tsv)}

Table S9

rsob180241supp10.tsv^{(8.1KB, tsv)}

Table S10

rsob180241supp11.tsv^{(23.7KB, tsv)}

Table S11

rsob180241supp12.tsv^{(453B, tsv)}

Table S12

rsob180241supp13.tsv^{(93.5KB, tsv)}

Table S13

rsob180241supp14.tsv^{(28.8KB, tsv)}

Data Availability Statement

This article has no additional data.

[RSOB180241C1] 1.Oliver SG. et al. 1992. The complete DNA sequence of yeast chromosome III. Nature 357, 38–46. ( 10.1038/357038a0) [DOI] [PubMed] [Google Scholar]

[RSOB180241C2] 2.Goffeau A. et al. 1996. Life with 6000 genes. Science 274, 546, 563–567 ( 10.1126/science.274.5287.546) [DOI] [PubMed] [Google Scholar]

[RSOB180241C3] 3.NIH. 2018 Genome information by organism. See https://www.ncbi.nlm.nih.gov/genome/browse/ (accessed: 1 October 2018).

[RSOB180241C4] 4.Oliver SG. 1996. From DNA sequence to biological function. Nature 379, 597–600. ( 10.1038/379597a0) [DOI] [PubMed] [Google Scholar]

[RSOB180241C5] 5.Hodges PE, McKee AH, Davis BP, Payne WE, Garrels JI. 1999. The Yeast Proteome Database (YPD): a model for the organization and presentation of genome-wide functional data. Nucleic Acids Res. 27, 69–73. ( 10.1093/nar/27.1.69) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSOB180241C6] 6.Wu LF, Hughes TR, Davierwala AP, Robinson MD, Stoughton R, Altschuler SJ. 2002. Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat. Genet. 31, 255–265. ( 10.1038/ng906) [DOI] [PubMed] [Google Scholar]

[RSOB180241C7] 7.MIPS. 2018 CYGD: the Comprehensive Yeast Genome Database. See http://mips.helmholtz-muenchen.de/proj/yeast/tables/inventy.html. [DOI] [PMC free article] [PubMed]

[RSOB180241C8] 8.Christie KR, Hong EL, Cherry JM. 2009. Functional annotations for the Saccharomyces cerevisiae genome: the knowns and the known unknowns. Trends Microbiol. 17, 286–294. ( 10.1016/j.tim.2009.04.005) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSOB180241C9] 9.Gibney PA, Hickman MJ, Bradley PH, Matese JC, Botstein D. 2013. Phylogenetic portrait of the Saccharomyces cerevisiae functional genome. G3 (Bethesda) 3, 1335–1340. ( 10.1534/g3.113.006585) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSOB180241C10] 10.PomBase. 2018 Protein-coding gene characterisation status. See https://www.pombase.org/status/genecharacterisation-statistics-history.

[RSOB180241C11] 11.Peña-Castillo L, Hughes TR. 2007. Why are there still over 1000 uncharacterized yeast genes? Genetics 176, 7–14. ( 10.1534/genetics.107.074468) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSOB180241C12] 12.Hughes TR, Robinson MD, Mitsakakis N, Johnston M. 2004. The promise of functional genomics: completing the encyclopedia of a cell. Curr. Opin. Microbiol. 7, 546–554. ( 10.1016/j.mib.2004.08.015) [DOI] [PubMed] [Google Scholar]

[RSOB180241C13] 13.2017. Expansion of the Gene Ontology knowledgebase and resources. Nucleic Acids Res. 45, D331–D338. ( 10.1093/nar/gkw1108) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSOB180241C14] 14.Balakrishnan R, Park J, Karra K, Hitz BC, Binkley G, Hong EL, Sullivan J, Micklem G, Cherry JM. 2012. YeastMine—an integrated data warehouse for Saccharomyces cerevisiae data as a multipurpose tool-kit. Database 2012, bar062 ( 10.1093/database/bar062) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSOB180241C15] 15.Smith RN, 2012. InterMine: a flexible data warehouse system for the integration and analysis of heterogeneous biological data. Bioinformatics 28, 3163–3165. ( 10.1093/bioinformatics/bts577) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSOB180241C16] 16.Gene Ontology. 2018 Download annotations. See http://current.geneontology.org/products/pages/downloads.html (accessed 26 September 2018).

[RSOB180241C17] 17.Rhee SY, Wood V, Dolinski K, Draghici S. 2008. Use and misuse of the gene ontology annotations. Nat. Rev. Genet. 9, 509–515. ( 10.1038/nrg2363) [DOI] [PubMed] [Google Scholar]

[RSOB180241C18] 18.PomBase. 2018 Fission yeast GO slim. See https://www.pombase.org/browsecuration/fission-yeast-go-slim-terms.

[RSOB180241C19] 19.Lock A, Rutherford K, Harris MA, Wood V. 2018. PomBase: the scientific resource for fission yeast, pp. 49–68. New York, NY: Humana Press, [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSOB180241C20] 20.Lock A. 2019. PomBase 2018: user-driven reimplementation of the fission yeast database provides rapid and intuitive access to diverse, interconnected information. Nucleic Acids Res. 47, D821–D827. ( 10.1093/nar/gky961 [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSOB180241C21] 21.Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, Sherlock G. 2004. GO::TermFinder—open source software for accessing gene ontology information and finding significantly enriched gene ontology terms associated with a list of genes. Bioinformatics 20, 3710–3715. ( 10.1093/bioinformatics/bth456) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSOB180241C22] 22.Gaudet P, Škunca N, Hu J, Dessimoz C. 2017. Primer on the gene ontology. Methods in Molecular Biology, vol. 1446 New York, NY: Humana Press. [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSOB180241C23] 23.Gene Ontology. 2018 No biological data available (ND) evidence code. See http://wiki.geneontology.org/index.php/No_biological_Data_available_(ND)_evidence_code.

[RSOB180241C24] 24.Hayles J, Wood V, Jeffery L, Hoe K-L, Kim D-U, Park H-O, Salas-Pino S, Heichinger C, Nurse P. 2013. A genome-wide resource of cell cycle and cell shape genes of fission yeast. Open. Biol. 3, 130053 ( 10.1098/rsob.130053) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSOB180241C25] 25.Hillenmeyer ME. et al. 2008. The chemical genomic portrait of yeast: uncovering a phenotype for all genes. Science 320, 362–365. ( 10.1126/science.1150021) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSOB180241C26] 26.Castrillo JI, Oliver SG (eds). 2015. Systems biology of Alzheimer’s disease. Methods in Molecular Biology New York, NY: Humana Press. [PubMed] [Google Scholar]

[RSOB180241C27] 27.Edwards AM, Isserlin R, Bader GD, Frye SV, Willson TM, Yu FH. 2011. Too many roads not taken. Nature 470, 163–165. ( 10.1038/470163a) [DOI] [PubMed] [Google Scholar]

[RSOB180241C28] 28.Stoeger T, Gerlach M, Morimoto RI, Nunes Amaral LA. 2018. Large-scale investigation of the reasons why potentially important genes are ignored. PLoS Biol. 16, 1–25. ( 10.1371/journal.pbio.2006643) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSOB180241C29] 29.PomBase. 2018 Taxonomic conservation. See https://www.pombase.org/documentation/taxonomic-conservation.

[RSOB180241C30] 30.Finn RD. 2017. InterPro in 2017-beyond protein family and domain annotations. Nucleic Acids Res. 45, D190–D199. ( 10.1093/nar/gkw1107) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSOB180241C31] 31.Matsuyama A. et al. 2006. ORFeome cloning and global analysis of protein localization in the fission yeast Schizosaccharomyces pombe. Nat. Biotechnol. 24, 841–847. ( 10.1038/nbt1222) [DOI] [PubMed] [Google Scholar]

[RSOB180241C32] 32.Kim D-U. et al. 2010. Analysis of a genome-wide set of gene deletions in the fission yeast Schizosaccharomyces pombe. Nat. Biotechnol. 28, 617–623. ( 10.1038/nbt.1628) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSOB180241C33] 33.Chen J-S. et al. 2014. Identification of new players in cell division, DNA damage response, and morphogenesis through construction of Schizosaccharomyces pombe deletion strains. G3 (Bethesda) 5, 361–370. ( 10.1534/g3.114.015701) [DOI] [PMC free article] [PubMed] [Google Scholar]

[RSOB180241C34] 34.Wolfe KH, Shields DC. 1997. Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387, 708–713. ( 10.1038/42711) [DOI] [PubMed] [Google Scholar]

PERMALINK

Hidden in plain sight: what remains to be discovered in the eukaryotic proteome?

Valerie Wood

Antonia Lock

Midori A Harris

Kim Rutherford

Jürg Bähler

Stephen G Oliver

Abstract

1. Slow progress in characterizing unknowns

Figure 1.

2. Defining unknown metrics: what counts as ‘known’?

2.1. Metrics to describe functional characterization levels

2.2. Annotation coverage for proteins by Gene Ontology aspect

Figure 2.

2.3. Known physiological function: Gene Ontology slim process coverage

Figure 3.

3. Why do these proteins remain unstudied?

3.1. Biological bias

Table 1.

3.2. Research bias

4. Classifying the conserved unknown proteins, or: what lies undiscovered?

Figure 4.

5. Strategies to link unknown genes to broad cellular roles

Supplementary Material

Supplementary Material

Supplementary Material

Supplementary Material

Supplementary Material

Supplementary Material

Supplementary Material

Supplementary Material

Supplementary Material

Supplementary Material

Supplementary Material

Supplementary Material

Supplementary Material

Supplementary Material

Data accessibility

Authors' contributions

Competing interests

Funding

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases