GOcats: A tool for categorizing Gene Ontology into subgraphs of user-defined concepts

Eugene W Hinderer, III; Hunter N B Moseley

doi:10.1371/journal.pone.0233311

. 2020 Jun 11;15(6):e0233311. doi: 10.1371/journal.pone.0233311

GOcats: A tool for categorizing Gene Ontology into subgraphs of user-defined concepts

Eugene W Hinderer III ¹, Hunter N B Moseley ^1,^2,^3,^4,^5,^*

Editor: Marc Robinson-Rechavi⁶

PMCID: PMC7289357 PMID: 32525872

Abstract

Gene Ontology is used extensively in scientific knowledgebases and repositories to organize a wealth of biological information. However, interpreting annotations derived from differential gene lists is often difficult without manually sorting into higher-order categories. To address these issues, we present GOcats, a novel tool that organizes the Gene Ontology (GO) into subgraphs representing user-defined concepts, while ensuring that all appropriate relations are congruent with respect to scoping semantics. We tested GOcats performance using subcellular location categories to mine annotations from GO-utilizing knowledgebases and evaluated their accuracy against immunohistochemistry datasets in the Human Protein Atlas (HPA). In comparison to term categorizations generated from UniProt’s controlled vocabulary and from GO slims via OWLTools’ Map2Slim, GOcats outperformed these methods in its ability to mimic human-categorized GO term sets. Unlike the other methods, GOcats relies only on an input of basic keywords from the user (e.g. biologist), not a manually compiled or static set of top-level GO terms. Additionally, by identifying and properly defining relations with respect to semantic scope, GOcats can utilize the traditionally problematic relation, has_part, without encountering erroneous term mapping. We applied GOcats in the comparison of HPA-sourced knowledgebase annotations to experimentally-derived annotations provided by HPA directly. During the comparison, GOcats improved correspondence between the annotation sources by adjusting semantic granularity. GOcats enables the creation of custom, GO slim-like filters to map fine-grained gene annotations from gene annotation files to general subcellular compartments without needing to hand-select a set of GO terms for categorization. Moreover, GOcats can customize the level of semantic specificity for annotation categories. Furthermore, GOcats enables a safe and more comprehensive semantic scoping utilization of go-core, allowing for a more complete utilization of information available in GO. Together, these improvements can impact a variety of GO knowledgebase data mining use-cases as well as knowledgebase curation and quality control.

Introduction

Gene Ontology (GO)

The Gene Ontology (GO) [1] is the most common biology-focused controlled vocabulary (CV) used to represent information and knowledge distilled from most biological and biomedical research data generated today, from classic wet-bench experiments to high-throughput analytical platforms, especially omics technologies. Each CV term in GO is assigned a unique alphanumeric code and is used to annotate genes and gene products in many other databases, including UniProt [2] and Ensembl [3]. GO is divided into three sub-ontologies: Cellular Component, Molecular Function, and Biological Process. A graph represents each sub-ontology, where individual GO terms are nodes connected by directional edges (i.e. relation). For example, the term “lobed nucleus” (GO:0098537) is connected by a directional is_a relation edge to the term “nucleus” (GO:0005634). In this graph context, the is_a relation defines the term “nucleus” as a parent of the term “lobed nucleus”. There are eleven types of relations used in the core version of GO; however, is_a is the most ubiquitous. The three GO sub-ontologies are “is_a disjoint” meaning that there are no is_a relations connecting any node among the three sub-ontologies.

There are also three versions of the GO database: go-basic which is filtered to only include is_a and part_of relations; go or go-core contains additional relations, that may span sub-ontologies and which point both toward and away from the top of the ontology; and go-plus contains cross-references to entries in external databases and ontologies.

Growth and evolution of biological controlled vocabularies

GO and other CVs like the Unified Medical Language System [4,5] saw an explosion in development in the mid-1990s and early 2000s, coinciding with the increase in high-throughput experimentation and “big data” projects like the Human Genome Project. Their intended purpose is to standardize the functional descriptions of biological entities so that these functions can be referenced via annotations across large databases unambiguously, consistently, and with increased automation. However, ontology annotations are also utilized alongside automated pipelines that analyze protein-protein interaction networks and form predictions of unknown protein function based on these networks [6,7], for gene annotation enrichment analyses, and are now being leveraged for the creation of predictive disease models in the scope of systems biochemistry [8].

Difficulty in representing biological concepts derived from omics-level research

Differential abundance analyses for a range of omics-level technologies, especially transcriptomics technologies can yield large lists of differential genes, gene-products, or gene variants. Many different GO annotation terms may be associated with these differential gene lists, making it difficult to interpret without manually sorting into appropriate descriptive categories [9]. It is similarly non-trivial to give a broad overview of a gene set or make queries for genes with annotations for a specific biological concept. For example, a recent effort to create a protein-protein interaction network analysis database resorted to manually building a hierarchical localization tree from GO cellular compartment terms due to the “incongruity in the resolution of localization data” in various source databases and the fact that no published method existed at that time for the automated organization of such terms [6]. If a subgraph of GO could be programmatically extracted to represent a specific biological concept, a category-defining general term could be easily associated with all its ontological child terms within the subgraph.

Meanwhile, high-throughput transcriptomic and proteomic characterization efforts like those carried out by the Human Protein Atlas (HPA) now provide sophisticated pipelines for resolving expression profiles at organ, tissue, cellular and subcellular levels by integrating quantitative transcriptomics with microarray-based immunohistochemistry [10]. Such efforts create a huge amount of omics-level experimental data that is cross-validated and distilled into systems-level annotations linking genes, proteins, biochemical pathways, and disease phenotypes across our knowledgebases. However, annotations provided by such efforts may vary in terms of granularity, annotation sets used, or ontologies used. Therefore, (semi-)automated (i.e. at least partially automated) and unbiased methods for categorizing semantically-similar and biologically-related annotations are needed for integrating information from heterogeneous sources—even if the annotation terms themselves are standardized—to facilitate effective downstream systems-level analyses and integrated network-based modeling.

Term categorization approaches

Issues of term organization and term filtering have led to the development of GO slims—manually trimmed versions of the gene ontology containing only generalized terms [11], which represent concepts within GO. Other software, like Categorizer [9], can organize the rest of GO into representative categories using semantic similarity measurements between GO terms. GO slims may be used in conjunction with mapping tools, such as OWLTools’ (https://github.com/owlcollab/owltools) Map2Slim (M2S) or GOATools (https://zenodo.org/record/31628), to map fine-grained annotations within Gene Annotation Files (GAFs) to the appropriate generalized term(s) within the GO slim or within a list of GO terms of interest. While web-based tools such as QuickGO exist to help compile lists of GO terms [12], using M2S either relies completely on the structure of existing GO slims or requires input or selection of individual GO identifiers for added customization, and necessitates the use of other tools for mapping. UniProt has also developed a manually-created mapping of GO to a hierarchy of biologically-relevant concepts [13]. However, it is smaller and less maintained than GO slims, and is intended for use only within UniProt’s native data structure.

Semantic similarity in the context of broad term categorization

In addition to utilizing the inherent hierarchical organization of GO to categorize terms, other metrics may be used for categorization. For instance, semantic similarity can be combined along with the GO structure to calculate a statistical value indicating whether a term should belong to a predefined group or category of [9,14–17]. One rationale for this type of approach is that the topological distance between two terms in the ontology graph is not necessarily proportional to the semantic closeness in meaning between those terms, and semantic similarity reconciles potential inconsistencies between semantic closeness and graph distance. Additionally, some nodes have multiple parents, where one parent is more closely related to the child than the others [9]. Semantic similarity can help determine which parent is semantically more closely related to the term in question. While these issues are valid, we maintain that in the context of aggregating fine-grained terms into general categories, these considerations are not necessary. First, fluctuations in semantic distances between individual terms are not an issue once terms are binned into categories: all binned terms will be reduced to a single step away from the category-defining node. Second, the problem of choosing the most appropriate parent term for a GO term only causes problems when selecting a representative node for a category; however, since most paths eventually converge onto a common ancestor, any significantly diverging paths would have its meaning captured by rooting multiple categories to a single term, cleanly sidestepping the issue.

Maintenance of ontologies

Despite maintenance and standard policies for adding terms, ontological organization is still subject to human error and disagreement, necessitating quality assurance and revising, especially as ontologies evolve or merge. A recent review of current methods for biomedical ontology mapping highlights the importance in developing semi-automatic methods [18,19] to aid in ontology evolution efforts and reiterates the aforementioned concept of semantic correspondence in terms of scoping between terms [20]. Methods incorporating such correspondences have been published elsewhere, but these deal with issues of ontology evolution and merging, and not with categorizing terms into user-defined subsets [21,22]. Ontology merging also continues to be an active area of development for integrating functional, locational, and phenotypic information. To aid in this, another recent review points out the importance of integrating phenotypic information across various levels of organismal complexity, from the cellular level to the organ system level [8]. Thus, organizing location-relevant ontology terms into discrete categories is an important step toward this end.

GO Categorization Suite (GOcats)

For the reasons indicated above, we have developed a tool called the GO Categorization Suite (GOcats), which serves to streamline the process of slicing the ontology into subgraphs representing specific biological concepts. Unlike previously developed tools, GOcats works with a list of user-provided keywords and/or GO terms, along with the structure of GO and augmented relation properties. Based on this input, GOcats automatically extracts a subgraph of related GO terms, identifies a representative category-defining GO term for the subgraph, and maps all subgraph child GO terms to this representative GO term. In essence, GOcats automatically generates a concept-specific GOslim with only keywords and GO terms provided by a user, typically a biologist. Furthermore, GOcats allows the user to choose between the strict axiomatic interpretation or a looser semantic scoping interpretation of part-whole (mereological) relation edges within GO. Specifically, we consider scoping relations to be comprised of is_a, part_of, and has_part, and mereological relations to be comprised of part_of and has_part. In the next section, we evaluate GOcats ability to generate category-specific subgraphs and to utilize these subgraphs to compare knowledgebase annotations to their experimental source (i.e. the HPA). Due to the nature of the experimentally verified properties available from the HPA, our analysis in this paper focuses on cellular locations, especially subcellular locations. Also, this paper provides an in-depth description of GOcats’s methods and their implementation. In a prior publication, we demonstrated GOcats’s ability to improve gene-annotation enrichment analyses, involving all GO sub-ontologies [23].

Results

GOcats compactly organizes GO subcellular localization terms into user-specified categories

As an initial proof-of-concept, we evaluated the automatic extraction and categorization of 25 subcellular locations, using GOcats’ “comprehensive” method of subgraph extension (See Methods and the go-core graph, data-version: releases/2016-01-12). Starting with common biological subcellular concepts like “nucleus”, “cytoplasm”, and “mitochondrion”, we recursively used terms not being categorized to identify additional subcellular concepts and associated keywords represented within the GO Cellular Component sub-ontology. Due to the eventual application to the HPA datasets, three unusual categories, “bacterial”, “viral”, and “other organism”, were included to prevent categorization of terms that would complicate a eukaryotic interpretation of the other 22 subcellular locations, within the context of a greedy subgraph extension algorithm. For these resulting 25 categories, 22 contained a designated GO term root-node that exactly matched the concept intended at the creation of the keyword list (Table 1).

Table 1. Summary of 25 example subcellular locations extracted by GOcats.

Subgraph name	User-input keywords	Predicted representative term (ID)	Nodes seeded from keyword search	Nodes added during graph extension	Seeded nodes not in subgraph	Total nodes
Aggresome	aggresome, aggresomal, aggresomes	aggresome (GO:0016235)	1	0	0	1
Bacterial	bacterial, bacteria, bacterial-type	bacterial-type flagellum (GO:0009288)	136	1	121	16
Cell Junction	junction	Cell junction (GO:0030054)	68	16	34	50
Chromosome	chromosome, chromosomal, chromosomes	chromosome (GO:0005694)	120	122	31	211
Cytoplasm	cytoplasm, cytoplasmic	Cytoplasm (GO:0005737)	296	1061	160	1197
Cytoplasmic Granule	granule, granules	secretory granule (GO:0030141)	81	16	50	47
Cytoskeleton	cytoskeleton, cytoskeletal	cytoskeleton (GO:0005856)	78	194	47	225
Cytosol	cytosol, cytosolic	cytosol (GO:0005829)	56	51	28	79
Endoplasmic Reticulum	endoplasmic, sarcoplasmic, reticulum	endoplasmic reticulum (GO:0005783)	113	39	51	101
Endosome	endosome, endosomes, endosomal	endosome (GO:0005768)	67	15	24	58
Extracellular	extracellular, secreted	extracellular region (GO:0005576)	142	123	85	180
Golgi Apparatus	golgi	golgi apparatus (GO:0005794)	67	12	25	54
Lysosome	lysosome, lysosomal, lysosomes	lysosome (GO:0005764)	42	7	16	33
Macromolecular Complex	protein, macromolecular	macromolecular complex (GO:0032991)	1317	969	184	2102
Microbody	microbody, microbodies	microbody (GO:0042579)	4	20	0	24
Mitochondrion	mitochondria, mitochondrial, mitochondrion	mitochondrion (GO:0005739)	134	2	44	92
Neuron Part	neuron, neuronal, neurons, synapse	neuron part (GO:0097458)	90	94	35	149
Nucleolus	nucleolus, nucleolar	nucleolus (GO:0005730)	25	11	12	24
Nucleus	nucleus, nuclei, nuclear	nucleus (GO:0005634)	288	340	118	510
Other Organism	other, host, organism	other organism (GO:0044215)	369	12	259	122
Plasma Membrane	plasma	plasma membrane (GO:0005886)	308	302	164	446
Plastid	plastid, chloroplast	plastid (GO:0009536)	95	48	8	135
Thylakoid	thylakoid, thylakoids	thylakoid (GO:0009579)	52	22	11	63
Vesicle	vesicle, vesicles	vesicle (GO:0031982)	198	90	85	203
Viral	virion, virus, viral	viral occlusion body (GO:0039679)	93	1	26	68
	Expected representative
	Unexpected representative

Open in a new tab

^aNodes seeded from keyword search.

^bNodes added through subgraph extension.

^cSeeded nodes removed due to subgraph omission.

^dBecause subgraph nodes may root to more than one representative root node, the totals in this table do not add up to the total number of GO terms in Cellular Component.

These subgraphs account for approximately 89% of GO’s Cellular Component sub-ontology. While keyword querying of GO provided an initial seeding of the growing subgraph, Table 1 highlights the necessity of re-analyzing the GO graph, both to remove terms erroneously added by the keyword search and to add appropriate subgraph terms not captured by the keyword search. For example, the “cytoplasm” subgraph grew from its initial seeding of 296 nodes to 1197 nodes after extension. Conversely, 136 nodes were seeded by keyword for the “bacterial” subgraph, but only 16 were rooted to the representative node.

To assess the relative size and structure of subgraphs within GO, we visualized the category subgraphs as a network using Cytoscape 3.0 [24]. GOcats outputs a dictionary of individual GO term keys with a list of category-defining root-node values as part of its normal functionality.

Of note, 2102 of the 3877 terms in Cellular Component could be rooted to a single concept: “macromolecular complex.” Despite cytosol being defined as “the part of the cytoplasm that does not contain organelles, but which does contain other particulate matter, such as protein complexes”, less than half of the terms rooted to macromolecular complex also rooted to cytosol or cytoplasm. Surprisingly, approximately 25% of the terms rooted to macromolecular complex are rooted to this category alone (Fig 1A). In this visualization, intracellular organelles tend to be clustered about cytoplasm, except for nucleus which the GO consortium does not consider as part of the cytoplasm. The visualization of the subgraph contents confirmed the uniqueness of the macromolecular complex category and showed the relative sizes of groups of GO terms shared between two or more categories. But the macromolecular complex category somewhat complicates the visualization of category organization within GO, due to this category’s size and interconnectedness within the ontology.

Fig 1 — A. Network of 25 categories whose subgraphs account for 89% of the GO cellular component sub-ontology. B. Network of all categories from A except for Macromolecular Complex. C. Network of 20 categories used in the Human Protein Atlas subcellular localization immunohistochemistry raw data.

To better reflect what might be a biologist’s expectation for a cell’s overall organization, we produced another visualization with the macromolecular complex category omitted (Fig 1B). Despite the idiosyncrasies with the macromolecular complex subgraph, compartments that typically contain a large range of protein complexes, such as the nucleus, plasma membrane, and cytoplasm appear to be appropriately populated. Furthermore, concepts such as endomembrane trafficking can be gleaned from the network connectedness of representative nodes, such as lysosome, Golgi apparatus, vesicle, secretory granule, and cytoplasm. Overall, the patterns of connectedness in this network make more sense biologically, within the constraints of GO’s internal organization. In other words, it is easier to see the expected biological relationships between cellular locations in Fig 1B versus Fig 1A.

GOcats-derived category subgraphs compare well with similar subgraphs derived by other methods

We compared GOcats’ category subgraphs taken from the go-core database, data-version: releases/2016-01-12 to subgraphs of the manually-curated UniProt subcellular localization controlled vocabulary (CV) [13] (see Fig 2 and Methods) and to subgraphs created by M2S (see Methods). Differences in the sets of GO terms contained within these subgraphs can be attributed to differences in the number of edges between nodes—as is the case between GOcats and M2S since M2S does not traverse across has_part edges—and the number of overall nodes being evaluated—as is the case when comparing M2S and GOcats term sets to the UniProt CV terms sets since the UniProt CV contains considerably fewer GO terms. For the most part, GOcats category subgraphs are large supersets of UniProt CV subgraphs, as demonstrated by the high inclusion indices and low Jaccard indices in Table 2. In the comparison of GOcats and M2S subgraphs, the mappings for most categories are in very close agreement, as evidenced by both high inclusion and Jaccard indices in Table 3 and further highlighted in Fig 3A and 3B and S1 Data A-V [25]. Overall, GOcats robustly categorizes GO terms into category subgraphs with high similarity to existing GO-utilizing categorization methods while including information gleaned from has_part edges.

Table 2. Agreement summary between corresponding GOcats and UniProt CV subgraphs.

Location Category	Term ID	Inclusion Index	Jaccard Index	GOcats subgraph size	UniProt CV subgraph size
Bacterial-type Flagellum	GO:0009288	1	0.0625	16	1
Cell Junction	GO:0030054	0.47619	0.163934	50	21
Chromosome	GO:0005694	1	0.0189573	211	4
Cytoplasm	GO:0005737	0.809524	0.0141549	1197	21
Endoplasmic Reticulum	GO:0005783	0.818182	0.0873786	101	11
Endosome	GO:0005783	1	0.241379	58	14
Extracellular Region	GO:0005576	0.5625	0.0481283	180	16
Golgi Apparatus	GO:0005794	0.8	0.142857	54	10
Lysosome	GO:0005764	1	0.0909091	33	3
Mitochondrion	GO:0005739	1	0.0978261	92	9
Nucleus	GO:0005634	1	0.0294118	510	15
Plastid	GO:0009536	0.846154	0.307692	135	52

Open in a new tab

Table 3. Agreement summary between corresponding GOcats and Map2Slim subgraphs.

Location Category	Term ID	Inclusion Index^‡	Jaccard Index	GOcats subgraph size	Map2Slim subgraph size	"Has_part" relationships
Aggresome	GO:0016235	1	1	1	1	0
Bacterial-type Flagellum	GO:0009288	1	1	16	16	8
Cell Junction	GO:0030054	0.980392	0.980392	50	51	4
Chromosome	GO:0005694	0.984375	0.883178	211	192	40
Cytoplasm	GO:0005737	0.927273	0.452055	1197	605	38
Cytoskeleton	GO:0005856	0.812274	0.812274	225	277	10
Cytosol	GO:0005829	0.963415	0.963415	79	82	8
Endoplasmic Reticulum	GO:0005783	1	0.990099	101	100	4
Endosome	GO:0005768	1	1	58	58	0
Extracellular Region	GO:0005576	1	0.927778	180	167	2
Golgi Apparatus	GO:0005794	1	1	54	54	0
Lysosome	GO:0005764	1	1	33	33	0
Macromolecular Complex	GO:0032991	0.947274	0.947274	2102	2219	232
Microbody	GO:0042579	1	1	2	24	0
Mitochondrion	GO:0005739	0.978723	0.978723	92	94	8
Neuron Part	GO:0097458	1	0.993289	149	148	22
Nucleolus	GO:0005730	0.857143	0.857143	24	28	0
Nucleus	GO:0005634	0.991684	0.928016	510	481	168
Other Organism	GO:0044215	1	1	122	122	8
Plasma Membrane	GO:0005886	0.563081	0.547097	446	753	20
Plastid	GO:0009536	0.992647	0.992647	135	136	0
Secretory Granule	GO:0030141	1	1	47	47	0
Thylakoid	GO:0009579	1	1	63	63	0
Vesicle	GO:0031982	0.981132	0.757282	203	159	12
Viral Occlusion Body	GO:0039679	1	0.0147059	68	1	4

Open in a new tab

^‡ Inclusion index quantifies the extent to which the smaller subgraph is included in the larger subgraph

However, in some categories, M2S and GOcats disagree as illustrated in Fig 3C and S1(E) Data. The most striking example of this is in the plasma membrane category, where M2S’s subgraph contained over 300 terms that were not mapped by GOcats. We manually examined theses discrepancies in the plasma membrane category and noted that many of the terms uniquely mapped by M2S did not appear to be properly rooted to “plasma membrane” (S2 Data). M2S mapped terms such as “nuclear envelope,” “endomembrane system,” “cell projection cytoplasm”, and “synaptic vesicle, resting pool” to the plasma membrane category, while such questionable associations were not made using GOcats. Even though most terms included by M2S but excluded by GOcats exist beyond the scope of or are largely unrelated to the concept of “plasma membrane,” a few terms in the set did seem appropriate, such as “intrinsic component of external side of cell outer membrane.” However, of these examples, no logical semantic path could be traced between the term and “plasma membrane” in GO, indicating that these associations are not present in the ontology itself. These differences in mapping are due to our reevaluation of the has_part edges with respect to scope. As shown in Table 3 the categories with the greatest agreement between the two methods were those with no instances of has_part relations, which is the only relation in Cellular Component that is natively incongruent with respect to scope. However, there is no apparent correlation between the frequency of this relation and the extent of disagreement.

Custom-tailoring of GO slim-like categories with GOcats allows for robust knowledgebase gene annotation mining

The ability to query knowledgebases for genes and gene products related to a set of general concepts-of-interest is an important method for biologists and bioinformaticians alike. We hypothesized that grouping annotations into categories using GOcats and relevant keywords would more closely match the annotations categorized manually by the HPA consortium than either M2S or UniProt’s CV. Using the set of GO terms annotated in the HPA’s immunohistochemistry localization raw data as “concepts” (Table 4), we derived mappings to annotation categories generated from GOcats, M2S, and UniProt’s CV based on UniProt- and Ensembl- sourced annotations from the European Molecular Biology Laboratories-European Bioinformatics Institute (EMBL-EBI) QuickGO knowledgebase resource [12] (See Methods). In this context, the term “raw data” refers to processed, curated experimental data that is annotated as a contrast to the GO annotations derived from a knowledgebase.

Table 4. Summary of 20 subcellular locations used in the HPA raw experimental data extracted by GOcats.

Subgraph name	User-input keywords	Predicted representative term (ID)	Nodes seeded from keyword search	Nodes added during graph extension	Seeded nodes not in subgraph^a	Total nodes^b
Actin cytoskeleton	actin cytoskeleton	actin cytoskeleton (GO:0015629)	117	22	77	62
Aggresome	aggresome, aggresomal, aggresomes	aggresome (GO:0016235)	1	0	0	1
Cell Junction	junction	cell junction (GO:0030054)	68	16	34	50
Centrosome	centrosome	centrosome (GO:0005813)	10	2	5	7
Cytoplasm	cytoplasm, cytoplasmic	cytoplasm (GO:0005737)	296	1061	160	1197
Endoplasmic Reticulum	endoplasmic, sarcoplasmic, reticulum	endoplasmic reticulum (GO:0005783)	113	39	51	101
Focal adhesion	focal adhesion	focal adhesion (GO:0005925)	29	0	28	1
Golgi Apparatus	golgi	golgi apparatus (GO:0005794)	67	12	25	54
Intercellular bridge	intercellular bridge	intercellular bridge (GO:0045171)	24	2	19	7
Intermediate filament cytoskeleton	intermediate filament cytoskeleton	intermediate filament cytoskeleton (GO:0045111)	126	0	118	8
Intracellular membrane-bounded organelle (vesicle^c)	intracellular membrane-bounded organelle	Intracellular membrane-bounded organelle (GO:0043231)	229	1116	118	1227
Microtubule cytoskeleton	microtubule cytoskeleton	microtubule cytoskeleton (GO:0015630)	112	55	68	109
Microtubule end	microtubule end	microtubule end (GO:1990752)	138	0	133	5
Microtubule organizing center	microtubule organizing center	microtubule organizing center (GO:0005815)	110	34	95	49
Mitochondrion	mitochondria, mitochondrial, mitochondrion	mitochondrion (GO:0005739)	134	2	44	92
Nuclear membrane	nuclear membrane	nuclear membrane (GO:0031965)	1151	0	1139	12
Nucleolus	nucleolus, nucleolar	nucleolus (GO:0005730)	25	11	12	24
Nucleoplasm	nucleoplasm	nucleoplasm (GO:0005654)	10	125	4	131
Nucleus	nucleus, nuclei, nuclear	nucleus (GO:0005634)	288	340	118	510
Plasma Membrane	plasma	plasma membrane (GO:0005886)	308	302	164	446
	Expected representative
	Unexpected representative

Open in a new tab

^aSeeded nodes removed due to subgraph omission.

^bBecause subgraph nodes may root to more than one representative root node, the totals in this table do not add up to the total number of GO terms in Cellular Component.

^cHPA conservatively annotates "vesicles" as intracellular membrane-bounded organelle.

Next, we evaluated how these derived annotation categories matched raw HPA data GO annotations (See Fig 4 and Methods). GOcats slightly outperformed M2S and significantly outperformed UniProt’s CV in the ability to query and extract genes and gene products from the knowledgebase that exactly matched the annotations provided by the HPA (Fig 5A). Similar relative results are seen for partially matched knowledgebase annotations. Genes in the “partial agreement,” “partial agreement is superset,” or “no agreement” groups may have annotations from other sources that place the gene in a location not tested by the HPA immunohistochemistry experiments or may be due to non-HPA annotations being at a higher semantic scoping than what the HPA provided. Also, novel localization provided by the HPA could explain genes in the “partial agreement” and “no agreement” groups. In this context, “partial agreement” refers to genes with at least one matching subcellular location, “partial agreement is superset” refers to genes where knowledgebase subcellular locations are a superset of the HPA dataset (these are mutually exclusive to the “partial agreement” category), "no agreement" refers to genes with no subcellular locations in common, and “no annotations” refers to genes in the experimental dataset that were not found in the knowledgebase.

Fig 5 — “Complete agreement” refers to genes where all subcellular locations derived from the knowledgebase and the HPA dataset matched, “partial agreement” refers to genes with at least one matching subcellular location, “partial agreement is superset” refers to genes where knowledgebase subcellular locations are a superset of the HPA dataset (these are mutually exclusive to the “partial agreement” category), "no agreement" refers to genes with no subcellular locations in common, and “no annotations” refers to genes in the experimental dataset that were not found in the knowledgebase. The more-generic categories used in panel B can be found in Table 3. A) Number of genes of the given agreement type when comparing mapped gene product annotations assigned by UniProt and Ensembl in the EMBL-EBI knowledgebase to those taken from The Human Protein Atlas’ raw data. Knowledgebase annotations were mapped by GOcats, Map2Slim, and the UniProt CV to the set of GO annotations used by the HPA in their experimental data. B) Shift in agreement following GOcats’ mapping of the same knowledgebase gene annotations and the set of annotations used in the raw experimental data using a more-generic set of location terms meant to rectify potential discrepancies in annotation granularity.

Furthermore, GOcats performed the categorization of HPA’s subcellular locations dataset in an average of 10.574 seconds after 50 test runs (standard deviation of 0.074 seconds), while M2S performed its mapping on the same data in an average of 14.837 seconds after 50 test runs (standard deviation of 0.300 seconds) (see Methods for hardware configuration details). These results are rather surprising since GOcats is implemented in Python [26], an interpreted language, versus M2S which is implemented in Java and compiled to Java byte code. However, through the use of Python decorators, GOcats recursively creates and stores ancestor and descendent node sets in a manner analogous to lazy evaluation, allowing the implementation of efficient subgraph-centric algorithms that only precomputes the ancestor and descendent sets that are needed. Based on these results, GOcats should offer appreciable computational improvement on significantly larger datasets. This is demonstrated in GOcats’s application in annotation enrichment analysis involving all three GO sub-ontologies, which executes in just a few seconds [23].

One key feature of GOcats is the ability to easily customize category subgraphs of interest. To improve agreement and rectify potential differences in term granularity, we used GOcats to organize HPA’s raw data annotation along with the knowledgebase data into slightly more generic categories (Table 5).

Table 5. Generic location categories used to resolve potential scoping inconsistencies in HPA raw data.

HPA annotation category	GOcats-customized general HPA category
Actin cytoskeleton	Cytoskeleton
Centrosome
Intermediate filament cytoskeleton
Microtubule cytoskeleton
Microtubule end
Microtubule organizing center
Aggresome	Aggresome
Cell junction	Cell junction
Cytoplasm	Cytoplasm
Endoplasmic reticulum	Endoplasmic reticulum
Focal adhesion	Focal adhesion
Golgi apparatus	Golgi apparatus
Intercellular bridge	intercellular bridge
intracellular membrane-bounded organelle	intracellular membrane-bounded organelle
Mitochondrion	Mitochondrion
Nucleus	Nucleus
Nucleoplasm
Nuclear membrane
Nucleolus	Nucleolus
Plasma membrane	Plasma membrane

Open in a new tab

In doing so, GOcats can query over twice as many knowledgebase-derived gene annotations with complete agreement with the more-generic HPA annotations, while also increasing the number of genes in the categories of “partial” and “partial agreement is superset” agreement types and decreasing the number of genes in the “no agreement” category (Fig 5B).

We then compared the methods’ mapping of knowledgebase gene annotations derived from HPA to the HPA experimental dataset to demonstrate how researchers could use the GOcats suite to evaluate how well their own experimental data is represented in public knowledgebases. Because the set of gene annotations used in the HPA experimental dataset and in the HPA-derived knowledgebase annotations are identical, no term mapping occurred during the agreement evaluation and so the assignment agreement was identical between GOcats and M2S. As expected, the complete agreement category was high, although there was a surprising number of partial agreement and even some genes that had no annotations in agreement (Fig 5). We next broke down which locations were involved in each agreement type and noted that the “nucleus,” “nucleolus,” and “nucleoplasm” had the highest disagreement relative to their sizes, but these disagreements were present across nearly all categories (Table 6).

Table 6. Summary of gene location category agreement between manually-curated HPA raw data and GOcats/Map2Slim categorized HPA-derived annotations.

	Agreement^*
Location	Complete	Partial	Superset^‡	None	Not in Knowledgebase
Actin cytoskeleton	51	0	7	0	37
Aggresome	2	0	0	3	4
Cell Junction	36	0	17	0	51
Centrosome	58	3	17	0	49
Cytoplasm	1037	55	162	5	643
Endoplasmic Reticulum	66	1	7	0	39
Focal adhesion	27	5	9	0	17
Golgi Apparatus	159	5	43	0	137
Intercellular bridge	14	0	4	0	19
Intermediate filament cytoskeleton	18	1	4	0	23
Intracellular membrane-bounded organelle	283	6	50	1	212
Microtubule cytoskeleton	35	2	9	0	27
Microtubule end	2	0	0	0	0
Microtubule organizing center	32	0	5	0	14
Mitochondrion	263	4	55	0	154
Nuclear membrane	47	6	17	0	39
Nucleolus	266	10	69	6	163
Nucleoplasm	989	26	230	23	534
Nucleus	437	14	217	23	373
Plasma Membrane	265	12	55	0	225

Open in a new tab

^‡Knowledgebase genes mapped to a set of categories that is a superset of those manually assigned by the HPA in raw data

*Numbers reflect how many times a location was involved in a particular agreement type; sums of all locations for an agreement category do not indicate the total number of genes for an agreement type.

Both M2S and GOcats avoid superset category term mapping; neither map a category-representative GO term to another category-representative GO term if one supersedes another (although GOcats has the option to enable this functionality). Therefore, discrepancies in annotation should not arise by term mapping methods. Nevertheless, we hypothesized that some granularity-level discrepancies exist between the HPA experimental raw data and the HPA-assigned gene annotations in the knowledgebase. We performed the same custom category generic mapping as we did for the previous test and discovered that some disagreements were indeed accounted for by granularity-level discrepancies, as seen in the decrease in “partial” and “no agreement” categories and increase in “complete” agreement category following generic mapping (Fig 6, blue bars). For example, 26S proteasome non-ATPase regulatory subunit 3 (PSMD3) was annotated to the nucleus (GO:0005634) and cytoplasm (GO:0005737) in the experimental data but was annotated to the nucleoplasm (GO:0005654) and cytoplasm in the knowledgebase. By matching the common ancestor mapping term “nucleus”, GOcats can group the two annotations in the same category. In total, 132 terms were a result of semantic scoping discrepancies. Worth noting is the fact that categories could be grouped to common categories to further improve agreement, for example “nucleolus” within “nucleus.”

Fig 6 — Number of genes in the given agreement type when comparing gene product annotations assigned by HPA in the EMBL-EBI knowledgebase to those in The Human Protein Atlas’ raw experimental data. “Complete agreement” refers to genes where all subcellular locations derived from the knowledgebase and the HPA dataset matched, “partial agreement” refers to genes with at least one matching subcellular location, “partial agreement is superset” refers to genes where knowledgebase subcellular locations are a superset of the HPA dataset (these are mutually exclusive to the “partial agreement” category), "no agreement" refers to genes with no subcellular locations in common, and “no annotations” refers to genes in the experimental dataset that were not found in the knowledgebase. The more-generic categories used in panel B can be found in Table 3.

Interestingly, among the remaining disagreeing assignments were some with fundamentally different annotations. Many of these are cases in which either the experimental data, or knowledgebase data have one or more additional locations distinct from the other. For example, NADH dehydrogenase [ubiquinone] 1 beta subcomplex subunit 6 (NDUB6) was localized only to the mitochondria (GO:0005739) in the experimental data yet has annotations to the mitochondria and the nucleoplasm (GO:0005654) in the knowledgebase. Why such discrepancies exist between experimental data and the knowledgebase is not clear.

We were also surprised by the high number of genes with “supportive” annotations in the HPA raw data that were not found in the EMBL-EBI knowledgebase when filtered to those annotated by HPA. As Fig 6 shows, roughly one-third of the annotations from the raw data were missing altogether from the knowledgebase; the gene was not present in the knowledgebase whatsoever. This was surprising because “supportive” was the highest confidence score for subcellular localization annotation.

Discussion

Discrepancies in the semantic granularity of gene annotations in knowledgebases represent a significant hurdle to overcome for researchers interested in mining genes based on a set of annotations used in experimental data. To demonstrate the potential GOcats has in resolving these discrepancies, we categorized annotations from HPA-sourced gene annotations using GOcats, M2S, and the UniProt subcellular localization CV. The HPA source was chosen because primary data from high-throughput immunofluorescence-based gene product localization experiments exist in publicly-accessible repositories and have been inspected by experts and given a confidence score [10]. As we show, utilizing only the set of specific annotations used in the HPA’s experimental data, M2S’s mapping matches only 366 identical sets of gene annotations from the knowledgebase with GOcats matching slightly more (Fig 5A). GOcats alleviates this problem by allowing researchers to define categories at a custom level of granularity so that categories may be specific enough to retain biological significance, but generic enough to encapsulate a larger set of knowledgebase-derived annotations. When we reevaluated the agreement between the raw data and knowledgebase annotations using custom GOcats categories for “cytoskeleton” and “nucleus”, the number of identical gene annotations increased to 776 (Fig 5B).

Because GOcats relies on user-input keywords to define categories, we understand that there is a risk of adding user bias when applying this method to organizing results of various analyses. While we have taken care to avoid bias in the comparisons made in this report, for example citing the exact category defining GO term for each category compared between methods (Fig 3, Tables 2 and 3) and reporting the exact common-sense categorizations applied when grouping location categories from HPA (Table 5), we strongly caution users to exercise similar care in their use as well. For instance, when categorizing results from annotation enrichment analyses it may be tempting to filter results to those categories defined by the user, which might conveniently eliminate unexpected (unwanted) highly-enriched terms. We do not condone the use of GOcats in this way. But because GOcats will always produce the same subgraph categorizations for the same set of keywords used with the same version of GO, we argue that our categorization is more reproducible and less prone to bias than manually grouping GO terms into categories or otherwise manually identifying major concepts represented from omics-level analyses. Furthermore, the set of keywords can be provided along with the version of GOcats, GO, and the dataset to enable reproducibility of analyses by others.

As GO continues to grow, automated methods to evaluate the structural organization of data will become necessary for curation and quality control. Because GOcats allows versatile interpretation of the GO directed acyclic graph (DAG) structure, it has many potential curation and quality control uses, especially for evaluating the high-level ontological organization of GO terms. For example, GOcats can facilitate the integrity checking of annotations that are added to public repositories by streamlining the process of extracting categories of annotations from knowledgebases and comparing them to the original annotations in the raw data. Interestingly, about one-third of the genes annotated with high-confidence in the HPA raw data were missing altogether from the EMBL-EBI knowledgebase when filtered to the HPA-sourced annotations. While this surprised us, the reason appears to be due to HPA’s use of two separate criteria for “supportive” annotation reliability scores and for knowledge-based annotations. For “supportive” reliability, one of several conditions must be met: i) two independent antibodies yielding similar or partly similar staining patterns, ii) two independent antibodies yielding dissimilar staining patterns, both supported by experimental gene/protein characterization data, iii) one antibody yielding a staining pattern supported by experimental gene/protein characterization data, iv) one antibody yielding a staining pattern with no available experimental gene/protein characterization data, but supported by other assay within the HPA, and v) one or more independent antibodies yielding staining patterns not consistent with experimental gene/protein characterization data, but supported by siRNA assay [10]. Meanwhile knowledge-based annotations are dependent on the number of cell lines annotated; specifically, the documentation states, “Knowledge-based annotation of subcellular location aims to provide an interpretation of the subcellular localization of a specific protein in at least three human cell lines. The conflation of immunofluorescence data from two or more antibody sources directed towards the same protein and a review of available protein/gene characterization data, allows for a knowledge-based interpretation of the subcellular location” (Uhlen et al., 2015). Unfortunately, we were unable to explore these differences further, since the experimental data-based subcellular localization annotations appeared aggregated across multiple cell lines, without specifying which cell lines were positive for each location. Meanwhile, tissue- and cell-line specific data, which contained expression level information, did not also contain subcellular localizations. Therefore, we would suggest that HPA and other major experimental data repositories always provide a specific annotation reliability category in their distilled experimental datasets that matches the criteria used for deposition of derived annotations in the knowledgebases. Such information will be invaluable for performing knowledgebase-level evaluation of large curated sets of annotations. One step better would involve providing a complete experimental and support data audit trail for each derived annotation curated for a knowledgebase, but this may be prohibitively difficult and time-consuming to do.

Looking towards the future, the work demonstrated here is a critical first step towards a goal of automatically enumerating all representable concepts within GO. Such an enumeration would provide scientists with the usable set of GO-representable concept subgraphs for a large variety of analyses unbiased by human selection. GOcats can derive subgraphs representing a specific concept by utilizing keywords and key terms, which would be a major component for an overall method to enumerate all representable concepts. We expect two other major components will be required, first is a way to derive possible key words and key terms and the last is a way to evaluate the quality of the concept subgraphs that are generated. We expect the latter evaluation to involve the development of various graph-based metrics for this purpose.

Conclusions

In this study, we: i) demonstrated an improvement in retrievable ontological information content by the reevaluation of GO’s has_part relation ii) applied our new method GOcats toward the categorization and utilization of the GO Cellular Component sub-ontology, and iii) evaluated the ability of GOcats and other mapping tools to relate HPA experimental to HPA knowledgebase GO Cellular Component annotation sources. GOcats outperforms the UniProt CV with respect to accurately deriving gene-product subcellular location from the UniProt and Ensembl database with the HPA raw dataset of gene localization annotations treated as the gold standard (Fig 5A). Moreover, the comparison of GOcats to M2S demonstrates similar mapping performance between the two methods, but with GOcats providing important improvements in mapping, computational speed, ease of use, and flexibility of use. In a previous publication, we demonstrated an improvement in the statistical power of gene-annotation enrichment analyses using GOcats along with all GO sub-ontologies [23].

In conclusion, GOcats enables the user to create custom, GO slim-like filters to map fine-grained gene annotations from GAFs to general subcellular compartments without needing to hand-select a set GO terms for categorization. Moreover, users can use GOcats to quickly customize the level of semantic specificity for annotation categories. Furthermore, GOcats was designed for scientists who are less familiar with GO; however, the package has advanced features for users with more bioinformatics expertise. GOcats enables a safe and more comprehensive semantic scoping utilization of go-core, preventing mistakes that can easily arise from using go-core instead of go-basic. Together, these improvements can impact a variety of GO knowledgebase data mining use-cases as well as knowledgebase curation and quality control. Looking towards the future, GOcats provides a critical categorization method for a future automatic enumeration of all representable concepts within GO.

Methods

Methodological overview and design rationale

We designed GOcats with a biologist user in mind, who may not be aware of the dangers associated with using different versions of GO for organizing terms with tools like M2S or how to circumvent potential pitfalls. For instance, although the M2S documentation (https://github.com/owlcollab/owltools/wiki/Map2Slim) states, "We recommend the go-basic version of the ontology be used, which contains: subClassOf (is a), part of, regulates (+ positively and negatively regulates)" and, "You can also use the full version of GO and filter those relationships you do not want to consider,” a non-bioinformatician may not be aware of how to filter out relationships from GO in a way that is safe to use the tool—or, more pertinently—the user may wish to use a fuller extent of the information contained in the ontology when organizing their terms. Currently, GOcats version 1.1.4 can handle go-core’s is_a, part_of, and has_part relations, with the has_part reinterpreted to retain proper scoping semantics, as detailed below and elsewhere [23]. As the development of GOcats progresses, we plan on handling the organization of terms connected by additional relations such as negatively/positively_regulates.

GOcats uses the go-core version of the GO database, which contains relations that connect the separate ontologies and may point away from the root of the ontology. GOcats can either exclude non-scoping relations or invert has_part directionality into a part_of_some interpretation, maintaining the acyclicity of the graph. Therefore, it can represent go-core as a DAG.

GOcats is a Python package written in major version 3 of the Python program language [26] and available on GitHub and the Python Package Index. It uses a Visitor design pattern implementation [27] to parse the go-core Ontology database file [4]. Searching with user-specified sets of keywords for each category, GOcats extracts subgraphs of the GO DAG and identifies a representative node for each category in question and whose child nodes are detailed features of the components. Fig 7 illustrates this approach, and details follow in pseudocode.

To overcome issues regarding scoping ambiguity among mereological relations, we assigned properties indicating which term was broader in scope and which term was narrower in scope to each edge object created from each of the scope-relevant relations in GO. For example, in the node pair connected by a part_of or is_a edge (e.g. node 1 is_a node 2), node 1 is narrower in scope than node 2. Conversely, node 1 is broader in scope than node 2 when connected by a has_part edge (e.g. node 1 has_part node 2). This edge is therefore reinterpreted by GOcats as part_of_some. This reinterpretation is not meant to imply exclusivity in composition between the meronym and the holonym. It simply stands as a distinction between “part of all” which is what the current “part_of” relationship implies, and “part of some,” or to be more verbose “instance a is part of instance b in at least one known biological example.” We have described additional explanations and rationale for this re-interpretation elsewhere and demonstrate improvement in annotation enrichment analyses across GO Cellular Component, Molecular Function, and Biological Process sub-ontologies, when this re-interpretation is used [23].

While the default scoping relations in GOcats are is_a, part_of, and has_part, the user has the option to define the scoping relation set. For instance, one can create go-basic-like subgraphs from a go-core version ontology by limiting to only those relations contained in go-basic. For convenience, we have added a command line option, “go-basic-scoping,” which allows only nodes with is_a and part_of relations to be extracted from the graph. Detailed API documentation and user-friendly tutorials are available online (https://gocats.readthedocs.io/en/latest/).

For mapping purposes, Python dictionaries are created which map GO terms to their corresponding category or categories. For inter-subgraph analysis, another Python dictionary is created which maps each category to a list of all its graph members. By default, fine-grained terms do not map to category root-nodes that define a subgraph that is a superset of a category with a root-node nearer to the term. For example, a member of the “nucleolus” subgraph would map only to “nucleolus,” and not to both “nucleolus” and “nucleus”. However, the user also has the option to override this functionality if desired with a simple “—map-supersets” command line option. Furthermore, we’ve included the option for users to directly input GO terms as category representatives, should they not wish to use keywords to define subgraph categories. This is helpful for users who have already compiled lists of GO terms by hand for use with other tools.

Implementation overview

As illustrated in the UML diagram in Fig 8A, the GOcats package is implemented using several modules that have clear dependencies starting from a command line interface (CLI) in gocats.py which depend on most of the other modules including ontologyparser.py, godag.py, subdag.py and tools.py. GOcats uses 10 classes implemented across ontologyparser.py, godag.py, subdag.py, and dag.py to extract and internally represent the GO database. GoParser, which inherits from the base OboParser class (Fig 8B), utilizes a visitor design pattern and regular expressions to parse the flat GO database obo file and instantiate the objects necessary to represent the GO DAG structure. These instantiated objects include (Fig 8C): 1) the GoGraph container object for the parts of the graph, which inherits from a more generic OboGraph containing functions for adding, removing, and modifying nodes and edges; 2) GoGraphNode objects for representing each term parsed from the ontology, which inherits from AbstractNode; 3) AbstractEdge objects for representing each instance of a relation parsed from the ontology; and 4) DirectionalRelationship objects, which inherit from the more generic AbstractRelationship object for representing each type of directional relation encountered in the ontology (for GO, all relations are directional, and this distinction is made only in anticipation for future extensions to handle other ontologies).

AbstractEdge objects and AbstractNode objects contain references to one another, which simplifies the process of iterating through ancestor and descendant nodes and allows for functions such as AbstractEdge.connect_nodes, which requires that the edge object update the node object’s child_node_set and parent_node_set. In this context, AbstractNode is a true abstract base class, while AbstractEdge started out as an abstract base class but eventually became a concrete class during development. However, we see the possibility of AbstractEdge becoming a base class in the future.

Ancestors and descendants of a node are implemented as sets, which are lazily created through the use of a Python property decorator (i.e. Python’s preferred “getter” syntax). At the first access of these sets through the ancestor or descendent property, the set is calculated with a recursive algorithm, stored for future use, and returned for immediate access. Subsequent accesses simply return the stored set. If the set of edges within a node change, the ancestor and descendent node sets will be recalculated on their next access. This implementation prevents pre-calculation of these sets when they are not used, while enabling their reuse within efficient graph analysis methods.

AbstractEdge also contains a reference to a DirectionalRelationship object, which is critical for graph traversal. This is because DirectionalRelationship contains the true directionality of the mereological correspondence between the categorization relevant relations (is_a, part_of, and has_part). In other words, it is within this class that we define in which direction the edge should be traversed when categorizing terms. Currently these rules are hard-coded within GoParser’s relationship_mapping dictionary.

The gocats.py module (Fig 8A) implements the command line interface and is responsible for handling the command line arguments, using the provided keywords and specified arguments like namespace filters (e.g. Cellular Component, Molecular Function, and Biological Process) to instantiate a GoParser object, a GoGraph object and a SubGraph object for each set of provided keywords. After creation of the GoGraph internal representation, each category subgraph is created by first instantiating the SubGraph object and calling the from_filtered_graph function, which filters to those nodes from the GoGraph containing the keywords in their names and definition. Note that the SubGraph object and GoGraph object both inherit from OboGraph, and that the SubGraph object contains a reference to GoGraph object (supergraph data member) of which it is a subgraph. This design was implemented to avoid accidental alterations of the GoGraph object when altering the contents of the subgraph, and to allow for specialization of functions within SubGraph without needing to use unique names e.g. add_node(). GoGraphNode objects within the subgraph are wrapped by SubGraphNode objects, which are directly used by the SubGraph object, but retain all original properties such as name, definition, and sets of edge object references, otherwise insidious changes could occur to the GoGraph object when updating the SubGraph object. The SubGraph object also contains a CategoryNode object, which wraps the category representative GoGraphNode object(s) for the subgraph category.

Specific implementation details

User-provided keyword sets are used by GOcats to query GO terms’ name and definition fields to create an initial seeding of the subgraph with terms that contain at least one keyword. This seeding is a list of nodes from the whole go-core graph (supergraph) that pass the query. Node synonyms were not used, due to there being four types of synonyms in GO: exact, narrow, broad, and related. Also, many nodes within GO do not have synonyms, which may create an unequal utilization of nodes if synonyms were queried. However, in the future, synonym utilization for seeding purposes may be revisited.

FOR node in supergraph.nodes

IF keyword from keyword_list in node.name or node.definition

subgraph.seeding_list.append(node)

Using the graph structure of GO, edges between these seed nodes are faithfully recreated except where edges link to a node that does not exist in the set of newly seeded GO terms. During this process, edges of appropriate scoping relations are used to create children and parent node sets for each node.

FOR edge in supergraph.edges

IF edge.parent_node in subgraph.nodes AND edge.child_node in subgraph.nodes AND /

edge.relation is TYPE: SCOPING

subgraph.edges.append(edge)

ELSE

PASS

FOR subnode in subgraph.nodes

subnode.child_node_set = /

{child_node for child_node in supergraph.id_index[subnode.id].child_node_set /

if child_node.id in subgraph.id_index}

subnode.parent_node_set = /

{parent_node for parent_node in supergraph.id_index[subnode.id].parent_node_set /

if parent_node.id in subgraph.id_index}

GOcats then selects a category representative node to represent the subgraph. To do this, a list of candidate representative nodes is compiled from non-leaf nodes, i.e. root-nodes in the subgraph which have at least one keyword in the term name. A single category representative root-node is selected by recursively counting the number of children each candidate term has (i.e. creating the node.descendents) and choosing the term with the most children.

FOR subnode in subgraph.nodes

IF subnode.child_node_set ! = None AND ANY keyword in subnode.name

candidate_list.append(subnode)

ELSE

PASS

representative_node = MAX(LEN(node.descendants) FOR node in candidates)

Because it may be possible that highly-specific or uncommon features included in the GO may not contain a keyword in its name or definition but still may be part of the subgraph in question by the GO graph structure, GOcats re-traces the supergraph to find various node paths that reach the representative node. We have implemented two methods for this subgraph extension: i) comprehensive (greedy) extension, whereby all supergraph descendants of the representative node are added to the subgraph and ii) conservative extension, whereby the supergraph is checked for intermediate nodes between subgraph leaf nodes and the subgraph representative node that may not have seeded in the initial step.

Comprehensive (Greedy) extension:

FOR node in supergraph.nodes

IF ANY (ancestor_node in node.ancestors) in subgraph

subgraph.nodes.append(ancestor_node)

UPDATE subgraph # appropriate edges added and parent/child nodes assigned

Conservative extension:

FOR leaf_node in subgraph.leaf_nodes # nodes with no children

start_node = leaf_node

end_node = representative_node

FOR node in super_graph.start_node.ancestors ∩ supergraph.end_node.descendents

subgraph.nodes.append(node)

UPDATE subgraph # appropriate edges added and parent/child nodes assigned

The subgraph is finally constrained to the descendants of the representative node in the subgraph. This excludes unrelated terms that were seeded by the keyword search due to serendipitous keyword matching.

Creating category mappings from UniProt’s subcellular location controlled vocabulary

We created mappings from fine-grained to general locations in UniProt’s subcellular location CV [2] for comparison to GOcats. To accomplish this, we parsed and recreated the graph structure of UniProt’s subcellular locations CV file [13] in a manner similar to the parsing of GO (Fig 2). Briefly, the flat-file representation of the CV file is parsed line-by-line and each term is stored in a dictionary along with information about its graph neighbors as well as its cross-referenced GO identifier. We assumed that terms without parent nodes in this graph are category-defining root-nodes and created a dictionary where a root-node key links to a list of all recursive children of that node in the graph. Only those terms with cross-referenced GO identifiers were included in the final mapping. The category subgraphs created from UniProt were compared to those with corresponding category root-nodes made by GOcats. An inclusion index, I, was calculated by considering the two subgraphs’ members as sets and applying the following equation:

I = \frac{| S_{n} \cap S_{g} |}{| S_{n} |}

(1)

where S_n and S_g are the set of members within the non-GOcats-derived category and GOcats-derived category, respectively. It is worth noting here that the size of the UniProt set was always smaller than the GOcats set. This is due to the inherent size differences between UniProt’s CV and the Cellular Component sub-ontology.

Creating category mappings from Map2Slim

The Java implementation of OWLTools’ M2S does not include the ability to output a mapping file between fine-grained GO terms and their GO slim mapping target from the GAF that is mapped. To compare subgraph contents of GOcats categories to a comparable M2S “category,” we created a special custom GAF where the gene ID column and GO term annotation column of each line were each replaced by a different GO term for each GO term in Cellular Component, data-version: releases/2016-01-12. We then allowed M2S to map this GAF with a provided GO slim. The resulting mapped GAF was parsed to create a standalone mapping between the terms from the GO slim and a set of the terms in their subgraphs.

Mapping gene annotations to user-defined categories

To allow users to easily map gene annotations from fine-grained annotations to specified categories, we added functionality for accepting GAFs as input, mapping annotations within the GAF and outputting a mapped GAF into a user-specified results directory. The input-output scheme used by GOcats and M2S are similar, with the exception that GOcats accepts the mapping dictionary created from category keywords, as described previously, instead of a GO slim. GAFs are parsed as a tab-separated-value file. When a row contains a GO annotation in the mapping dictionary, the row is rewritten to replace the original fine-grained GO term with the corresponding category-defining GO term. If the gene annotation is not in the mapping dictionary, the row is not copied to the mapped GAF, and is added to a separate file containing a list of unmapped genes for review. The mapped GAF and list of unmapped genes are then saved to the user-specified results directory.

Visualizing and characterizing intersections of category subgraphs

To compare the contents of category subgraphs made by GOcats, UniProt CV, and M2S, we took the set of subgraph terms for each category in each method, converted them into a Pandas DataFrame [28] representation, and plotted the intersections using the UpSetR R package [25]. Inclusion indices were also computed for M2S categories using Eq 1. Jaccard indices were computed for every subgraph pair to evaluate the similarity between subgraphs of the same concept, created by different methods.

Assigning generalized subcellular locations to genes from the knowledgebase and comparing assignments to experimentally-determined locations

We first mapped two GAFs downloaded from the EMBL-EBI QuickGO resource [12] using GOcats, the UniProt CV, and M2S. We filtered the gene annotations by dataset source and evidence type, resulting in separate GAFs containing annotations from the following sources: UniProt-Ensembl, and HPA. Both GAFs had the evidence type, inferred from Electronic Annotation, filtered out because it is generally considered to be the least reliable evidence type for gene annotation and in the interest of minimizing memory usage. We used this data to assess the performance of the mapping methods in their ability to assign genes to subcellular locations based on annotations from knowledgebases by comparing these assignments to those made experimentally in HPA’s localization dataset (Fig 3A). Comparison results for each gene were aggregated into 4 types: i) “complete agreement” for genes where all subcellular locations derived from the knowledgebase and the HPA dataset matched, ii) “partial agreement” for genes with at least one matching subcellular location, iii) “partial superset” for genes where knowledgebase subcellular locations are a superset of the HPA dataset, iv) "no agreement" for genes with no subcellular locations in common, and v) “no annotations” for genes in the experimental dataset that were not found in the knowledgebase.

Only gene product localizations from the HPA dataset with a “supportive” confidence score were used for this analysis (n = 4795). We created a GO slim by looking up the corresponding GO term for each location in this dataset with the aid of QuickGO term basket and filtering tools. The resulting GO slim served as input for the creation of mapped GAFs using M2S. To create mapped GAFs using GOcats, we entered keywords related to each location in the HPA dataset (Table 4). We matched the identifier in the “gene name” column of the experimental data with the identifier in the “database object symbol” column in the GAF to compare gene annotations. Our assessment of comparing the HPA raw data to mapped gene annotations from the knowledgebase represents the ability to accurately query and mine genes and their annotations from the knowledgebase into categories of biological significance. Our assessment of comparing the methods’ mapping output to the HPA raw dataset represents the ability of these methods to evaluate the representation of HPA’s latest experimental data as it exists in public repositories.

Running time tests

For comparing the runtimes of GOcats and M2S for categorizing HPA’s subcellular location dataset, each method was run separately on the same machine with the following configuration: Intel ^® Core ^™ i7-4930K CPU with 6 hyperthreaded cores clocked at 3.40GHzn and 64 GB of RAM clocked at 1866 MHz. We used the Linux “time” command with no additional options and reported the real time from its output. The datasets and scripts used can be found in our FigShare (See Availability of Data and Material). We used the dataset contained in our ScriptsDirectory/KBData/11-02-2016/hpa-no_IEA.goa for these comparisons. For M2S we executed a custom script that can be found within ScriptsDirectory/runscripts:

sh owlmultitest.sh

which ran the following command, found in the same subdirectory, 50 times:

time sh owltoolsspeedtest.sh

For GOcats, we executed a custom script that can be found within ScriptsDirectory/runscripts:

sh gcmultitest.sh

which ran the following command, found in the same subdirectory, 50 times:

time sh GOcatsspeedtest.sh

Both tests were executed using the same version of the go-core used across all other analyses performed in this work, which is data version: releases/2016-01-12.

Supporting information

S1 Data. Visualizing the degree of overlap between the category subgraphs created by GOcats, Map2Slim, and the UniProt CV (additional categories).

(DOCX)

Click here for additional data file.^{(1.3MB, docx)}

S2 Data. List of GO terms mapped by Map2Slim to the term plasma membrane that were not mapped to this location by GOcats.

(DOCX)

Click here for additional data file.^{(23.1KB, docx)}

S1 File

(DOCX)

Click here for additional data file.^{(12.9KB, docx)}

Acknowledgments

We thank Dr. Robert M. Flight for his advice and expertise regarding the statistics reported in this project, for the generation of the plots in Fig 3, and for his feedback during the drafting of the manuscript. We thank Dr. Thilakam Murali for extensive feedback on the general scientific readability of the manuscript.

Data Availability

GOcats is an open-source Python software package under a BSD-3 License, available on GitHub at https://github.com/MoseleyBioinformaticsLab/GOcats and on the Python Package Index (PyPI) at https://pypi.python.org/pypi/GOcats. Documentation can be found at http://gocats.readthedocs.io/en/latest/. The exact version of GOcats used in this study, along with all scripts used to generate results can be found in the Figshare repository at https://doi.org/10.6084/m9.figshare.7064516 and at https://doi.org/10.6084/m9.figshare.7064549. The version of GO used to generate these results is go-core (go.obo) data-version: releases/2016-01-12. The UniProt Controlled Vocabulary file can be found at https://www.uniprot.org/docs/subcell.txt. Associated GO terms are indicated in by the GO identifier in each stanza. Map2slim is available on GitHub (https://github.com/owlcollab/owltools/wiki/Map2Slim) and requires OWL Tools, also available via GitHub (https://github.com/owlcollab/owltools/wiki/Install-OWLTools#building-from-source). Subcellular location data was obtained from version 15 of the Human Protein Atlas and can be downloaded at http://v15.proteinatlas.org/download/subcellular_location.csv.zip.

Funding Statement

This work was supported in part by grants NSF 1419282 (Moseley), NIH 1U24DK097215-01A1 (Higashi, Fan, Lane, Moseley), and NIH UL1TR001998-01 (Kern).

References

1.Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J. Gene Ontology: tool for the unification of biology. Nat Genet [Internet]. 2000;25(1):25–9. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/pmc3037419/ [DOI] [PMC free article] [PubMed] [Google Scholar]
2.The UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res [Internet]. 2015;43(D1):D204–12. Available from: http://nar.oxfordjournals.org/lookup/doi/10.1093/nar/gku989 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Cunningham F, Amode MR, Barrell D, Beal K, Billis K, Brent S, et al. Ensembl 2015. Nucleic Acids Res [Internet]. 2015;43(D1):D662–9. Available from: http://nar.oxfordjournals.org/lookup/doi/10.1093/nar/gku1010 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Gene Ontology consortium. Gene Ontology Consortium: going forward. Nucleic Acids Res [Internet]. 2015;43(D1):D1049–56. Available from: http://nar.oxfordjournals.org/lookup/doi/10.1093/nar/gku1179 [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res [Internet]. 2004;32(suppl 1):D267–70. Available from: http://nar.oxfordjournals.org/content/32/suppl_1/D267%5Cnhttp://nar.oxfordjournals.org/content/32/suppl_1/D267.full.pdf%5Cnhttp://nar.oxfordjournals.org/content/32/suppl_1/D267.short%5Cnhttp://www.ncbi.nlm.nih.gov/pubmed/14681409 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Veres D V., Gyurko DM, Thaler B, Szalay KZ, Fazekas D, Korcsmaros T, et al. ComPPI: a cellular compartment-specific database for protein-protein interaction network analysis. Nucleic Acids Res [Internet]. 2015;43(D1):D485–93. Available from: http://nar.oxfordjournals.org/lookup/doi/10.1093/nar/gku1007 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Huttlin EL, Ting L, Bruckner RJ, Gebreab F, Gygi MP, Szpyt J, et al. The BioPlex Network: A Systematic Exploration of the Human Interactome. Cell [Internet]. 2015;162(2):425–40. Available from: http://www.sciencedirect.com/science/article/pii/S0092867415007680 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Papatheodorou I, Oellrich A, Smedley D. Linking gene expression to phenotypes via pathway information. J Biomed Semantics [Internet]. 2015;6:17 Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=4404592&tool=pmcentrez&rendertype=abstract [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Na D, Son H, Gsponer J. Categorizer: a tool to categorize genes into user-defined biological groups based on semantic similarity. BMC Genomics [Internet]. 2014;15:1091 Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=4298957&tool=pmcentrez&rendertype=abstract [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Uhlen M, Fagerberg L, Hallstrom BM, Lindskog C, Oksvold P, Mardinoglu A., et al. Tissue-based map of the human proteome. Science (80-) [Internet]. 2015;347(6220):1260419–1260419. Available from: http://www.sciencemag.org/content/347/6220/1260419 [DOI] [PubMed] [Google Scholar]
11.GO Slim and Subset Guide [Internet]. [cited 2016 Nov 22]. http://geneontology.org/page/go-slim-and-subset-guide
12.Binns D, Dimmer EC, Huntley RP, Barrell DG, O’Donovan C, Apweiler R. QuickGO: a web-based tool for Gene Ontology searching. Bioinformatics [Internet]. 2009;25(22):3045–6. Available from: http://doi.wiley.com/10.1002/pmic.200800002 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.The UniProt Consortium. subcell.txt [Internet]. 2015 [cited 2015 May 27]. http://www.uniprot.org/docs/subcell
14.Jiang JJ. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: Proceedings of International Conference Research on Computational Linguistics (ROCLING X). 1997.
15.Lin D. An Information-Theoretic Definition of Similarity. In: ICML ‘98 Proceedings of the Fifteenth International Conference on Machine Learning. 1989. p. 296–304.
16.Resnik P. Semantic Similarity in a Taxonomy: An Information Based Measure and Its Application to Problems of Ambiguity in Natural Language. J Aritificial Intell Res. 1999;11:95–130. [Google Scholar]
17.Schlicker A, Domingues FS, Rahnenführer J, Lengauer T. A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics [Internet]. 2006;7:302 Available from: http://www.scopus.com/inward/record.url?eid=2-s2.0-33748335463&partnerID=tZOtx3y1 [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Abeysinghe R, Hinderer EW, Moseley HNB, Cui L. Auditing Subtype Inconsistencies among Gene Ontology Concepts. In: The 2nd International Workshop on Semantics-Powered Data Analytics (SEPDA 2017)—in conjunction with IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 2017.
19.Abeysinghe R, Zheng F, Hinderer EW, Moseley HNB, Cui L. A Lexical Approach to Identifying Subtype Inconsistencies in Biomedical Terminologies. In: Quality Assurance of Biological and Biomedical Ontologies and Terminologies Workshop—Bioinformatics and Biomedicine (BIBM), 2018 IEEE International Conference. 2018.
20.Groß A, Pruski C, Rahm E. Evolution of Biomedical Ontologies and Mappings: Overview of Recent Approaches. Comput Struct Biotechnol J [Internet]. 2016;14:1–8. Available from: http://linkinghub.elsevier.com/retrieve/pii/S2001037016300319 [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Groß A, Dos Reis JC, Hartung M, Pruski C, Rahm E. Semi-automatic adaptation of mappings between life science ontologies. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics). 2013;7970 LNBI:90–104. [Google Scholar]
22.Cesar J, Reis D, Santec CR, Tudor CRPH, Da Silveira M, Reynaud-delaître C. Mapping Adaptation Actions for the Automatic Reconciliation of Dynamic Ontologies. Cikm. 2013;599–608. [Google Scholar]
23.Hinderer EW, Flight RM, Dubey R, Macleod JN, Moseley HNB. Advances in gene ontology utilization improve statistical power of annotation enrichment. PLoS One. 2019;14(8):1–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res [Internet]. 2003;13(11):2498–504. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=403769&tool=pmcentrez&rendertype=abstract [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Lex A, Gehlenborg N, Strobelt H, Vuillemot R. UpSet: Visualization of Intersecting Sets Supplementary Material. IEEE Trans Vis Comput Graph. 2014;20(12):1983–1992. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.van Rossum G, Drake F. The Python Language Reference Manual. Network Theory Ltd; 2011. [Google Scholar]
27.Gamma E, Helm R, Johnson R, Vlissides J, Booch G. Design Patterns: Elements of Reusable Object-Oriented Software 1st Edition Addison-Wesley Professional; 1994. [Google Scholar]
28.McKinney W. Data Structures for Statistical Computing in Python. Proc 9th Python Sci Conf [Internet]. 2010;1697900(Scipy):51–6. Available from: http://conference.scipy.org/proceedings/scipy2010/mckinney.html

PLoS One. doi: 10.1371/journal.pone.0233311.r001

Decision Letter 0

Marc Robinson-Rechavi

28 Feb 2020

PONE-D-20-00699

GOcats: A tool for categorizing Gene Ontology into subgraphs of user-defined concepts

PLOS ONE

Dear Dr. Moseley,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

As you will see, both reviewers appreciated your work and considered it a welcome addition to the literature, but had important remarks concerning the relation of your work to previous work, its impact on downstream analyses, and the presentation of the manuscript. All of these remarks are constructive and helpful, and I invite you to take them into account in your revision.

We would appreciate receiving your revised manuscript by Apr 13 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Marc Robinson-Rechavi

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

1. Thank you for including your funding statement; "This work was supported in part by grants NSF 1419282 (Moseley), NIH 1U24DK097215-01A1 (Higashi, Fan, Lane, Moseley), and NIH UL1TR001998-01 (Kern)."

Please provide an amended statement that declares *all* the funding or sources of support (whether external or internal to your organization) received during this study, as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now. Please also include the statement “There was no additional external funding received for this study.” in your updated Funding Statement.

Please include your amended Funding Statement within your cover letter. We will change the online submission form on your behalf.

We note that your BSD-3 License of the software may have copyright restrictions. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures or software specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

1. You may seek permission from the original copyright holder of the software to publish the content specifically under the CC BY 4.0 license.

We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission.

In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”

2. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

Reviewer #2: N/A

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors present GOcats, a tool that categorizes the Gene Ontology (GO) into subgraphs based on user inputs. Similar to GO-slim, it generates a sub-graph of the full GO using a list of user-provided keywords, while also handling the semantic scoping of the relationships within the GO. The paper itself only focuses on the sub-ontology under subcellular locations in order to be able report its accuracy against immunohistochemistry datasets in the Human Protein Atlas (HPA). Overall, it aims to reduce the manual effort required in hand-selecting a set of GO terms for categorization, which can improve the current complex workflows of GO analysis.

The paper is well-written, with an appreciable level of background and sufficient details for a scientist who is not completely familiar with the GO. The authors present how their work fits in among the numerous GO tools out there, and provide several examples of how the output of GOcats improve the correspondence between the annotations sources. The source code, scripts, and documentation are complete and appear up-to-date with the manuscript. The Methods section has included important details about their implementation, accompanied with reasonable explanation. For example, they have considered different user scenarios such as comprehensive and conservative subgraph extensions. As a statistician, I appreciate their discussion regarding how user-input keywords that prune the GO can increase the bias during enrichment analysis. A well-written software developed with the intent to maintain statistical reproducibility can also more generally benefit the field of data analysis. Overall, I would recommend this paper for publication, conditioned on the authors addressing some of my concerns below and some minor suggestions that could improve readability.

Major concerns:

1. Distinction between this manuscript and their previous one [Hinderer EW, Flight RM, Dubey R, Macleod JN, Moseley HNB. Advances in gene ontology utilization improve statistical power of annotation enrichment. PLoS One. 2019;14(8):1–20.]

a. From what I understood, the previous manuscript also introduces GOcats as a tool that ‘organizes the Gene Ontology into subgraphs representing user-defined concepts, while ensuring that all appropriate relations are congruent with respect to scoping semantics.’ Is the underlying algorithm improved in any way? Was the previous software unable to perform the same analyses done in this paper?

b. At the end of the introduction, they mention ‘In a prior publication, we demonstrated GOcats’s ability to improve gene-annotation enrichment analyses, involving all GO sub-ontologies (23).’ I suggest that either (i) they should reference (23) earlier in the introduction to make it clear when it was initially developed, or (ii) they could discuss the differences between results from the previous paper and those in this paper. I skimmed the previous paper and indeed did not find any overlap, but I think it is the authors’ responsibly to clarify the differences to their previous work in this follow-up manuscript.

2. Technical correctness of the algorithms. The pseudo-code in the Methods are interspersed and difficult to read without cross referencing. Here are some detailed pointers for the authors:

a. They mention ‘A single category representative root-node is selected by *recursively* counting the number of children each candidate term has and choosing the term with most children’. It is technically wrong to say the code block following the description recursive, because it iterative and does not consider the children of newly appended nodes.

b. The object class definitions are inconsistent. I see both ‘FOR subnode in subgraph.nodes’ vs. ‘FOR subnode in subgraph’, and similar inconsistencies for the supergraph class in different code blocks. Although I couldn’t find their corresponding source code for these steps, I strongly suggest the authors to double-check their source and make the pseudo-code reflect their implementation more precisely.

c. There are very long one-line codes such as ‘subnode.parent_node_set = …’ which are written in Python3 set-comprehension form. This is unnecessary for pseudo-code and makes it quite difficult parse the logic.

3. Time complexity and scalability.

a.When comparing themselves with M2S, they say even though they were implemented in Python and M2S was implemented in Java, they were faster in performance due to ‘the utilization of stored ancestor and descendent node’. The statement is not convincing because M2S also stores the data but perhaps in a different format. They should be clear, for example, if it is because the node information is faster to access because they are in memory when the subgraph is built, or is it because they have a better way to parse the flat files that both M2S and they use. Right now, this discussion sounds too vague and mysterious to me. So instead of letting a reader guess what is happening, they should add just a couple more reasons accounting for any overhead that they and M2S do.

b.They say that ‘GOcats should offer appreciable computation improvement’ on significantly larger datasets. To support this claim, I urge the authors to give some discussion of what their computational time complexity is for a user to get an idea of how the software will scale. (Also, Python could be less efficient in scaling up compared to Java due to its innate memory management.) How long did it take to run GOcats with the largest input they had so far? What if someone wants to run it on a larger sub-ontology such as ‘biological processes’?

Minor comments:

Again, this paper is well-presented. The following comments are mainly cosmetic changes that fix some small writing issues here and there:

1. In the Introduction, what do they mean by ‘(semi-) automated’? The term semi-automated is not well defined in this paper. I would rather be explicit about what manual procedures are needed, because GOcats could be interpreted as semi-automated, because it requires user-input (and perhaps user verification that the subgraph is indeed useful by visualization).

2. ‘Due to the nature of experimentally verified properties available, …’. sounds vague. What is the ‘nature’ of the ‘properties’?

3. ‘Due to the eventual application to the HPA datasets, …, were included to prevent categorization of terms that would complicate a eukaryotic interpretation…’. Were they ‘included’ or ‘excluded’? Why would including them prevent complications?

4. Typo ‘indeces’ occur in two places in the manuscript: ‘indeces’ -> ‘indices’

5. ‘Moreover, GOcats comparison with…’ sounds awkward grammatically. Consider rephrasing the whole sentence.

6. Punctuation ‘.’ Missing at the end of the legend of Figures 1C, 2, and 4.

Reviewer #2: The paper presents GOCats a novel tool that organizes the Gene Ontology (GO) into subgraphs representing user-defined concepts. This tool aims at mitigating the issues introduced by manual selection of higer-order GO terms to summarize results.

GOcats was evaluated using subcellular location categories to mine annotations from GO utilizing knowledgebases and evaluated their accuracy against immunohistochemistry datasets in the Human Protein Atlas (HPA) where it was shown to produce results comparable to mapping to GOslims and in some cases, potentially better results.

The tool addresses an important aspect. Many GO-based analyses suffer from manual selection of high-level GO terms to summarize results, which is time consuming and introduces potential bias.

However there are a number of issues that need to be clarified or improved upon.

1. The paper does not present results on how GOCats avoids the bias introduced by manual categorization. While it does eliminate the effort of manual work (just as M2S does), the authors themselves are aware of the potential for misuse.

2. Since GOCats are based on user input of keywords, the results for the analysis of the exact same data will very likely be different when done by two different researchers. This threatens reproducibility and comparison between studies. I would like to see the authors expand on this and on how their tool should be used to allow reproducibility. Please compare to GOSlims which are a shared model.

3. GOCats was developed for users not familiar with GO or bioinformatics. However, the correct usage of GOCats relies on user defining keywords at the right level of granularity (Figures 5 a and b illustrate this). If a user is not familiar with GO, how can they select an appropriate granularity level?

4. There are several references to the "traditionally problematic relation, has-part". Since GOCats has different usage modes "user has the option to define the scoping relation set." it would be really good to see the impact on results switching this on or off has.

5. There is a rather long review of semantic similarity and ontology mapping/evolution, themes that appear to be only marginally related to the topic. These portions of the text could be summarized.

6. A relevant application of GOCats is presented as "GOcats can facilitate the

integrity checking of annotations that are added to public repositories by streamlining the

process of extracting categories of annotations from knowledgebases and comparing

them to the original annotations in the raw data."

This needs to be explained in more detail. The workflow presented in the paper uses the raw data as the keyword input for GOCats, so how exactly would it result in an independent integrity checking is not clear to me.

Also this use case would be a lot stronger, if the reader was given an idea of how often cellular localization identification is not made to GO.

7. GOCats is potentially generalizable to the other GO types. Why was it only applied to cellular component? The other GO branches are much larger than CC, would this impact GOCats negatively?

8. In general, I find that the evaluation of the tool is lacking. I understand how it can potentially be useful, but the evaluation is based on a small controlled vocabulary in use by HPA. A better evaluation would be to run user studies, with users selecting keywords to categorize their data and reporting on their experience and usefulness of the results.

Minor:

9. This statement in Page 20 is unclear to me "Overall, the patterns of connectedness in this

network make more sense biologically, within the constraints of GO’s internal

organization." More sense compared to what?

10. In page 28 partial and no agreement definitions are not easy to understand in the text. Definition should not be in Figure 5 caption but in the main text.

11. In table 1 , how were the 25 examples selected?

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Jun 11;15(6):e0233311. doi: 10.1371/journal.pone.0233311.r002

Author response to Decision Letter 0

31 Mar 2020

Reviewer #1:

The authors present GOcats, a tool that categorizes the Gene Ontology (GO) into subgraphs based on user inputs. Similar to GO-slim, it generates a sub-graph of the full GO using a list of user-provided keywords, while also handling the semantic scoping of the relationships within the GO. The paper itself only focuses on the sub-ontology under subcellular locations in order to be able report its accuracy against immunohistochemistry datasets in the Human Protein Atlas (HPA). Overall, it aims to reduce the manual effort required in hand-selecting a set of GO terms for categorization, which can improve the current complex workflows of GO analysis.

Response:

We thank the reviewer for their thorough review of our manuscript and recognizing the significance of our methods. We have addressed each of the reviewer’s comments below:

Issue 1:

Major concerns:

Response:

In a perfect world, this manuscript would have been published first. However, the other paper was published first, because the significance of the application of GOcats was easier to perceive for other reviewers. This manuscript provides an in-depth description of GOcats’s methods and implementation and demonstrates different types of applications involving knowledgebase curation and deriving effective subcellular localization information from gene-product annotations. The other paper demonstrates the application of GOcats in annotation enrichment analysis, which is probably the most recognized use-case for GO at this time.

Issue 2:

Response:

To better contrast this manuscript from the previous paper, we have added the following statements that clarify that this manuscript provides a thorough description of GOcats’s methods and implementation, along with their application in deriving gene-product-specific subcellular localization information and in knowledgebase curation:

“Due to the nature of the experimentally verified properties available from the HPA, our analysis in this paper focuses on cellular locations, especially subcellular locations. Also, this paper provides an in-depth description of GOcats’s methods and their implementation. In a prior publication, we demonstrated GOcats’s ability to improve gene-annotation enrichment analyses, involving all GO sub-ontologies (23).”

Issue 3:

2. Technical correctness of the algorithms. The pseudo-code in the Methods are interspersed and difficult to read without cross referencing. Here are some detailed pointers for the authors:

Response:

The recursion is partly hidden in this example, because recursion is used to generate descendent sets. We stated this earlier in the manuscript as follows:

“At the first access of these sets through the ancestor or descendent property, the set is calculated with a recursive algorithm, stored for future use, and returned for immediate access.“

Part of the problem is deciding what level of detail to directly include in the pseudocode. To make the recursion clearer in this case, we have restated that the recursion takes place in the calculation of the descendents:

“A single category representative root-node is selected by recursively counting the number of children each candidate term has (i.e. creating the node.descendents) and choosing the term with the most children.”

Issue 4:

Response:

We have made all of the pseudocode examples consistent.

Issue 5:

Response:

We reduced the font size and the amount of indentation to reduce the breakup of code statements across multiple lines. This improves the readability of these pseudocode blocks in the manuscript. We will work with the journal style editors to create an equivalent in the published form.

Issue 6:

3. Time complexity and scalability.

Response:

We have added the following statement to make this point clearer:

“However through the use of Python decorators, GOcats recursively creates and stores ancestor and descendent node sets in a manner analogous to lazy evaluation, allowing the implementation of efficient subgraph-centric algorithms that only precomputes the ancestor and descendent sets that are needed.”

Issue 7:

Response:

In the other GOcats paper, we demonstrate the use of GOcats on all three GO sub-ontologies, including biological process. GOcats generates these results for all three GO sub-ontologies in a few seconds. We believe that the precomputation and storage of the ancestor and descendent sets has complexity O(n log n); however, since GOcats runs on all of GO in just seconds, we have not felt the need to rigorously test the computational complexity of GOcats’s algorithms. We have added the following statement to support our point that GOcats performs very efficiently on all of GO:

“Based on these results, GOcats should offer appreciable computational improvement on significantly larger datasets. This is demonstrated in GOcats’s application in annotation enrichment analysis involving all three GO sub-ontologies, which executes in just a few seconds (23).”

Issue 8:

Minor comments:

Again, this paper is well-presented. The following comments are mainly cosmetic changes that fix some small writing issues here and there:

Response:

We mean at least partially automated by the term (semi-)automated. And we do consider several use-cases of GOcats to be semi-automated due to the requirement for user input. We have added a clarifying phrase to the Introduction:

“Therefore, (semi-)automated (i.e. at least partially automated) and unbiased methods for categorizing semantically-similar and biologically-related annotations are needed for integrating information from heterogeneous sources—even if the annotation terms themselves are standardized—to facilitate effective downstream systems-level analyses and integrated network-based modeling.”

Issue 9:

2. ‘Due to the nature of experimentally verified properties available, …’. sounds vague. What is the ‘nature’ of the ‘properties’?

Response:

We have added the clarifying phrase “from the HPA” into this sentence:

“Due to the nature of the experimentally verified properties available from the HPA, our analysis in this paper focuses on cellular locations, especially subcellular locations.”

Issue 10:

Response:

Thank you again! We sometimes forget to explicitly spell out all of the logical steps in our arguments. In this instance, we are considering the implications of a greedy subgraph extension algorithm. We have tried to make this easier to follow with the following revision:

“Due to the eventual application to the HPA datasets, three unusual categories, “bacterial”, “viral”, and “other organism”, were included to prevent categorization of terms that would complicate a eukaryotic interpretation of the other 22 subcellular locations, within the context of a greedy subgraph extension algorithm. “

Issue 11:

4. Typo ‘indeces’ occur in two places in the manuscript: ‘indeces’ -> ‘indices’

Response:

Fixed.

Issue 12:

5. ‘Moreover, GOcats comparison with…’ sounds awkward grammatically. Consider rephrasing the whole sentence.

Response:

We rephrased it as follows:

“Moreover, the comparison of GOcats to M2S demonstrates similar mapping performance between the two methods, but with GOcats providing important improvements in mapping, computational speed, ease of use, and flexibility of use.”

Issue 13:

6. Punctuation ‘.’ Missing at the end of the legend of Figures 1C, 2, and 4.

Response:

Fixed.

Reviewer #2:

The paper presents GOCats a novel tool that organizes the Gene Ontology (GO) into subgraphs representing user-defined concepts. This tool aims at mitigating the issues introduced by manual selection of higer-order GO terms to summarize results.

The tool addresses an important aspect. Many GO-based analyses suffer from manual selection of high-level GO terms to summarize results, which is time consuming and introduces potential bias.

However there are a number of issues that need to be clarified or improved upon.

Response:

We thank the reviewer for their review of our manuscript. We have addressed each of the reviewer’s comments below:

Issue 1:

Response:

There are two different sources of bias that are mentioned in this manuscript. GOcats provides an automated way to build subgraph categories. This eliminates potential bias that can come from the manual building of these subgraph categories. We have the following statement in the manuscript, highlighting this point:

“But because GOcats will always produce the same subgraph categorizations for the same set of keywords used with the same version of GO, we argue that our categorization is more reproducible and less prone to bias than manually grouping GO terms into categories or otherwise manually identifying major concepts represented from omics-level analyses.”

However, GOcats is still prone to bias that comes from any user input, which in this case, is the keywords and terms provided by the user. We have tried to be careful and clearly indicate what biases GOcats can and cannot avoid.

Issue 2:

Response:

This is an issue of reproducibility. Using the same keywords with the same versions of GO and datasets will produce the same results. We have a public FigShare repository that includes all of the manuscript’s results and the programs and scripts used to produce these results. One researcher simply needs to provide the keywords they used along with the versions of GOcats, GO, and their datasets for another researcher to reanalyze. This is no different than a shared GO slim. Also, GO slims do change over time. We have added the following statement to highlight the point of enhanced reproducibility:

“Furthermore, the set of keywords can be provided along with the version of GOcats, GO, and the dataset to enable reproducibility of analyses by others.”

Issue 3:

Response:

GOcats has multiple use-cases. We describe how it can be used to generate GO slim like categories, deriving subcellular location from a large knowledgebase, and knowledgebase curation. In another publication, we demonstrates GOcats’s use in annotation enrichment analysis. These use-cases require different levels of GO and bioinformatics expertise. Dataset harmonization and knowledgebase curation where granularity adjustment would be useful would require quite a bit of expertise. We have tried to illustrate this with the following added statement:

“Furthermore, GOcats was designed for scientists who are less familiar with GO; however, the package has advanced features for users with more bioinformatics expertise.“

Issue 4:

Response:

Our other publication on GOcats provides results illustrating the effect of turning on and off the has_part relationship in categorization and annotation enrichment analysis.

Issue 5:

5. There is a rather long review of semantic similarity and ontology mapping/evolution, themes that appear to be only marginally related to the topic. These portions of the text could be summarized.

Response:

We have found it necessary to provide an introduction to these topics so that those unfamiliar with them can understand the significance of our work. A lot of people use GO for a wide range of purposes and have quite a bit of expertise in specific applications of GO. Therefore, a lot of people view themselves as “experts” on GO and on ontologies as a whole; however, they have little formal training in ontologies as an area of research. Therefore, we find ourselves needing to provide the necessary background in order for others to understand exactly what problems we are solving and why the solutions are significant.

Issue 6:

6. A relevant application of GOCats is presented as "GOcats can facilitate the integrity checking of annotations that are added to public repositories by streamlining the process of extracting categories of annotations from knowledgebases and comparing them to the original annotations in the raw data."

Also this use case would be a lot stronger, if the reader was given an idea of how often cellular localization identification is not made to GO.

Response:

The word “raw” is a misnomer. We meant to contrast the raw HPA datasets with respect to derived ontology-normalized information stored in a knowledgebase. The raw HPA datasets are actually highly processed results that used a controlled vocabulary to describe specific subcellular locations. To make this point clearer, we have added the following statement:

‘In this context, the term “raw data” refers to processed, curated experimental data that is annotated as a contrast to the GO annotations derived from a knowledgebase.’

In response to the reviewer’s comment about an independent integrity check, we directly compared the subcellular localization indicated by the “raw” HPA datasets to the HPA-deposited annotations in the knowledgebase. We hope this is clearer by the statement we add above.

With respect to reviewer’s request the we provide an idea of how often cellular localization identification is not made with GO, we cannot feasibly review all potential uses of GO that have been published to give an idea of how often this occurs. However, we have pointed out a major instance when cellular compartments were manually organized into a hierarchical localization tree:

‘For example, a recent effort to create a protein-protein interaction network analysis database resorted to manually building a hierarchical localization tree from GO cellular compartment terms due to the “incongruity in the resolution of localization data” in various source databases and the fact that no published method existed at that time for the automated organization of such terms (6).’

Issue 7:

7. GOCats is potentially generalizable to the other GO types. Why was it only applied to cellular component? The other GO branches are much larger than CC, would this impact GOCats negatively?

Response:

Our other GOcats publication illustrates GOcats applied to all three GO sub-ontologies.

Issue 8:

Response:

GOcats is a versatile tool. Our other publication illustrates GOcats use in annotation enrichment analyses of RNAseq datasets. This manuscript is meant to provide a thorough description of the methods and implementation with interesting applications in categorization and curation. We have provided a thorough evaluation of these use-cases. Evaluation of the usability of GOcats by end-users is not the point of this manuscript. Demonstration of new capabilities in automated categorization methods is the point of this manuscript. We have done this.

Issue 9:

Minor:

9. This statement in Page 20 is unclear to me "Overall, the patterns of connectedness in this

network make more sense biologically, within the constraints of GO’s internal

organization." More sense compared to what?

Response:

We have added the following clarifying statement:

“In other words, it is easier to see the expected biological relationships between cellular locations in Figure 1B versus Figure 1A.”

Issue 10:

10. In page 28 partial and no agreement definitions are not easy to understand in the text. Definition should not be in Figure 5 caption but in the main text.

Response:

Respectfully, we disagree. It is painful for the reader to have to flip back and forth from Figure and main text to understand what each column means. However, we have added these definitions to the main text as well:

‘In this context, “partial agreement” refers to genes with at least one matching subcellular location, “partial agreement is superset” refers to genes where knowledgebase subcellular locations are a superset of the HPA dataset (these are mutually exclusive to the “partial agreement” category), "no agreement" refers to genes with no subcellular locations in common, and “no annotations” refers to genes in the experimental dataset that were not found in the knowledgebase.’

Issue 11:

11. In table 1 , how were the 25 examples selected?

Response:

As we stated in the text:

‘Starting with common biological subcellular concepts like “nucleus”, “cytoplasm”, and “mitochondrion”, we recursively used terms not being categorized to identify additional subcellular concepts and associated keywords represented within the GO Cellular Component sub-ontology.’

Attachment

Submitted filename: response_to_reviewer_comments_v2.docx

Click here for additional data file.^{(25.4KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0233311.r003

Decision Letter 1

Marc Robinson-Rechavi

4 May 2020

GOcats: A tool for categorizing Gene Ontology into subgraphs of user-defined concepts

PONE-D-20-00699R1

Dear Dr. Moseley,

We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements.

Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication.

Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

With kind regards,

Marc Robinson-Rechavi

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: N/A

Reviewer #3: N/A

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Reviewer #1: The authors have addressed my concerns and fixed the places that I was confused about. Please make sure that the figures are readable in the final print.

Reviewer #3: ... Please use the space provided to explain your answers to the questions above.

Thank you for addressing my comments.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #3: Yes: Pascale Gaudet

PLoS One. doi: 10.1371/journal.pone.0233311.r004

Acceptance letter

Marc Robinson-Rechavi

28 May 2020

PONE-D-20-00699R1

GOcats: A tool for categorizing Gene Ontology into subgraphs of user-defined concepts

Dear Dr. Moseley:

I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

For any other questions or concerns, please email plosone@plos.org.

Thank you for submitting your work to PLOS ONE.

With kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Prof. Marc Robinson-Rechavi

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Data. Visualizing the degree of overlap between the category subgraphs created by GOcats, Map2Slim, and the UniProt CV (additional categories).

(DOCX)

Click here for additional data file.^{(1.3MB, docx)}

S2 Data. List of GO terms mapped by Map2Slim to the term plasma membrane that were not mapped to this location by GOcats.

(DOCX)

Click here for additional data file.^{(23.1KB, docx)}

S1 File

(DOCX)

Click here for additional data file.^{(12.9KB, docx)}

Attachment

Submitted filename: response_to_reviewer_comments_v2.docx

Click here for additional data file.^{(25.4KB, docx)}

Data Availability Statement

[pone.0233311.ref001] 1.Ashburner M, Ball C, Blake J, Botstein D, Butler H, Cherry J. Gene Ontology: tool for the unification of biology. Nat Genet [Internet]. 2000;25(1):25–9. Available from: http://www.ncbi.nlm.nih.gov/pmc/articles/pmc3037419/ [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0233311.ref002] 2.The UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res [Internet]. 2015;43(D1):D204–12. Available from: http://nar.oxfordjournals.org/lookup/doi/10.1093/nar/gku989 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0233311.ref003] 3.Cunningham F, Amode MR, Barrell D, Beal K, Billis K, Brent S, et al. Ensembl 2015. Nucleic Acids Res [Internet]. 2015;43(D1):D662–9. Available from: http://nar.oxfordjournals.org/lookup/doi/10.1093/nar/gku1010 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0233311.ref004] 4.Gene Ontology consortium. Gene Ontology Consortium: going forward. Nucleic Acids Res [Internet]. 2015;43(D1):D1049–56. Available from: http://nar.oxfordjournals.org/lookup/doi/10.1093/nar/gku1179 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0233311.ref005] 5.Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res [Internet]. 2004;32(suppl 1):D267–70. Available from: http://nar.oxfordjournals.org/content/32/suppl_1/D267%5Cnhttp://nar.oxfordjournals.org/content/32/suppl_1/D267.full.pdf%5Cnhttp://nar.oxfordjournals.org/content/32/suppl_1/D267.short%5Cnhttp://www.ncbi.nlm.nih.gov/pubmed/14681409 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0233311.ref006] 6.Veres D V., Gyurko DM, Thaler B, Szalay KZ, Fazekas D, Korcsmaros T, et al. ComPPI: a cellular compartment-specific database for protein-protein interaction network analysis. Nucleic Acids Res [Internet]. 2015;43(D1):D485–93. Available from: http://nar.oxfordjournals.org/lookup/doi/10.1093/nar/gku1007 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0233311.ref007] 7.Huttlin EL, Ting L, Bruckner RJ, Gebreab F, Gygi MP, Szpyt J, et al. The BioPlex Network: A Systematic Exploration of the Human Interactome. Cell [Internet]. 2015;162(2):425–40. Available from: http://www.sciencedirect.com/science/article/pii/S0092867415007680 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0233311.ref008] 8.Papatheodorou I, Oellrich A, Smedley D. Linking gene expression to phenotypes via pathway information. J Biomed Semantics [Internet]. 2015;6:17 Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=4404592&tool=pmcentrez&rendertype=abstract [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0233311.ref009] 9.Na D, Son H, Gsponer J. Categorizer: a tool to categorize genes into user-defined biological groups based on semantic similarity. BMC Genomics [Internet]. 2014;15:1091 Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=4298957&tool=pmcentrez&rendertype=abstract [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0233311.ref010] 10.Uhlen M, Fagerberg L, Hallstrom BM, Lindskog C, Oksvold P, Mardinoglu A., et al. Tissue-based map of the human proteome. Science (80-) [Internet]. 2015;347(6220):1260419–1260419. Available from: http://www.sciencemag.org/content/347/6220/1260419 [DOI] [PubMed] [Google Scholar]

[pone.0233311.ref011] 11.GO Slim and Subset Guide [Internet]. [cited 2016 Nov 22]. http://geneontology.org/page/go-slim-and-subset-guide

[pone.0233311.ref012] 12.Binns D, Dimmer EC, Huntley RP, Barrell DG, O’Donovan C, Apweiler R. QuickGO: a web-based tool for Gene Ontology searching. Bioinformatics [Internet]. 2009;25(22):3045–6. Available from: http://doi.wiley.com/10.1002/pmic.200800002 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0233311.ref013] 13.The UniProt Consortium. subcell.txt [Internet]. 2015 [cited 2015 May 27]. http://www.uniprot.org/docs/subcell

[pone.0233311.ref014] 14.Jiang JJ. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: Proceedings of International Conference Research on Computational Linguistics (ROCLING X). 1997.

[pone.0233311.ref015] 15.Lin D. An Information-Theoretic Definition of Similarity. In: ICML ‘98 Proceedings of the Fifteenth International Conference on Machine Learning. 1989. p. 296–304.

[pone.0233311.ref016] 16.Resnik P. Semantic Similarity in a Taxonomy: An Information Based Measure and Its Application to Problems of Ambiguity in Natural Language. J Aritificial Intell Res. 1999;11:95–130. [Google Scholar]

[pone.0233311.ref017] 17.Schlicker A, Domingues FS, Rahnenführer J, Lengauer T. A new measure for functional similarity of gene products based on Gene Ontology. BMC Bioinformatics [Internet]. 2006;7:302 Available from: http://www.scopus.com/inward/record.url?eid=2-s2.0-33748335463&partnerID=tZOtx3y1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0233311.ref018] 18.Abeysinghe R, Hinderer EW, Moseley HNB, Cui L. Auditing Subtype Inconsistencies among Gene Ontology Concepts. In: The 2nd International Workshop on Semantics-Powered Data Analytics (SEPDA 2017)—in conjunction with IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 2017.

[pone.0233311.ref019] 19.Abeysinghe R, Zheng F, Hinderer EW, Moseley HNB, Cui L. A Lexical Approach to Identifying Subtype Inconsistencies in Biomedical Terminologies. In: Quality Assurance of Biological and Biomedical Ontologies and Terminologies Workshop—Bioinformatics and Biomedicine (BIBM), 2018 IEEE International Conference. 2018.

[pone.0233311.ref020] 20.Groß A, Pruski C, Rahm E. Evolution of Biomedical Ontologies and Mappings: Overview of Recent Approaches. Comput Struct Biotechnol J [Internet]. 2016;14:1–8. Available from: http://linkinghub.elsevier.com/retrieve/pii/S2001037016300319 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0233311.ref021] 21.Groß A, Dos Reis JC, Hartung M, Pruski C, Rahm E. Semi-automatic adaptation of mappings between life science ontologies. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics). 2013;7970 LNBI:90–104. [Google Scholar]

[pone.0233311.ref022] 22.Cesar J, Reis D, Santec CR, Tudor CRPH, Da Silveira M, Reynaud-delaître C. Mapping Adaptation Actions for the Automatic Reconciliation of Dynamic Ontologies. Cikm. 2013;599–608. [Google Scholar]

[pone.0233311.ref023] 23.Hinderer EW, Flight RM, Dubey R, Macleod JN, Moseley HNB. Advances in gene ontology utilization improve statistical power of annotation enrichment. PLoS One. 2019;14(8):1–20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0233311.ref024] 24.Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res [Internet]. 2003;13(11):2498–504. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=403769&tool=pmcentrez&rendertype=abstract [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0233311.ref025] 25.Lex A, Gehlenborg N, Strobelt H, Vuillemot R. UpSet: Visualization of Intersecting Sets Supplementary Material. IEEE Trans Vis Comput Graph. 2014;20(12):1983–1992. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0233311.ref026] 26.van Rossum G, Drake F. The Python Language Reference Manual. Network Theory Ltd; 2011. [Google Scholar]

[pone.0233311.ref027] 27.Gamma E, Helm R, Johnson R, Vlissides J, Booch G. Design Patterns: Elements of Reusable Object-Oriented Software 1st Edition Addison-Wesley Professional; 1994. [Google Scholar]

[pone.0233311.ref028] 28.McKinney W. Data Structures for Statistical Computing in Python. Proc 9th Python Sci Conf [Internet]. 2010;1697900(Scipy):51–6. Available from: http://conference.scipy.org/proceedings/scipy2010/mckinney.html

PERMALINK

GOcats: A tool for categorizing Gene Ontology into subgraphs of user-defined concepts

Eugene W Hinderer III

Hunter N B Moseley

Roles

Abstract

Introduction

Gene Ontology (GO)

Growth and evolution of biological controlled vocabularies

Difficulty in representing biological concepts derived from omics-level research

Term categorization approaches

Semantic similarity in the context of broad term categorization

Maintenance of ontologies

GO Categorization Suite (GOcats)

Results

GOcats compactly organizes GO subcellular localization terms into user-specified categories

Table 1. Summary of 25 example subcellular locations extracted by GOcats.

Fig 1.

GOcats-derived category subgraphs compare well with similar subgraphs derived by other methods

Fig 2. Flowchart of the UniProt subcellular location CV subgraph creation method and inclusion index equation.

Table 2. Agreement summary between corresponding GOcats and UniProt CV subgraphs.

Table 3. Agreement summary between corresponding GOcats and Map2Slim subgraphs.

Fig 3. Visualizing the degree of overlap between the category subgraphs created by GOcats, Map2Slim, and the UniProt CV.

Custom-tailoring of GO slim-like categories with GOcats allows for robust knowledgebase gene annotation mining

Table 4. Summary of 20 subcellular locations used in the HPA raw experimental data extracted by GOcats.

Fig 4. Methods overview of knowledgebase gene annotation mapping and comparison to human protein database subcellular localization raw data.

Fig 5. Comparison of UniProt-Ensembl knowledgebase annotation data mining extraction performance by GOcats, Map2Slim, and UniProt CV.

Table 5. Generic location categories used to resolve potential scoping inconsistencies in HPA raw data.

Table 6. Summary of gene location category agreement between manually-curated HPA raw data and GOcats/Map2Slim categorized HPA-derived annotations.

Fig 6. Comparison of HPA knowledgebase derived annotations to HPA experimental data.

Discussion

Conclusions

Methods

Methodological overview and design rationale

Fig 7. Flowchart of the GOcats’ subgraph creation method.

Implementation overview

Fig 8. UML diagrams describing the GOcats implementation.

Specific implementation details

Creating category mappings from UniProt’s subcellular location controlled vocabulary

Creating category mappings from Map2Slim

Mapping gene annotations to user-defined categories

Visualizing and characterizing intersections of category subgraphs

Assigning generalized subcellular locations to genes from the knowledgebase and comparing assignments to experimentally-determined locations

Running time tests

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Marc Robinson-Rechavi

Roles

Author response to Decision Letter 0

Decision Letter 1

Marc Robinson-Rechavi

Roles

Acceptance letter

Marc Robinson-Rechavi

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases