The dcGO Domain-Centric Ontology Database in 2023: New Website and Extended Annotations for Protein Structural Domains

Chaohui Bao; Chang Lu; James Lin; Julian Gough; Hai Fang

doi:10.1016/j.jmb.2023.168093

. Author manuscript; available in PMC: 2023 Aug 29.

Published in final edited form as: J Mol Biol. 2023 Apr 13;435(14):168093. doi: 10.1016/j.jmb.2023.168093

The dcGO Domain-Centric Ontology Database in 2023: New Website and Extended Annotations for Protein Structural Domains

Chaohui Bao ¹, Chang Lu ^2,³, James Lin ⁴, Julian Gough ², Hai Fang ^1,^✉

PMCID: PMC7614987 EMSID: EMS185259 PMID: 37061086

Abstract

Protein structural domains have been less studied than full-length proteins in terms of ontology annotations. The dcGO database has filled this gap by providing mappings from protein domains to ontologies. The dcGO update in 2023 extends annotations for protein domains of multiple definitions (SCOP, Pfam, and InterPro) with commonly used ontologies that are categorised into functions, phenotypes, diseases, drugs, pathways, regulators, and hallmarks. This update adds new dimensions to the utility of both ontology and protein domain resources. A newly designed website at http://www.protdomainonto.pro/dcGO offers a more centralised and user-friendly way to access the dcGO database, with enhanced faceted search returning term- and domain-specific information pages. Users can navigate both ontology terms and annotated domains through improved ontology hierarchy browsing. A newly added facility enables domain-based ontology enrichment analysis.

Keywords: protein structural domains, ontologies, annotations, enrichment analysis, computational resources

Introduction

Computational prediction of protein structures has become feasible,¹ but most available protein sequences lack biological annotations.² Protein structural domains have received less attention than full-length proteins in terms of ontology annotations, such as annotations using Gene Ontology (GO).³ To resolve this gap, about ten years ago we developed a domain-centric method⁴ to create the dcGO database,⁵ an ontology resource that provides annotations for protein structural domains. A growing number of ontologies have been created to annotate full-length proteins; however, there is a significant need for using ontologies to annotate protein domains. Domain-centric ontology annotation resources are essential since protein domains often act as the functional units of proteins and haven been shown to be useful in protein function prediction^6,7 and more recently in hypothesis-free phenotype prediction.⁸

Over time, dcGO has evolved to support domain-centric annotations not only for protein domains taken from the structural classification of protein (SCOP) at both the superfamily and family levels,⁹ but also for domains from Pfam¹⁰ and InterPro.¹¹ In parallel with the growth in ontology knowledge-bases, these domain-centric annotations are available across various knowledge contexts, ranging from functions and pathways to phenotypes and diseases, and even drugs. Systematic mappings from protein domains to ontology terms, via dcGO, maximise the utility of both ontology and domain resources.

Since our previous publications closely related to dcGO,^4,5,12,13 we have continued to expand ontologies and domains, and considerably, we have redesigned a new website (Figure 1). The website includes a booklet-style user manual and features enhanced faceted search (augmenting search results with a faceted navigation system,¹⁴ improved ontology hierarchy browsing, and domain-based ontology enrichment analysis. All these improvements represent the dcGO database update in 2023, which we will describe in detail in the following sections.

**(A)** The content. *Top*: ontologies are categorised into functions, pathways, phenotypes, diseases, drugs, regulators, and hallmarks. *Bottom*: a treemap summarises the database content. Each box represents an ontology and is color-coded by the total number of annotations per ontology. The treemap describes numbers on annotations, ontology terms, and protein domains of different definitions (i.e., SCOP, Pfam, and InterPro). SF, SCOP superfamilies; FA, SCOP families. **(B)** The website. It includes interfaces for browsing the ontology hierarchy and annotated domains, performing domain-based ontology enrichment analysis, providing the help on database access, and using the faceted search to explore the dcGO resource. Notably, the faceted search enables simultaneous search for protein domains (of different definitions) and ontology terms (of various categories).

Materials and methods

The dcGO building method

The building method has evolved over time and can be simplified into the following steps:

(i)
Prepare a correspondence matrix^5,6 that records the observed number of proteins (i.e. matrix entries) with structural domains (in columns) and ontology terms (in rows).
(ii)
Deduce associations/annotations between domains and terms from the corresponding matrix using Fisher’s exact test. The annotation significance is measured by false discovery rate (FDR) with Benjamini-Hochberg corrections for multiple hypothesis testing,¹⁵ and the annotation strength is quantified by a hypergeometric distribution-based score (or ‘annotation score’) rescaled into the 1–100 range.
(iii)
Propagate domain-centric annotations to all ancestor terms (along with annotation scores) according to the ‘True Path Rule’, which respects the directed acyclic graph of an ontology (e.g. GO).¹⁶ This rule ensures that a protein domain annotated to a term must also be annotated by its top-level parent terms in paths towards the ontology root.⁵

In summary, the dcGO building method takes as inputs ontology terms attached to proteins and the domain composition of proteins, and then statistically infers mappings from protein domains to ontology terms within a probabilistic framework. For further details, users are referred to our previous publications on the method.^4,5 In this 2023 update, the method has been applied to almost all commonly used ontologies for protein domains of different definitions, which are described in greater detail below.

Protein domains of different definitions

Presently, the dcGO database provides ontology annotations for protein domains taken from SCOP,⁹ Pfam,¹⁰ and InterPro¹¹ (Figure 1(A)). Annotations are supported for SCOP at both the superfamily and family levels. SCOP domains are classified into a superfamily if there exists structure, sequence, and function evidence for a common evolutionary ancestor. Superfamilies can be further divided into families based on high sequence similarity or related function. In addition to SCOP, ontology annotations have also been extended to approximately 1,000 Pfam domains and around 800 Inter-Pro domains, two popular protein family resources.

Commonly used ontologies

The dcGO update in 2023 now conveniently organises ontologies into seven broad categories (Figure 1(A)):

(i)
Functions: GO¹⁷ (accessed in October 2022), which includes GO Biological Process (GOBP), GO Molecular Function (GOMF), and GO Cellular Component (GOCC).
(ii)
Phenotypes: This category includes Human Phenotype Ontology (HPO)¹⁸ (June 2022 release), Mammalian Phenotype Ontology (MPO)¹⁹ (accessed in July 2022), and other phenotype and anatomy ontologies for model organisms such as WormBase²⁰ (WS284 release), FlyBase²¹ (6.48 release), ZFIN²² (accessed in July 2022), and TAIR²³ (accessed in July 2022).
(iii)
Diseases: This category includes Mondo Disease Ontology (MONDO) that harmonises disease definitions across the world²⁴ (v2023-01-04 release), and Experimental Factor Ontology (EFO) used to annotate genome-wide association study (GWAS) disease traits²⁵ (3.44.0 release).
(iv)
Drugs: That is, druggable categories from DGIdb²⁶ (2022-Feb release) and target tractability buckets (Bucket) from Open Targets²⁷ (22.06 release).
(v)
Pathways: This category primarily includes sources from KEGG²⁸ (103.0 release), REACTOME²⁹ (version 81 release), PANTHER³⁰ (17.0 release), WikiPathways³¹ (July 2022 release), and MitoPathways from MitoCarta³² (MitoCarta3.0 version).
(vi)
Regulators: That is, ENRICHR Consensus TFs³³ (accessed in July 2022) and TRRUST³⁴ (2018.04.16 release).
(vii)
Hallmarks: Molecular signature hallmarks from MSigDB³⁵ (v7.5.1 release).

The dcGO website

The website has been revamped using the Mojolicious Perl real-time web framework (https://mojolicious.org) and Bootstrap (https://getbootstrap.com) to support a mobile-first and responsive web experience for all major browsers and devices. To enable faceted search, the website uses the typeahead JavaScript library (https://twitter.github.io/typeahead.js), which includes a suggestion engine for queries (ontology terms or protein domains) and a user interface view for rendering suggestions and handling hyperlinks from search results. Enrichment results from domain-based enrichment analysis are rendered using the bookdown R package (https://bookdown.org), which generates self-contained dynamic HTML files in the enrichment results page. The source code for the dcGO website is made available at GitHub (https://github.com/hfang-bristol/dcGO).

Results and discussion

Faceted search as a hub to explore the dcGO resource

The dcGO website offers a powerful faceted search (Figure 1(B)) that allows users to perform multiple tasks with hyperlinks from the search 3 results. This is enabled using a flexible JavaScript library to create robust typeaheads (see Materials and Methods). The search engine supports fulltext queries for protein domains and ontology terms. When users search for an ontology term, the results are hyperlinked to a term-specific page, which displays a table of annotated domains. Similarly, when searching for a particular protein domain, the results are hyperlinked to a domain-specific page, which displays a table of ontology terms used to annotate that protein domain. These tabular displays include annotation scores that quantify the support for annotations between domains and terms. By clicking on the hyperlinks provided, users can easily switch between domain-specific and term-specific pages. In conclusion, the faceted search not just provides search results but also interconnects all database contents, enabling users to perform integrated mining of the dcGO resource.

Browsing ontology hierarchy and annotated domains

The dcGO website features the ‘Ontology Hierarchy’ navigation that allows users to browse ontology hierarchies.Figure 1(A) summarises the ontologies currently supported in the database. As before, the most abundant annotations are seen for ontologies related to functions and phenotypes. The least abundant domain-centric annotations are seen for mitochondrial pathways, which have recently been added to the dcGO database. The ontology hierarchy has a node for each term and directed edges linking it to its children nodes. All direct children of the current node are listed underneath, allowing users to browse the hierarchy in a downward direction. In addition to the hierarchy itself, the toggle panels for domain-centric annotations are also displayed separately for SCOP, Pfam, and InterPro.

To illustrate how users can access ontologies and annotated domains, we take as an exemplar the EFO,36 a newly added ontology in the dcGO database that enables domain-centric annotations with GWAS disease traits (Figure 2(A)). The hierarchy roots of all supported ontologies in dcGO can be found on the landing page, including the EFO root term ‘disease’ (EFO:0000408). This root term is hyperlinked to its detailed hierarchy page (Click 1 of Figure 2(A)), displaying its 35 child terms in a table. In this table, each child term [such as ‘immune system disease’ (EFO:0000540)] provides a hyperlink to both the hierarchy page and the term-specific page (Click 2 of Figure 2(A)). The term-specific page displays a table of annotated domains, grouped separately by SCOP, Pfam, and InterPro. For example, a total of 33 Pfam domains are annotated to the ‘immune system disease’ term, and these annotations are sorted by their annotation scores (Click 3 of Figure 2(A); also see Table 1). Users can explore these annotations using hyperlinks that lead to the domain-centric pages. In summary, the ontology hierarchy interfaces offer a more integrated and cohesive way to navigate ontology terms and annotated domains.

The integers in the hexagons denote sequential clicks. **(A)** Interfaces for exploring the ontology hierarchy and annotated domains. *Top-left*, the hierarchy page lists all supported ontologies, including Experimental Factor Ontology (EFO). *Top-right*: the EFO term ‘disease’ (EF0.0000408) and its child terms. Each child term provides a hyperlink to the hierarchy page and a hyperlink to the term-specific page. *Bottom*: the term-specific page for the child term ‘immune system disease’ (EF0.0000540), which lists the annotated domains separately for SCOP, Pfam and InterPro; for example, Pfam domain annotations (also listed in Table 1). **(B)** Domain-based ontology enrichment analysis for identifying enriched ontology terms from user-input protein domains. *Left*, the user-request interface, which takes a list of user-input protein domains and their matched domain type, available ontologies, and additional parameters for more control over the enrichment analysis and results. Enrichment results include a table (see Table 2) and a dot plot, all embedded into a self-contained dynamic HTML file available for exploration and download.

Table 1. List of Pfam domains annotated to the EFO term ‘immune system disease’.

Identifier	Description	Annotation score [1–100]
PF07654	Immunoglobulin C1-set domain	59
PF00969	Class II histocompatibility antigen, beta domain	39
PF00605	Interferon regulatory factor transcription factor	25
PF01023	S-100/ICaBP type calcium binding domain	23
PF01582	TIR domain	23
PF00017	SH2 domain	22
PF00229	TNF (Tumour Necrosis Factor) family	22
PF00020	TNFR/NGFR cysteine-rich region	22
PF00048	Small cytokines (intecrine/chemokine), interleukin-8 like	19
PF01108	Tissue factor	18
PF00619	Caspase recruitment domain	17
PF00008	EGF-like domain	16
PF03770	Inositol polyphosphate kinase	16
PF01017	STAT protein, all-alpha domain	16
PF02864	STAT protein, DNA binding domain	16
PF02865	STAT protein, protein interaction domain	16
PF09294	Interferon-alpha/beta receptor, fibronectin type III	15
PF10401	Interferon-regulatory factor 3	15
PF00129	Class I Histocompatibility antigen, domains alpha 1 and 2	14
PF00178	Ets-domain	14
PF00993	Class II histocompatibility antigen, alpha domain	13
PF00001	7 transmembrane receptor (rhodopsin family)	12
PF00023	Ankyrin repeat	12
PF00656	Caspase domain	11
PF07686	Immunoglobulin V-set domain	11
PF02198	Sterile alpha motif (SAM)/Pointed domain	11
PF07714	Protein tyrosine and serine/threonine kinase	10
PF00018	SH3 domain	10
PF07716	Basic region leucine zipper	8
PF00170	bZIP transcription factor	8
PF00173	Cytochrome b5-like Heme/Steroid binding domain	8
PF00130	Phorbol esters/diacylglycerol binding domain (C1 domain)	5
PF00169	PH domain	2

Open in a new tab

A new facility supporting domain-based ontology enrichment analysis

The dcGO resource provides a unique reference knowledgebase for domain-centric ontology annotations, and a new facility has been developed to perform enrichment analysis for user-input protein domains. This facility enables the identification of enriched ontology terms, a feature not available in other web-based enrichment analysis tools (for example, DAVID web server for enrichment analysis focusing on genes/proteins³⁷). The user-request interface (Click 4 of Figure 2(B) allows users to input a list of protein domains and their matched domain type, as well as select available ontologies (organised by category; see Figure 1 (A)). Additional parameters can be specified to control the analysis and results. The interface provides an example showcase (that is, 33 Pfam domains described above in Figure 2(A)). In the enrichment results page, the enriched ontology terms are presented in an interactive table, along with the significant information such as Z-scores and FDR, and member domains that overlap with the input domains (Click 5 of Figure 2(B); Table 2). The results are also illustrated in the ‘Dotplot of enriched ontology terms’ tab, which shows the top five terms with their respective Z-scores and FDR. All enrichment results are embedded into a self-contained dynamic HTML file, which can be downloaded and explored interactively in a new browser window, making it easy for users to explore the results further.

Table 2. List of top 5 enriched GOBP terms.

Term ID	Term Name	Z-score	FDR	Num	Member domains
GO:0002376	immune system process	13.2	6.90E-17	22	PF00001, PF00008, PF00017, PF00018, PF00020, PF00048, PF00129, PF00130, PF00169, PF00229, PF00605, PF00619, PF00656, PF00969, PF00993, PF01108, PF01582, PF07654, PF07686, PF07714, PF07716, PF10401
GO:0048522	positive regulation of cellular process	9.19	3.30E-14	29	PF00001, PF00008, PF00017, PF00018, PF00020, PF00023, PF00048, PF00129, PF00130, PF00169, PF00170, PF00178, PF00229, PF00605, PF00619, PF00656, PF00969, PF00993, PF01017, PF01023, PF01582, PF02198, PF02864, PF02865, PF07654, PF07686, PF07714, PF07716, PF10401
GO:0002684	positive regulation of immune system process	12.3	7.70E-13	16	PF00001, PF00017, PF00018, PF00020, PF00048, PF00129, PF00130, PF00169, PF00229, PF00619, PF00969, PF00993, PF01582, PF07654, PF07686, PF07714
GO:0006952	defense response	11.4	9.60E-13	18	PF00001, PF00017, PF00018, PF00020, PF00048, PF00129, PF00605, PF00619, PF01017, PF01023, PF01108, PF01582, PF02864, PF02865, PF07654, PF07714, PF09294, PF10401
GO:0006950	response to stress	9.55	9.60E-13	24	PF00001, PF00008, PF00017, PF00018, PF00020, PF00023, PF00048, PF00129, PF00130, PF00169, PF00170, PF00605, PF00619, PF01017, PF01023, PF01108, PF01582, PF02864, PF02865, PF07654, PF07714, PF07716, PF09294, PF10401

Open in a new tab

Conclusion

In this updated version of the dcGO resource, our continued focus is on providing systematic mappings from protein domains to ontologies. We are excited to introduce a new website with enhanced data analyses and a unique facility for identifying ontology knowledge enrichments from the perspective of domain-centric annotations. Our commitment to updating the resource twice a year includes integrating information from our previously established resources such as XGR,³⁸ SUPERFAMILY,³⁹ and Priority index.^40–42 Looking to the future, we are also excited to explore the potential of large language models⁴³ in generating domain-centric ontologies, following their success in generating functional protein sequences.⁴⁴

Acknowledgements

This work has received the following funding sources: National Natural Science Foundation of China [32170663 to H. Fang]; Shanghai Pujiang Program [21PJ1409600 to H. Fang]; Program for Professor of Special Appointment (Eastern Scholar) at Shanghai Institutions of Higher Learning [H. Fang]; and Innovative Research Team of High-Level Local Universities in Shanghai.

Footnotes

CRediT authorship contribution statement

Chaohui Bao: Data curation, Writing – original draft. Chang Lu: Data curation, Writing – review & editing. James Lin: Writing – review & editing. Julian Gough: Conceptualization, Resources. Hai Fang: Conceptualization, Supervision, Data curation, Resources, Funding acquisition, Writing – original draft, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data Availability

All dcGO data and online tools are provided to the public free of charge.

References

1.Tunyasuvunakool K, Adler J, Wu Z, Green T, Zielinski M, Zidek A, Bridgland A, Cowie A, et al. Highly accurate protein structure prediction for the human proteome. Nature. 2021;596:590–596. doi: 10.1038/s41586-021-03828-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Scaiewicz A, Levitt M. The language of the protein universe. Curr Opin Genet Dev. 2015;35:50–56. doi: 10.1016/j.gde.2015.08.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, et al. Gene Ontology: tool for the unification of biology. Nat Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.de Morais DAL, Fang H, Rackham OJL, Wilson D, Pethica R, Chothia C, Gough J. SUPERFAMILY 1.75 including a domain-centric gene ontology method. Nucleic Acids Res. 2011;39:D427–D434. doi: 10.1093/nar/gkq1130. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Fang H, Gough J. dcGO: database of domaincentric ontologies on functions, phenotypes, diseases and more. Nucleic Acids Res. 2013;41:D536–D544. doi: 10.1093/nar/gks1080. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Fang H, Gough J. A domain-centric solution to functional genomics via dcGO Predictor. BMC Bioinf. 2013;14:1–11. doi: 10.1186/1471-2105-14-S3-S9. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K, Funk C, et al. A large-scale evaluation of computational protein function prediction. Nat Methods. 2013;10:221–227. doi: 10.1038/nmeth.2340. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Lu C, Zaucha J, Gam R, Fang H, Smithers B, Oates ME, Bernabe-rubio M, Williams J, et al. Hypothesis-free phenotype prediction within a genetics-first framework. Nat Commun. 2023;14:919. doi: 10.1038/s41467-023-36634-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995;247:536–540. doi: 10.1006/jmbi.1995.0159. [DOI] [PubMed] [Google Scholar]
10.Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer ELL, Tosatto Paladin SCE, et al. Pfam: The protein families database in 2021. Nucleic Acids Res. 2021;49:D412–D419. doi: 10.1093/nar/gkaa913. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Blum M, Chang HY, Chuguransky S, Grego T, Kandasaamy S, Mitchell A, Nuka G, Paysan-Lafosse T, et al. The InterPro protein families and domains database: 20 years on. Nucleic Acids Res. 2021;49:D344–D354. doi: 10.1093/nar/gkaa977. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Fang H. dcGOR: an R package for analysing ontologies and protein domain annotations. PLoS Comput Biol. 2014;10:e1003929. doi: 10.1371/journal.pcbi.1003929. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Oates ME, Stahlhacke J, Vavoulis D, Smithers B, Rackham O, Sardar A, Zaucha J, Thurlby N, et al. a doubling of data. Nucleic Acids Res. 2015;43:D227–D233. doi: 10.1093/nar/gku1041. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Tunkelang D. Faceted Search. Springer; Cham: 2009. [DOI] [Google Scholar]
15.Benjamini Y, Hochberg Y. Controlling the False Discovery Rate-a Practical and Powerful Approach to Multiple Testing. J R Stat Soc Ser B-Methodological. 1995;57:289–300. doi: 10.1111/j.2517-6161.1995.tb02031.x. [DOI] [Google Scholar]
16.Ashburner M, Ball CA, Blake JA, Butler H, Cherry JM, Eppig JT, Harris M, Hill DP, et al. Creating the Gene Ontology resource: Design and implementation. Genome Res. 2001;11:1425–1433. doi: 10.1101/gr.180801. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Carbon S, Douglass E, Good BM, Unni DR, Harris NL, Mungall CJ, Basu S, Chisholm RL, et al. The Gene Ontology resource: Enriching a GOld mine. Nucleic Acids Res. 2021;49:D325–D334. doi: 10.1093/nar/gkaa1113. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Kohler S, Gargano M, Matentzoglu N, Carmody LC, Lewis-Smith D, Vasilevsky NA, Danis D, Balagura G, et al. The human phenotype ontology in 2021. Nucleic Acids Res. 2021;49:D1207–D1217. doi: 10.1093/nar/gkaa1043. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Bogue MA, Philip VM, Walton DO, Grubb SC, Dunn MH, Kolishovski G, Emerson J, Mukherjee G, et al. Mouse Phenome Database: A data repository and analysis suite for curated primary mouse phenotype data. Nucleic Acids Res. 2020;48:D716–D723. doi: 10.1093/nar/gkz1032. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Harris TW, Arnaboldi V, Cain S, Chan J, Chen WJ, Cho J, Davis P, Gao S, et al. WormBase: A modern Model Organism Information Resource. Nucleic Acids Res. 2020;48:D762–D767. doi: 10.1093/nar/gkz920. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Gramates LS, Agapite J, Attrill H, Calvi BR, Crosby MA, dos Santos G, Goodman JL, Goutte-Gattat D, et al. FlyBase: a guided tour of highlighted features. Genetics. 2022;220 doi: 10.1093/genetics/iyac035. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Bradford Y, Conlin T, Dunn N, Fashena D, Frazer K, Howe DG, Knight J, Mani P, et al. ZFIN: enhancements and updates to the Zebrafish Model Organism Database. Nucleic Acids Res. 2011;39:D822–D829. doi: 10.1093/nar/gkq1077. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, Muller R, Dreher K, et al. The Arabidopsis Information Resource (TAIR): Improved gene annotation and new tools. Nucleic Acids Res. 2012;40:1202–1210. doi: 10.1093/nar/gkr1090. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Shefchek KA, Harris NL, Gargano M, Matentzoglu N, Unni D, Brush M, Keith D, Conlin T, et al. An integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Res. 2020;48:D704–D715. doi: 10.1093/nar/gkz997. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Sollis E, Mosaku A, Abid A, Buniello A, Cerezo M, Gil L, Groza T, GUneş O, et al. The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Res. 2023;51:D977–D985. doi: 10.1093/nar/gkac1010. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Freshour SL, Kiwala S, Cotto KC, Coffman AC, McMichael JF, Song JJ, Griffith M, Griffith OL, et al. Integration of the Drug-Gene Interaction Database (DGIdb 4.0) with open crowdsource efforts. Nucleic Acids Res. 2021;49:D1144–D1151. doi: 10.1093/nar/gkaa1084. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Ochoa D, Hercules A, Carmona M, Suveges D, Baker J, Malangone C, Lopez I, Miranda A, et al. The next-generation Open Targets Platform: reimagined, redesigned, rebuilt. Nucleic Acids Res. 2023;51:D1353–D1359. doi: 10.1093/nar/gkac1046. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Kanehisa M, Furumichi M, Sato Y, Kawashima M, Ishiguro-Watanabe M. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 2023;51:D587–D592. doi: 10.1093/nar/gkac963. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Gillespie M, Jassal B, Stephan R, Milacic M, Rothfels K, Senff-Ribeiro A, Griss J, Sevilla C, et al. The reactome pathway knowledgebase 2022. Nucleic Acids Res. 2022;50:D687–D692. doi: 10.1093/nar/gkab1028. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Thomas PD, Ebert D, Muruganujan A, Mushayahama T, Albou LP, Mi H. PANTHER: Making genomescale phylogenetics accessible to all. Protein Sci. 2022;31:8–22. doi: 10.1002/pro.4218. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Martens M, Ammar A, Riutta A, Waagmeester A, Slenter DN, Hanspers K, Miller RA, Digles D, et al. WikiPathways: Connecting communities. Nucleic Acids Res. 2021;49:D613–D621. doi: 10.1093/nar/gkaa1024. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Rath S, Sharma R, Gupta R, Ast T, Chan C, Durham TJ, Goodman RP, Grabarek Z, et al. MitoCarta3.0: An updated mitochondrial proteome now with sub-organelle localization and pathway annotations. Nucleic Acids Res. 2021;49:D1541–D1547. doi: 10.1093/nar/gkaa1011. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Xie Z, Bailey A, Kuleshov MV, Clarke DJB, Evangelista JE, Jenkins SL, Lachmann A, Wojciechowicz ML, et al. Gene Set Knowledge Discovery with Enrichr. Curr Protoc. 2021;1:e90. doi: 10.1002/cpz1.90. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Han H, Cho J-W, Lee S, Yun A, Kim H, Bae D, Yang S, Kim CY, et al. TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic Acids Res. 2018;46:D380–D386. doi: 10.1093/nar/gkx1013. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Liberzon A, Birger C, Thorvaldsdottir H, Ghandi M, Mesirov JP, Tamayo P. The molecular signatures database hallmark gene set collection. Cell Syst. 2015;1:417–425. doi: 10.1016/j.cels.2015.12.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Buniello A, Macarthur JAL, Cerezo M, Harris LW, Hayhurst J, Malangone C, McMahon A, Morales J, et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 2019;47:D1005–D1012. doi: 10.1093/nar/gky1120. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Sherman BT, Hao M, Qiu J, Jiao X, Baseler MW, Lane HC, Imamichi T, Chang W, et al. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update) Nucleic Acids Res. 2022;50:W216–W221. doi: 10.1093/nar/gkac194. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Fang H, Knezevic B, Burnham KL, Knight JC. XGR software for enhanced interpretation of genomic summary data, illustrated by application to immunological traits. Genome Med. 2016;8:1–20. doi: 10.1186/s13073-016-0384-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Gough J, Karplus K, Hughey R, Chothia C. Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol. 2001;313:903–919. doi: 10.1006/jmbi.2001.5080. [DOI] [PubMed] [Google Scholar]
40.Fang H, De Wolf H, Knezevic B, Burnham KL, Osgood J, Sanniti A, Lledo Lara A, Kasela S, et al. The ULTRA-DD Consortium. A genetics-led approach defines the drug target landscape of 30 immune-related traits. Nat Genet. 2019;51:1082–1091. doi: 10.1038/s41588-019-0456-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Fang H, Knight JC. Priority index: database of genetic targets in immune-mediated disease. Nucleic Acids Res. 2022;50:D1358–D1367. doi: 10.1093/nar/gkab994. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Fang H. PiER: web-based facilities tailored for genetic target prioritisation harnessing human disease genetics, functional genomics and protein interactions. Nucleic Acids Res. 2022;50:W583–W592. doi: 10.1093/nar/gkac379. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877–1901. doi: 10.48550/arXiv.2005.14165. [DOI] [Google Scholar]
44.Madani A, Krause B, Greene ER, Subramanian S, Mohr BP, Holton JM, Olmos JL, Xiong C, et al. Large language models generate functional protein sequences across diverse families. Nat Biotechnol. 2023 doi: 10.1038/s41587-022-01618-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All dcGO data and online tools are provided to the public free of charge.

[R1] 1.Tunyasuvunakool K, Adler J, Wu Z, Green T, Zielinski M, Zidek A, Bridgland A, Cowie A, et al. Highly accurate protein structure prediction for the human proteome. Nature. 2021;596:590–596. doi: 10.1038/s41586-021-03828-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Scaiewicz A, Levitt M. The language of the protein universe. Curr Opin Genet Dev. 2015;35:50–56. doi: 10.1016/j.gde.2015.08.010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, et al. Gene Ontology: tool for the unification of biology. Nat Genet. 2000;25:25–29. doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.de Morais DAL, Fang H, Rackham OJL, Wilson D, Pethica R, Chothia C, Gough J. SUPERFAMILY 1.75 including a domain-centric gene ontology method. Nucleic Acids Res. 2011;39:D427–D434. doi: 10.1093/nar/gkq1130. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Fang H, Gough J. dcGO: database of domaincentric ontologies on functions, phenotypes, diseases and more. Nucleic Acids Res. 2013;41:D536–D544. doi: 10.1093/nar/gks1080. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Fang H, Gough J. A domain-centric solution to functional genomics via dcGO Predictor. BMC Bioinf. 2013;14:1–11. doi: 10.1186/1471-2105-14-S3-S9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K, Funk C, et al. A large-scale evaluation of computational protein function prediction. Nat Methods. 2013;10:221–227. doi: 10.1038/nmeth.2340. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Lu C, Zaucha J, Gam R, Fang H, Smithers B, Oates ME, Bernabe-rubio M, Williams J, et al. Hypothesis-free phenotype prediction within a genetics-first framework. Nat Commun. 2023;14:919. doi: 10.1038/s41467-023-36634-6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995;247:536–540. doi: 10.1006/jmbi.1995.0159. [DOI] [PubMed] [Google Scholar]

[R10] 10.Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar GA, Sonnhammer ELL, Tosatto Paladin SCE, et al. Pfam: The protein families database in 2021. Nucleic Acids Res. 2021;49:D412–D419. doi: 10.1093/nar/gkaa913. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Blum M, Chang HY, Chuguransky S, Grego T, Kandasaamy S, Mitchell A, Nuka G, Paysan-Lafosse T, et al. The InterPro protein families and domains database: 20 years on. Nucleic Acids Res. 2021;49:D344–D354. doi: 10.1093/nar/gkaa977. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Fang H. dcGOR: an R package for analysing ontologies and protein domain annotations. PLoS Comput Biol. 2014;10:e1003929. doi: 10.1371/journal.pcbi.1003929. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Oates ME, Stahlhacke J, Vavoulis D, Smithers B, Rackham O, Sardar A, Zaucha J, Thurlby N, et al. a doubling of data. Nucleic Acids Res. 2015;43:D227–D233. doi: 10.1093/nar/gku1041. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Tunkelang D. Faceted Search. Springer; Cham: 2009. [DOI] [Google Scholar]

[R15] 15.Benjamini Y, Hochberg Y. Controlling the False Discovery Rate-a Practical and Powerful Approach to Multiple Testing. J R Stat Soc Ser B-Methodological. 1995;57:289–300. doi: 10.1111/j.2517-6161.1995.tb02031.x. [DOI] [Google Scholar]

[R16] 16.Ashburner M, Ball CA, Blake JA, Butler H, Cherry JM, Eppig JT, Harris M, Hill DP, et al. Creating the Gene Ontology resource: Design and implementation. Genome Res. 2001;11:1425–1433. doi: 10.1101/gr.180801. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Carbon S, Douglass E, Good BM, Unni DR, Harris NL, Mungall CJ, Basu S, Chisholm RL, et al. The Gene Ontology resource: Enriching a GOld mine. Nucleic Acids Res. 2021;49:D325–D334. doi: 10.1093/nar/gkaa1113. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Kohler S, Gargano M, Matentzoglu N, Carmody LC, Lewis-Smith D, Vasilevsky NA, Danis D, Balagura G, et al. The human phenotype ontology in 2021. Nucleic Acids Res. 2021;49:D1207–D1217. doi: 10.1093/nar/gkaa1043. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Bogue MA, Philip VM, Walton DO, Grubb SC, Dunn MH, Kolishovski G, Emerson J, Mukherjee G, et al. Mouse Phenome Database: A data repository and analysis suite for curated primary mouse phenotype data. Nucleic Acids Res. 2020;48:D716–D723. doi: 10.1093/nar/gkz1032. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Harris TW, Arnaboldi V, Cain S, Chan J, Chen WJ, Cho J, Davis P, Gao S, et al. WormBase: A modern Model Organism Information Resource. Nucleic Acids Res. 2020;48:D762–D767. doi: 10.1093/nar/gkz920. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Gramates LS, Agapite J, Attrill H, Calvi BR, Crosby MA, dos Santos G, Goodman JL, Goutte-Gattat D, et al. FlyBase: a guided tour of highlighted features. Genetics. 2022;220 doi: 10.1093/genetics/iyac035. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Bradford Y, Conlin T, Dunn N, Fashena D, Frazer K, Howe DG, Knight J, Mani P, et al. ZFIN: enhancements and updates to the Zebrafish Model Organism Database. Nucleic Acids Res. 2011;39:D822–D829. doi: 10.1093/nar/gkq1077. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Lamesch P, Berardini TZ, Li D, Swarbreck D, Wilks C, Sasidharan R, Muller R, Dreher K, et al. The Arabidopsis Information Resource (TAIR): Improved gene annotation and new tools. Nucleic Acids Res. 2012;40:1202–1210. doi: 10.1093/nar/gkr1090. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Shefchek KA, Harris NL, Gargano M, Matentzoglu N, Unni D, Brush M, Keith D, Conlin T, et al. An integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Res. 2020;48:D704–D715. doi: 10.1093/nar/gkz997. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Sollis E, Mosaku A, Abid A, Buniello A, Cerezo M, Gil L, Groza T, GUneş O, et al. The NHGRI-EBI GWAS Catalog: knowledgebase and deposition resource. Nucleic Acids Res. 2023;51:D977–D985. doi: 10.1093/nar/gkac1010. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Freshour SL, Kiwala S, Cotto KC, Coffman AC, McMichael JF, Song JJ, Griffith M, Griffith OL, et al. Integration of the Drug-Gene Interaction Database (DGIdb 4.0) with open crowdsource efforts. Nucleic Acids Res. 2021;49:D1144–D1151. doi: 10.1093/nar/gkaa1084. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] 27.Ochoa D, Hercules A, Carmona M, Suveges D, Baker J, Malangone C, Lopez I, Miranda A, et al. The next-generation Open Targets Platform: reimagined, redesigned, rebuilt. Nucleic Acids Res. 2023;51:D1353–D1359. doi: 10.1093/nar/gkac1046. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Kanehisa M, Furumichi M, Sato Y, Kawashima M, Ishiguro-Watanabe M. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 2023;51:D587–D592. doi: 10.1093/nar/gkac963. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Gillespie M, Jassal B, Stephan R, Milacic M, Rothfels K, Senff-Ribeiro A, Griss J, Sevilla C, et al. The reactome pathway knowledgebase 2022. Nucleic Acids Res. 2022;50:D687–D692. doi: 10.1093/nar/gkab1028. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Thomas PD, Ebert D, Muruganujan A, Mushayahama T, Albou LP, Mi H. PANTHER: Making genomescale phylogenetics accessible to all. Protein Sci. 2022;31:8–22. doi: 10.1002/pro.4218. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Martens M, Ammar A, Riutta A, Waagmeester A, Slenter DN, Hanspers K, Miller RA, Digles D, et al. WikiPathways: Connecting communities. Nucleic Acids Res. 2021;49:D613–D621. doi: 10.1093/nar/gkaa1024. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Rath S, Sharma R, Gupta R, Ast T, Chan C, Durham TJ, Goodman RP, Grabarek Z, et al. MitoCarta3.0: An updated mitochondrial proteome now with sub-organelle localization and pathway annotations. Nucleic Acids Res. 2021;49:D1541–D1547. doi: 10.1093/nar/gkaa1011. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.Xie Z, Bailey A, Kuleshov MV, Clarke DJB, Evangelista JE, Jenkins SL, Lachmann A, Wojciechowicz ML, et al. Gene Set Knowledge Discovery with Enrichr. Curr Protoc. 2021;1:e90. doi: 10.1002/cpz1.90. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] 34.Han H, Cho J-W, Lee S, Yun A, Kim H, Bae D, Yang S, Kim CY, et al. TRRUST v2: an expanded reference database of human and mouse transcriptional regulatory interactions. Nucleic Acids Res. 2018;46:D380–D386. doi: 10.1093/nar/gkx1013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Liberzon A, Birger C, Thorvaldsdottir H, Ghandi M, Mesirov JP, Tamayo P. The molecular signatures database hallmark gene set collection. Cell Syst. 2015;1:417–425. doi: 10.1016/j.cels.2015.12.004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Buniello A, Macarthur JAL, Cerezo M, Harris LW, Hayhurst J, Malangone C, McMahon A, Morales J, et al. The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res. 2019;47:D1005–D1012. doi: 10.1093/nar/gky1120. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Sherman BT, Hao M, Qiu J, Jiao X, Baseler MW, Lane HC, Imamichi T, Chang W, et al. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update) Nucleic Acids Res. 2022;50:W216–W221. doi: 10.1093/nar/gkac194. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Fang H, Knezevic B, Burnham KL, Knight JC. XGR software for enhanced interpretation of genomic summary data, illustrated by application to immunological traits. Genome Med. 2016;8:1–20. doi: 10.1186/s13073-016-0384-y. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] 39.Gough J, Karplus K, Hughey R, Chothia C. Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol. 2001;313:903–919. doi: 10.1006/jmbi.2001.5080. [DOI] [PubMed] [Google Scholar]

[R40] 40.Fang H, De Wolf H, Knezevic B, Burnham KL, Osgood J, Sanniti A, Lledo Lara A, Kasela S, et al. The ULTRA-DD Consortium. A genetics-led approach defines the drug target landscape of 30 immune-related traits. Nat Genet. 2019;51:1082–1091. doi: 10.1038/s41588-019-0456-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Fang H, Knight JC. Priority index: database of genetic targets in immune-mediated disease. Nucleic Acids Res. 2022;50:D1358–D1367. doi: 10.1093/nar/gkab994. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] 42.Fang H. PiER: web-based facilities tailored for genetic target prioritisation harnessing human disease genetics, functional genomics and protein interactions. Nucleic Acids Res. 2022;50:W583–W592. doi: 10.1093/nar/gkac379. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] 43.Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877–1901. doi: 10.48550/arXiv.2005.14165. [DOI] [Google Scholar]

[R44] 44.Madani A, Krause B, Greene ER, Subramanian S, Mohr BP, Holton JM, Olmos JL, Xiong C, et al. Large language models generate functional protein sequences across diverse families. Nat Biotechnol. 2023 doi: 10.1038/s41587-022-01618-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

The dcGO Domain-Centric Ontology Database in 2023: New Website and Extended Annotations for Protein Structural Domains

Chaohui Bao

Chang Lu

James Lin

Julian Gough

Hai Fang

Abstract

Graphic abstract.

Introduction

Figure 1. Content and website of the dcGO database in 2023.

Materials and methods

The dcGO building method

Protein domains of different definitions

Commonly used ontologies

The dcGO website

Results and discussion

Faceted search as a hub to explore the dcGO resource

Browsing ontology hierarchy and annotated domains

Figure 2. Illustrating how to use the resource via ontology hierarchy browsing and domain-based ontology enrichment analysis.

Table 1. List of Pfam domains annotated to the EFO term ‘immune system disease’.

A new facility supporting domain-based ontology enrichment analysis

Table 2. List of top 5 enriched GOBP terms.

Conclusion

Acknowledgements

Footnotes

Data Availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

The dcGO Domain-Centric Ontology Database in 2023: New Website and Extended Annotations for Protein Structural Domains

Chaohui Bao

Chang Lu

James Lin

Julian Gough

Hai Fang

Abstract

Graphic abstract.

Introduction

Figure 1. Content and website of the dcGO database in 2023.

Materials and methods

The dcGO building method

Protein domains of different definitions

Commonly used ontologies

The dcGO website

Results and discussion

Faceted search as a hub to explore the dcGO resource

Browsing ontology hierarchy and annotated domains

Figure 2. Illustrating how to use the resource via ontology hierarchy browsing and domain-based ontology enrichment analysis.

Table 1. List of Pfam domains annotated to the EFO term ‘immune system disease’.

A new facility supporting domain-based ontology enrichment analysis

Table 2. List of top 5 enriched GOBP terms.

Conclusion

Acknowledgements

Footnotes

Data Availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases