With this issue, GENETICS expands its scope to include publication of a new article type: “Knowledgebase and Database Resources.” This article type encompasses knowledgebases that “have the primary function to extract, accumulate, organize, annotate, and link growing bodies of information related to core datasets” and databases, which are repositories that “on the other hand accept submission of relevant data from the community to store, organize, validate, archive, preserve and distribute” (https://grants.nih.gov/grants/guide/pa-files/par-20-097.html).
The issue includes 8 knowledgebase articles: 6 highlight advances in established model organism knowledgebase projects—for zebrafish (Zebrafish Information Network or ZFIN, Bradford et al. 2022), budding yeast (Saccharomyces Genome Database or SGD, Engel et al. 2021), fission yeast (PomBase, Harris et al. 2021), Caenorhabditis elegans (WormBase, Davis et al. 2022), Drosophila (FlyBase, Gramates et al. 2022), and rat (Rat Genome Database or RGD, Vedi et al. 2022); one describes a new project, on a second fission yeast, Schizosaccharomyces japonicus (JaponicusDB, Rutherford et al. 2022); and one updates an effort to integrate data and process for several of the most highly used knowledgebases— Alliance of Genome Resources (2022). Throughout this editorial, we will retain the traditional term “Model Organism Database” (MOD) for these knowledgebases rather than introducing yet another, albeit more appropriate, description. To accompany the MOD reports, Jonathan Hodgkin, the former nomenclature coordinator for C. elegans, provides a “worm-centric” Perspective (Hodgkin 2022).
Model organisms represent irreplaceable tools for the understanding of biological systems and have shaped biological knowledge for over a century. Despite technical advances allowing more experiments to be performed directly in cultured human cells or artificially derived organs (“organoids”), basic research in model species remains the cornerstone for identification and description of gene product function (Fields and Johnston 2005). Indeed, even the interpretation of the term “model organism” as “an organism for which a wealth of tools and resources exist” is so pervasive that it is increasingly used to refer to only one of the small number of species supported by any particular dedicated MOD (Russell et al. 2017).
Expertly curated data are critical to the efficient and successful application of model species as research tools (e.g. Oliver et al. 2016; Bellen et al. 2021; Lipshitz 2021). The primary aim of MODs is to systemize and integrate heterogeneous published knowledge for the species of interest in accordance with FAIR (Findable, Accessible, Interoperable, and Reusable) sharing principles (Wilkinson et al. 2016). The first online MODs were established in the 1990s (Lipshitz 2021), when the primary focus was data collection—identifying genes, refining gene structures, collecting genetic interactions, phenotypes, and functions. In the first decade of the 21st century, the focus expanded from data collection to standardization as MODs embraced the use of ontologies (sophisticated graph-based dictionaries) to systematically describe the biological attributes of individual gene products in a way that can be understood by both humans and computers (Ashburner and Lewis 2002). A large part of the MOD biocurators’ workload is to contribute to the development, maintenance, and revision of these bio-ontologies. Currently, the MOD annotation corpus encompasses almost 4 million experimentally supported function and phenotype statements extracted from over 200,000 publications (Alliance of Genome Resources 2022), but these statements represent only a fraction of the knowledge buried in the scientific literature. The past decade has seen a further shift in focus for the MODs as they have moved beyond standardized descriptions; increasingly, the curated statements are used to build causal knowledge graphs, where entities (gene products) are connected to each other and to ontology terms by different relationships to provide complex qualitative models of known biology (Thomas et al. 2019; Alliance of Genome Resources 2022). MODs provide a community hub with deep connections to the biological researchers they represent and provide extensive outreach in the form of tailored advice, workshops, gene name arbitration, discussion forums, and even a mechanism to publish and deposit brief novel findings (Raciti et al. 2018). To actively facilitate research advances MODs are also designed, developed, and maintained in close consultation with the model organism research community they support (Marygold et al. 2016; Lock et al. 2019; Harris et al. 2020). Genetics research and the MODs form a virtuous cycle with research feeding knowledge to the MODs, which in turn facilitate research.
Looking forward, the workload of biocurators at MODs can be classified into three monumental challenges of critical importance to biological research. The first is to systematically describe biological variation by connecting diverse and detailed phenotypes to specific alleles and variations (see Bradford et al. 2022; Davis et al. 2022; Engel et al. 2021; Harris et al. 2021). The second is to describe the “normal” functions of gene products and the connections among them through assembly into pathways (see Alliance of Genome Resources 2022; Gramates et al. 2022; Harris et al. 2021). The third is to ensure that curation is aligned with current knowledge. Extensive quality control procedures are implemented across the MODs to remove erroneous historic annotations and ontology terms (Wood et al. 2020; Carbon et al. 2021). For example, between 2019 and 2021, the MODs and Swiss-Prot revised 16,000 Gene Ontology (GO) annotations supported by 3,272 scientific articles as a result of review processes (Carbon et al. 2021).
Together these three goals currently constitute the major workload of the MOD biocurators and are essential to fully decipher the molecular bases of biology and disease; but in each of these areas, MODs are only able to scratch the surface with currently available—sadly, shrinking—resources. Outwardly, the MODs may appear to represent data silos because they comprise multiple independent resources. In reality, they are more analogous to a federated network since the curated data within them is made interoperable by the proactive use of shared standards, data models, tools, ontologies, quality control pipelines, and data exchange formats wherever possible (Alliance of Genome Resources Consortium 2020; Kishore et al. 2020; Wood et al. 2020; Carbon et al. 2021). Interoperability is increasingly consolidated by the work of the Alliance of Genome Resources (2022), which is standardizing the remaining concepts used by the eukaryotic MODs.
Machine learning (ML) and artificial intelligence (AI)—especially natural language processing—are sometimes considered by funding bodies as a potential panacea to solve the “biocuration problem.” However, these approaches need lots of data in a standard format and training sets defined by experts. The MODs produce both. In return, ML/AI approaches are helpful in identifying papers to curate, identifying entities (gene products) and, in specific cases, can extract knowledge from text (Muller et al. 2018). ML and AI perform best with well-constrained data (numerical values, limited classes)—curated data are categorical, highly complex, and with hundreds of thousands of heterogeneous classes, often not explicitly labeled in text or figures. Better performance for knowledge extraction might well be achieved with larger volumes of expertly curated data, but will still require continual expert validation.
Several of the subject species of the MOD reports in this issue represent the longest standing complete eukaryotic genomes but their primary annotations of genes continue to evolve. As examples, Engel et al. (2021) describe newly annotated genes in Saccharomyces cerevisiae while WormBase has added predictions of new transcripts based on multiple-species alignment (Davis et al. 2022). Even when gene sets are stable, nomenclature evolves; this can be seen in yeast which has a thousand genes with names based on ORFs. SGD has made significant progress rationalizing gene names (Engel et al. 2021). Tracking merges and splits in gene identifiers is a core function of these knowledgebases as is tracking the multiple names for the same gene. Indeed, PomBase (Harris et al. 2021) and WormBase (Davis et al. 2022) both mention ID mapping as an important part of the service provided, and RGD’s Multi-Ontology Enrichment Tool (MOET) does ID mapping for the mammalian strains covered (Vedi et al. 2022).
Displays of the hard-won curation continue to evolve. FlyBase (Gramates et al. 2022) has an esthetically pleasing multitime course display for ChIP-seq and RNA-seq data (TopoView). Given that many of these MODs have similar data yet each has unique displays, it seems that more and better use could be made of the curated data. Vedi et al. (2022) describe a new version of their comprehensive multiontology enrichment tool that analyzes a set of genes for statistically enriched gene ontology, phenotype, human disease, chemical interaction, and pathway terms for several mammalian species. This is a significant step forward, and the approach should be extensible to more species and additional ontologies to which gene annotations are made. WormBase (Davis et al. 2022) discusses a text-mining approach (WormiCloud) to find clusters of terms and then feeds them into a 3-ontology enrichment tool. We look forward to seeing the best features of all of these tools and displays combined and broadly used.
The various MOD articles touch briefly on how authors can contribute to the knowledgebases. One approach is for authors to index their articles postpublication, noting the types of information and whether they describe new gene, phenotypes, variants, etc. FlyBase has “Fast Track Your Paper” (Gramates et al. 2022) and WormBase has “Author First Pass” (Davis et al. 2022) that implement this strategy that implement this strategy. Most MODs have webforms for submitting data. PomBase has a successful community curation program where authors curate their own papers in full, and curation is approved by professional curators (Harris et al. 2021). The PomBase system has also been successfully implemented for JaponicusDB but the community is responsible for both curation and checking (Harris et al. 2021; Rutherford et al. 2022). As an effort to garner more control over the way authors and knowledgebases interact, MODs such as SGD and PomBase discuss their use of the microPublication Biology platform (Raciti et al. 2018).
There is a heartening trend toward coordination. The S. japonicus knowledgebase uses the infrastructure of PomBase. The Alliance of Genome Resources is leveling (and raising!) the playing field for its member organisms. The genetics community deserves and needs the best features of all of these knowledgebases to facilitate and expand its range of research. The articles in this issue drive the knowledgebases upwards to meet this aspiration.
Literature cited
- Alliance of Genome Resources Consortium. Alliance of genome resources portal: unified model organism research platform. Nucleic Acids Res. 2020;48:D650–D658. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Alliance of Genome Resources Consortium, Harmonizing model organism data in the Alliance of Genome Resources. Genetics. 2022; 10.1093/genetics/iyac022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ashburner M, Lewis S.. On ontologies for biologists: the Gene Ontology–untangling the web. Novartis Found Symp. 2002;247:66–80; discussion 80–83, 84–90, 244–252. [PubMed] [Google Scholar]
- Bellen HJ, Hubbard EJA, Lehmann R, Madhani HD, Solnica-Krezel L, Southard-Smith EM.. Model organism databases are in jeopardy. Development. 2021;148(19):dev200193. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bradford YM, Van Slyke CE, Ruzicka L, Singer A, Eagle A, Fashena D, Howe DG, Frazer K, Martin R, Paddock H, et al. Zebrafish Information Network, the knowledgebase for Danio rerio research. Genetics. 2022; 10.1093/genetics/iyac016. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carbon S, Douglass E, Good BM, Unni DR, Harris NL, Mungall CJ, Basu S, Chisholm RL, Dodson RJ, Hartline E, et al. ; The Gene Ontology Consortium. The Gene Ontology resource: enriching a GOld mine. Nucleic Acids Res. 2021;49(D1):D325–D334. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Davis P, Zarowiecki M, Arnaboldi V, Becerra A, Cain S, Chan J, Chen WJ, Jaehyoung Cho, Eduardo da Veiga Beltrame, Stavros Diamantakis, et al. WormBase in 2022—data, processes, and tools for analyzing Caenorhabditis elegans. Genetics. 2022; 10.1093/genetics/iyac003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Engel RE, Wong ED, Nash RS, Aleksander S, Alexander M, Douglass E, Karra K, Miyasato SR, Simison M, Skrzypek MS, et al. New data and collaborations at the Saccharomyces Genome Database: updated reference genome, alleles, and the Alliance of Genome Resources. Genetics. 2021; 10.1093/genetics/iyab224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fields S, Johnston M.. Cell biology. Whither model organism research? Science. 2005;307(5717):1885–1886. [DOI] [PubMed] [Google Scholar]
- Gramates LS, Agapite J, Attrill H, Calvi BR, Crosby MA, dos Santos G, Goodman JL, Goutte-Gattat D, Jenkins VK, Kaufman T, et al. FlyBase: A guided tour of highlighted features. Genetics. 2022; 10.1093/genetics/iyac035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harris TW, Arnaboldi V, Cain S, Chan J, Chen WJ, Cho J, Davis P, Gao S, Grove CA, Kishore R, et al. WormBase: a modern Model Organism Information Resource. Nucleic Acids Res. 2020;48(D1):D762–D767. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harris MA, Rutherford KM, Hayles J, Lock A, Bähler J, Oliver SG, Mata J, Wood V, Fission stories: Using PomBase to understand Schizosaccharomyces pombe biology. Genetics. 2021; 10.1093/genetics/iyab222. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hodgkin J, Rectifying nematode names. Genetics. 2022; 10.1093/genetics/iyab228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kishore R, Arnaboldi V, Van Slyke CE, Chan J, Nash RS, Urbano JM, Dolan ME, Engel SR, Shimoyama M, Sternberg PW, et al. Automated generation of gene summaries at the Alliance of Genome Resources. Database (Oxford). 2020;2020:baaa037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lipshitz HD. The descent of databases. Genetics. 2021;217(3):iyab023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lock A, Rutherford K, Harris MA, Hayles J, Oliver SG, Bähler J, Wood V.. PomBase 2018: user-driven reimplementation of the fission yeast database provides rapid and intuitive access to diverse, interconnected information. Nucleic Acids Res. 2019;47(D1):D821–D827. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marygold SJ, Crosby MA, Goodman JL, FlyBase C; FlyBase Consortium. Using FlyBase, a database of Drosophila genes and genomes. Methods Mol Biol. 2016;1478:1–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Muller HM, Van Auken KM, Li Y, Sternberg PW.. Textpresso central: a customizable platform for searching, text mining, viewing, and curating biomedical literature. BMC Bioinformatics. 2018;19(1):94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oliver SG, Lock A, Harris MA, Nurse P, Wood V.. Model organism databases: essential resources that need the support of both funders and users. BMC Biol. 2016;14:49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Raciti D, Yook K, Harris TW, Schedl T, Sternberg PW.. Micropublication: incentivizing community curation and placing unpublished data into the public domain. Database (Oxford). 2018;2018:bay013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Russell JJ, Theriot JA, Sood P, Marshall WF, Landweber LF, Fritz-Laylin L, Polka JK, Oliferenko S, Gerbich T, Gladfelter A, et al. Non-model model organisms. BMC Biol. 2017;15(1):55. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rutherford KM, Harris MA, Oliferenko S, Wood V. JaponicusDB: Rapid deployment of a model organism database for an emerging model species. Genetics. 2022; 10.1093/genetics/iyab223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thomas PD, Hill DP, Mi H, Osumi-Sutherland D, Van Auken K, Carbon S, Balhoff JP, Albou L-P, Good B, Gaudet P, et al. Gene Ontology Causal Activity Modeling (GO-CAM) moves beyond GO annotations to structured descriptions of biological functions and systems. Nat Genet. 2019;51(10):1429–1433. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vedi M, Nalabolu HS, Lin C-W, Hoffman MJ, Smith JR, Brodie K, De Pons JL, Demos WM, Gibson AC, Hayman GT, et al. MOET: a web-based gene set enrichment tool at the Rat Genome Database for multiontology and multispecies analyses. Genetics. 2022; 10.1093/genetics/iyac005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten J-W, da Silva Santos LB, Bourne PE, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3(1):160018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wood V, Carbon S, Harris MA, Lock A, Engel SR, Hill DP, Van Auken K, Attrill H, Feuermann M, Gaudet P, et al. Term Matrix: a novel Gene Ontology annotation quality control system based on ontology term co-annotation patterns. Open Biol. 2020;10(9):200149. [DOI] [PMC free article] [PubMed] [Google Scholar]