Strategies for community-sourced biocuration in bioinformatics: a case study on MIBiG 4.0

Kai Blin; Catarina Loureiro; Nico L L Louwen; Jorge C Navarro-Muñoz; Hans Gerstmans; Serina L Robinson; Adriano Rutz; Zachary L Reitz; Drew T Doering; Justin J J van der Hooft; Tilmann Weber; Marnix H Medema; Mitja M Zdouc

doi:10.1093/bib/bbaf659

. 2025 Dec 11;26(6):bbaf659. doi: 10.1093/bib/bbaf659

Strategies for community-sourced biocuration in bioinformatics: a case study on MIBiG 4.0

Kai Blin ¹, Catarina Loureiro ², Nico L L Louwen ³, Jorge C Navarro-Muñoz ⁴, Hans Gerstmans ^5,^6,⁷, Serina L Robinson ⁸, Adriano Rutz ⁹, Zachary L Reitz ¹⁰, Drew T Doering ¹¹, Justin J J van der Hooft ^12,¹³, Tilmann Weber ¹⁴, Marnix H Medema ^15,^2,^✉, Mitja M Zdouc ^16,^2,^✉

¹ The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Building 220, Kongens Lyngby 2800, Denmark

² Bioinformatics Group, Wageningen University & Research, Droevendaalsesteeg 1, Wageningen, 6708 PB, the Netherlands

³ Bioinformatics Group, Wageningen University & Research, Droevendaalsesteeg 1, Wageningen, 6708 PB, the Netherlands

⁴ Bioinformatics Group, Wageningen University & Research, Droevendaalsesteeg 1, Wageningen, 6708 PB, the Netherlands

⁵ VIB-KU Leuven Center for Microbiology, Flanders Institute for Biotechnology, Kasteelpark Arenberg 31, Leuven 3001, Belgium

⁶ Department of Biology, Laboratory for Biomolecular Discovery & Engineering, KU Leuven, Kasteelpark Arenberg 31, Leuven 3001, Belgium

⁷ Department of Biosystems, Biosensors Group, KU Leuven, Willem de Croylaan 42, box 2428, Leuven 3001, Belgium

⁸ Department of Environmental Microbiology, Swiss Federal Institute of Aquatic Science and Technology, Ueberlandstrasse 133, Duebendorf 8600, Switzerland

⁹ Institute for Molecular Systems Biology, ETH Zürich, Otto-Stern-Weg 3, Zürich 8093, Switzerland

¹⁰ Department of Ecology, Evolution and Marine Biology, University of California, 1169 Biological Sciences II, Santa Barbara, CA 93106, United States

¹¹ Lawrence Berkeley National Laboratory, US Department of Energy Joint Genome Institute, 1 Cyclotron Road, Berkeley, CA 94720, United States

¹² Bioinformatics Group, Wageningen University & Research, Droevendaalsesteeg 1, Wageningen, 6708 PB, the Netherlands

¹³ Department of Biochemistry, University of Johannesburg, C2 Lab Building 224, Kingsway Campus, Cnr University & Kingsway Road, Auckland Park, Johannesburg 2006, South Africa

¹⁴ The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Building 220, Kongens Lyngby 2800, Denmark

¹⁵ Bioinformatics Group, Wageningen University & Research, Droevendaalsesteeg 1, Wageningen, 6708 PB, the Netherlands

¹⁶ Bioinformatics Group, Wageningen University & Research, Droevendaalsesteeg 1, Wageningen, 6708 PB, the Netherlands

^✉

Corresponding authors. Marnix H. Medema, Bioinformatics Group, Wageningen University & Research, Droevendaalsesteeg 1, Wageningen, PB 6708, the Netherlands. Tel: +31317482036; E-mail: marnix.medema@wur.nl; Mitja M. Zdouc, Bioinformatics Group, Wageningen University & Research, Droevendaalsesteeg 1, Wageningen, PB 6708, the Netherlands. Tel: +31317482036; E-mail: mitja.zdouc@wur.nl

Marnix H. Medema and Mitja M. Zdouc co-last authors.

PMCID: PMC12696713 PMID: 41378881

Abstract

Biocuration is essential to transform molecular sequence data into standardized, machine-readable resources. Such curated datasets enable comparative analysis, predictive modeling, and data integration across bioinformatics platforms. While professional biocuration is resource-intensive and usually limited to institutional settings, community-driven approaches can mobilize large-scale annotation of specialized datasets and are more resilient to disruptions in scientific funding. Here, we present a model for community-powered curation applied to the Minimum Information about a Biosynthetic Gene Cluster (MIBiG) repository. Through a framework of workflows for metadata capture, annotation validation, and contributor coordination, the MIBiG 4.0 initiative recruited 267 scientists across 178 institutions from 33 countries, volunteering an estimated 4000 h of work. These efforts expanded the MIBiG repository by 22% and enhanced its usability in downstream molecular data analyses in comparative genomic analyses, natural product discovery, and machine learning applications. We provide strategies and actionable lessons for adopting this model, supporting the sustainability of curated bioinformatics resources central to nucleic acid research and related fields.

Keywords: biocuration, open science, data standards, genome mining, secondary metabolites

Graphical Abstract

Introduction

Biological research activities, such as (meta) genome sequencing, generate large quantities of data, but the interpretations of their results are customarily published in narrative form as journal articles. Despite the development of data standards and associated online repositories for several key types of biological data, descriptions of the characterization of various kinds of biological and biochemical entities are still frequently released as heterogeneous, unstructured data files, lacking metadata. Such data formats are challenging to directly interpret with computers, hindering large–scale comparison, reuse, and data integration. Biocuration is the process of transforming these dispersed and unstructured data into structured, computable resources [1]. It involves tasks such as extracting and validating biological information from experimental studies, harmonizing terminology and metadata, and integrating the results into existing knowledge frameworks such as databases or ontologies. Consequently, biocuration serves as a bridge between primary research outputs and machine-readable data, enabling standardized and comprehensive computational analysis, supporting machine-learning applications, and promoting FAIR (Findable, Accessible, Interoperable, and Reusable) data principles [2–5]. Despite its crucial importance to nucleic acid research, biocuration also faces challenges. Performed by highly educated and experienced experts, professional biocuration is resource-intensive and usually limited to institutional bodies [6]. Even though the benefits of biocuration are estimated to be more than 20 times their direct operational costs [1], traditional academic crediting systems (e.g. citation counts, article publications) are rarely applicable to biocuration, reducing its visibility [2]. While text mining and large language models (LLMs) may be used in domains such as sequence annotation and biosynthetic gene cluster (BGC) annotation [7–9], they still require expert oversight and validation and should be considered preprocessing tools rather than autonomous actors [10–14]. As an inherently manual process, biocuration is a major rate-limiting step in the interpretation and subsequent systematization of biological data, directly impacting the availability of high-quality molecular datasets and resources usable in comparative genomics or the training of machine learning algorithms.

To address this bottleneck, crowd-sourced and community-driven biocuration strategies have been developed. For instance, biocuration initiatives centered on education benefit from a large pool of (under) graduate participants and can produce high-quality annotations while supporting learning outcomes, but they are usually limited to a single institution and require substantial initial training and ongoing supervision [15, 16]. Another example is the general-purpose knowledge base Wikidata, which provides a readily available, scalable, and community-driven platform for expert data curation and has been successfully applied to the curation of biosynthetic pathways [17]. However, its underlying data structure and editing conventions require familiarization, and the absence of dedicated data validation and review mechanisms present a potential barrier to Wikidata’s adoption by projects that demand stringent data accuracy. Furthermore, domain-specific content may fall outside Wikidata’s notability guidelines, requiring additional considerations before data deposition. These limitations reduce the practicality of current crowd-sourced biocuration strategies for specialist initiatives that require rigorous data validation and editorial oversight, such as BGC databases, enzyme function annotation, and other molecular data resources. There is a need for platforms that offer a low barrier to entry, effectively use the time and expertise of domain experts, are cross-institutional, and can accommodate a large number of participants. At the same time, such models need to ensure sustainability, governance, and attribution mechanisms, while also producing standardized data outputs that integrate well into computational bioinformatics pipelines [1, 18].

The Minimum Information about a Biosynthetic Gene Cluster (MIBiG) repository [19, 20] is regarded as the gold-standard reference database for BGCs by the natural products community. MIBiG is widely used by bioinformatics tools for genome mining, comparative genomics, and natural product discovery pipelines [21–28]. Since its inception in 2015 [29, 30], MIBiG has been maintained and updated by a small, cross-institutional volunteer team, regularly expanding its data schema and coverage by community-driven annotation hackathons (“Annotathons”) in line with scientific advances [29, 31, 32]. However, these activities were organized mostly on an ad hoc basis and encountered difficulties in accommodating a growing number of international participants. During the preparation of the fourth iteration of the MIBiG Annotathons (MIBiG 4.0), we sought to improve our data curation workflow by creating a dedicated organization framework consisting of workflows for data collection and validation, communication, and task coordination, inspired by the Open Data, Open Code, and Open Infrastructure (O3) guidelines [18]. Centered on Annotathon events, this biocuration model enabled the largest MIBiG data curation effort to date, engaging 267 scientists across 178 institutions from 33 countries spanning 20 time zones (New Zealand to California) who contributed an estimated 4000 person-hours to extract and curate literature-derived data [20]. In this Position article, we present the biocuration model developed through the MIBiG 4.0 initiative, evaluate its impact on the quality and sustainability of molecular data resources, compare it to selected existing community curation models, and provide recommendations for related efforts seeking to adopt the model. We believe that this biocuration model provides a scalable strategy for generating high-quality, community-driven bioinformatics resources across nucleic acids, protein, metabolite, and integrative omics databases, suitable for adoption by both new and established resources [33–41].

Preparing foundations: establishing the project framework

Successful community-driven biocuration projects require a clear organizational framework, with a well-defined governance model to ensure maintenance and sustainability of the resource [18]. For MIBiG 4.0, the existing governing body was expanded to establish a core organizational team with dedicated roles and responsibilities (Fig. 1a): a project officer, infrastructure officer, software engineer, and communications officer. This team was responsible for setting up infrastructure, coordinating the overall curation effort, and implementing editorial oversight, such as decision-making about accommodating edge cases in the curation workflow. During the initial preparation phase, the team established foundational governance elements, guided by a clear project aim: to update the MIBiG repository with the latest knowledge on secondary metabolite BGCs and to establish a review process for submitted data. To support this aim, two new community roles outside the core organizational team were introduced and staffed with volunteer field experts (Fig. 1a): (i) interest group coordinators, who assisted individual contributors (placed in predefined interest groups that connected participants with respect to specific interests, e.g. biosynthesis, chemical structures) within specific areas of expertise, helped resolve domain-specific questions, and facilitated the creation of detailed, high-quality entries; and (ii) reviewers, who evaluated submitted entries, providing feedback where needed to ensure data accuracy and consistency. Finally, a detailed project timeline was created, structured around key milestones including participant recruitment, training sessions, and data collection events (Annotathons), with clearly defined deadlines (Fig. 1b).

A composite figure showing the social and technical workflows employed by the MIBiG 4.0 initiative. A timeline indicates the overall timeframe of the project. — Schematic overview of curation effort. The figure shows the organizational aspects and workflow governing the MIBiG 4.0 curation model. (a) Communication and coordination framework: the MIBiG organizational team directs and supports the interest group coordinators and reviewers, who, in turn, guide and assist contributors with their annotation effort. (b) Timeline of major organizational events leading to, during, and after the MIBiG 4.0 Annotathons: After the Annotathons, an internal “clean-up-a-thon” was held to finalize unfinished entries. (c) General curation workflow: contributors coordinate through Kanban boards, submit data to the MIBiG submission portal, followed by review and approval (or request for changes) by at least one reviewer.

Participant recruitment and onboarding

Successful community curation efforts rely on clear recruitment and onboarding strategies that define the project scope, set contributor expectations, and equip participants with the tools needed to contribute effectively. These strategies must also accommodate diverse backgrounds and establish mechanisms for resolving potential conflicts [18]. In the MIBiG 4.0 initiative, participant mobilization targeted both previous Annotathon contributors and individual domain specialists, complemented by calls for participation on social media, resulting in 398 preregistrations. Follow-up kickoff events (Fig. 1b) introduced potential participants to their roles, familiarized them with the project’s collaborative model, infrastructure, and outlined communication and coordination workflows. The criterion for co-authorship (a time-commitment of at least 6 h of biocuration) was defined, and a code of conduct based on the Contributor Covenant (https://www.contributor-covenant.org/) was established. To support onboarding, we developed comprehensive training resources outlining editorial guidelines, including educational videos and a step-by-step data submission guide [42]. Given the highly international composition of the participant pool, kickoff meetings were scheduled across multiple time zones to ensure accessibility.

Creating effective communication channels

Effective communication is vital for the success of global, remotely coordinated community efforts. To this end, the MIBiG 4.0 curation model employed a mix of synchronous and asynchronous communication strategies. Formal announcements, including schedules, training materials, and event updates, were distributed synchronously via email to ensure consistent outreach. To foster community building and informal, real-time interactions between participants, an instant messaging application was used. Already utilized in previous Annotathons, a “workspace” of the messaging application Slack (https://mibigannotathons.slack.com) was updated to include interest group-dedicated “channels”, each moderated by one or more interest group coordinators (Fig. 1a). During Annotathon sessions, a live video call using the Zoom software (https://www.zoom.com/) was available to facilitate real-time, face-to-face interaction and support in interest group-specific breakout rooms. Of note, we do not endorse any of the closed-source, commercial tools that were utilized in the MIBiG Annotathons, which were mainly selected due to availability and/or widespread use by participants.

Infrastructure development and workflow implementation

Effective biocuration relies on three key parameters: a defined data curation format, standardized data capture, and effective data validation strategies, which directly shape the workflow design and infrastructure requirements. In the MIBiG repository, BGC entries are stored as JavaScript Object Notation (JSON) text files, defined by a regularly updated bespoke JSON Schema (https://json-schema.org/) data standard [20]. To facilitate interaction with this data model, a dedicated submission portal was created (Fig. 1c). This web server [43] employed a series of forms with built-in data validation, safeguarding against erroneous input. Additionally, the use of persistent identifiers and application programming interfaces (APIs) allowed querying of related resources, such as fetching chemical compound information present in the NPAtlas database. A login system and user privilege management allowed the use of the web portal for both data submission and review. To assess contributor engagement, each edit (a modification submitted through the portal) was recorded, timestamped, and linked to an anonymized contributor ID. Since the time investment per edit could vary considerably, we also gathered self-reported time commitments and other qualitative data through an anonymous post-hackathon questionnaire (Supplementary File 3), complemented by anonymous observation letters from selected participants (Supplementary Information). Around the MIBiG submission portal, a data curation workflow was designed (Fig. 2). To promote autonomous and efficient task coordination, we prepared an online Kanban-style work management system using the free tier version of Trello (https://trello.com). Each phase of the data curation workflow of data collection, review, and revision was represented as columns, with individual entries modeled as cards (Fig. 1c) [20]. This structure provided a transparent overview of task status and availability, thereby facilitating distributed collaboration among remote participants. Contributors and reviewers were able to self-assign tasks and coordinate directly, while a tag-based system supported targeted requests for specialist input without requiring centralized oversight. Doing so, workflow bottlenecks could be promptly detected, and problematic entries flagged for revision, ensuring continuity and consistency throughout the curation process. Before the Annotathons, the Kanban boards were prepopulated with 817 “stub entries”, corresponding to either newly proposed BGC records or identified issues in existing entries (Supplementary File 1).

This Figure shows the biocuration workflow using flowchart symbols. Starting from scientific literature, data curation is coordinated via a Kanban board. Individual contributors add to the process, which ends by the final approval by a reviewer. — Workflow of the curation process. Starting from scientific literature, contributors create cards on the Kanban board, representing either newly proposed BGC records or issues identified in existing entries. Contributors self-assign to cards and enter data through the MIBiG submission portal. Once a submission is completed, the corresponding card becomes available for review. Reviewers, like contributors, can self-assign to cards and perform reviews via the MIBiG submission portal. Reviewers may request revisions requiring further corrections or approve entries, after which the record is considered complete.

The MIBiG curation model resulted in high yield and contributor satisfaction

In 2024, the MIBiG 4.0 initiative held eight 3-h Annotathon sessions, engaging 267 scientists in data curation and review efforts, over three times the number of participants from the previous edition [20, 32]. In total, contributors performed 8304 edits using the MIBiG submission portal, creating 557 new entries and updating 590 existing ones, growing the MIBiG repository by 22% and expanding nearly 25% of existing entries [20]. Based on self-reported data, this effort represented an estimated total of 4000 volunteer hours (Supplementary File 3). Analysis of activity over time showed that contributor participation was highest during scheduled Annotathon events, with 71% of edits occurring throughout 4 days (Fig. 3a). Additionally, 29% of edits (2417) occurred outside scheduled events (Fig. 3b), indicating that many participants appreciated the possibility to continue working outside the planned Annotathon hours. Edit distribution per participant revealed that a small group of “super-contributors” (10% of participants, Fig. 3c) was responsible for 42% of all edits (Supplementary File 2). This pattern mirrors observations in other community science efforts, such as Wikidata [44], highlighting the importance of recognizing, rewarding, and retaining these individuals for future community initiatives, ideally in leadership and mentorship roles, if possible.

Figure visualizing participation of contributors over time. — Overview of contributor engagement, measured in edits performed in the MIBiG submission portal. (a) Shows the total count of edits per day (horizontal axis shared with b)), with blue diamonds indicating individual Annotathon dates (two 3-h sessions per day; “clean-up-a-thon” referring to data completion session with a smaller number of participants); (b) breaks down the edit count per day to individual contributors, each assigned a random color; (c) boxplot showing the distribution of edits per contributor (median 19, mean 31.1, SD 37.5), with each blue dot representing an individual contributor in the overlaid strip plot.

To get insight into participant demographics and assess satisfaction with the MIBiG curation model, we conducted an anonymous exit poll, to which 82 participants (28% of contributors) responded. Additionally, five participants provided detailed qualitative observations (Supplementary Information). The 82 respondents represented a diverse, global community with members across all academic career stages, from (under) graduate students to senior faculty. Generally, satisfaction among respondents was high. 67% identified contributing to a shared effort as their primary motivation (Supplementary Figure S5). Notably, 91.5% indicated they would participate again (Supplementary Fig. S18) and recommend the initiative to others (Supplementary Fig. S19), even though 76.8% were first-time participants (Supplementary Fig. S4).

Case study

To gain insight into the social dynamics and interaction patterns that emerged during the MIBiG Annotathons, we examined in detail the provenance and creation history of selected entries. Activities were tracked through Kanban board cards, which also facilitated communication among participants via comments. Two distinct modes of interaction were observed. The first followed a traditional workflow, in which a contributor created an entry that was subsequently peer-reviewed, revised upon request, and corrected by the original contributor (Fig. 4a). In contrast, we also frequently observed a more collaborative pattern, exemplified in Fig. 4b, where multiple contributors jointly refined an entry: contributor 1 initiated the entry, a reviewer suggested revisions, contributor 2 implemented the changes, and a second reviewer performed the final inspection and completion. Although anecdotal, these case studies illustrate the advantages of a Kanban-based coordination framework for managing community-driven, parallel annotation efforts, thereby making use of complementary expertise in an effective manner.

Figure showing two case studies of collaborative biocuration. — Examples of biocuration within the MIBiG workflow. (a) Improvement of an existing entry, in which the contributor and reviewer followed a traditional peer-review model (Supplementary Fig. S21). (b) Creation of a new entry through the coordinated input of four participants. Notably, a contributor other than the original creator performed the revision, illustrating the collaborative practices fostered in the MIBiG Annotathons (Supplementary Fig. S22).

Systematic comparison with related biocuration models

To explore the transferability of the MIBiG curation model to other communities, we performed a systematic comparison with related collaborative biocuration strategies (Table 1). We investigated the type of collaboration model, its target audience, and participant autonomy, compared the required technical and semantic expertise (knowledge of programs and ontologies, respectively), evaluated the complexity of customizing the provided curation infrastructure, and outlined data quality assurance. The MIBiG 4.0 collaboration model allowed the recruitment of a large number of international experts who performed a high volume of data curation in a relatively short time (eight 3-h sessions) due to adherence to principles described by the Behavioural Insights Team as the EAST framework (Easy, Attractive, Social, Timely, https://www.bi.team/wp-content/uploads/2014/04/BIT-EAST-1.pdf). While attendance in the MIBiG Annotathons presumed expertise in some aspect of NP research, participation was “easy” since no knowledge of data structures, ontologies, or specialist software was required, with graphical user interfaces for data capture (MIBiG submission portal) and coordination (Kanban) appealing to experimentalists. While contribution to the MIBiG database undoubtedly benefited researchers using it as a reference database or as a knowledge base, the MIBIG curation model also presented a tangible and attractive incentive (co-authorship with a subsequent publication) depending on clearly communicated requirements (at least 6 h of curation work). Organizing participation in the form of hackathons introduced a valuable social component, fostering informal collaboration, community-building, and creating a sense of belonging to a collective effort. Additionally, the MIBiG curation model emphasized timeliness, with defined hackathon dates and clear deadlines conveying a sense of urgency. We believe that these aspects can be generalized to other biocuration efforts, motivating the creation of novel resources, and allowing existing systems to expand their efficiency. Continuously running biocuration models could mobilize existing and new participants in hackathon-like events, incentivizing participation with co-authorships on community papers. Models employing technically challenging vocabularies could benefit from dedicated interfaces that facilitate data input, validation, and review. The MIBiG curation model has already inspired the creation of a related resource, the Minimum Information about a Tailoring Enzyme database, which combines a continuous model with regular hackathons and invites all data contributors to be co-authors with its subsequent publication [45]. At the same time, the MIBiG curation model is organizationally demanding, requiring detailed project planning and bespoke infrastructure. Contributor expertise may vary considerably, and the focus on easy-to-use data capture systems over ontologies can impact data quality consistency, which may be addressed with additional participant training. Therefore, related efforts may carefully consider the advantages and disadvantages of the MIBiG curation model and its components, as specified in detail in the next section.

Table 1.

Systematic comparison between MIBiG 4.0 and other representative community-driven biocuration models.

Name & reference	Description	Collaboration model & participant autonomy	Required technical and semantic expertise	Curation infrastructure setup complexity	Data input control & quality assurance	Transferability of curation model (data, organizational)
MIBiG 4.0 [20]	Curation of natural product biosynthetic gene clusters	Hackathons recruiting field experts; high autonomy	Low (online web portal, forms with auto-fill)	High (data submission web server, Kanban board for coordination)	High (automated validation, peer review)	High
CACAO using GONUTS software [15]	Curation of gene ontology terms	Education-based competitions recruiting undergraduates; low autonomy	Medium (online wiki instance; knowledge of Gene Ontology terms)	High (custom wiki instance)	High (ontologies, peer review)	Medium (university-level, focus on Gene Ontology curation)
PomBase using Canto software [46, 47]	Knowledge base for fission yeast Schizosaccharomyces pombe	Continuous curation recruiting publication authors per email; high autonomy	Medium (online annotation workflow; ontology knowledge required)	Medium (web server instance can be customized)	High (ontologies, peer review, co-curation of contributor and professional curator)	High (variable ontologies)
LOTUS using Wikidata [17, 40]	Curation of natural product provenance (producing organism)	Continuous curation without dedicated recruitment; high autonomy	High (knowledge of ontologies; contribution conventions)	High (ontology implementation, considerations to adhere to notability guidelines)	Low (ontologies available but limited automated checks, no peer review)	Low
Wikipathways using GitHub [35]	Biosynthetic pathway curation	Continuous curation without dedicated recruitment; high autonomy	High (extensive training to use PathVisio software)	Medium (leveraging free-to-use infrastructure such as GitHub)	High (automated checks, peer review)	High
PDBe-KB using institutional infrastructure [48]	Aggregator of functional annotations for Protein Data Bank	Continuous curation addressing partner resources; low autonomy	High (structured JSON files deposited via FTP)	High (software stack of private/public FTP, API, webpage)	High (defined data exchange format, processing pipeline)	Low (complex infrastructure, technical expertise)
Paired Omics Data Platform using institutional infrastructure [41]	Reference database for paired genomic and metabolomic data	Initial recruitment of selected experts (high autonomy); currently continuous curation (low autonomy)	Medium (online annotation workflow; ontology knowledge required)	High (ontology implementation, considerations to adhere to notability guidelines)	High (automated checks, peer review)	Medium

Open in a new tab

Lessons learned in developing a community-driven biocuration framework

The MIBiG 4.0 biocuration model enabled a series of successful Annotathon events, significantly expanding the MIBiG repository as a reference database for bioinformatics applications. The Kanban-style work management system proved to be highly beneficial in guiding efforts and implementing editorial oversight. Pre-populating the Kanban boards with “stub entries” helped focus participants’ attention and facilitated self-directed, independent task assignment. Unexpectedly, the Kanban boards also served as a platform for discussions about entry-specific details, and archiving such exchanges would have been valuable for preserving entry histories, similar to the “Discussion” pages used on Wikipedia (Supplementary Figs S21–S22). At the same time, issues arose when contributors self-assigned many more “stub entries” than they could possibly work on, despite guidelines to limit reservations. Further, contributors frequently forgot to unassign themselves from tasks, leading to confusion about task status. Nevertheless, the Kanban boards were welcomed by the participants, receiving a satisfaction score of 4.15/5.0 from questionnaire respondents (Supplementary File 3).

The MIBiG submission portal was a valuable upgrade from the previous spreadsheet-based approach. Structuring data submission through input forms encouraged participation without requiring prior training, while automated validation and the integrated review system reduced the incorporation of mis-annotated data. By the end of the Annotathons, ~40% of new or modified entries had successfully passed the review workflow, which was only possible due to the newly implemented infrastructure. Participants also expressed interest in a modular submission framework that would support annotation and review of specific data types in bulk, rather than entry by entry, enabling more effective use of specialist expertise (e.g. adding and reviewing of chemical structures). Although some contributors missed the simplicity of spreadsheets, overall satisfaction with the MIBiG submission portal was high, with an average satisfaction score of 4.16/5.0 (Supplementary File 3).

In comparison, respondents were less satisfied with the overall curation workflow, giving it an average score of 3.44/5.0 (Supplementary File 3). This was primarily due to the need to switch between separate platforms for coordination (Kanban) and data submission (submission portal). Additionally, the use of a “freemium” software solution for the Kanban board imposed limitations in availability, customizability, and required user consent to their terms of use (including the collection of personal data). Ideally, an open-source Kanban software would have been directly integrated into the submission portal, seamlessly linking task assignment with data curation and automatically managing entry reservations and task statuses. While this would enhance the user experience, developing such a sophisticated solution is labor-intensive, exceeds the capacities of most community initiatives, and raises concerns about long-term maintenance, funding, and sustainability [18]. One possible solution would be a crowd-sourced effort to develop a dedicated, free open-source software package that integrates task coordination, data submission and validation, and communication tools into a customizable curation framework, possibly as an extension of the existing Wikibase infrastructure (https://wikiba.se/). Examples for such reusable software packages are Canto [46] and Wikipathways [35], both providing infrastructure for various community-driven biocuration efforts [47, 49, 50]. The availability of such a software template would enable initiatives to focus their resources primarily on biocuration rather than platform engineering, thereby democratizing access. A “standardized” software package could also facilitate automated import of curated data into Wikidata, combining the strengths of specialized biocuration with the benefits of large-scale knowledge graph integration. While such an effort is not yet planned to originate from the MIBiG organizational team, we hope to contribute inspiration by making the source code of the MIBiG submission portal freely available [43].

In addition to technical workflows, the social workflows implemented in the MIBiG 4.0 curation model played a key role in facilitating the curation process. The assignment of predefined roles helped clarify responsibilities and improve communication. The core organizational team was instrumental in coordinating contributors, managing infrastructure, and resolving technical issues. The newly introduced roles of interest group coordinators (responsible for editorial oversight and clarifying edge cases) and reviewers (data quality assurance) were generally well-received, earning satisfaction scores of 3.8/5.0 and 4.0/5.0, respectively (Supplementary File 3). Notably, reviewers experienced a high workload, as they represented only 16.5% of participants. Future initiatives may benefit from investing in dedicated training to better define the scope of the review process and reduce the perceived workload. While clearly defined roles facilitate task triaging, smaller initiatives may struggle to recruit sufficient personnel. One possible solution could be consolidating roles like project and communication officer, or reviewer and interest group coordinator. Moreover, the sustainability and resilience of social workflows must be carefully considered. Redundancy should be built in to ensure continuity when key participants leave the project, and thorough documentation must be available to facilitate onboarding of new members. The establishment of a comprehensive governance model and a clear development roadmap is crucial to ensure the long-term success and sustainability of the project [18].

Next steps and future perspectives

Since its inception a decade ago, the MIBiG database has been continuously expanded through regular annotation hackathons designed to keep pace with scientific advances. While the current community curation model enables efficient and timely participation, several aspects can still be improved. In preparation for the upcoming MIBiG 5.0 Annotathons, we plan to streamline our submission platform to integrate data curation and coordination, improve internal data validation and adherence to terminology, and to decrease our reliance on third-party tools. We will also consolidate the roles of interest group coordinators and reviewers into a unified “senior participant” role responsible for editorial oversight and data quality, with a target of ~25% of participants serving in this capacity to distribute workload more effectively and better support less-experienced contributors. In parallel, we are investigating a hybrid human-AI curation model, in which data parsing is performed by a domain-specific LLM and subsequently verified through expert review. This workflow has the potential to enhance the throughput of data curation by allocating human expertise to validation rather than data extraction. Furthermore, the prediction-validation feedback loop provides a foundation for active learning, enabling iterative model retraining through identification and correction of misannotations. Preliminary evaluations in our laboratories are promising (data not shown), but broader implementation of this approach is only envisioned for later iterations of the MIBiG Annotathons, once its robustness and workflow integration are more thoroughly assessed.

Despite the sustained efforts of the MIBiG community and other crowd-sourced curation initiatives, biocuration inherently lags behind the pace of primary research. This gap will persist until data deposition in machine-readable formats with rich metadata becomes a prerequisite for manuscript acceptance. To support this transition, MIBiG allows entries to be embargoed by the data submitter, enabling researchers to prepare and reference accession numbers during manuscript submission. In parallel, emerging publication formats such as micropublications [51] and nanopublications [52] are promising alternatives to traditional narrative scientific articles, allowing sharing of small, machine-readable knowledge snippets that are both citable and creditable. We are exploring ways to integrate such formats into MIBiG to enhance the recognition of individual contributions.

Ultimately, scientists generating primary data are best positioned to curate their results accurately. By engaging researchers directly and lowering the technical barriers to participation, initiatives such as MIBiG aim to promote a cultural shift in which biocuration becomes an integral and recognized component of the scientific publication process.

Conclusions

In this work, we presented a new biocuration model, which was successfully applied to a large-scale, community-driven effort to annotate molecular data. By combining social and technical workflows, this cost-effective approach accommodates a high number of participants and appeals to both novice and expert contributors, creating opportunities for the former while making efficient use of the time of the latter. While the MIBiG curation model is particularly well suited to bioinformatics-based initiatives involving the extraction of gene, protein, and pathway annotation, chemical structure information, or phenotype association, its general principles (hackathon-centered community curation incentivized by co-authorship) are broadly applicable. Additionally, the MIBiG model fosters community-building and collaboration, which in turn drives a virtuous circle of data generation, curation, and reuse. These concepts can be extended to other biological and biomedical domains to promote community participation in data curation. The lessons learned from MIBiG 4.0 offer a blueprint for developing sustainable, open, and collaborative scientific resources that facilitate bioinformatics analyses and advance biological understanding.

Key Points

Professional biocuration is essential for ensuring high-quality molecular datasets, but cannot keep pace with the rapid growth of biological data.
Existing crowd-sourced strategies often lack scalability and rigorous data validation mechanisms.
We present a new community-driven biocuration framework, developed during the MIBiG 4.0 annotation hackathons, which combines technical workflows with Annotathon-centered expert participation.
This strategy enables the rapid generation of expert-curated molecular data and is transferable to other bioinformatics annotation initiatives.

Supplementary Material

MIBiG_position_10_25_bib_revision_SI_bbaf659

mibig_position_10_25_bib_revision_si_bbaf659.pdf^{(1.2MB, pdf)}

Sup_1_MIBiG_4_initial_card_set_trello_NPI_bbaf659

sup_1_mibig_4_initial_card_set_trello_npi_bbaf659.zip^{(44.5KB, zip)}

Sup_2_edits_per_date_per_editor_bbaf659

sup_2_edits_per_date_per_editor_bbaf659.zip^{(3.2KB, zip)}

Sup_3_MIBiG_4_0_Post_Annotathon_Questionnaire_NPI_bbaf659

sup_3_mibig_4_0_post_annotathon_questionnaire_npi_bbaf659.zip^{(9.1KB, zip)}

Acknowledgements

The authors thank Julia K. Schink for valuable inspiration in the development of the methodology.

Contributor Information

Kai Blin, The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Building 220, Kongens Lyngby 2800, Denmark.

Catarina Loureiro, Bioinformatics Group, Wageningen University & Research, Droevendaalsesteeg 1, Wageningen, 6708 PB, the Netherlands.

Nico L L Louwen, Bioinformatics Group, Wageningen University & Research, Droevendaalsesteeg 1, Wageningen, 6708 PB, the Netherlands.

Jorge C Navarro-Muñoz, Bioinformatics Group, Wageningen University & Research, Droevendaalsesteeg 1, Wageningen, 6708 PB, the Netherlands.

Hans Gerstmans, VIB-KU Leuven Center for Microbiology, Flanders Institute for Biotechnology, Kasteelpark Arenberg 31, Leuven 3001, Belgium; Department of Biology, Laboratory for Biomolecular Discovery & Engineering, KU Leuven, Kasteelpark Arenberg 31, Leuven 3001, Belgium; Department of Biosystems, Biosensors Group, KU Leuven, Willem de Croylaan 42, box 2428, Leuven 3001, Belgium.

Serina L Robinson, Department of Environmental Microbiology, Swiss Federal Institute of Aquatic Science and Technology, Ueberlandstrasse 133, Duebendorf 8600, Switzerland.

Adriano Rutz, Institute for Molecular Systems Biology, ETH Zürich, Otto-Stern-Weg 3, Zürich 8093, Switzerland.

Zachary L Reitz, Department of Ecology, Evolution and Marine Biology, University of California, 1169 Biological Sciences II, Santa Barbara, CA 93106, United States.

Drew T Doering, Lawrence Berkeley National Laboratory, US Department of Energy Joint Genome Institute, 1 Cyclotron Road, Berkeley, CA 94720, United States.

Justin J J van der Hooft, Bioinformatics Group, Wageningen University & Research, Droevendaalsesteeg 1, Wageningen, 6708 PB, the Netherlands; Department of Biochemistry, University of Johannesburg, C2 Lab Building 224, Kingsway Campus, Cnr University & Kingsway Road, Auckland Park, Johannesburg 2006, South Africa.

Tilmann Weber, The Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Building 220, Kongens Lyngby 2800, Denmark.

Marnix H Medema, Bioinformatics Group, Wageningen University & Research, Droevendaalsesteeg 1, Wageningen, 6708 PB, the Netherlands.

Mitja M Zdouc, Bioinformatics Group, Wageningen University & Research, Droevendaalsesteeg 1, Wageningen, 6708 PB, the Netherlands.

Author contributions

K.B., C.L., M.H.M., M.M.Z. (Conceptualization), H.G., S.L.R., Z.L.R., D.T.D., M.M.Z. (Data curation), K.B., C.L., T.W., M.H.M (Funding acquisition), K.B., C.L. M.M.Z. (Methodology), K.B., T.W., M.H.M., M.M.Z. (Project Administration), K.B., N.L.L.L. (Software), C.L., M.M.Z. (Visualization ), C.L., M.M.Z. (Writing—original draft), K.B., C.L., J.C.N.M, H.G., S.L.R., A.R., Z.L.R, D.T.D, J.J.J.v.d.H., T.W., and M.H.M. (Writing—review & editing )

Conflict of interest: J.J.J.v.d.H. is a member of the scientific advisory board of NAICONS Srl., Milano, Italy, and consults for Corteva Agriscience, Indianapolis, IN, USA. M.H.M. is a member of the scientific advisory board of Hexagon Bio. All other authors declare no competing interests.

Funding

C.L. and N.L.L.L. were supported by the NWO Open Science Project “BiG-CODEC” No. OSF.23.1.044; H.G. was supported by the Research Foundation-Flanders (FWO) under the scope of a junior postdoctoral fellowship (1229222 N); S.L.R. was supported by the Swiss National Science Foundation (PZPGP2_209124); T.W. and K.B. were supported by the Novo Nordisk Foundation, NNF20CC003558; T.W. was furthermore supported by the Danish National Research Foundation CeMiSt, DNRF137; M.M.Z. was supported by the Dutch Research Council (NWO) Grant KICH1.LWV04.21.013. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under Grant Agreement no. 101000392 (MARBLES). The work conducted by the U.S. Department of Energy Joint Genome Institute (https://ror.org/04xm1d337), a DOE Office of Science User Facility, is supported by the Office of Science of the U.S. Department of Energy operated under Contract No. DE-AC02-05CH11231.

Data availability

Computer code for the MIBiG submission portal web application is deposited on Zenodo under https://zenodo.org/records/13970328 [43].

References

1. International Society for Biocuration . Biocuration: Distilling data into knowledge. PLoS Biol 2018;16:e2002846. 10.1371/journal.pbio.2002846 [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Haendel M, Su A, McMurry J. FAIR-TLC: Metrics to Assess Value of Biomedical Digital Repositories: Response to RFI NOT-OD-16-133. 10.5281/zenodo.203295. [DOI]
3. Wilkinson MD, Dumontier M, Aalbersberg IJJ. et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data 2016;3:160018. 10.1038/sdata.2016.18 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Hanson MA, Barreiro PG, Crosetto P. et al. The strain on scientific publishing. Quant Sci Stud 2024;5:823–43. 10.1162/qss_a_00327 [DOI] [Google Scholar]
5. Stephens ZD, Lee SY, Faghri F. et al. Big data: Astronomical or Genomical? PLoS Biol 2015;13:e1002195. 10.1371/journal.pbio.1002195 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Davies SR. Working in biocuration: Contemporary experiences and perspectives. Database (Oxford) 2025;2025:baaf003. 10.1093/database/baaf003 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Skunca N, Altenhoff A, Dessimoz C. Quality of computationally inferred gene ontology annotations. PLoS Comput Biol 2012;8:e1002533. 10.1371/journal.pcbi.1002533 [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Kalmer TL, Ancajas CMF, Cheng Z. et al. Assessing the ability of ChatGPT to extract natural product bioactivity and biosynthesis data from publications. bioRxiv 2024. 10.1101/2024.08.01.606186 [DOI] [Google Scholar]
9. Niyonkuru E, Harry Caufield J, Carmody LC. et al. Leveraging generative AI to accelerate biocuration of medical actions for rare disease. medRxiv 2024. 10.1101/2024.08.22.24310814 [DOI] [PMC free article] [PubMed] [Google Scholar]
10. UniProt Consortium . UniProt: A worldwide hub of protein knowledge. Nucleic Acids Res 2019;47:D506–15. 10.1093/nar/gky1049 [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Huang L, Yu W, Ma W. et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv 2023. 10.48550/arXiv.2311.05232 [DOI] [Google Scholar]
12. Zhou L, Schellaert W, Martínez-Plumed F. et al. Larger and more instructable language models become less reliable. Nature 2024;634:61–8. 10.1038/s41586-024-07930-y [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Steyvers M, Tejeda H, Kumar A. et al. What large language models know and what people think they know. Nat Mach Intell 2025;7:221–31. 10.1038/s42256-024-00976-7 [DOI] [Google Scholar]
14. Caufield H, Kroll C, O’Neil ST. et al. CurateGPT: A flexible language-model assisted biocuration tool. arXiv 2024. 10.48550/arXiv.2411.00046 [DOI] [Google Scholar]
15. Ramsey J, McIntosh B, Renfro D. et al. Crowdsourcing biocuration: The community assessment of community annotation with ontologies (CACAO). PLoS Comput Biol 2021;17:e1009463. 10.1371/journal.pcbi.1009463 [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Renfro DP, McIntosh BK, Venkatraman A. et al. GONUTS: The gene ontology normal usage tracking system. Nucleic Acids Res 2011;40:D1262–9. 10.1093/nar/gkr907 [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Waagmeester A, Stupp G, Burgstaller-Muehlbacher S. et al. Science forum: Wikidata as a knowledge graph for the life sciences. elife 2020;9:e52614. 10.7554/eLife.52614 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Hoyt CT, Gyori BM. The O3 guidelines: Open data, open code, and open infrastructure for sustainable curated scientific resources. Sci Data 2024;11:547. 10.1038/s41597-024-03406-w [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Medema MH, de Rond T, Moore BS. Mining genomes to illuminate the specialized chemistry of life. Nat Rev Genet 2021;22:553–71. 10.1038/s41576-021-00363-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Zdouc MM, Blin K, Louwen NLL. et al. MIBiG 4.0: Advancing biosynthetic gene cluster curation through global collaboration. Nucleic Acids Res 2024;53:D678–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Blin K, Shaw S, Vader L. et al. antiSMASH 8.0: Extended gene cluster detection capabilities and analyses of chemistry, enzymology, and regulation. Nucleic Acids Res 2025;53:W32–8. 10.1093/nar/gkaf334 [DOI] [PMC free article] [PubMed] [Google Scholar]
22. van Heel AJ, de Jong A, Song C. et al. BAGEL4: A user-friendly web server to thoroughly mine RiPPs and bacteriocins. Nucleic Acids Res 2018;46:W278–81. 10.1093/nar/gky383 [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Merwin NJ, Mousa WK, Dejong CA. et al. DeepRiPP integrates multiomics data to automate discovery of novel ribosomally synthesized natural products. Proc Natl Acad Sci USA 2020;117:371–80. 10.1073/pnas.1901493116 [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Liu M, Li Y, Li H. Deep learning to predict the biosynthetic gene clusters in bacterial genomes. J Mol Biol 2022;434:167597. 10.1016/j.jmb.2022.167597 [DOI] [PubMed] [Google Scholar]
25. Hannigan GD, Prihoda D, Palicka A. et al. A deep learning genome-mining strategy for biosynthetic gene cluster prediction. Nucleic Acids Res 2019;47:e110. 10.1093/nar/gkz654 [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Sanchez S, Rogers JD, Rogers AB. et al. Expansion of novel biosynthetic gene clusters from diverse environments using SanntiS. bioRxiv 2023. 10.1101/2023.05.23.540769 [DOI] [Google Scholar]
27. Carroll LM, Larralde M, Fleck JS. et al. Accurate de novo identification of biosynthetic gene clusters with GECCO. bioRxiv 2021. 10.1101/2021.05.03.442509 [DOI] [Google Scholar]
28. Poynton EF, van Santen JA, Pin M. et al. The natural products atlas 3.0: Extending the database of microbially-derived natural products. Nucleic Acids Res 2025;53:D691–9. 10.1093/nar/gkae1093 [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Medema MH, Kottmann R, Yilmaz P. et al. Minimum information about a biosynthetic gene cluster. Nat Chem Biol 2015;11:625–31. 10.1038/nchembio.1890 [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Epstein SC, Charkoudian LK, Medema MH. A standardized workflow for submitting data to the minimum information about a biosynthetic gene cluster (MIBiG) repository: Prospects for research-based educational experiences. Stand Genomic Sci 2018;13:16. 10.1186/s40793-018-0318-y [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Kautsar SA, Blin K, Shaw S. et al. MIBiG 2.0: A repository for biosynthetic gene clusters of known function. Nucleic Acids Res 2020;48:D454–8. 10.1093/nar/gkz882 [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Terlouw BR, Blin K, Navarro-Muñoz JC. et al. MIBiG 3.0: A community-driven effort to annotate experimentally validated biosynthetic gene clusters. Nucleic Acids Res 2023;51:D603–10. 10.1093/nar/gkac1049 [DOI] [PMC free article] [PubMed] [Google Scholar]
33. Bansal P, Morgat A, Axelsen KB. et al. Rhea, the reaction knowledgebase in 2022. Nucleic Acids Res 2021;50:D693–700. 10.1093/nar/gkab1016 [DOI] [PMC free article] [PubMed] [Google Scholar]
34. Kanehisa M, Furumichi M, Sato Y. et al. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res 2022;51:D587–92. 10.1093/nar/gkac963 [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Agrawal A, Balcı H, Hanspers K. et al. WikiPathways 2024: Next generation pathway database. Nucleic Acids Res 2023;52:D679–89. 10.1093/nar/gkad960 [DOI] [PMC free article] [PubMed] [Google Scholar]
36. Augustijn HE, Karapliafis D, Joosten KMM. et al. LogoMotif: A comprehensive database of transcription factor binding site profiles in actinobacteria. J Mol Biol 2024;436:168558. 10.1016/j.jmb.2024.168558 [DOI] [PubMed] [Google Scholar]
37. Terrapon N, Lombard V, Drula E. et al. In: Aoki-Kinoshita KF (ed.), The CAZy Database/the Carbohydrate-Active Enzyme (CAZy) Database: Principles and Usage Guidelines. A Practical Guide to Using Glycomics Databases. Tokyo: Springer Japan, 2016, 117–31. [Google Scholar]
38. Jones MR, Pinto E, Torres MA. et al. CyanoMetDB, a comprehensive public database of secondary metabolites from cyanobacteria. Water Res 2021;196:117017. 10.1016/j.watres.2021.117017 [DOI] [PubMed] [Google Scholar]
39. Chandrasekhar V, Rajan K, Kanakam SRS. et al. COCONUT 2.0: A comprehensive overhaul and curation of the collection of open natural products database. Nucleic Acids Res 2024;53:D634–43. 10.1093/nar/gkae1063 [DOI] [PMC free article] [PubMed] [Google Scholar]
40. Rutz A, Sorokina M, Galgonek J. et al. The LOTUS initiative for open knowledge management in natural products research. elife 2022;11:e70780. 10.7554/eLife.70780 [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Schorn MA, Verhoeven S, Ridder L. et al. A community resource for paired genomic and metabolomic data mining. Nat Chem Biol 2021;17:363–8. 10.1038/s41589-020-00724-z [DOI] [PMC free article] [PubMed] [Google Scholar]
42. Zdouc MM, Navarro-Muñoz JC, Loureiro C., et al. Training Materials for Minimum Information about a Biosynthetic Gene Cluster (MIBiG) 4.0 Annotathons. Zenodo 2024; 10.5281/zenodo.15585490 [DOI]
43. Louwen NLL, Blin K. Minimum Information about a Biosynthetic Gene Cluster (MIBiG) Submission Portal. Zenodo 2024. 10.5281/zenodo.13970328 [DOI]
44. Sarasua C, Checco A, Demartini G. et al. The evolution of power and standard wikidata editors: Comparing editing behavior over time to predict lifespan and volume of edits. Comput Support Coop Work 2019;28:843–82. 10.1007/s10606-018-9344-y [DOI] [Google Scholar]
45. Rutz A, Probst D, Aguilar C. et al. MITE: The minimum information about a tailoring enzyme database for capturing specialized metabolite biosynthesis. Nucleic Acids Res 2025;gkaf969. 10.1093/nar/gkaf969 [DOI] [PMC free article] [PubMed] [Google Scholar]
46. Rutherford KM, Harris MA, Lock A. et al. Canto: An online tool for community literature curation. Bioinformatics 2014;30:1791–2. 10.1093/bioinformatics/btu103 [DOI] [PMC free article] [PubMed] [Google Scholar]
47. Lock A, Harris MA, Rutherford K. et al. Community curation in PomBase: Enabling fission yeast experts to provide detailed, standardized, sharable annotation from research publications. Database (Oxford) 2020;2020:baaa028. 10.1093/database/baaa028 [DOI] [PMC free article] [PubMed] [Google Scholar]
48. PDBe-KB consortium. PDBe-KB: A community-driven resource for structural and functional annotations. Nucleic Acids Res 2020;48:D344–53. 10.1093/nar/gkz853 [DOI] [PMC free article] [PubMed] [Google Scholar]
49. Öztürk-Çolak A, Marygold SJ, Antonazzo G. et al. FlyBase: Updates to the drosophila genes and genomes database. Genetics 2024;227:iyad211. 10.1093/genetics/iyad211 [DOI] [PMC free article] [PubMed] [Google Scholar]
50. Rutherford KM, Harris MA, Oliferenko S. et al. JaponicusDB: Rapid deployment of a model organism database for an emerging model species. Genetics 2022;220:iyab223. 10.1093/genetics/iyab223 [DOI] [PMC free article] [PubMed] [Google Scholar]
51. Raciti D, Yook K, Harris TW. et al. Micropublication: Incentivizing community curation and placing unpublished data into the public domain. Database (Oxford) 2018;2018:bay013. 10.1093/database/bay013 [DOI] [PMC free article] [PubMed] [Google Scholar]
52. Kuhn T, Barbano PE, Nagy ML. et al. Broadening the scope of nanopublications. The Semantic Web: Semantics and Big Data 2013;7882:487–501. 10.1007/978-3-642-38288-8_33 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

Haendel M, Su A, McMurry J. FAIR-TLC: Metrics to Assess Value of Biomedical Digital Repositories: Response to RFI NOT-OD-16-133. 10.5281/zenodo.203295. [DOI]
Zdouc MM, Navarro-Muñoz JC, Loureiro C., et al. Training Materials for Minimum Information about a Biosynthetic Gene Cluster (MIBiG) 4.0 Annotathons. Zenodo 2024; 10.5281/zenodo.15585490 [DOI]
Louwen NLL, Blin K. Minimum Information about a Biosynthetic Gene Cluster (MIBiG) Submission Portal. Zenodo 2024. 10.5281/zenodo.13970328 [DOI]

Supplementary Materials

MIBiG_position_10_25_bib_revision_SI_bbaf659

mibig_position_10_25_bib_revision_si_bbaf659.pdf^{(1.2MB, pdf)}

Sup_1_MIBiG_4_initial_card_set_trello_NPI_bbaf659

sup_1_mibig_4_initial_card_set_trello_npi_bbaf659.zip^{(44.5KB, zip)}

Sup_2_edits_per_date_per_editor_bbaf659

sup_2_edits_per_date_per_editor_bbaf659.zip^{(3.2KB, zip)}

Sup_3_MIBiG_4_0_Post_Annotathon_Questionnaire_NPI_bbaf659

sup_3_mibig_4_0_post_annotathon_questionnaire_npi_bbaf659.zip^{(9.1KB, zip)}

Data Availability Statement

Computer code for the MIBiG submission portal web application is deposited on Zenodo under https://zenodo.org/records/13970328 [43].

[ref1] 1. International Society for Biocuration . Biocuration: Distilling data into knowledge. PLoS Biol 2018;16:e2002846. 10.1371/journal.pbio.2002846 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref2] 2. Haendel M, Su A, McMurry J. FAIR-TLC: Metrics to Assess Value of Biomedical Digital Repositories: Response to RFI NOT-OD-16-133. 10.5281/zenodo.203295. [DOI]

[ref3] 3. Wilkinson MD, Dumontier M, Aalbersberg IJJ. et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data 2016;3:160018. 10.1038/sdata.2016.18 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref4] 4. Hanson MA, Barreiro PG, Crosetto P. et al. The strain on scientific publishing. Quant Sci Stud 2024;5:823–43. 10.1162/qss_a_00327 [DOI] [Google Scholar]

[ref5] 5. Stephens ZD, Lee SY, Faghri F. et al. Big data: Astronomical or Genomical? PLoS Biol 2015;13:e1002195. 10.1371/journal.pbio.1002195 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref6] 6. Davies SR. Working in biocuration: Contemporary experiences and perspectives. Database (Oxford) 2025;2025:baaf003. 10.1093/database/baaf003 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref7] 7. Skunca N, Altenhoff A, Dessimoz C. Quality of computationally inferred gene ontology annotations. PLoS Comput Biol 2012;8:e1002533. 10.1371/journal.pcbi.1002533 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref8] 8. Kalmer TL, Ancajas CMF, Cheng Z. et al. Assessing the ability of ChatGPT to extract natural product bioactivity and biosynthesis data from publications. bioRxiv 2024. 10.1101/2024.08.01.606186 [DOI] [Google Scholar]

[ref9] 9. Niyonkuru E, Harry Caufield J, Carmody LC. et al. Leveraging generative AI to accelerate biocuration of medical actions for rare disease. medRxiv 2024. 10.1101/2024.08.22.24310814 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref10] 10. UniProt Consortium . UniProt: A worldwide hub of protein knowledge. Nucleic Acids Res 2019;47:D506–15. 10.1093/nar/gky1049 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] 11. Huang L, Yu W, Ma W. et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv 2023. 10.48550/arXiv.2311.05232 [DOI] [Google Scholar]

[ref12] 12. Zhou L, Schellaert W, Martínez-Plumed F. et al. Larger and more instructable language models become less reliable. Nature 2024;634:61–8. 10.1038/s41586-024-07930-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref13] 13. Steyvers M, Tejeda H, Kumar A. et al. What large language models know and what people think they know. Nat Mach Intell 2025;7:221–31. 10.1038/s42256-024-00976-7 [DOI] [Google Scholar]

[ref14] 14. Caufield H, Kroll C, O’Neil ST. et al. CurateGPT: A flexible language-model assisted biocuration tool. arXiv 2024. 10.48550/arXiv.2411.00046 [DOI] [Google Scholar]

[ref15] 15. Ramsey J, McIntosh B, Renfro D. et al. Crowdsourcing biocuration: The community assessment of community annotation with ontologies (CACAO). PLoS Comput Biol 2021;17:e1009463. 10.1371/journal.pcbi.1009463 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref16] 16. Renfro DP, McIntosh BK, Venkatraman A. et al. GONUTS: The gene ontology normal usage tracking system. Nucleic Acids Res 2011;40:D1262–9. 10.1093/nar/gkr907 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref17] 17. Waagmeester A, Stupp G, Burgstaller-Muehlbacher S. et al. Science forum: Wikidata as a knowledge graph for the life sciences. elife 2020;9:e52614. 10.7554/eLife.52614 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref18] 18. Hoyt CT, Gyori BM. The O3 guidelines: Open data, open code, and open infrastructure for sustainable curated scientific resources. Sci Data 2024;11:547. 10.1038/s41597-024-03406-w [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref19] 19. Medema MH, de Rond T, Moore BS. Mining genomes to illuminate the specialized chemistry of life. Nat Rev Genet 2021;22:553–71. 10.1038/s41576-021-00363-7 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref20] 20. Zdouc MM, Blin K, Louwen NLL. et al. MIBiG 4.0: Advancing biosynthetic gene cluster curation through global collaboration. Nucleic Acids Res 2024;53:D678–90. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref21] 21. Blin K, Shaw S, Vader L. et al. antiSMASH 8.0: Extended gene cluster detection capabilities and analyses of chemistry, enzymology, and regulation. Nucleic Acids Res 2025;53:W32–8. 10.1093/nar/gkaf334 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref22] 22. van Heel AJ, de Jong A, Song C. et al. BAGEL4: A user-friendly web server to thoroughly mine RiPPs and bacteriocins. Nucleic Acids Res 2018;46:W278–81. 10.1093/nar/gky383 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref23] 23. Merwin NJ, Mousa WK, Dejong CA. et al. DeepRiPP integrates multiomics data to automate discovery of novel ribosomally synthesized natural products. Proc Natl Acad Sci USA 2020;117:371–80. 10.1073/pnas.1901493116 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref24] 24. Liu M, Li Y, Li H. Deep learning to predict the biosynthetic gene clusters in bacterial genomes. J Mol Biol 2022;434:167597. 10.1016/j.jmb.2022.167597 [DOI] [PubMed] [Google Scholar]

[ref25] 25. Hannigan GD, Prihoda D, Palicka A. et al. A deep learning genome-mining strategy for biosynthetic gene cluster prediction. Nucleic Acids Res 2019;47:e110. 10.1093/nar/gkz654 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref26] 26. Sanchez S, Rogers JD, Rogers AB. et al. Expansion of novel biosynthetic gene clusters from diverse environments using SanntiS. bioRxiv 2023. 10.1101/2023.05.23.540769 [DOI] [Google Scholar]

[ref27] 27. Carroll LM, Larralde M, Fleck JS. et al. Accurate de novo identification of biosynthetic gene clusters with GECCO. bioRxiv 2021. 10.1101/2021.05.03.442509 [DOI] [Google Scholar]

[ref28] 28. Poynton EF, van Santen JA, Pin M. et al. The natural products atlas 3.0: Extending the database of microbially-derived natural products. Nucleic Acids Res 2025;53:D691–9. 10.1093/nar/gkae1093 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref29] 29. Medema MH, Kottmann R, Yilmaz P. et al. Minimum information about a biosynthetic gene cluster. Nat Chem Biol 2015;11:625–31. 10.1038/nchembio.1890 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref30] 30. Epstein SC, Charkoudian LK, Medema MH. A standardized workflow for submitting data to the minimum information about a biosynthetic gene cluster (MIBiG) repository: Prospects for research-based educational experiences. Stand Genomic Sci 2018;13:16. 10.1186/s40793-018-0318-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref31] 31. Kautsar SA, Blin K, Shaw S. et al. MIBiG 2.0: A repository for biosynthetic gene clusters of known function. Nucleic Acids Res 2020;48:D454–8. 10.1093/nar/gkz882 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref32] 32. Terlouw BR, Blin K, Navarro-Muñoz JC. et al. MIBiG 3.0: A community-driven effort to annotate experimentally validated biosynthetic gene clusters. Nucleic Acids Res 2023;51:D603–10. 10.1093/nar/gkac1049 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref33] 33. Bansal P, Morgat A, Axelsen KB. et al. Rhea, the reaction knowledgebase in 2022. Nucleic Acids Res 2021;50:D693–700. 10.1093/nar/gkab1016 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref34] 34. Kanehisa M, Furumichi M, Sato Y. et al. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res 2022;51:D587–92. 10.1093/nar/gkac963 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref35] 35. Agrawal A, Balcı H, Hanspers K. et al. WikiPathways 2024: Next generation pathway database. Nucleic Acids Res 2023;52:D679–89. 10.1093/nar/gkad960 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref36] 36. Augustijn HE, Karapliafis D, Joosten KMM. et al. LogoMotif: A comprehensive database of transcription factor binding site profiles in actinobacteria. J Mol Biol 2024;436:168558. 10.1016/j.jmb.2024.168558 [DOI] [PubMed] [Google Scholar]

[ref37] 37. Terrapon N, Lombard V, Drula E. et al. In: Aoki-Kinoshita KF (ed.), The CAZy Database/the Carbohydrate-Active Enzyme (CAZy) Database: Principles and Usage Guidelines. A Practical Guide to Using Glycomics Databases. Tokyo: Springer Japan, 2016, 117–31. [Google Scholar]

[ref38] 38. Jones MR, Pinto E, Torres MA. et al. CyanoMetDB, a comprehensive public database of secondary metabolites from cyanobacteria. Water Res 2021;196:117017. 10.1016/j.watres.2021.117017 [DOI] [PubMed] [Google Scholar]

[ref39] 39. Chandrasekhar V, Rajan K, Kanakam SRS. et al. COCONUT 2.0: A comprehensive overhaul and curation of the collection of open natural products database. Nucleic Acids Res 2024;53:D634–43. 10.1093/nar/gkae1063 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref40] 40. Rutz A, Sorokina M, Galgonek J. et al. The LOTUS initiative for open knowledge management in natural products research. elife 2022;11:e70780. 10.7554/eLife.70780 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref41] 41. Schorn MA, Verhoeven S, Ridder L. et al. A community resource for paired genomic and metabolomic data mining. Nat Chem Biol 2021;17:363–8. 10.1038/s41589-020-00724-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref42] 42. Zdouc MM, Navarro-Muñoz JC, Loureiro C., et al. Training Materials for Minimum Information about a Biosynthetic Gene Cluster (MIBiG) 4.0 Annotathons. Zenodo 2024; 10.5281/zenodo.15585490 [DOI]

[ref43] 43. Louwen NLL, Blin K. Minimum Information about a Biosynthetic Gene Cluster (MIBiG) Submission Portal. Zenodo 2024. 10.5281/zenodo.13970328 [DOI]

[ref44] 44. Sarasua C, Checco A, Demartini G. et al. The evolution of power and standard wikidata editors: Comparing editing behavior over time to predict lifespan and volume of edits. Comput Support Coop Work 2019;28:843–82. 10.1007/s10606-018-9344-y [DOI] [Google Scholar]

[ref45] 45. Rutz A, Probst D, Aguilar C. et al. MITE: The minimum information about a tailoring enzyme database for capturing specialized metabolite biosynthesis. Nucleic Acids Res 2025;gkaf969. 10.1093/nar/gkaf969 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref46] 46. Rutherford KM, Harris MA, Lock A. et al. Canto: An online tool for community literature curation. Bioinformatics 2014;30:1791–2. 10.1093/bioinformatics/btu103 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref47] 47. Lock A, Harris MA, Rutherford K. et al. Community curation in PomBase: Enabling fission yeast experts to provide detailed, standardized, sharable annotation from research publications. Database (Oxford) 2020;2020:baaa028. 10.1093/database/baaa028 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref48] 48. PDBe-KB consortium. PDBe-KB: A community-driven resource for structural and functional annotations. Nucleic Acids Res 2020;48:D344–53. 10.1093/nar/gkz853 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref49] 49. Öztürk-Çolak A, Marygold SJ, Antonazzo G. et al. FlyBase: Updates to the drosophila genes and genomes database. Genetics 2024;227:iyad211. 10.1093/genetics/iyad211 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref50] 50. Rutherford KM, Harris MA, Oliferenko S. et al. JaponicusDB: Rapid deployment of a model organism database for an emerging model species. Genetics 2022;220:iyab223. 10.1093/genetics/iyab223 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref51] 51. Raciti D, Yook K, Harris TW. et al. Micropublication: Incentivizing community curation and placing unpublished data into the public domain. Database (Oxford) 2018;2018:bay013. 10.1093/database/bay013 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref52] 52. Kuhn T, Barbano PE, Nagy ML. et al. Broadening the scope of nanopublications. The Semantic Web: Semantics and Big Data 2013;7882:487–501. 10.1007/978-3-642-38288-8_33 [DOI] [Google Scholar]

PERMALINK

Strategies for community-sourced biocuration in bioinformatics: a case study on MIBiG 4.0

Kai Blin

Catarina Loureiro

Nico L L Louwen

Jorge C Navarro-Muñoz

Hans Gerstmans

Serina L Robinson

Adriano Rutz

Zachary L Reitz

Drew T Doering

Justin J J van der Hooft

Tilmann Weber

Marnix H Medema

Mitja M Zdouc

Abstract

Graphical Abstract

Graphical Abstract.

Introduction

Preparing foundations: establishing the project framework

Figure 1.

Participant recruitment and onboarding

Creating effective communication channels

Infrastructure development and workflow implementation

Figure 2.

The MIBiG curation model resulted in high yield and contributor satisfaction

Figure 3.

Case study

Figure 4.

Systematic comparison with related biocuration models

Table 1.

Lessons learned in developing a community-driven biocuration framework

Next steps and future perspectives

Conclusions

Key Points

Supplementary Material

Acknowledgements

Contributor Information

Author contributions

Funding

Data availability

References

Associated Data

Data Citations

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases