Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Dec 29.
Published in final edited form as: J Chem Inf Model. 2025 Dec 18;66(1):138–151. doi: 10.1021/acs.jcim.5c02082

BGC-MAC and BGC-MAP: Attention-Based Models for Biosynthetic Gene Cluster Classification and Product Matching

Kechen Lu 1, Mengting Li 2, Hao Geng 3, Wenqiang Xu 2, Manyun Chen 2, Tian-Min Fu 1, Hendrik Luesch 2,4, Yousong Ding 2, Wen Jun Xie 2
PMCID: PMC12744964  NIHMSID: NIHMS2131865  PMID: 41412814

Abstract

Natural products, synthesized via enzymes encoded by biosynthetic gene clusters (BGCs), represent a major source of therapeutic agents. Accurate BGC annotation is essential to unlock the vast potential of natural product diversity. However, BGC annotation remains challenging due to our incomplete understanding of the enzymatic logic underlying biosynthesis. Here, we present two deep learning models trained on experimentally validated BGC–natural product pairs to advance BGC annotation. The BGC-Multihead Attention Classifier (BGC-MAC) classifies BGCs by natural product class, outperforming antiSMASH and DeepBGC. The BGC-Multihead Attention Product-matcher (BGC-MAP) associates BGCs with product structures, demonstrating potential to prioritize candidate BGCs given a natural product, or to identify potential natural products from a given BGC. Importantly, the models’ cross-attention mechanisms enable explainable AI, identifying key protein domains and revealing BGC–substructure relationships in the biosynthesis without requiring prior annotations. Together, BGC-MAC and BGC-MAP establish a data-driven, explainable AI framework that enhances BGC annotation, deepens biosynthetic insight, and accelerates the discovery of new natural products. The software is available at https://github.com/EvoCatalysis/BGC_annotation.

Graphical Abstract

graphic file with name nihms-2131865-f0001.jpg

Introduction

Natural products, biosynthesized by bacteria, fungi, plants and animals, are among the most important sources of therapeutic agents.1 Their remarkable structure diversity has made them indispensable in the development of antibiotics, chemotherapeutics, immunosuppressants, and crop protection agents. Notably, approximately 66% of all approved drugs are derived from natural products, the majority of which originate from microorganisms.2,3 Biosynthetic gene clusters (BGCs), defined as co-located genes encoding proteins, primarily govern the biosynthesis of natural products.4,5 Advances in genome and metagenome mining have led to the identification of millions of putative BGCs6, predicted by detecting adjacent protein domains commonly associated with biosynthetic functions.7,8 However, only around 3,000 BGCs have been experimentally linked to their natural products, underscoring a vast uncharacterized reservoir of chemical diversity and the need for computational tools that can accurately annotate BGCs and link them to their products.9

Annotating BGCs largely depends on rule-based methods such as those implemented in antiSMASH (ANTIbiotics & Secondary Metabolite Analysis SHell). antiSMASH10 employs profile-Hidden Markov Models (pHMMs) to identify biosynthetic domains and classify BGCs into product types, such as nonribosomal peptide (NRP), ribosomally synthesized and post-translationally modified peptide (RiPP), polyketide, terpene, alkaloid, and others. However, rule-based approaches often fail to capture higher-order features, such as gene order, regulatory elements, or context-dependent interactions between domains11, which may be critical for accurate function and product prediction.

Recent advances in machine learning have begun to address these limitations. For example, DeepBGC11 uses a random forest classifier that predicts product classes from Pfam domain composition features enriched in biosynthetic regions. The random forest was trained on annotated BGCs in Minimum Information about a Biosynthetic Gene cluster (MIBiG). BigCARP12 leverages a self-supervised masked language model trained on antiSMASH-predicted BGC classes, learning contextual representations of protein domains to predict the natural product class. However, machine learning models often exhibit higher false positive rates than rule-based approaches and also suffer from false negatives when predicting known BGC types.13 Furthermore, previous deep learning models function as ‘black boxes’, failing to identify the key domains driving their predictions and thereby offering limited biosynthetic insight.

Beyond classification, linking BGCs to the chemical structures of their encoded products is critical for elucidating biosynthetic pathways, and prioritizing clusters for experimental characterization. Currently, rule-based methods were developed to predict the structure of trans-AT14 and cis-AT polyketides15, RiPP16, or secondary metabolites17. PRISM17 predicts BGC product structures by modeling biosynthetic pathways as transformations of chemical graphs. It applies 618 virtual tailoring reactions to modify chemical subgraphs based on enzymatic content, generating potential structures. However, this approach relies on homology to known enzymes, limiting its capacity to predict novel structures. Current rules for secondary metabolite biosynthesis are far from complete, restricting structure prediction accuracy. While some methods incorporate deep learning methods16 and mass spectrometry data15 to better characterize BGCs and predict product structure, their prediction still rely on rule-based systems, and focus exclusively on specific classes of natural products.

Here, we developed two independent deep learning models, BGC-MAC and BGC-MAP, to accurately classify BGCs by product class and to link BGCs with product structure, respectively. Both models employ cross-attention layers to integrate information between BGC and either product class (BGC-MAC) or product structure (BGC-MAP). BGC-MAC achieves better performance compared to antiSMASH (version 7.0)10 and DeepBGC11. Besides, BGC-MAC identifies class-discriminative domains without prior annotation. BGC-MAP directly associates BGCs with their product structures, achieving encouraging AUROC score (>0.80). It also demonstrates the potential to prioritize BGC candidates for known natural products or vice versa, thus reducing the experimental burden. Together, BGC-MAC and BGC-MAP establish a data-driven and explainable AI framework that advances genome mining, improves BGC annotation, and accelerates natural product discovery. By coupling predictive power with biological interpretability, these tools not only facilitate drug discovery but also provide new mechanistic insights into the biosynthetic logic of natural products. For practical usage, both models accept BGC GenBank files in antiSMASH format, with usage guidelines provided in the GitHub repository.

Results

Architectures of BGC-MAC and BGC-MAP

We assembled a training dataset of 2,635 BGCs available from the MIBiG database (version 4.0), 2,114 of which contains experimentally characterized natural product.9 There are 3,551 unique natural products and 3,824 BGC-product pairs (Figure 1A). A BGC may belong to multiple biosynthetic classes (Figure 1B). BGCs possess a hierarchical structure that can be broken down into individual enzymes and their corresponding Pfam domains. Such domains are critical for BGC identification and classification, as Pfam domains recognize protein families involved in biosynthesis.8 Following previous works11,12, we parsed BGCs into antiSMASH-annotated domains, resulting in 78,075 unique enzyme domains (Figure 1A). Pretrained ESM218 was used to embed each domain sequence into a 1,280-dimensional vector. The SMILES string of each natural product was tokenized by splitting every character into individual tokens.

Figure 1. Overview of BGC-product training data and model architecture.

Figure 1.

(A) 2,635 experimentally validated BGCs were extracted from MIBiG database for model training and evaluation. BGCs are considered sequence of domains. 78,075 unique enzyme domains, 3,551 unique natural products and 3,824 BGC-product pairs are included in our dataset. (B) Distribution of BGCs with different biosynthetic classes combinations. Classes combinations with less than 5 items are omitted in the illustration. (C) Cross-attention is leveraged in BGC-MAC for classification and BGC-MAP for BGC-product matching. Dotted line corresponds to skip connection. (D) BGC encoder takes BGC domain sequences initialized by ESM2 embeddings. SMILES encoder takes SMILES string as input. Both domain and SMILES representations are updated using a self-attention mechanism enhanced with RoPE. N: Length of domain sequence. n: Length of SMILES string.

We trained two independent models: BGC-MAC for BGC classification and BGC-MAP for BGC-product matching (Figure 1C). To train BGC-MAC, each BGC was assigned to at least one of six MIBiG class: NRP, other, polyketide, RiPP, saccharide and terpene. To train BGC-MAP, for each positive BGC-product pair, we sampled five negative BGC-product pairs. Both models utilize a BGC encoder (Figure 1D) which takes a list of 1,280-dimensional vectors of enzyme domains as input, and the representations are updated using a multihead self-attention mechanism19 enhanced with Rotary Position Embeddings (RoPE)20. Similarly, the product encoder (Figure 1D) processes SMILES tokens using multihead self-attention with RoPE.

Cross-modal feature integration was achieved through cross-attention layers. In BGC-MAC, six class tokens act as queries to interact with the BGC encoder outputs, enabling class-specific focus on relevant domains of the BGC. In BGC-MAP, BGC representations query product representations to identify critical substructures for matching.

For both BGC-MAC and BGC-MAP, the training dataset was randomly divided into ten folds, with fold 10 reserved as the test set. The remaining nine folds were used to train an ensemble model, where each fold was used once as the validation set during cross-validation. We refer the reader to Methods for more details about data preprocessing, model architecture and training.

BGC-MAC rivals antiSMASH and DeepBGC in natural product classification

We formulated the BGC classification task as multilabel classification, where each class prediction is treated as a separate classification task. BGC-MAC uses separate prediction heads to output six scores (0–1) for each BGC, indicating the likelihood of belonging to one of six classes. The loss function computes the weighted average of per-label losses, with weights assigned to negative samples to address data imbalance. A 9-ensemble model of BGC-MAC was trained and validated using cross-validation to learn BGC’s biosynthetic class. Final model performance was generated by averaging scores on the held-out test set (fold 10 with 263 BGCs) across all ensemble models.

We first evaluated the performance of BGC-MAC. Because certain BGCs belong to more than one class, the test 263 BGCs comprise 310 BGC–class pairs in total. BGC-MAC achieved AUROC scores >0.97 for RiPP, saccharide and terpene, and >0.96 for NRP and polyketide, demonstrating strong separation of positive/negative instances (Table 1 and Figure 2A). The “other” class yielded a lower AUROC (0.857), likely due to its heterogeneous nature or ambiguous features compared to well-defined classes. NRP, polyketide, and RiPP showed balanced precision-recall trade-offs, yielding F1 scores >0.85. Saccharide and terpene, with fewer positive samples, had lower F1 scores (Table 1).

Table 1.

Performance of BGC-MAC, antiSMASH and DeepBGC in BGC classification

Class BGC count Model BGC-MAC antiSMASH DeepBGC
NRP 78 AUROC 0.964 0.919 0.940
Recall 0.885 0.897 0.846
Precision 0.908 0.864 0.880
F1 0.896 0.881 0.863
Other 52 AUROC 0.857 0.655 0.783
Recall 0.692 0.404 0.269
Precision 0.600 0.512 0.824
F1 0.643 0.452 0.406
Polyketide 99 AUROC 0.965 0.940 0.916
Recall 0.970 0.960 0.939
Precision 0.897 0.880 0.894
F1 0.932 0.918 0.916
RiPP 39 AUROC 0.989 0.965 0.977
Recall 0.923 0.949 0.897
Precision 0.923 0.902 0.946
F1 0.923 0.925 0.921
Saccharide 25 AUROC 0.980 0.516 0.937
Recall 0.840 0.040 0.640
Precision 0.700 0.333 0.941
F1 0.764 0.071 0.761
Terpene 17 AUROC 0.997 0.878 0.897
Recall 1.000 0.765 0.647
Precision 0.680 0.868 0.786
F1 0.810 0.813 0.710

Figure 2. Comparison between BGC-MAC and antiSMASH in biosynthetic class prediction.

Figure 2.

(A) ROC curves on the test set. The dotted line displays the ROC curve expected for a random model. (B) Class-wise comparison of BGC-MAC and antiSMAH. Red dots indicate classes where BGC-MAC correctly predicted more samples than antiSMASH, and blue dots indicate the opposite. (C) Confusion matrix of BGC-MAC on the test set. (D) Confusion matrix of antiSMASH on the test set. BGCs with multiple biosynthetic classes were excluded in panels C and D.

We compared the performance of BGC-MAC with antiSMASH (version 7.0) and DeepBGC (version 0.1.31). antiSMASH predictions were generated by processing test set MIBiG BGC GenBank files using antiSMASH with default parameters. DeepBGC prediction scores were generated by its random forest classifier following the BGC detection step. For fair comparison, we retrained the DeepBGC classifier with our training dataset and evaluated its performance on the same test BGCs used for BGC-MAC. All models provide MIBiG-defined class output and can be compared directly. As a result, BGC-MAC outperformed antiSMASH and DeepBGC on average AUROC across every product class (Table 1). BGC-MAC had higher F1 scores than DeepBGC across all classes, and it also surpassed antiSMASH in all classes except RiPP and Terpene, where its scores were comparable. BGC-MAC also achieved higher precision scores than antiSMASH for all classes except terpenes, demonstrating its ability to overcome the high false positive rates typically associated with machine learning-based approaches13. Interestingly, antiSMASH failed to identify most saccharide BGCs in the test set, resulting in a low recall rate. This is likely because antiSMASH is intentionally designed to avoid confidently identifying these clusters during processing.

Given that antiSMASH is the most widely used tool for BGC annotation, we further compared per-class performance between BGC-MAC and antiSMASH. We constructed scatter plots depicting the counts of BGC-MAC-correct/antiSMASH-incorrect and BGC-MAC-incorrect/antiSMASH-correct samples for each category (Figure 2B). Results showed BGC-MAC’s superiority in four classes compared to antiSMASH. Consequently, leveraging BGC-MAC to rectify instances of antiSMASH prediction errors, particularly within these classes, holds promise for achieving enhanced predictive accuracy through model ensemble (Supplementary Table S5). The confusion matrix for BGC-MAC (Figure 2C) and antiSMASH (Figure 2D) were consistent with the per-class recall and precision scores. For BGCs with single class, BGC-MAC had fewer false negatives except in RiPP, and had fewer false positives except in terpene, compared to antiSMASH. The most common input for BGC-MAC is antiSMASH-formatted GenBank files, whose boundaries may be broader than MIBiG’s due to the inclusion of extended adjacent regions. We evaluated the influence of these extended borders and found the difference to be small (Supplementary Note 1).

Given the differences between bacterial and fungal BGCs21 and the predominance of bacterial BGCs in our training data (Supplementary Figure S1C), we assessed BGC-MAC’s performance on 47 fungal BGCs within the test set. BGC-MAC achieved F1 scores comparable to antiSMASH for major classes like NRP and polyketide (Supplementary Table S1). Furthermore, a large-scale assessment of 7,162 fungal BGCs from antiSMASH database confirmed its generalization capabilities, showing encouraging concordance with antiSMASH for NRP, polyketide and terpene classes (Supplementary Figure S3C).

BGC-MAC identifies key functional domains with attention mechanism

We investigated whether BGC-MAC can identify key domains involved in natural product biosynthesis. We hypothesized that important domains play determining roles in the classification and thus have high attention weights from the model’s cross-attention mechanism. To test this, we analyzed the functional role of high attention-weighted domains.

Each gene in the MIBiG dataset is annotated with a gene_kind label, indicating its functional role—such as biosynthetic for core enzymes that assemble the main scaffold of the natural product, biosynthetic-additional for tailoring enzymes that modify the core structure, and other categories like transport and regulatory for genes involved in export or expression control. We extracted the gene_kind field of each gene annotated in the MIBiG data file, assigning the same gene_kind value to all domains within a gene. We also annotated each domain using the Pfam database. Pfam annotations provide finer-grained identification of functional domains, revealing their specific biochemical activity.

We first ranked the domains in each correctly classified BGC in the test set based on their attention weight from the cross-attention layer. Top 20% of attention-weighted domains were then extracted and grouped into different BGC classes. For model interpretation, we compared the distribution of high attention-weighted domains with the original domain distribution within each BGC class in terms of their gene_kind or pfam annotation.

Gene_kind annotation analysis revealed that biosynthetic and biosynthetic-additional domains accounted for a higher proportion among high-attention domains across most classes (Figure 3A), providing a coarse-grained perspective that biosynthetic-associated domains contribute more significantly to classification in BGC-MAC.

Figure 3. BGC-MAC’s attention mechanism identifies key domains.

Figure 3.

(A) Comparison of domain gene_kind distributions in each class, shown as paired stacked bar charts. In each pair, the lower bar represents the original distribution, while the upper bar shows the distribution of high-attention domain gene kinds identified by the model. (B) Attention heatmap from the cross-attention layer for BGC0000198, with its true biosynthetic class highlighted in red. Key domain smCOG names are labeled. (C) Distribution of Pfam domain annotations across six classes. For each class, the top five high-attention Pfam domains are displayed, with red bars indicating their proportion among high-attention domains and blue bars showing their proportion in the original data.

Pfam annotation analysis further clarified the model’s recognition of domains critical for specific biosynthetic classes (Figure 3B). For NRP, the condensation domain and AMP-binding domain (adenylation domain) showed significantly higher proportions in high-attention domains (40.26% and 33.71%, respectively) compared to their frequencies in the original data. In the biosynthesis of NRP, amino acids are assembled into peptides using modular enzymatic architecture. The condensation domain catalyzes the formation of peptide bonds by facilitating the nucleophilic attack of a downstream amine on an upstream thioester-activated acyl group, extending the growing peptide chain.22 The adenylation domain selectively recognizes and activates specific amino acids, which is then tethered to a peptidyl carrier protein (thiolation domain).

Ketoacyl synthase (KS) and acyl transferase (AT) domains are highly represented among domains with high attention scores for polyketides. The AT domain is responsible for selecting an acyl-CoA substrate for the biosynthesis of polyketides, while the KS domain catalyzes the ketoacylation reaction via a Claisen mechanism, extending the polyketide intermediate chain with two carbons.23

For saccharides and terpenes, BGC-MAC also identified key domains. Multiple glycosyltransferase and terpene synthase were enriched among high-attention domains for saccharide and terpene BGCs, respectively. In terpenoid biosynthesis, P450s play a crucial role in oxidative functionalization, particularly in modifying terpene scaffolds. However, their proportion (9.09%) among high-attention domains is slightly higher than in the original dataset (7.98%), likely because P450s act as tailoring enzymes rather than contributing to scaffold biosynthesis.24 P450s are not specific to any single BGC class, constituting 0.73%, 2.19% and 1.13% of all domains in NRP, other, and polyketide BGCs, respectively.

To further examine BGC-MAC’s interpretability, we analyzed the attention map of BGC0000198, a polyketide-saccharide hybrid gene cluster encoding arenimycin A25 (Figure 3C). While antiSMASH classified this cluster solely as a polyketide, BGC-MAC accurately predicted its dual identity as both polyketide and saccharide BGC. BGC-MAC attributed high attention scores to two saccharide-related domains: dTDP-glucose synthase and MGT family glycosyltransferase.

The results indicated that BGC-MAC identifies key functional domains related to the classification of a given BGC using sequence information alone. This capability provides valuable insights into the underlying biosynthetic mechanisms and offers potential for mining enzyme functions related to biosynthesis.

BGC-MAP links BGCs to their encoded natural products

To establish associations between BGCs and the chemical structures of natural products, we trained BGC-MAP to predict whether a given chemical structure is encoded by a BGC. Similar to BGC-MAC, BGC-MAP was trained as an ensemble model and validated using nine-fold cross-validation. Model performance was finally evaluated on the held-out test set (fold 10 with 2,525 BGCs). The final prediction score represents the probability that a given compound is encoded by the input BGC.

BGC-MAP demonstrated encouraging performance, achieving AUROC scores >0.80 across most biosynthetic classes (Figure 4A). This corresponds to a greater than 0.80 probability that a randomly chosen positive instance is ranked above a negative one. The cut off during evaluation is 0.5, which separates the prediction score distributions of positive (62% >0.5) and negative (85% <0.5) samples (Supplementary Figure S4). Further evaluation via the confusion matrix showed that the model exhibited a recall rate from 0.519 to 0.699 and a precision rate from 0.342 to 0.446 (Figure 4B).

Figure 4. BGC-MAP performance in product matching and application.

Figure 4.

(A) ROC curves for the test set. The dotted line displays the ROC curve expected for a random model. (B) Confusion matrix of BGC-MAP for six biosynthetic classes on the test set. (C) Applying BGC-MAP to search for target BGCs and products. Product to BGC: Given a compound, one identifies the responsible BGC from multiple candidates. BGC to product: Given a BGC, identify the target compounds among a library of candidate compounds.

We also investigated the impact of the ensemble learning strategy. The ensemble model consistently outperformed individual model averages across all classes and metrics (precision, recall, AUC). The ensemble model achieved higher overall precision (0.420 vs 0.378), recall (0.617 vs 0.593) and AUC (0.843 vs 0.815), demonstrating more robust classification performance by aggregating predictions (Supplementary Table S6).

BGC-MAP performance in product-to-BGC prioritization

BGC-MAP could be used in different scenarios. In cases involving a single BGC-product pair, users can directly interpret the prediction score to assess whether the product is likely produced by the BGC. For multi-to-one matching—such as ranking multiple candidate compounds against a single BGC, or vice versa—users can prioritize pairs with high scores without relying on an absolute threshold (Figure 4C).

In natural product discovery, top-down methods involve starting with a newly isolated compound and working backward to identify the responsible BGC from multiple candidates in the genomic data.26 Computational methods that can prioritize candidates are crucial for reducing the extensive experimental workload.27 To validate BGC-MAP’s performance in this scenario, we randomly selected five positive BGC-product pairs from test set. For each product, we chose the genome assembly harboring the target BGC and retrieved all other BGC annotations within the genome from the antiSMASH database as candidates. Subsequently, BGC-MAP was employed to compute and rank the prediction score between the compound and each candidate BGC (Table 2). The results showed that both BGC0000693 and BGC0001007 as the ground truth were ranked among the top three candidates in their respective genomes. Although BGC0000448, as the ground truth, received a prediction score below the positive threshold (<0.5), it ranked in the top ten among the candidates.

Table 2.

Performance of BGC-MAP in identifying native BGCs encoding target compounds across BGC candidates within one genome

BGC ID Class Prediction score Score>0.5 Genome assembly Ranking
BGC0000448 NRP 0.22 False GCF_001896135 9/33
BGC0000693 Saccharide 0.68 True GCF_004117095 2/18
BGC0001007 NRP/Polyketide 0.79 True GCF_000514775 3/26
BGC0001790 NRP 0.69 True GCF_902831635 7/23
BGC0002209 Polyketide 0.05 False GCF_009556855 26/58

To further evaluate the model’s performance on non-MIBiG pairs, we retrieved three experimentally validated BGC-product pairs from the literature. These were designated as BGCs #1-#3, corresponding respectively to product dapalide A28 (NRP), lyngbyapeptin A29 (NRP-Polyketide hybrid), and merosterol30 (Terpene) (Figure 4C). We then queried these products against the 2,635 MIBiG BGCs together with the three validated BGCs. For BGC #1 (dapalide A, NRP) and BGC #2 (lyngbyapeptin A, NRP-Polyketide hybrid), despite the presence of many NRP and polyketide BGCs in MIBiG, BGC-MAP maintained strong discriminative power: BGC #2 ranked 46th with a high score (0.76) and BGC #1 ranked 395th of 2,638. Although BGC #3 had a low score of 0.23, it still ranked ~30% out of all candidates.

For BGC #3, the candidate BGCs with low scores (< 0.1) were primarily NRP (803), Polyketide (614), and RiPP (289) classes (Figure 5A). This demonstrated that even though the true pair’s score was only 0.23, BGC-MAP was still able to filter out many structurally inconsistent candidates. On the other hand, for BGCs that scored highly (> 0.5), most are Polyketide (231) and the correct class, Terpene (111). This suggests that in this specific example, BGC-MAP showed confusion between these terpene and polyketide structures similar to merosterol, leading to a ranking of 803rd out of 2,638 candidates for the true pair.

Figure 5. Performance of BGC-MAP in prioritizing BGC-product pairs.

Figure 5.

(A) Class distribution of candidate BGCs in merosterol (BGC#3’s product) to BGC matching task. The bars show the class distribution for BGCs with prediction scores below 0.1 versus those with scores above 0.5. (B) Prediction scores versus rankings for the true products of 10 test BGCs queried against the NP Atlas compound database. The green dashed line indicates the default score threshold (0.5) for a positive link, and the red dashed line represents the expected random ranking. (C, D) t-SNE visualization of the chemical space (defined by ECFP) of 10,826 product candidates for (C) BGC0000676 (terpene) and (D) BGC0001007 (NRP-Polyketide Hybrid). Each point represents a compound, colored by the ranking of its prediction. The true product is represented as a red dot. (E) Average ranking of the k-nearest neighbors defined by Euclidean distance on ECFP fingerprints for the true products of seven query BGCs. The lines show how the ranking changes with neighborhood size (K).

BGC-MAP performance in BGC-to-product prioritization

We also evaluated the reverse scenario: ranking a large library of candidate compounds against a single BGC. Using seven randomly selected MIBiG BGCs and the three non-MIBiG examples as queries, we searched against >10,000 representative compounds from the Natural Product Atlas (NP Atlas) database. Several true BGC-product pairs ranked within the top 300 candidates (BGC0000693, BGC0001007, BGC0001790, BGC#2) with high prediction scores (Figure 5B). Notably, novel pair BGC#2-lyngbyapeptin A had a prediction score of 0.76 and ranks 43 in all 10,829 candidates. Only one BGC-product pair (BGC0002209) ranked worse than expected random chance.

We further visualized the chemical space of all candidates using t-SNE31 based on their Extended Connectivity Fingerprints (ECFP)32. By coloring each compound by the ranking of its prediction, we observed that regions surrounding the true product tend to show higher rankings (Figure 5CD). However, since t-SNE visualizations primarily preserve local similarities, we sought to quantify this trend by focusing on seven out of ten BGC-product pairs with overall ranking smaller than 2,000. For target compounds of these pairs, we calculated the average ranking of their k-nearest neighbors defined by Euclidean distance. We observed that for small values of k, the average ranking of the neighbors was also low. As k increased, the average ranking rose and gradually approached 3,000 (Figure 5E). This trend indicated that as the neighborhood expands to include more structurally dissimilar compounds, their average ranking worsens, confirming that the model correctly deems them as poor candidates for the target BGC. This suggests BGC-MAP prunes irrelevant regions of chemical space, creating a favorable landscape for prioritization.

However, the results for novel pair BGC#1 (1889/10829) and BGC#3 (3542/10829) (Supplementary Table S9) rank over 1000. This suggests that, although these rankings are better than random selection, BGC-MAP’s ability to reduce experimental effort may be constrained when the candidate space is large. This is likely due to the high density and structural redundancy within the NP Atlas library, where many structurally similar compounds may interfere.

Overall, BGC-MAP demonstrated encouraging potential for both product-to-BGC and BGC-to-product matching by surpassing random selection in most cases and filtering out false pairs. Its practical use is more promising when the candidate set is smaller and contains a larger proportion of compounds from categories distinct from the query.

BGC-MAP identifies key substructures in natural products with attention mechanism

To evaluate the ability of BGC-MAP in identifying critical substructure-domain relations, we analyzed cross-attention layer heatmaps averaging across all attention heads and ensemble models for three distinct BGCs and their associated products. Specifically, we averaged the attention scores along the BGC domain dimension. Our results indicated that BGC-MAP effectively identified critical substructures within natural products using its attention mechanism. We present two illustrative examples below.

BGC0001007 (NRP-PKS hybrid, Lymphostin):

The pyrrolo[4,3,2-de]quinoline nitrogen (critical for core formation from L-tryptophan) and the N-acetylation site displayed moderate-to-high attention weights (Figure 6A). This corresponds to steps in lymphostin biosynthetic pathway, where NAT-mediated acylation orchestrates N-acetylated diketide aldehyde assembly, despite unreported mechanistic details for nucleophilic quinoline formation. In addition, moderate attention was also observed in a double bond in the polyketide part (Figure 6A), which is a common substructure in polyketide produced by dehydratase33 or other mechanism34,35. In this case, the methyltransferase lymB converts the β-keto intermediate (O=C-C-C=O) into an α,β-unsaturated methyl ester (O=C-C=C-O-CH3), which generates the C=C double bond36.

Figure 6. Visualization of BGC-MAP attention score aligned with product structures.

Figure 6.

High attention scores correspond to key structural features of the predicted products. The bar chart represents domain-averaged attention scores on SMILES token. (A) Visualization of BGC0001007 product and corresponding SMILES string. Two nitrogen-containing groups are colored orange and lime. The double bond is colored yellow. (B) Visualization of BGC0001790 product and respective SMILES string. The beta-lactam ring region, peptide bond region and amino group are highlighted with orange, yellow and lime, respectively.

BGC0001790 (NRP, Sulfazecin):

The biosynthesis of sulfazecin is initiated by two NRPSs (SulI and SulM) containing three canonical modules (M1–M3). Activation of D-Glu (by A1 domain) and L-Ala (epimerized to D-Ala in M2), followed by L-2,3-Dap incorporation via M3, forming a D,D,L-tripeptide. The timing of N-sulfonation remains unresolved. β-lactam ring formation occurs through a mechanism distinct from the C domain-catalyzed β-elimination/addition seen in nocardicin G37. C-3 methoxylation is introduced post-β-lactam formation.38 High attention weights were observed in the N-sulfonation group, β-lactam ring (ring 1 in SMILES string) and peptide bond regions (Figure 6B). The chiral carbon atom on an amino acid also had high attention weights. In this instance, except common peptide bonds in NRP, the model also prioritized sulfonamide moiety to infer BGC-product associations, reflecting a substructure-driven pattern recognition strategy.

Despite identifying key substructures in natural product, the observed band-like distribution of attention weights—where high-attention substructures correlate broadly with nearly all domains in a BGC—suggests that BGC-MAP does not explicitly learn a one-to-one mapping between protein domains and natural product substructures. Our analysis of three additional BGC-product pairs (Supplementary Note 3, Supplementary Figure S56) revealed that many substructures with high-attention weights do not consistently correspond to particular biosynthetic steps. Instead, these moieties are only class-defining features in certain natural products. This may be due to the involvement of numerous reaction steps in natural product biosynthesis, causing each substructure to be associated with multiple domains rather than a single specific domain, thereby precluding a one-to-one mapping.

Discussion

Our study established a deep learning framework trained on experimentally validated BGCs to advance their biosynthetic classification (BGC-MAC) and predict BGC-product associations (BGC-MAP). Despite being trained on a limited dataset, BGC-MAC demonstrated exceptional performance, achieving AUROC scores exceeding 0.96 for most biosynthetic classes except the “other” class, and outperforming antiSMASH and DeepBGC in average AUROC across all the classes. BGC-MAC can also classify fungal BGC. Attention weight analysis indicated that BGC-MAC can identify functional enzyme domains that are closely related to a certain BGC class without prior annotation. This highlights the efficiency of data-driven approaches using pretrained embeddings in capturing BGC features. One can also combine BGC-MAC and antiSMASH to cross-validate predictions for a BGC of interest.

BGC-MAP directly links product structures with BGC sequences, achieving AUROC scores above 0.80 across most classes. Case studies revealed its potential to prioritize BGCs of a given natural product among candidates or vice versa. Visualization of the chemical space and prediction score suggested that BGC-MAP excludes irrelevant regions of chemical space for prioritization. Analysis of attention weights revealed BGC-MAP’s ability to identify key substructures in BGC-natural product matching task, offering insights into biosynthetic mechanisms and potential enzyme functions for genome mining.

Nonetheless, several limitations in our models remain. Insufficient training data in underrepresented classes restricts the performance of BGC-MAC and BGC-MAP. The complexity of biosynthetic pathways prevents precise substructure-to-domain mapping in BGC-MAP, as seen in the band-like distribution of attention scores across domains. In the future, further finetuning and more training data can make BGC-MAC and BGC-MAP into powerful prioritization tools.

Looking forward, expanding the number of experimentally validated BGCs will be critical for improving model performance. As datasets grow, we expect that deep learning models can substantially surpass rule-based methods, particularly for novel natural products. Besides, incorporating precursor data or intermediate metabolites could enhance the BGC-MAP’s ability to link specific domains to substructures.

In summary, our models represent a substantial advance in BGC annotation, offering a scalable, data-driven alternative to traditional rule-based tools and laying the groundwork for more precise biosynthetic predictions as experimental datasets expand.

Methods

Preparing BGC Dataset

We assembled a training dataset using BGCs from the MIBiG (version 4.0) database, where the biosynthetic classes and product structures are annotated based on experimental evidence.9 Retired entries and one BGC with excessive enzyme domains (BGC0002977 with >4,000 domains) were excluded from the dataset, resulting in 2,635 experimentally validated BGCs for model training and evaluation. We identified 2,114 BGCs with at least one characterized natural product, yielding 3,824 BGC-product pairs due to some BGCs producing multiple products. Since the same natural product can be synthesized by different BGCs, these pairs encompass 3,551 unique natural products. For each BGC, we extracted the SMILES string of its product, its biosynthetic class defined by MIBiG, and the list of constituent enzyme sequences. Enzymes in the MIBiG data file were split into individual domains to prevent excessively long sequences, and for consistency, we also refer to these unsplit enzymes as domains, yielding a total of 78,075 unique domains. Each BGC was represented as a sequence of enzyme domains, where each domain served as individual token. Biosynthetic classes and SMILES strings of products serve as ground truth label for BGC-MAC and BGC-MAP during training and evaluation, respectively. The analysis of taxonomy distribution of BGCs was shown in Supplementary Figure S1.

Preprocessing data

We used ESM-2 model with 33 layers and 650M parameters18, a masked protein language model, to embed each domain sequence into a 1,280-dimensional vector. We initialized the model using the pretrained weights esm2_t33_650M_UR50D.pt, which processes amino acid sequences of each domain through 33 iterative update steps via an attention mechanism. The final representation of each domain was derived from the <cls> token’s output vector rather than averaging residue embeddings. For SMILES tokenization, we applied the regular expression-based SMILES tokenization method developed by Schwaller et al.39, resulting in a vocabulary size of 138.

Sampling negative BGC-product pairs for BGC-MAC

For the 3,824 positive BGC-product pairs in our dataset, we generated five negative samples per BGC by randomly sampling natural products. To challenge BGC-MAC to learn non-trivial features, three negative samples were restricted to natural products structurally similar to the true product. Specifically, we computed the pairwise similarity of all true natural products and 10% compounds (89,502) sampled from coconut database40 using FingerprintSimilarity function in RDKit41. For each positive BGC-product pair, three negative candidates were sampled from the subset where similarity values in the top 0.1% among the 89,502 candidates, with a maximum similarity threshold of 0.8. To increase negative data diversity and avoid potential bias from coconut database, two negative samples were sampled form products encoded by other BGCs in the MIBiG database9. For 521 BGCs lacking validated product structures, four negative samples were generated without similarity constraints just from coconut database, resulting in a total of 25,252 data points. To address the potential inconsistency in the SMILES string formats from MIBiG and COCONUT, we canonicalized all SMILES strings using RDKit prior to training (via Chem.MolFromSmiles followed by Chem.MolToSmiles(canonical=True)).

Deep learning model architecture

We implemented and trained BGC-MAC and BGC-MAP using PyTorch.42 For clarity, let B=D1DN denote the whole domain sequence of a BGC, where N is the number of domains in a BGC. Di is a 1,280-dimensional vector obtained from ESM2. The biosynthetic classes were represented as six class tokens C=c1c6 (NRP, Other, Polyketide, RiPP, Saccharide and Terpene). The structure of natural products was represented by sequence of SMILES tokens S=s1sn, where n denotes the length of tokenized SMILES string.

BGC encoder.

The BGC encoder employs attention mechanism on the BGC sequence to learn relationships between different domains within biosynthetic pathways. The input vector BN×1280 undergoes a linear transformation and the output 512-dimensional BGC representation BN×512 was updated using a multi-head self-attention layer19 integrated with RoPE.20 Query, key and value matrices are first obtained (eq (1)). Then RoPE applies position-dependent rotation to the query and key vectors, embedding positional dependencies directly into the attention scores, which encode the relative positional information of domains within a BGC.

Q=BWQ,K=BWK,V=BWV (1)

For query vector qm in position m in Q and the key vector kn at position n in, we apply the rotary transformation as follows:

q~m=Rθ,mqm,k~n=Rθ,nkn (2)

where Rθ,m is a rotation matrix determined by position m, defined such that each pair of dimensions (2i, 2i+1) in the d-dimensional vector is rotated by an angle θi,m=100002id. The value vector V remains unchanged. The attention mechanism is then computed as

Attention(Q,K,V)=softmaxQ~K~TdkV (3)

Here, Q~=q~1,,q~N,K~=k~1,,k~N represent the rotated query and key matrices, respectively. dk is the dimensionality of each attention head. The output of attention layer Bout is combined with the input via a residual connection and layer normalization, followed by a feedforward network (FFN) with another residual connection and normalization:

H=LayerNormB+Bout,Output=LayerNormH+FFNH (4)

SMILES encoder.

The SMILES encoder utilizes attention to understand the contextual relationships between characters that define atoms and their connectivity. The SMILES of natural product is tokenized as input for the encoder. An embedding layer transforms the sequence of indices into a vector representation.

S=Embedding(indices),indicesn,Sn×512 (5)

Then S is updated in the same way as in the BGC encoder.

BGC-MAC.

BGC-MAC takes a BGC sequence BN×1280 and six class token indices as input to classify the BGC. An embedding layer transforms class indices into vector representations C6×512. Subsequently, the BGC sequence was fed into the BGC encoder to learn functional organization and interdependencies of these domains within a BGC. Initial class representation (key) interacts with BGC encoder outputs BON×512 (key and value) through a cross-attention layer, the respective query, key and value matrices are obtained as follows:

Q=CWQ,K=BOWK,V=BOWV (6)

The cross-attention layer allows each class to selectively attend to relevant parts of the BGC and extract class-specific information. The output of the cross-attention is obtained as

CrossAttention(Q,K,V)=softmaxQKTdkV (7)

The output is combined with the input via a residual connection and layer normalization, followed by an FFN with another residual connection and normalization. Subsequently, it passes through six independent binary prediction heads for each class. The prediction head outputs a real number between 0 and 1 after passing through the sigmoid activation function, describing the probability that the BGC belongs to a certain biosynthetic class. In model evaluation, we assigned the input to a certain class when its prediction scores greater than 0.5. During inference, the model generates a csv file containing the original prediction score of each class.

BGC-MAP.

BGC-MAP takes a BGC sequence and SMILES token indices of a natural product as input to judge whether a given structure is encoded by the BGC. SMILES indices are fed into SMILES encoder to capture crucial structural information from SMILES string. The BGC sequence representations are updated in the BGC encoder. Subsequently cross-attention is performed, where the BGC embeddings BON×512 act as queries and product embeddings Sn×512 serve as keys and values. The respective query, key and value matrices are obtained as follows:

Q=BOWQ,K=SWK,V=SWV (8)

The cross-attention layer extracts information about the relationship between product substructure and BGC domains. The output of the cross-attention is obtained as

CrossAttention(Q,K,V)=softmaxQKTdkV (9)

The output is combined with the input via a residual connection and layer normalization, followed by an FFN with another residual connection and normalization. Subsequently, it is aggregated using mean pooling and passed through a classification head. The final output is a real number between 0 (product) and 1 (not product) after passing through the sigmoid activation function, representing the probability that the given structure is synthesized by the BGC. In model evaluation, we assign the input as a BGC product when its prediction score is greater than 0.5. During inference, the model generates a csv file containing the original prediction scores of each given BGC-product pair.

Model training and optimization

For each model, the full dataset was divided into 10 equal folds after being shuffled with a fixed random_seed to ensure reproducibility. Fold 10 was held out as the fixed test set for final model evaluation. The information of each fold can be found in Supplementary Table S2 and metadata file in Zenodo. For the remaining 9 folds, 9-fold cross-validation was applied: in each iteration (i = 1 to 9), fold i served as the validation set while the other 8 folds are the training set. This process generated 9 distinct training-validation splits used to train 9-ensemble models, each trained on a unique combination of training and validation data, while the test set was fixed for final evaluation (Supplementary Figure S2). For BGC-MAC, fold 10 has 263 BGCs and the remaining folds have 2,372; for BGC-MAP, fold 10 has 2,525 BGC-product pairs and the remaining folds have 22,727.

To mitigate class imbalance (more negatives than positives), binary cross-entropy (BCE) loss with weighted positive samples was employed to prevent the model from assigning too many samples to the over-represented negative class. For both BGC-MAC and BGC-MAP, the per-BGC training losses are defined as:

MAC(t)=i=16wi(t)lBCEp^i,pi (10)
MAP=lBCE(p^,p) (11)

The training process employed an early stopping strategy, where training was halted if the validation loss did not improve for 5 consecutive epochs, and the model from the epoch with the lowest validation loss was selected for downstream experiments.

To increase randomness during BGC-MAP training, the BGCs of n negative pairs was replaced by BGCs in positive pairs in the same batch to create dynamic negative samples. n is equal to the number of positive pairs in the batch.

To find the best hyperparameters, we used the Python package Optuna43 to perform hyperparameter optimization leveraging its efficient Tree-structured Parzen Estimator (TPE) algorithm for the following hyperparameters: learning rate of model and GradNorm, hidden dimension of attention layer, dropout rate and weight for negative data points. For all models, the final hyperparameters used were tabulated in Supplementary Table S3 and S4.

Model evaluation

For BGC-MAC and BGC-MAP, predictions scores from nine ensemble models were aggregated to enhance robustness. The ensemble was evaluated on the withheld test set, with the final prediction scores obtained by averaging the outputs of all models in the ensemble. The following metrics were computed for each model:

Accuracy=nTP+nTNnTP+nTN+nFP+nFN (12)
Precision=nTPnTP+nFP (13)
Recall=nTPnTP+nFN (14)
F1=2×Precision×RecallPrecision+Recall (15)

, where nTP, nTN, nFP, and nFN refer to the counts of true positives, true negatives, false positives, and false negatives, respectively.

We compared the performance of BGC-MAC with the rule-based antiSMASH (version 7.0)10 and machine learning-based DeepBGC (version 0.1.31)11 to evaluate its ability to classify BGCs in the test set, using AUROC and F1 scores. Both antiSMASH and DeepBGC provide output with classes defined by MIBiG, enabling us to compare them directly with BGC-MAC. antiSMASH predictions were generated by processing the GenBank files of test-set BGCs using default parameters, and the predicted classes were extracted from the “category” field in the protocluster feature of the GenBank output. DeepBGC prediction scores for each class were obtained from its tsv output file generated by running DeepBGC pipeline command. Because older versions of DeepBGC trained its random forest classifier on 2,502 BGCs from MIBiG (version 3.1), we retrained the classifier with MIBiG (version 4.0) with the test BGCs held out. For an input BGC, we assigned the classification result as ‘None’ if any of the following conditions applied: (1) all BGC-MAC prediction scores were below 0.5, (2) antiSMASH or DeepBGC provided no annotations, or (3) DeepBGC’s classifier produced no confident class prediction.

To examine BGC-MAC’s performance on fungal BGC classification, we extracted fungal BGCs from the test set and separately computed their precision, recall, F1-score, and AUC (Supplementary Table S1). For the generalization analysis, a search was conducted in antiSMASH database for BGC belonging to the superkingdom Eukaryota, resulting in 7,162 fungal BGCs in GenBank format. These BGCs spanned several fungal phyla, including Ascomycota, Basidiomycota, Chytridiomycota and Mucoromycota. These GenBank files were subsequently used as input for BGC-MAC prediction.

To evaluate the performance of BGC-MAP in BGC-to-product prioritization, five positive BGC-product pairs were randomly sampled from the test set. For each pair, we queried and selected a genome assembly harboring the target BGC with high similarity (>80%) through KnownClusterBlast in antiSMASH database version 444. Then for each genome, we downloaded all the antiSMASH annotated BGCs in it. We computed and ranked the prediction score of target compound and each BGC using BGC-MAP.

To evaluate the performance of BGC-MAP in product-to-BGC prioritization, we downloaded NP Atlas database (Version 2024.09), which contains 36,454 natural products and 10,508 clusters. The candidate set for ranking consisted of 10,508 representative products (one from each NP Atlas cluster; seed=42) and 333 compounds from the BGC-MAP test set (one per BGC), yielding a total of 10,826 candidates after deduplication. For querying the 3 non-MIBiG BGC-compound pairs, where the 3 corresponding products are included in the candidate pool, the total number of candidates was 10,829. The expected rank of a randomly selected target candidate among the total pool of ~10,830 candidates was calculated to be approximately 5,415.

t-SNE and K-Nearest Neighbor Visualization

To represent each BGC, we averaged the ESM embedding of all enzyme domains within the cluster to generate a single 1,280-dimensional embedding. We applied t-SNE for dimensionality reduction to two components as implemented in scikit-learn with a perplexity of 30, a random seed of 42, and a maximum of 1,000 iterations.

To visualize the chemical space of candidate products in the NP Atlas query task, we computed ECFPs using RDKit’s RDKitFPGenerator with a fingerprint size of 1,024 bits. The resulting ECFP representations were then reduced to two dimensions using t-SNE with a perplexity of 50 to account for the dataset’s density. Because t-SNE primarily preserves local similarity rather than global distance, we additionally computed the k-nearest neighbors for each target compound based on Euclidean distance in the ECFP space and calculated the average ranking among its k nearest neighbors.

HMMER annotation for interpretability Analysis

To interpret the model, each domain in the BGCs was annotated using protein family domains through hmmscan in pyHmmer45 and Pfam database version 37.1,46 with an E-value threshold of 1e-10. We extracted average attention weights from the cross-attention layers of each of the nine models in the BGC-MAC ensemble. For each biosynthetic class token, we selected the top 20% of attention-weighted domains associated with the class from correctly predicted positive samples in test set. The corresponding Pfam annotations and MIBiG gene_kind annotations of these domains were extracted. By comparing the distribution of high-attention domains with the original domain distribution in each class, we assessed whether the model effectively identifies key domains of specific biosynthetic classes.

Supplementary Material

SI

Supplementary notes detailing the influence of BGC borders on model performance, the classification of fungal BGCs, and extended examples for BGC-MAP interpretation; supplementary tables detailing model performance metrics, hyperparameter optimization, dataset composition, and the effects of ensemble learning; supplementary figures illustrating dataset taxonomy distributions, training strategy, and BGC-MAP attention visualizations.

Acknowledgements

The authors acknowledge support from the UF Blue Future Medicine Initiative and from the National Institutes of Health grants RM1GM145426 (H.L., Y.D.), R35GM128742 (Y.D.), and R35GM159995 (W.J.X.).

Footnotes

Competing interests

The authors declare no competing financial interest.

Data and software availability

All data generated in this study and all processed data used to produce the results of this study have been deposited in the Zenodo repository available at https://zenodo.org/records/17458129. Source data for all figures is provided with this paper.

Code availability

The Python code used to generate all results is publicly available only at https://github.com/EvoCatalysis/BGC_annotation.

Reference

  • (1).Newman DJ; Cragg GM Natural Products as Sources of New Drugs from 1981 to 2014. J. Nat. Prod 2016, 79 (3), 629–661. 10.1021/acs.jnatprod.5b01055. [DOI] [PubMed] [Google Scholar]
  • (2).Dobson PD; Patel Y; Kell DB ‘Metabolite-Likeness’ as a Criterion in the Design and Selection of Pharmaceutical Drug Libraries. Drug Discovery Today 2009, 14 (1), 31–40. 10.1016/j.drudis.2008.10.011. [DOI] [PubMed] [Google Scholar]
  • (3).Newman DJ; Cragg GM Natural Products as Sources of New Drugs over the Nearly Four Decades from 01/1981 to 09/2019. J. Nat. Prod 2020, 83 (3), 770–803. 10.1021/acs.jnatprod.9b01285. [DOI] [PubMed] [Google Scholar]
  • (4).Martin JF Clusters of Genes for the Biosynthesis of Antibiotics: Regulatory Genes and Overproduction of Pharmaceuticals. Journal of Industrial Microbiology 1992, 9 (2), 73–90. 10.1007/BF01569737. [DOI] [PubMed] [Google Scholar]
  • (5).Martin JF; Liras P ORGANIZATION AND EXPRESSION OF GENES INVOLVED IN THE BIOSYNTHESIS OF ANTIBIOTICS AND OTHER SECONDARY METABOLITES. Annual Review of Microbiology 1989, 43 (Volume 43,), 173–206. 10.1146/annurev.mi.43.100189.001133. [DOI] [Google Scholar]
  • (6).Udwary DW; Doering DT; Foster B; Smirnova T; Kautsar SA; Mouncey NJ The Secondary Metabolism Collaboratory: A Database and Web Discussion Portal for Secondary Metabolite Biosynthetic Gene Clusters. Nucleic Acids Res 2025, 53 (D1), D717–D723. 10.1093/nar/gkae1060. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (7).Atanasov AG; Zotchev SB; Dirsch VM; Supuran CT Natural Products in Drug Discovery: Advances and Opportunities. Nat Rev Drug Discov 2021, 20 (3), 200–216. 10.1038/s41573-020-00114-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (8).Medema MH; Fischbach MA Computational Approaches to Natural Product Discovery. Nat Chem Biol 2015, 11 (9), 639–648. 10.1038/nchembio.1884. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (9).Zdouc MM; Blin K; Louwen NLL; Navarro J; Loureiro C; Bader CD; Bailey CB; Barra L; Booth TJ; Bozhüyük KAJ; Cediel-Becerra JDD; Charlop-Powers Z; Chevrette MG; Chooi YH; D’Agostino PM; de Rond T; Del Pup E; Duncan KR; Gu W; Hanif N; Helfrich EJN; Jenner M; Katsuyama Y; Korenskaia A; Krug D; Libis V; Lund GA; Mantri S; Morgan KD; Owen C; Phan C-S; Philmus B; Reitz ZL; Robinson SL; Singh KS; Teufel R; Tong Y; Tugizimana F; Ulanova D; Winter JM; Aguilar C; Akiyama DY; Al-Salihi SAA; Alanjary M; Alberti F; Aleti G; Alharthi SA; Rojo MYA; Arishi AA; Augustijn HE; Avalon NE; Avelar-Rivas JA; Axt KK; Barbieri HB; Barbosa JCJ; Barboza Segato LG; Barrett SE; Baunach M; Beemelmanns C; Beqaj D; Berger T; Bernaldo-Agüero J; Bettenbühl SM; Bielinski VA; Biermann F; Borges RM; Borriss R; Breitenbach M; Bretscher KM; Brigham MW; Buedenbender L; Bulcock BW; Cano-Prieto C; Capela J; Carrion VJ; Carter RS; Castelo-Branco R; Castro-Falcón G; Chagas FO; Charria-Girón E; Chaudhri AA; Chaudhry V; Choi H; Choi Y; Choupannejad R; Chromy J; Donahey MSC; Collemare J; Connolly JA; Creamer KE; Crüsemann M; Cruz AA; Cumsille A; Dallery J-F; Damas-Ramos LC; Damiani T; de Kruijff M; Martín BD; Sala GD; Dillen J; Doering DT; Dommaraju SR; Durusu S; Egbert S; Ellerhorst M; Faussurier B; Fetter A; Feuermann M; Fewer DP; Foldi J; Frediansyah A; Garza EA; Gavriilidou A; Gentile A; Gerke J; Gerstmans H; Gomez-Escribano JP; González-Salazar LA; Grayson NE; Greco C; Gomez JEG; Guerra S; Flores SG; Gurevich A; Gutiérrez-García K; Hart L; Haslinger K; He B; Hebra T; Hemmann JL; Hindra H; Höing L; Holland DC; Holme JE; Horch T; Hrab P; Hu J; Huynh T-H; Hwang J-Y; Iacovelli R; Iftime D; Iorio M; Jayachandran S; Jeong E; Jing J; Jung JJ; Kakumu Y; Kalkreuter E; Kang KB; Kang S; Kim W; Kim GJ; Kim H; Kim HU; Klapper M; Koetsier RA; Kollten C; Kovács ÁT; Kriukova Y; Kubach N; Kunjapur AM; Kushnareva AK; Kust A; Lamber J; Larralde M; Larsen NJ; Launay AP; Le N-T-H; Lebeer S; Lee BT; Lee K; Lev KL; Li S-M; Li Y-X; Licona-Cassani C; Lien A; Liu J; Lopez JAV; Machushynets NV; Macias MI; Mahmud T; Maleckis M; Martinez-Martinez AM; Mast Y; Maximo MF; McBride CM; McLellan RM; Bhatt KM; Melkonian C; Merrild A; Metsä-Ketelä M; Mitchell DA; Müller AV; Nguyen G-S; Nguyen HT; Niedermeyer THJ; O’Hare JH; Ossowicki A; Ostash BO; Otani H; Padva L; Paliyal S; Pan X; Panghal M; Parade DS; Park J; Parra J; Rubio MP; Pham HT; Pidot SJ; Piel J; Pourmohsenin B; Rakhmanov M; Ramesh S; Rasmussen MH; Rego A; Reher R; Rice AJ; Rigolet A; Romero-Otero A; Rosas-Becerra LR; Rosiles PY; Rutz A; Ryu B; Sahadeo L-A; Saldanha M; Salvi L; Sánchez-Carvajal E; Santos-Medellin C; Sbaraini N; Schoellhorn SM; Schumm C; Sehnal L; Selem N; Shah AD; Shishido TK; Sieber S; Silviani V; Singh G; Singh H; Sokolova N; Sonnenschein EC; Sosio M; Sowa ST; Steffen K; Stegmann E; Streiff AB; Strüder A; Surup F; Svenningsen T; Sweeney D; Szenei J; Tagirdzhanov A; Tan B; Tarnowski MJ; Terlouw BR; Rey T; Thome NU; Torres Ortega LR; Tørring T; Trindade M; Truman AW; Tvilum M; Udwary DW; Ulbricht C; Vader L; van Wezel GP; Walmsley M; Warnasinghe R; Weddeling HG; Weir ANM; Williams K; Williams SE; Witte TE; Rocca SMW; Yamada K; Yang D; Yang D; Yu J; Zhou Z; Ziemert N; Zimmer L; Zimmermann A; Zimmermann C; van der Hooft JJJ; Linington RG; Weber T; Medema MH MIBiG 4.0: Advancing Biosynthetic Gene Cluster Curation through Global Collaboration. Nucleic Acids Research 2024, gkae1115. 10.1093/nar/gkae1115. [DOI] [Google Scholar]
  • (10).Blin K; Shaw S; Augustijn HE; Reitz ZL; Biermann F; Alanjary M; Fetter A; Terlouw BR; Metcalf WW; Helfrich EJN; van Wezel GP; Medema MH; Weber T antiSMASH 7.0: New and Improved Predictions for Detection, Regulation, Chemical Structures and Visualisation. Nucleic Acids Research 2023, 51 (W1), W46–W50. 10.1093/nar/gkad344. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (11).Hannigan GD; Prihoda D; Palicka A; Soukup J; Klempir O; Rampula L; Durcak J; Wurst M; Kotowski J; Chang D; Wang R; Piizzi G; Temesi G; Hazuda DJ; Woelk CH; Bitton DA A Deep Learning Genome-Mining Strategy for Biosynthetic Gene Cluster Prediction. Nucleic Acids Research 2019, 47 (18), e110. 10.1093/nar/gkz654. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (12).Rios-Martinez C; Bhattacharya N; Amini AP; Crawford L; Yang KK Deep Self-Supervised Learning for Biosynthetic Gene Cluster Detection and Product Classification. PLoS Comput Biol 2023, 19 (5), e1011162. 10.1371/journal.pcbi.1011162. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (13).Mullowney MW; Duncan KR; Elsayed SS; Garg N; van der Hooft JJJ; Martin NI; Meijer D; Terlouw BR; Biermann F; Blin K; Durairaj J; Gorostiola González M; Helfrich EJN; Huber F; Leopold-Messer S; Rajan K; de Rond T; van Santen JA; Sorokina M; Balunas MJ; Beniddir MA; van Bergeijk DA; Carroll LM; Clark CM; Clevert D-A; Dejong CA; Du C; Ferrinho S; Grisoni F; Hofstetter A; Jespers W; Kalinina OV; Kautsar SA; Kim H; Leao TF; Masschelein J; Rees ER; Reher R; Reker D; Schwaller P; Segler M; Skinnider MA; Walker AS; Willighagen EL; Zdrazil B; Ziemert N; Goss RJM; Guyomard P; Volkamer A; Gerwick WH; Kim HU; Müller R; van Wezel GP; van Westen GJP; Hirsch AKH; Linington RG; Robinson SL; Medema MH Artificial Intelligence for Natural Product Drug Discovery. Nat Rev Drug Discov 2023, 22 (11), 895–916. 10.1038/s41573-023-00774-7. [DOI] [PubMed] [Google Scholar]
  • (14).Helfrich EJN; Ueoka R; Dolev A; Rust M; Meoded RA; Bhushan A; Califano G; Costa R; Gugger M; Steinbeck C; Moreno P; Piel J Automated Structure Prediction of Trans-Acyltransferase Polyketide Synthase Products. Nat Chem Biol 2019, 15 (8), 813–821. 10.1038/s41589-019-0313-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (15).Yan D; Zhou M; Adduri A; Zhuang Y; Guler M; Liu S; Shin H; Kovach T; Oh G; Liu X; Deng Y; Wang X; Cao L; Sherman DH; Schultz PJ; Kersten RD; Clement JA; Tripathi A; Behsaz B; Mohimani H Discovering Type I Cis-AT Polyketides through Computational Mass Spectrometry and Genome Mining with Seq2PKS. Nat Commun 2024, 15 (1), 5356. 10.1038/s41467-024-49587-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (16).Merwin NJ; Mousa WK; Dejong CA; Skinnider MA; Cannon MJ; Li H; Dial K; Gunabalasingam M; Johnston C; Magarvey NA DeepRiPP Integrates Multiomics Data to Automate Discovery of Novel Ribosomally Synthesized Natural Products. Proceedings of the National Academy of Sciences 2020, 117 (1), 371–380. 10.1073/pnas.1901493116. [DOI] [Google Scholar]
  • (17).Skinnider MA; Johnston CW; Gunabalasingam M; Merwin NJ; Kieliszek AM; MacLellan RJ; Li H; Ranieri MRM; Webster ALH; Cao MPT; Pfeifle A; Spencer N; To QH; Wallace DP; Dejong CA; Magarvey NA Comprehensive Prediction of Secondary Metabolite Structure and Biological Activity from Microbial Genome Sequences. Nat Commun 2020, 11 (1), 6058. 10.1038/s41467-020-19986-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (18).Lin Z; Akin H; Rao R; Hie B; Zhu Z; Lu W; Smetanin N; Verkuil R; Kabeli O; Shmueli Y; dos Santos Costa A; Fazel-Zarandi M; Sercu T; Candido S; Rives A Evolutionary-Scale Prediction of Atomic-Level Protein Structure with a Language Model. Science 2023, 379 (6637), 1123–1130. 10.1126/science.ade2574. [DOI] [PubMed] [Google Scholar]
  • (19).Bahdanau D; Cho K; Bengio Y Neural Machine Translation by Jointly Learning to Align and Translate. arXiv May 19, 2016. 10.48550/arXiv.1409.0473. [DOI] [Google Scholar]
  • (20).Su J; Lu Y; Pan S; Murtadha A; Wen B; Liu Y RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv November 8, 2023. 10.48550/arXiv.2104.09864. [DOI] [Google Scholar]
  • (21).Robey MT; Caesar LK; Drott MT; Keller NP; Kelleher NL An Interpreted Atlas of Biosynthetic Gene Clusters from 1,000 Fungal Genomes. Proceedings of the National Academy of Sciences 2021, 118 (19), e2020230118. 10.1073/pnas.2020230118. [DOI] [Google Scholar]
  • (22).Fischbach MA; Walsh CT Assembly-Line Enzymology for Polyketide and Nonribosomal Peptide Antibiotics: Logic, Machinery, and Mechanisms. Chem. Rev 2006, 106 (8), 3468–3496. 10.1021/cr0503097. [DOI] [PubMed] [Google Scholar]
  • (23).Nivina A; Yuet KP; Hsu J; Khosla C Evolution and Diversity of Assembly-Line Polyketide Synthases. Chem. Rev 2019, 119 (24), 12524–12547. 10.1021/acs.chemrev.9b00525. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (24).Kakumu Y; Chaudhri AA; Helfrich EJN The Role and Mechanisms of Canonical and Non-Canonical Tailoring Enzymes in Bacterial Terpenoid Biosynthesis. Nat. Prod. Rep 2025. 10.1039/D4NP00048J. [DOI] [Google Scholar]
  • (25).Asolkar RN; Kirkland TN; Jensen PR; Fenical W Arenimycin, an Antibiotic Effective against Rifampin- and Methicillin-Resistant Staphylococcus Aureus from the Marine Actinomycete Salinispora Arenicola. J Antibiot 2010, 63 (1), 37–39. 10.1038/ja.2009.114. [DOI] [Google Scholar]
  • (26).Waterworth SC The Role of Nucleotide Sequencing in Natural Product Drug Discovery. J. Nat. Prod 2025. 10.1021/acs.jnatprod.5c00876. [DOI] [Google Scholar]
  • (27).Tran PN; Yen M-R; Chiang C-Y; Lin H-C; Chen P-Y Detecting and Prioritizing Biosynthetic Gene Clusters for Bioactive Compounds in Bacteria and Fungi. Appl Microbiol Biotechnol 2019, 103 (8), 3277–3287. 10.1007/s00253-019-09708-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (28).Ellis EK; Ióca LP; Liu J; Chen M; Bruner SD; Ding Y; Paul VJ; Donia MS; Luesch H Structure Determination and Biosynthesis of Dapalides A–C, Glycosylated Kahalalide F Analogues from the Marine Cyanobacterium Dapis Sp. J. Nat. Prod 2025, 88 (9), 2138–2150. 10.1021/acs.jnatprod.5c00757. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (29).Klein D; Braekman J-C; Daloze D; Hoffmann L; Castillo G; Demoulin V Lyngbyapeptin A, a Modified Tetrapeptide from Lyngbya Bouillonii (Cyanophyceae). Tetrahedron Letters 1999, 40 (4), 695–696. 10.1016/S0040-4039(98)02451-4. [DOI] [Google Scholar]
  • (30).Moosmann P; Ecker F; Leopold-Messer S; Cahn JKB; Dieterich CL; Groll M; Piel J A Monodomain Class II Terpene Cyclase Assembles Complex Isoprenoid Scaffolds. Nat. Chem 2020, 12 (10), 968–972. 10.1038/s41557-020-0515-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (31).van der Maaten L; Hinton G Visualizing Data Using T-SNE. Journal of Machine Learning Research 2008, 9 (86), 2579–2605. [Google Scholar]
  • (32).Rogers D; Hahn M Extended-Connectivity Fingerprints. J. Chem. Inf. Model 2010, 50 (5), 742–754. 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
  • (33).Li Y; Dodge GJ; Fiers WD; Fecik RA; Smith JL; Aldrich CC Functional Characterization of a Dehydratase Domain from the Pikromycin Polyketide Synthase. J. Am. Chem. Soc 2015, 137 (22), 7003–7006. 10.1021/jacs.5b02325. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (34).Skiba MA; Bivins MM; Schultz JR; Bernard SM; Fiers WD; Dan Q; Kulkarni S; Wipf P; Gerwick WH; Sherman DH; Aldrich CC; Smith JL Structural Basis of Polyketide Synthase O-Methylation. ACS Chemical Biology 2018. 10.1021/acschembio.8b00687. [DOI] [Google Scholar]
  • (35).Yin Z; Dickschat JS Cis Double Bond Formation in Polyketide Biosynthesis. Nat. Prod. Rep 2021, 38 (8), 1445–1468. 10.1039/D0NP00091D. [DOI] [PubMed] [Google Scholar]
  • (36).Miyanaga A; Janso JE; McDonald L; He M; Liu H; Barbieri L; Eustáquio AS; Fielding EN; Carter GT; Jensen PR; Feng X; Leighton M; Koehn FE; Moore BS Discovery and Assembly-Line Biosynthesis of the Lymphostin Pyrroloquinoline Alkaloid Family of mTOR Inhibitors in Salinispora Bacteria. J. Am. Chem. Soc 2011, 133 (34), 13311–13313. 10.1021/ja205655w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (37).Gaudelli NM; Townsend CA Epimerization and Substrate Gating by a TE Domain in β-Lactam Antibiotic Biosynthesis. Nat Chem Biol 2014, 10 (4), 251–258. 10.1038/nchembio.1456. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (38).Li R; Oliver RA; Townsend CA Identification and Characterization of the Sulfazecin Monobactam Biosynthetic Gene Cluster. Cell Chemical Biology 2017, 24 (1), 24–34. 10.1016/j.chembiol.2016.11.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (39).Schwaller P; Laino T; Gaudin T; Bolgar P; Hunter CA; Bekas C; Lee AA Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction. ACS Cent. Sci 2019, 5 (9), 1572–1583. 10.1021/acscentsci.9b00576. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (40).Sorokina M; Merseburger P; Rajan K; Yirik MA; Steinbeck C COCONUT Online: Collection of Open Natural Products Database. Journal of Cheminformatics 2021, 13 (1), 2. 10.1186/s13321-020-00478-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (41).Landrum G RDKit: Open-Source Cheminformatics. 2006. Google Scholar 2006. [Google Scholar]
  • (42).Paszke A; Gross S; Massa F; Lerer A; Bradbury J; Chanan G; Killeen T; Lin Z; Gimelshein N; Antiga L; Desmaison A; Köpf A; Yang E; DeVito Z; Raison M; Tejani A; Chilamkurthy S; Steiner B; Fang L; Bai J; Chintala S PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv December 3, 2019. 10.48550/arXiv.1912.01703. [DOI] [Google Scholar]
  • (43).Akiba T; Sano S; Yanase T; Ohta T; Koyama M Optuna: A Next-Generation Hyperparameter Optimization Framework. arXiv July 25, 2019. 10.48550/arXiv.1907.10902. [DOI] [Google Scholar]
  • (44).Blin K; Shaw S; Medema MH; Weber T The antiSMASH Database Version 4: Additional Genomes and BGCs, New Sequence-Based Searches and More. [Google Scholar]
  • (45).Larralde M; Zeller G PyHMMER: A Python Library Binding to HMMER for Efficient Sequence Analysis. Bioinformatics 2023, 39 (5), btad214. 10.1093/bioinformatics/btad214. [DOI] [Google Scholar]
  • (46).Mistry J; Chuguransky S; Williams L; Qureshi M; Salazar GA; Sonnhammer ELL; Tosatto SCE; Paladin L; Raj S; Richardson LJ; Finn RD; Bateman A Pfam: The Protein Families Database in 2021. Nucleic Acids Research 2021, 49 (D1), D412–D419. 10.1093/nar/gkaa913. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

SI

Data Availability Statement

All data generated in this study and all processed data used to produce the results of this study have been deposited in the Zenodo repository available at https://zenodo.org/records/17458129. Source data for all figures is provided with this paper.

The Python code used to generate all results is publicly available only at https://github.com/EvoCatalysis/BGC_annotation.

RESOURCES