Skip to main content
The Journal of Biological Chemistry logoLink to The Journal of Biological Chemistry
. 2022 Sep 12;298(10):102480. doi: 10.1016/j.jbc.2022.102480

The Natural Product Domain Seeker version 2 (NaPDoS2) webtool relates ketosynthase phylogeny to biosynthetic function

Leesa J Klau 1,2,, Sheila Podell 1,, Kaitlin E Creamer 1,, Alyssa M Demko 1, Hans W Singh 1, Eric E Allen 1,3, Bradley S Moore 1,4, Nadine Ziemert 1, Anne Catrin Letzel 1, Paul R Jensen 1,
PMCID: PMC9582728  PMID: 36108739

Abstract

The Natural Product Domain Seeker (NaPDoS) webtool detects and classifies ketosynthase (KS) and condensation domains from genomic, metagenomic, and amplicon sequence data. Unlike other tools, a phylogeny-based classification scheme is used to make broader predictions about the polyketide synthase (PKS) and nonribosomal peptide synthetase (NRPS) genes in which these domains are found. NaPDoS is particularly useful for the analysis of incomplete biosynthetic genes or gene clusters, as are often observed in poorly assembled genomes and metagenomes, or when loci are not clustered, as in eukaryotic genomes. To help support the growing interest in sequence-based analyses of natural product biosynthetic diversity, here we introduce version 2 of the webtool, NaPDoS2, available at http://napdos.ucsd.edu/napdos2. This update includes the addition of 1417 KS sequences, representing a major expansion of the taxonomic and functional diversity represented in the webtool database. The phylogeny-based KS classification scheme now recognizes 41 class and subclass assignments, including new type II PKS subclasses. Workflow modifications accelerate run times, allowing larger datasets to be analyzed. In addition, default parameters were established using statistical validation tests to maximize KS detection and classification accuracy while minimizing false positives. We further demonstrate the applications of NaPDoS2 to assess PKS biosynthetic potential using genomic, metagenomic, and PCR amplicon datasets. These examples illustrate how NaPDoS2 can be used to predict biosynthetic potential and detect genes involved in the biosynthesis of specific structure classes or new biosynthetic mechanisms.

Abbreviations: ACP, acyl carrier protein; BGC, biosynthetic gene cluster; C, condensation; ECH, enoyl-CoA hydratase; ER, enoyl reductase; FAS, fatty acid synthase; HR, highly reducing; KS, ketosynthase; NaPDoS, Natural Product Domain Seeker; NRPS, nonribosomal peptide synthetase; NR, nonreducing; PKS, polyketide synthase; PUFA, polyunsaturated fatty acid; PTM, polycyclic tetramate macrolactam; trans-AT, trans-acyl transferase


Increased access to DNA sequence data coupled with a better understanding of the molecular genetics of natural product biosynthesis are driving major advances in natural products research. These advances have been facilitated by webtools such as antiSMASH 6.0 (1) and PRISM 4 (2) that identify natural product biosynthetic gene clusters (BGCs) from assembled sequence data and provide insight into the types of small molecules produced (3). These tools have proven instrumental for genome mining research (4), while others have been developed to address more specific topics such as resistance-guided antibiotic discovery (5), biosynthetic gene biogeography (6), and trans-acyl transferase (trans-AT) substrate specificity (7). The Natural Product Domain Seeker (NaPDoS) is a specialized webtool used to assess biosynthetic diversity based on short sequence tags and thus does not require BGC assembly (8). It targets ketosynthase (KS) and condensation (C) domains to make broader predictions about the polyketide synthase (PKS) and non-ribosomal peptide synthetase (NRPS) genes, respectively, in which they reside. NaPDoS detects these domains and classifies them using a phylogeny-based scheme that reflects well-supported biosynthetic knowledge and established PKS and NRPS function.

Here, we introduce version 2 of the NaPDoS webtool (NaPDoS2), which includes an updated KS database and classification scheme that better reflects the expanded taxonomic distributions and functional diversity of PKSs. These enzymes produce structurally diverse polyketides ranging from lipids to macrolides, which represent an important source of compounds for pharmaceutical and other biotechnological applications (9). Polyketides also serve important ecological functions ranging from antioxidants (10) to allelochemicals (11) and thus can provide insight into how organisms interact with each other and the environment. PKSs share much in common with fatty acid synthases (FASs), generating compounds via the successive decarboxylative condensation and processing of acyl-CoA precursors (12). PKSs have been broadly divided into three types based on their organization and function (13) of which NaPDoS2 detects and classifies KSs associated with types I and II. While canonical type I PKSs have a modular organization and function in an assembly line fashion, some function iteratively while others (e.g., trans-AT) lack a cognate acyl-transferase domain (9, 14). Similarly, canonical type II PKSs function iteratively and were originally best known to produce aromatic polyketides. Yet some type II PKSs function noniteratively (13), while others produce linear specialized metabolites (15). A central feature of PKSs is the KS domain, which in most cases catalyzes a Claisen condensation between the extender unit and the growing, thioester-linked polyketide chain. In recent years, our knowledge of KS functional diversity has expanded significantly to reveal new enzymology and diverse product outcomes across biology. The broad distributions and functional specificities among type I and II PKSs can, in many cases, be resolved phylogenetically (16, 17, 18), with these evolutionary relationships forming the basis of the NaPDoS2 classification scheme.

The NaPDoS2 website, available at http://napdos.ucsd.edu/napdos2/, includes many updated features that improve the usability of the tool for natural product discovery. Here, we report database and pipeline modifications that provide broader taxonomic coverage, better resolution among functionally characterized PKSs, a new subclassification scheme for type II PKSs, an increased capacity for processing large datasets, and the enhanced detection of eukaryotic KSs. Statistical validation tests have been used to select parameters for optimizing sensitivity and specificity, including both detection and classification accuracy based on query sequence length. The upgraded webtool was used to demonstrate the utility of NaPDoS2 for predicting biosynthetic potential in genomic, metagenomic, and amplicon datasets.

Results and discussion

Pipeline efficiency and interface upgrades

As in the original release (8), NaPDoS2 detects and classifies KS and C domains from nucleotide or amino acid sequence data. While both versions follow the same general workflow, substitution of the more computationally efficient program DIAMOND (19) for NCBI BLAST (20), eliminates the need for an extra Hidden Markov Model prefiltering step (Fig. S1). Speed improvements using DIAMOND are relatively modest for small jobs but can reach orders of magnitude for large datasets, especially those consisting of short query sequences (19). The NaPDoS2 pipeline executes most rapidly on amino acid sequences, which do not need to be translated. Although processing times increase with total nucleotide sequence length and the number of database matches, results for microbial genomes, assembled metagenomes, and PCR KS amplicons containing thousands of hits can typically be obtained within seconds to minutes (Table S1). These improvements enable users to perform large-scale analyses that were not feasible with the original NaPDoS release.

User interface upgrades include the addition of a “Domain Classification Summary” page that lists the total number of domains detected in the query data as grouped by their NaPDoS2 classification (Fig. S2A). This page provides the option to select classes of specific interest for more detailed investigation (Fig. S2B), which is particularly useful when large numbers of domains are detected. Each BGC represented in the database is linked to a summary page that includes a representative structure and the classification of each KS or C domain within that BGC (Fig. S2C). The independent classification of each domain is particularly useful given that a single gene or BGC can contain multiple domain types (21). New webtool features also include quick start instructions outlining the NaPDoS2 workflow (Fig. S3A) and a downloadable documentation and tutorial file.

Database expansion

The primary goals for NaPDoS2 were to expand the KS database and classification scheme to include biosynthetic functions that were not represented in the original release, to supplement poorly populated classes, and to provide greater taxonomic coverage. One thousand four hundred seventeen new KS sequences were added, raising the database total to 1877 (average length 418 ± 49 amino acids), an increase of 308% (Fig. S4). Most of the additional sequences were derived from the MIBiG repository of experimentally verified BGCs (22), including new type II PKSs, type I fungal PKSs, and type I FASs from both bacterial and fungal sources. A few uncharacterized protist and metazoan sequences were included due to the scarcity of experimental verification among these groups. Although 84 C domain sequences were added, their classification scheme has not been updated and remains an important goal for future releases. The taxonomic breakdown of the current NaPDoS2 KS database sequences is 93.8% bacteria, 4.7% fungi, and 1.5% other eukaryotes (Table S2), reflecting the taxonomic skew of available experimental data. To improve result interpretation, database (match) IDs now display the BGC name, domain number, class, and subclass identifiers for each domain.

Phylogeny-based KS classification

The ability to predict PKS and NRPS biosynthetic potential based on KS or C domains distinguishes NaPDoS2 from other bioinformatic tools. The domain classification scheme is derived from KS sequence phylogenies and their relationships to established biochemical functions, gene architectures, and structural features of the compounds produced. While not all classes and subclasses are monophyletic, functionally coherent clades form the basis of the classification scheme. Phylogenies generated from the updated KS database allowed us to establish 41 class and subclass assignments associated with type I and II PKSs and FASs (Table S2 and Fig. S5). This represents an increase of 410% over the original release. To verify the sufficiency of using KS domains to indicate broader PKS context, we assessed the genomic neighborhoods of KSs detected in a variety of genomic and metagenomic datasets. In all cases, the NaPDoS2 KS classification agreed with the antiSMASH 6.0 (1) classification (Fig. S6). In some cases, NaPDoS2 provided more detailed information including the identification of type II subclasses, omega-3 polyunsaturated fatty acid (PUFA) KSs, and highly, partially, and nonreducing fungal KSs. NaPDoS2 can also distinguish among different KS classes within a single gene or BGC (Fig. S6).

With the classification scheme established, a simplified reference tree was generated using 414 sequences representing all class and subclass assignments (Fig. 1). Broadly, this reference phylogeny delineates the well-established relationships between FASs and type I and II PKSs (17). The enhanced classification of type I FAS KSs from bacteria/fungi, protists, and metazoa allows users to better distinguish between KSs associated with specialized metabolism and fatty acid biosynthesis across diverse taxa. The expanded eukaryotic type I coverage includes KSs from fungal iterative cis-AT PKSs and several protist (Phyla Amoebozoa and Apicomplexa) and metazoan (Phyla Chordata, Echinodermata, and Nematoda) PKSs, including those linked to characterized natural products from Caenorhabditis elegans (23) and Dictyostelium discoideum (24). The seven type I PKS classes described in the original NaPDoS release have been reorganized into three classes (modular cis-AT, iterative cis-AT, and trans-AT) and 14 subclasses in the NaPDoS2 release. Example structures associated with each KS type are shown in Figure 2 with associated metadata (Table S3) and descriptions of each KS type (Fig. S5 and website documentation) provided to help connect KSs with structures and relevant references.

Figure 1.

Figure 1

KS phylogeny-based classification. Maximum likelihood phylogeny generated from 414 KS sequences. Clades are color-coded and labeled according to their NaPDoS2 classification. Transfer bootstrap expectation (TBE) support was estimated using Booster (72). The full name of each sequence can be viewed in Fig. S7, which can be used to link a query match to a specific location in the tree. Thiolases from Escherichia coli, Zoogloea ramigera, and Streptomyces avermitilis were used as outgroups. An expanded phylogeny of the type II KSs is presented in Figure 3. FAS, fatty acid synthase; KS, ketosynthase; NaPDoS2, Natural Product Domain Seeker version 2; PKS, polyketide synthase; PUFA, polyunsaturated fatty acid.

Figure 2.

Figure 2

Example structures for the type I KS classes and subclasses recognized by NaPDoS2. Descriptions of each KS type, associated metadata, and references can be found in Fig. S5 and Table S3. Information on each structure can also be accessed from the BGC tab on the website (note: 1-heptadecene = Moorea producens olefin synthase, 4-Z-annimycin = annimycin, and eicosapentaenoic acid = Schizochytrium polyunsaturated fatty acid on the website). Colors correspond to Figure 1. KS, ketosynthase; NaPDoS2, Natural Product Domain Seeker version 2.

Modular cis-AT KSs

The modular cis-AT class comprises KSs associated with the canonical assembly line PKSs (e.g., erythromycin biosynthesis) and now includes four subclasses of which two [olefin synthase and tandem enoyl-CoA hydratase (ECH)] are new to NaPDoS2 (Fig. 1). The olefin synthase subclass is best known from cyanobacteria and is associated with the formation of a terminal olefin on a fatty acyl precursor (25). These KS sequences form two clades in the reference tree, which reflects the sporadic distribution of the OLS pathway among cyanobacteria (25). The tandem ECH subclass is associated with gene cassettes that introduce a branch at the β-keto position. In these PKSs, the KS domain is located immediately downstream of tandem ECH and enoyl reductase (ER) domains, which catalyze decarboxylation and reduction reactions, respectively, as seen in cylindrocyclophane biosynthesis (26) and reviewed elsewhere in detail (27). Most of the modular cis-AT KSs in the database are not associated with a specialized function and thus are not assigned to a subclass.

Iterative cis-AT KSs (bacteria)

The iterative cis-AT class (iPKSs) maintains a modular organization, with multiple enzymatic domains on a single protein, yet instead of functioning as an assembly line these enzymes catalyze an iterative series of elongation steps (28). The expanded NaPDoS2 classification scheme now includes seven iPKS subclasses. The four observed in bacteria include the aromatic and polycyclic tetramate macrolactam (PTM) subclasses (29), both of which are new to NaPDoS2, and the enediyne and PUFA subclasses, which were identified in the original release. PUFA KSs now form three clades in the reference tree, which correlate with the three KS domains typically present in PUFA PKSs (30). Aromatic iPKSs generally produce simple monocyclic or bicyclic aromatic compounds that are distinct from enediynes and PUFAs (29). The PTM-like iPKSs produce complex compounds containing a macrocyclic lactam with an embedded tetramic acid moiety fused with a polycyclic system (i.e., 2–3 rings) derived from polyene chains (29). The ability to quickly recognize PTM biosynthetic potential provides opportunities to expand on the number of compounds discovered in this unusual and often biologically active class (31). New iterative type I PKSs continue to be discovered, including some that are unexpectedly abundant and widespread in the genus Streptomyces (31), providing opportunities to expand this class in future updates beyond the seven subclasses currently recognized by NaPDoS2.

Iterative cis-AT KSs (fungi)

Iterative cis-AT KSs are also observed in fungi and can be delineated into highly reducing (HR), partially reducing, and NR iPKS subclasses (32). These subclasses differ in the number and type of β-keto processing ketoreductase, dehydratase, and ER domains present. HR iPKS BGCs, which contain all three β-keto processing domains, often contain a C-methylation domain responsible for ɑ-carbon methylation and some are fused with NRPS modules (33). Their products are usually linear or cyclic, nonaromatic compounds. Partially reducing iPKS BGCs lack ER and sometimes also dehydratase domains. Due to their similarity to bacterial aromatic iPKSs, they have been hypothesized to originate from a horizontal gene transfer event between bacteria and fungi (34). NR iPKS BGCs lack all three β-keto processing domains and have specialized domains that facilitate starter unit loading and product folding (33).

Trans-AT KSs

KS domains from trans-AT PKSs form multiple clades in the reference tree (Fig. 1), which may reflect KS substrate-specificity (35). Nonetheless, three functionally distinct subclasses can be recognized by NaPDoS2 of which those associated with β-branching modules and hybrid, nonelongating KSs are new to NaPDoS2. The β-branching module, comprising a KS domain, a cryptic “B” domain, and an acyl carrier protein (ACP) domain, represents one of the mechanisms for the formation of β-branching in trans-AT PKSs (36, 37). Hybrid, nonelongating KSs (KS0) typically follow an NRPS module, lack the catalytic histidine required for elongation, and are proposed to facilitate the transfer of the growing polyketide to the next module (7, 38). The hybrid trans-AT KS subclass clades with the hybrid cis-AT subclass and functions similarly in modules that lack AT domains.

Type II KSs

A major aim for the NaPDoS2 upgrade was to expand the classification of type II KS domains based on established functions and evolutionary relationships (39). A phylogeny of the type II KS domains in the updated database (Fig. 3) provided sufficient resolution to delineate five functionally defined classes and four sequences that have yet to be assigned a functional class (annotated as type II unclassified). This represents a major improvement over the original NaPDoS release, which simply identified a sequence as being associated with a type II PKS. The most highly populated of the five type II KS classes recognized by NaPDoS2 are associated with PKSs that function iteratively to produce reactive poly-β-ketoacyl intermediates that cyclize to polycyclic aromatic compounds. The type II aromatic class could be further delineated into nine functionally defined subclasses based on a concatenated alignment of the two KS subunits, which generated a better resolved phylogeny (Fig. 4). Example structures (Fig. 5) along with the associated metadata (Table S3) help connect each KS type with structures and relevant references.

Figure 3.

Figure 3

Type II KS phylogeny-based classification. Maximum likelihood phylogeny of 212 KS sequences. Clades are color-coded and labeled according to their NaPDoS2 classification. TBE bootstrap support was estimated using Booster (72). See Fig. S8 for sequence annotations. Thiolases from E. coli, Z. ramigera, and S. avermitilis were used as outgroups. Type II KS sequences that lack further annotation have yet to be assigned a functional class. KS, ketosynthase; NaPDoS2, Natural Product Domain Seeker version 2; TBE, transfer bootstrap expectation.

Figure 4.

Figure 4

Type II aromatic KS phylogeny-based classification. Maximum likelihood phylogeny of concatenated KSα and β subunits from 59 type II BGCs. Clades are annotated based on NaPDoS2 classification. Structural motifs associated with subclasses shown in colored rings (white, not determined). See Fig. S9 for sequence annotations and Table S3 for biosynthetic references. TBE bootstrap support was estimated using Booster (72). BGCs, biosynthetic gene clusters; KS, ketosynthase; NaPDoS2, Natural Product Domain Seeker version 2; TBE, transfer bootstrap expectation.

Figure 5.

Figure 5

Example structures for the type II KS classes and subclasses recognized by NaPDoS2. Descriptions of each KS type, associated metadata, and references can be found in Fig. S5 and Table S3. Information on each structure can also be accessed from the BGC tab on the website (note: bromoalterochromide A = alterochromide on the website). Colors correspond to Figure 3. BGC, biosynthetic gene cluster; KS, ketosynthase; NaPDoS2, Natural Product Domain Seeker version 2.

Type II KSs function as a heterodimer, with the KSα subunit responsible for the iterative elongation of a nascent poly-β-keto chain, the KSβ subunit determining the chain length (also referred to as the chain-length factor), and both subunits mediating the first ring cyclization reaction (38, 40). The KSα and KSβ subunits are implemented as individual sequences in the NaPDoS2 workflow thus allowing them to be distinguished. To define these subclasses, the tree topology was compared with poly-β-keto chain length, carbon-carbon position of the first ring cyclization, type of starter unit, and the core and intermediate structures of the products (Fig. 4). These subclasses are consistent with previous phylogenies (16, 41) and chemotypes identified in PKMiner (42) and antiSMASH version 5.0 (43) based on full-length BGC analyses. KSs from several unusual type II aromatic PKSs that are poorly represented in the database (e.g., resistomycin biosynthesis) were not assigned subclasses and are instead annotated as “type II aromatic unclassified” in NaPDoS2.

The four remaining type II KS classes are more unusual and include one that is responsible for introducing β-branching in trans-AT PKSs. These KSs are associated with HMGS or HCS (3-hydroxy-3-methylglutaryl coenzyme A synthase) gene cassettes that encode several monodomain proteins including a decarboxylating KS that can be resolved in the type II KS phylogeny (Fig. 3) (27). Another unusual type II KS class recognized in the phylogeny functions noniteratively. The KSs associated with these enzymes facilitate the incorporation of acetate, propionate, and succinate building blocks leading to the production of macrocyclic compounds such as the nonactin antibiotics, in which case they catalyze C-O bond formation (44). Noniterative type II PKSs contain multiple type II and III KSs in which each domain is responsible for a single condensation step (45, 46). The remaining type II classes represent KSs associated with the production of highly reduced polyenes or aryl polyenes (Fig. 3), with the latter representing the most prominent biosynthetic family observed across a wide taxonomic distribution of bacterial genomes (47). Similar to type II aromatic PKSs, polyene and aryl polyene PKS also contain KSα and KSβ subunits with biosynthesis proceeding through a series of alternating elongation and reduction steps (15, 48, 49). Aryl polyene PKSs also possess homodimeric KSs that function to complement the heterodimer (15, 49). These are most similar to KSs observed in type II FASs (not shown in the tree) and are classified as such in the NaPDoS2 database.

Performance evaluation

Recognizing that computational prediction algorithms are seldom perfect, leave-one-out cross validation (50) and receiver operating characteristic (ROC) curves (51) are powerful statistical tools for quantifying sensitivity and specificity but require calibration using known positive and negative control datasets that maximize discriminatory power. A set of positive controls (213 sequences) for evaluating NaPDoS2 performance was obtained by clustering all full-length, experimentally verified KS domains in the NaPDoS2 database at 50% amino acid identity. These sequences belong to five conserved domain families within the condensing enzyme superfamily of the NCBI Conserved Domain Database (Fig. S10, A and B), which encompasses enzymes that catalyze decarboxylating or nondecarboxylating Claisen-like condensation reactions for the synthesis and degradation of fatty acids and polyketides from all kingdoms of life (23% eukaryota, 70% bacteria, and 7% archaea). Negative controls (308 sequences) were selected from condensing enzyme families falling immediately outside the NaPDoS2 clades and similarly clustered at 50% amino acid identity (Fig. S10A). These negative controls, which include beta-ketoacyl-ACP synthases, ketoacyl-ACP synthases III, type III chalcone and stilbene synthases, thiolases, and sterol carrier protein-associated thiolases, were augmented with additional sequences (52) (Fig. S10C) and KSs from type III PKSs retrieved from MIBiG 2.0 (Fig. S10D).

Leave-one-out cross-validation scores were generated for positive and negative control sequences based on the e-values of their closest nonself match using both the original NaPDoS and NaPDoS2 KS reference databases. ROC curves were calculated from these scores to generate area under the curve (AUC) values (Fig. 6A), demonstrating the superior performance of NaPDoS2 (AUC = 0.987) versus the original release (AUC = 0.978). ROC curves were further used to establish e-value cutoff points that maximize sensitivity and minimize false positives, identifying optimum values as 1e-8 for NaPDoS2 and 4e-11 for the original NaPDoS release. Although these cutoff values provided equivalent sensitivity (detection of true positives) for their respective algorithms, the highest achievable specificity obtainable for NaPDoS was 93.0% (7% false positive rate) versus 97.2% for NaPDoS2 (2.8% false positive rate). This improvement is most likely explained by the expanded database underlying the NaPDoS2 pipeline. Consequences of these statistically identified differences in real-world use cases are presented in the “NaPDoS2 applications” section below.

Figure 6.

Figure 6

Receiver operating characteristic curves for NaPDoS and NaPDoS2. Nonredundant sets of 213 positive and 308 negative control KS domains were compared based on the BLASTP e-values of their closest non-self-match in each NaPDoS database. Optimal cutoff values were selected to maximize sensitivity. FP, false positives; KS, ketosynthase; NaPDoS2, Natural Product Domain Seeker version 2; TP, true positives.

The effects of partial KS sequences on detection and classification accuracy were evaluated using both full-length database sequences (typically 425 amino acids) and shorter overlapping subsets of 30, 50, 100, and 200 amino acids (aa) covering the entire length of these sequences. These subsets were designed to mimic domain fragments encountered in draft genomes, metagenomic assemblies, or KS amplicon sequences. Using the previously established 1e-8 cutoff for maximum sensitivity, NaPDoS2 detected 99% of the 200aa length subsequences as KS domains, of which 85% were correctly classified (Fig. 7). Detection and classification accuracy declined with shorter sequences, falling dramatically for sequences <50 amino acids. Length-dependent performance degradation was also observed using 1e-5 and 1e-10 e-values as cutoff scores (Fig. S11). These results illustrate the difficulty of analyzing unassembled, next-generation sequencing reads and short contigs covering partial domain fragments.

Figure 7.

Figure 7

Effect of query size on KS detection and classification accuracy. Classifications were based on a 1e-8 BLASTP e-value cutoff score for the closest non-self-database match. Test sequences of varying lengths were obtained as overlapping sliding window subsequences covering the full length of 213 nonredundant, positive control KS domains. KS, ketosynthase.

Based on performance evaluation results, default NaPDoS2 parameters were set at an e-value of 1e-8 and a minimum alignment length of 200 amino acids; however, users may choose to adjust these settings for their individual datasets. Sensitivity declines for partial KS domains can be partially offset by decreasing BLAST stringency, at the cost of increasing false positives and misclassifications. PCR-generated amplicons can provide better sensitivity than similarly sized random sequence fragments but may still be too short to obtain accurate classifications. Confident assignments to poorly populated classes or subclasses are particularly challenging given the sequence diversity observed within highly populated classes and subclasses (Fig. S12). Some ambiguities may be resolved by using NaPDoS2 sequence alignments to generate detailed phylogenetic trees, but others will remain until additional functional studies are reported.

NaPDoS2 applications

Large-scale performance evaluations targeting genomic, metagenomic, and amplicon sequence data from a variety of bacterial, fungal, metazoa, and environmental sources were conducted to demonstrate the utility of NaPDoS2 in identifying polyketide biosynthetic potential (Table 1; accession numbers in Table S3). These examples include the integration of NaPDoS2 output with other webtools (Fig. S3B).

Table 1.

NaPDoS2 applications

Table # Application Data type Biological source Dataset Ref.
S4, S5 Bacterial type II KS Genome Bacteria 118 Salinispora strains (53)
S6 Fungal KS & FAS Genome Fungi 27 Fungal spp. (55)
S7 Fungal KS & FAS Genome Fungi 159 Fungal MIBiG 2.0 PKS BGCs (22)
S7 Environmental type II KS Amplicon Environmental DNA 147 KS clones from soil (61)
S8 Eukaryotic KS & FAS Genome Metazoa Elysia chlorotica (56, 57)
S9 Environmental type I & II KS, FAS Metagenome Environmental DNA 20 Marine sediment samples (60)
S10 Environmental type I KS Amplicon Environmental DNA eSNaPD v2.0 KS sequences from soil (62)
S11 Environmental type II KS Amplicon Environmental DNA Type II KS sequences from 12 soil samples (63)
S12 Environmental and cultured type I & type II KS Amplicon Bacteria, environmental DNA Type I and II KS sequences from lake sediment and enrichment cultures (64)

The utility of NaPDoS2 was demonstrated across a variety of data types and biological sources. Details for each analysis can be found in Tables S4–S12 as summarized below. Table S3 lists accession numbers for all analyses.

Genomes

While the original NaPDoS release simply identified KSs as type II, the new release provides more sensitive detection and more detailed classifications. We analyzed 118 Salinispora genomes (53) with NaPDoS2 and detected a total of 662 type II KS domains in contrast to 363 using the original release (Table S4). The type II KSs detected by NaPDoS2 were delineated into seven functionally defined classes and subclasses; those that were unassigned may represent new functional diversity. A broader summary of all KSs detected in these genomes provided the first evidence that S. arenicola has the potential to produce PTMs (Table S5), a class of structurally complex natural products that exhibits diverse biological activities (54). These results provide new insight into the biosynthetic potential of this marine actinomycete genus.

Another important NaPDoS2 improvement is the ability to detect and classify eukaryotic KS sequences. This is illustrated by the analysis of 27 taxonomically diverse fungal genomes (55), where KSs ranged from one in Malassezia globosa to 50 in Aspergillus niger (Table S6) and the majority could be assigned to the HR subclass. We next analyzed all 159 fungal PKS BGCs in the MIBiG 2.0 repository (22) and identified 182 KS domains, all of which matched MIBiG 2.0 descriptions and literature reports (Table S7). In contrast, the original NaPDoS release only identified 14 KS domains from these same BGCs. Although relatively few metazoan PKSs have been experimentally characterized, NaPDoS2 recovered the recently described FAS and FAS-like KSs from the Elysia chlorotica sacoglossan genome (Table S8) (56, 57) and correctly classified them as metazoan type I FASs. A phylogenetic tree generated using NaPDoS2 confirmed their divergence from previously characterized animal FASs (data not shown). In an exploratory search for KSs in other eukaryotic genomes, NaPDoS2 detected 37 modular cis-AT domains, 17 type I FAS domains, 2 type II FAS domains, and 1 protist-type KS domain from the dinoflagellate Symbodium minimus (58), and six type II FAS domains and one type II KSα aromatic anthracycline domain from the diatom Nitzschia inconpicua (59), thus further validating its utility for analyzing eukaryotic sequences. While these data show that NaPDoS2 can detect and classify KS domains from complex metazoan datasets, it is best suited for predicted proteins, translated coding sequences, or transcriptomes, since it cannot excise introns associated with eukaryotic sequence data.

Metagenomes

A single NaPDoS2 analysis can provide a simultaneous overview of both bacterial and eukaryotic polyketide and fatty acid biosynthetic potential in large metagenomic datasets. While it is not reliant on fully assembled BGCs, assembled contigs are recommended to avoid the reduced classification accuracies associated with short sequence reads (Figs. 7 and S11). To illustrate this application, we assessed 20 assembled marine sediment metagenomes deposited in the Paired Omics Data Platform (60). We observed a wide range in the numbers and types of KSs detected, which can provide important insight when selecting samples for further study (Table S9). These results show how NaPDoS2 can be used to identify samples with the potential to produce compounds in rare but biologically active classes, such as enediynes and PTMs, to expand on poorly understood metazoan PKS diversity and, in cases with low sequence similarity to database or BLAST matches, detect new functional diversity. Trimmed domains can be used as search queries to assess broader genomic context (Fig. S6), compare with previously reported BGCs (22), and potentially identify the host organism when phylogenetic markers are encountered in the KS-containing contig.

Amplicons

NaPDoS2 is particularly useful for the analysis of KS/C domain PCR amplicon datasets, where it can be used not only to classify sequences into specific functional categories and assign a top BGC product match but also to assess primer specificity and remove nontarget sequences prior to downstream analysis. We illustrate the applications of NaPDoS2 using four KS amplicon datasets, starting with 147 KS sequences cloned from soil eDNA using type II–specific KS primers (61). Both versions of NaPDoS identified all 147 KSs as type II, while NaPDoS2 further delineated them into three type II aromatic subclasses, which agrees with the original report (Table S7). We next compared the NaPDoS2 output with that from eSNaPD v 2.0, which relates KS amplicon sequences to a database of characterized BGCs (6, 62). NaPDoS2 detected all 381 KS sequences in the eSNaPD v 2.0 New Mexico desert soil library “NM_KS_ARRY_LIB01” dataset (62) and classified the vast majority as cis-AT modular (Table S10). Additionally, NaPDoS2 classified virtually all KS sequences within what eSNaPD2 v 2.0 listed as novel clusters, providing new information about the biosynthetic diversity within this library (6, 62) (Table S10).

Larger KS amplicon datasets generated from soil eDNA and lake sediment enrichment cultures using type I and II specific primers were also evaluated (63, 64). While the original analyses estimated biosynthetic potential based solely on the closest MIBiG repository match, NaPDoS2 further delineated the sequences into specific type I and type II classes and subclasses. Analysis of the type II soil amplicons (63) revealed that many were not identified as KSs and appear to represent off-target sequences. Of those identified as KSs, a number were classified by NaPDos2 as type II FASs and a few as type I PKSs (Table S11). This illustrates the application of NaPDoS2 to assess primer specificity and identify potential nontarget KS sequences prior to downstream analyses.

Finally, we addressed the question of whether lowering the minimum alignment length for amplicons might increase the number of false positives, as previously observed for random domain fragments (Fig. 7). To do this, we analyzed 40,000 randomly selected sequences from longer amplicon (602 bp) type I and II KS datasets from lake sediment enrichment cultures (64) using a range of minimum amino acid alignment lengths (5000 of these sequences shown in Table S12). We placed both hit and nonhit sequences in a phylogenetic context within the condensing enzyme superfamily tree (52, 64) and mapped the conserved domains with TREND (65) (Fig. S13). Sequences identified by NaPDoS2 as KSs, regardless of alignment length (30–200aa), fall within the two clades associated with the NaPDoS2 database positive control sequences. Conversely, the sequences that NaPDoS2 did not identify as KSs fell outside of these clades and were associated with off-target, nonketosynthase domains such as AMP-binding domains and multiple phage-related domains. These results confirm that shorter NaPDoS2 alignment length settings can be used to assess polyketide biosynthetic potential from amplicon datasets and highlight the value of verifying detection and classification accuracy using phylogenetic approaches and conserved domain architectures.

Conclusions

While several tools can detect and classify the gene clusters associated with natural product biosynthesis (1, 6), NaPDoS2 employs KS and C domains as sequence tags to predict biosynthetic potential. This approach makes it well-suited for nonclustered or incomplete BGCs, amplicons, and eukaryotic genomes where other tools are less effective. The NaPDoS2 update features an expanded KS database and classification scheme that better reflects the broader taxonomic and functional diversity now recognized among type I and II PKSs. It provides a single workflow to detect and classify KS domains from diverse biological origins including bacteria, fungi, and other eukaryotes and to distinguish among those involved in fatty acid and specialized metabolite production. Updates to workflow efficiency can now accommodate the larger genomic, metagenomic, and amplicon datasets achievable with next-generation sequencing. NaPDoS2 provides a rapid method to identify microorganisms or environments with the potential to yield rare classes of compounds, such as those produced by noniterative type II PKSs (13). It can be used to prioritize samples for cultivation and to identify potentially new biosynthetic mechanisms when sequences are phylogenetically distinct from those previously characterized, as recently demonstrated for the standalone KS (salC) that functions as an aldolase/β-lactone synthase in salinosporamide A biosynthesis (66). The NaPDoS2 upgrades expand on the PKS diversity that can be detected with this tool and provide a method to quickly assess biosynthetic potential in a manner that facilitates more targeted approaches to natural product discovery.

Experimental procedures

Sequence database expansion

Experimentally validated PKS BGCs were selected from MIBiG (https://mibig.secondarymetabolites.org/) and published reports. Sequences were extracted from GenBank (https://www.ncbi.nlm.nih.gov/genbank/) and KS domains identified using annotations from both published literature and MIBiG. In silico predictions were made using the NRPS/PKS prediction tool (67). KS domains were further annotated (e.g., type II noniterative) according to experimentally verified functions and information derived from the associated gene or BGC. Metadata including BGC name, BGC type (e.g., PKS, NRPS), MIBiG accession number, PubMed reference, source organism, and example product name and structure were recorded for each BGC. The complete NaPDoS2 amino acid sequence database (KS and C domains in FASTA format), BGC table, and domain metadata can be downloaded from the NaPDoS2 website.

KS sequence alignment, phylogenetic analysis, and classification

Phylogenies generated from the 1877 KS database sequences (1417 new and 460 existing) were used to establish class and subclass assignments based on correspondence between functional annotation and tree topology. Sequences were aligned using MAFFT (v 7.017) (68) as implemented in Geneious (v 6.1.8) using the FFT-NS-i × 1000 alignment algorithm, BLOSUM62 scoring matrix, and defaults for both gap open penalty and offset value (1.53 and 0.123, respectively). The alignments were trimmed, exported to PHYLIP, and phylogenies generated using the PhyML online tool (http://www.atgc-montpellier.fr/phyml/) with smart model selection enabled (69).

The type II aromatic PKS subclasses were established using a concatenated alignment of the KSα and KSβ subunits and phylogenies generated using the methods described above. Subclasses were delineated based on phylogenetic groupings and features of the core polyketide structure including the length of the poly-β-ketoacyl intermediate, the carbon position of the first ring cyclization, and the type of starter unit. The accuracy of select NaPDoS2 KS classifications from the use case analyses were assessed by analyzing the associated contig with antiSMASH 6.0 (1), the NRPS/PKS prediction tool (67), and transATor (7) (Fig. S6).

KS reference trees

A subset of 414 KS sequences representative of all class and subclass assignments were chosen to generate a KS reference tree (Figs. 1 and S7). Sequences were aligned using MAFFT (68) (v 7.407, FFT-NS-i × 1000 alignment algorithm, BLOSUM62 scoring matrix, default gap open penalty = 1.53, and offset value = 0.123) and trimmed using trimAI (70) (1.4.1 with automatic configuration). The trimmed alignment was used to estimate a maximum likelihood tree using FastTree (71) (v 2.1.11, LG+G model of evolution) with 1000 bootstraps. Booster (72) (v 0.3.1) was used to estimate support as transfer bootstrap expectation values. These programs were implemented in ngphylogeny.fr (71) and run locally as a docker image. Two type II KS phylogenies were generated following the same steps. One comprising all 201 type II KS sequences in the database, eight FAS sequences, and three thiolase outgroup sequences (Figs. 3 and S8). The second used concatenated KSα and KSβ subunits from 59 type II aromatic BGCs (Figs. 4 and S9). Trees were visualized using TreeViewer (https://treeviewer.org/) and annotations added using Adobe Illustrator.

The NaPDoS2 workflow

Database sequences and associated metadata are stored in a back-end MySQL database linked to the NaPDoS2 web portal through CGI-scripting as previously described (8). The transeq tool from the EMBOSS package (v 6.6) is used for 6-frame translations of nucleic acid queries (73). BLAST queries of amino acid sequences are performed using DIAMOND v 0.9.29 (19). Multiple sequence alignments are obtained with MUSCLE v 3.8 (74) using the profile alignment feature to merge query sequences with previously aligned database sequences. Phylogenetic trees are constructed from trimmed amino acid sequences using FastTree v 2.2.1 (75). Graphical tree depictions are generated using Newick Utilities v 1.5.0.

Performance testing

The 1877 full-length amino acid sequences in the NaPDoS2 KS database were clustered at 50% identity using CD-HIT v 4.7 (76), yielding 213 nonredundant positive controls. Using the CD-Search (77) function of the curated NCBI Conserved Domain Database (78), negative controls were selected from subfamilies within the condensing enzyme superfamily (cl09938) that are functionally related to type I and II KS domains but phylogentically distinct from the positive control sequences. An additional 49 sequences (52) and 14 KSs from type III PKSs in the MIBiG 2.0 repository (22) were added for a total of 697 sequences (Table S3), which were also clustered at 50% identity using CD-HIT to obtain 308 nonredundant negative controls.

Cross-validation and receiver operating characteristic curves

Domain detection sensitivity and specificity were determined using BLASTP searches of full-length positive and negative control sequences against the NaPDoS and NaPDoS2 databases, excluding self-matches, to generate leave-one-out cross-validation values. Receiver operating characteristic (ROC) curves were constructed and AUC values calculated from these results using easyROC v 1.3 (79). EasyROC data output tables were used to identify potential cutoff points based on the most restrictive cross-validation e-value at which 100% of true positives were detected, thus maximizing sensitivity with the minimum possible number of false positives.

KS detection and classification accuracy

A custom perl script (sequence_subdivider.pl, available at https://github.com/spodell/NaPDoS2_website) was used to subdivide the 213 positive control KS sequences into test sets containing overlapping subsequences of 30, 50, 100, or 200 amino acids, each offset by a 10 amino acid sliding window start site. These size-selected test sets, which contained 8555, 8129, 7064, and 4934 subsequences, respectively, were analyzed using the NaPDoS2 workflow to assess the effects of query size on classification accuracy. Accuracy evaluations for each test set were based on the percentage of subsequences whose best nonself, BLASTP match had the same NaPDoS2 classification as the original, full-length sequence from which it was derived.

Application use cases

Accession information for all sequences and datasets analyzed is provided in Table S3. All analyses used the following default settings unless noted otherwise: NaPDoS version 1: domain detection: Hidden Markov Model 1e-5, 200aa minimum alignment length, pathway assignment: e-value cutoff of 1e-5. NaPDoS2: e-value cutoff 1e-8 and 200aa minimum alignment length. Sequence files containing >500,000 sequences or larger than 500 MB were split into smaller subunits using a custom perl script (serialize_seqs.pl, available at https://github.com/spodell/NaPDoS2_website).

Genomes and metagenomes

Salinispora genome protein sequences (53) downloaded from NCBI and JGI IMG/MER (80) were concatenated into a single FASTA file for NaPDoS2 analysis. Fungal genome protein sequences were downloaded from NCBI; fungal PKS BGCs were extracted from the MIBiG 2.0 repository using the query “Kingdom: Fungi AND BGC type: pks”. Coding sequences and predicted proteins for E. chlorotica were downloaded from NCBI (56). Trimmed E. chlorotica KS sequences generated by NaPDoS2 were aligned with previously published EcPKS1, EcPKS2, and EcFAS sequences (57) and used to construct a phylogenetic tree with the closest NaPDoS2 database hits. Marine sediment metagenomes were selected from the Paired Omics Data Platform (60) and downloaded from NCBI SRA.

Amplicons

KS amplicon sequences were analyzed using minimum alignment lengths of 50aa in NaPDoS2 unless otherwise noted. Sequence accession and dataset references are listed in Table S3. Random subsets of query sequences were obtained using custom perl scripts (get_seq_info.pl, randomize_lines.pl, serialize_large_list.pl, and getseq_multiple.pl, available at https://github.com/spodell/NaPDoS2_website).

Data availability

Relevant alignment files, phylogenetic tree files, webtool documentation file, example tutorials, sequence files, and other supporting information can be found on the corresponding OSF project page: https://osf.io/uzhcp/. The code used to construct version two of the Natural Product Domain Seeker website can be found on the corresponding Github repository: https://github.com/spodell/NaPDoS2_website. All accession information and dataset references can be found in Table S3 (separate Microsoft Excel file).

Supporting information

Additional figures and tables referred to in the text (1, 7, 8, 22, 52, 53, 56, 57, 60, 61, 62, 63, 64, 65, 67, 74, 75, 78, 81, 82, 83, 84, 85).

Figs. S1–S13: NaPDoS2 workflow and website screenshots, updated database statistics and classification validation, positive/negative control sequence selection, classification category accuracy validation, phylogenetic context of amplicon dataset analysis, and expanded KS phylogenetic trees.

Tables S1–S12: NaPDoS2 dataset analysis speed comparisons, classification category taxa, application use case analyses of genome, metagenome, and amplicon datasets, and accession information for all analyzed sequences and datasets. Table S3 (Microsoft Excel file): all dataset and sequence accession information.

Conflict of interest

The authors declare that they have no conflicts of interest with the contents of this article.

Acknowledgments

The authors acknowledge James Wang and Ghulam Mustafa for assistance in identifying sequences added to the database. We also thank Dulce Guillén-Matus and Jeong Sang Yi for providing useful feedback on beta-versions of the webtool.

Author contributions

L. J. K., S. P., N. Z., A. C. L., and P. R. J conceptualization; L. J. K., K. E. C., A. M. D., S. P., N. Z., and A. C. L. methodology; L. J. K., K. E. C., A. M. D., S. P., H. W. S., B. S. M., P. R. J., and A. C. L. formal analyses; L. J. K., K. E. C., S. P., and A. M. D. data curation, L. J. K. writing-original draft; S. P., K. E. C. validation; S. P. software; S. P., K. E. C., A. M. D., H. W. S., B. S. M., E. E. A., P. R. J., and N. Z. writing–review and editing; B. S. M. and P. R. J. funding acquisition.

Funding and additional information

This research was supported by the National Institutes of Health (grant no. 5R01GM085770 to P. R. J. and B. S. M.), the National Science Foundation Graduate Research Fellowship Program (grant no. DGE-1650112 to K. E. C. and A. M. D.; grant no. DGE-2038238 to H. W. S.), and the National Science Foundation Division of Molecular and Cellular Biosciences (grant MCB-1149552 to E. E. A.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health or the National Science Foundation.

Edited by F. Peter Guengerich

Footnotes

Present addresses for Alyssa M Demko: Smithsonian Marine Station, Fort Pierce, FL 34949, United States.

Present addresses for Nadine Ziemert: Interfaculty Institute of Microbiology and Infection Medicine, Institute for Bioinformatics and Medical Informatics (IBMI), University of Tübingen, Auf der Morgenstelle 28, 72,076 Tübingen, Germany. German Centre for Infection Research (DZIF), Partner Site Tübingen, Germany.

Supporting information

Supporting Information
mmc1.pdf (4MB, pdf)
Table S3
mmc2.xlsx (138.8KB, xlsx)

References

  • 1.Blin K., Shaw S., Kloosterman A.M., Charlop-Powers Z., van Wezel G.P., Medema M.H., et al. antiSMASH 6.0: improving cluster detection and comparison capabilities. Nucl. Acids Res. 2021;1:W29–W35. doi: 10.1093/nar/gkab335. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Skinnider M.A., Johnston C.W., Gunabalasingam M., Merwin N.J., Kieliszek A.M., MacLellan R.J., et al. Comprehensive prediction of secondary metabolite structure and biological activity from microbial genome sequences. Nat. Comm. 2020;11:6058. doi: 10.1038/s41467-020-19986-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Medema M.H. The year 2020 in natural product bioinformatics: an overview of the latest tools and databases. Nat. Prod. Rep. 2021;38:301–306. doi: 10.1039/d0np00090f. [DOI] [PubMed] [Google Scholar]
  • 4.Medema M.H., de Rond T., Moore B.S. Mining genomes to illuminate the specialized chemistry of life. Nat. Rev. Genet. 2021;22:553–571. doi: 10.1038/s41576-021-00363-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Mungan M.D., Alanjary M., Blin K., Weber T., Medema M.H., Ziemert N. Arts 2.0: feature updates and expansion of the antibiotic resistant target seeker for comparative genome mining. Nucl. Acids Res. 2020;48:W546–W552. doi: 10.1093/nar/gkaa374. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Reddy B.V.B., Milshteyn A., Charlop-Powers Z., Brady S.F. eSNaPD: a versatile, web-based bioinformatics platform for surveying and mining natural product biosynthetic diversity from metagenomes. Chem. Biol. 2014;21:1023–1033. doi: 10.1016/j.chembiol.2014.06.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Helfrich E.J., Ueoka R., Dolev A., Rust M., Meoded R.A., Bhushan A., et al. Automated structure prediction of trans-acyltransferase polyketide synthase products. Nat. Chem. Biol. 2019;15:813–821. doi: 10.1038/s41589-019-0313-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Ziemert N., Podell S., Penn K., Badger J.H., Allen E., Jensen P.R. The natural product domain seeker NaPDoS: a phylogeny based bioinformatic tool to classify secondary metabolite gene diversity. PLoS One. 2012;7 doi: 10.1371/journal.pone.0034064. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Hertweck C. The biosynthetic logic of polyketide diversity. Angew. Chem. Int. Ed. 2009;48:4688–4716. doi: 10.1002/anie.200806121. [DOI] [PubMed] [Google Scholar]
  • 10.Schöner T.A., Gassel S., Osawa A., Tobias N.J., Okuno Y., Sakakibara Y., et al. Aryl polyenes, a highly abundant class of bacterial natural products, are functionally related to antioxidative carotenoids. ChemBioChem. 2016;17:247–253. doi: 10.1002/cbic.201500474. [DOI] [PubMed] [Google Scholar]
  • 11.Wietz M., Duncan K., Patin N., Jensen P. Antagonistic interactions mediated by marine bacteria: the role of small molecules. J. Chem. Ecol. 2013;39:879–891. doi: 10.1007/s10886-013-0316-x. [DOI] [PubMed] [Google Scholar]
  • 12.Fischbach M.A., Walsh C.T. Assembly-line enzymology for polyketide and nonribosomal peptide antibiotics: logic, machinery, and mechanisms. Chem. Rev. 2006;106:3468–3496. doi: 10.1021/cr0503097. [DOI] [PubMed] [Google Scholar]
  • 13.Shen B. Polyketide biosynthesis beyond the type I, II and III polyketide synthase paradigms. Curr. Opin. Chem. Biol. 2003;7:285–295. doi: 10.1016/s1367-5931(03)00020-6. [DOI] [PubMed] [Google Scholar]
  • 14.Lohman J.R., Ma M., Osipiuk J., Nocek B., Kim Y., Chang C., et al. Structural and evolutionary relationships of “AT-less” type I polyketide synthase ketosynthases. Proc. Nat. Acad. Sci. U. S. A. 2015;112:12693–12698. doi: 10.1073/pnas.1515460112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Grammbitter G.L., Schmalhofer M., Karimi K., Shi Y.-M., Schoner T.A., Tobias N.J., et al. An uncommon type II PKS catalyzes biosynthesis of aryl polyene pigments. J. Amer. Chem. Soc. 2019;141:16615–16623. doi: 10.1021/jacs.8b10776. [DOI] [PubMed] [Google Scholar]
  • 16.Metsa-Ketela M., Halo L., Munukka E., Hakala J., Mantsala P., Ylihonko K. Molecular evolution of aromatic polyketides and comparative sequence analysis of polyketide ketosynthase and 16S ribosomal DNA genes from various Streptomyces species. Appl. Environ. Microbiol. 2002;68:4472–4479. doi: 10.1128/AEM.68.9.4472-4479.2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Jenke-Kodama H., Sandmann A., Muller R., Dittmann E. Evolutionary implications of bacterial polyketide synthases. Mol. Biol. Evol. 2005;22:2027–2039. doi: 10.1093/molbev/msi193. [DOI] [PubMed] [Google Scholar]
  • 18.Moffitt M.C., Neilan B.A. Evolutionary affiliations within the superfamily of ketosynthases reflect complex pathway associations. J. Mol. Evol. 2003;56:446–457. doi: 10.1007/s00239-002-2415-0. [DOI] [PubMed] [Google Scholar]
  • 19.Buchfink B., Xie C., Huson D.H. Fast and sensitive protein alignment using DIAMOND. Nat. Met. 2015;12:59–60. doi: 10.1038/nmeth.3176. [DOI] [PubMed] [Google Scholar]
  • 20.Altschul S.F., Gish W., Miller W., Myers E.W., Lipman D.J. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
  • 21.Sigrist R., Luhavaya H., McKinnie S.M., Ferreira da Silva A., Jurberg I.D., Moore B.S., et al. Nonlinear biosynthetic assembly of alpiniamide by a hybrid cis/trans-AT PKS-NRPS. ACS Chem. Biol. 2020;15:1067–1077. doi: 10.1021/acschembio.0c00081. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Kautsar S.A., Blin K., Shaw S., Navarro-Muñoz J.C., Terlouw B.R., van der Hooft J.J., et al. MIBiG 2.0: a repository for biosynthetic gene clusters of known function. Nucl. Acids Res. 2020;48:D454–D458. doi: 10.1093/nar/gkz882. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Feng L., Gordon M.T., Liu Y., Basso K.B., Butcher R.A. Mapping the biosynthetic pathway of a hybrid polyketide-nonribosomal peptide in a metazoan. Nat. Comm. 2021;12:4912. doi: 10.1038/s41467-021-24682-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Austin M.B., Saito T., Bowman M.E., Haydock S., Kato A., Moore B.S., et al. Biosynthesis of Dictyostelium discoideum differentiation-inducing factor by a hybrid type I fatty acid–type III polyketide synthase. Nat. Chem. Biol. 2006;2:494–502. doi: 10.1038/nchembio811. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Coates R.C., Podell S., Korobeynikov A., Lapidus A., Pevzner P., Sherman D.H., et al. Characterization of cyanobacterial hydrocarbon composition and distribution of biosynthetic pathways. PLoS One. 2014;9 doi: 10.1371/journal.pone.0085140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Nakamura H., Hamer H.A., Sirasani G., Balskus E.P. Cylindrocyclophane biosynthesis involves functionalization of an unactivated carbon center. J. Am. Chem. Soc. 2012;134:18518–18521. doi: 10.1021/ja308318p. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Walker P., Weir A., Willis C., Crump M. Polyketide β-branching: diversity, mechanism and selectivity. Nat. Prod. Rep. 2021;38:723–756. doi: 10.1039/d0np00045k. [DOI] [PubMed] [Google Scholar]
  • 28.Herbst D.A., Townsend C.A., Maier T. The architectures of iterative type I PKS and FAS. Nat. Prod. Rep. 2018;35:1046–1069. doi: 10.1039/c8np00039e. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Chen H., Du L. Iterative polyketide biosynthesis by modular polyketide synthases in bacteria. Appl. Microbiol. Biotechnol. 2016;100:541–557. doi: 10.1007/s00253-015-7093-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Metz J.G., Roessler P., Facciotti D., Levering C., Dittrich F., Lassner M., et al. Production of polyunsaturated fatty acids by polyketide synthases in both prokaryotes and eukaryotes. Science. 2001;293:290–293. doi: 10.1126/science.1059593. [DOI] [PubMed] [Google Scholar]
  • 31.Cao S., Blodgett J.A., Clardy J.J.O.L. Targeted discovery of polycyclic tetramate macrolactams from an environmental Streptomyces strain. Org. Lett. 2010;12:4652–4654. doi: 10.1021/ol1020064. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Gallo A., Ferrara M., Perrone G. Phylogenetic study of polyketide synthases and nonribosomal peptide synthetases involved in the biosynthesis of mycotoxins. Toxins. 2013;5:717–742. doi: 10.3390/toxins5040717. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Chooi Y.-H., Tang Y. Navigating the fungal polyketide chemical space: from genes to molecules. J. Org. Chem. 2012;77:9933–9953. doi: 10.1021/jo301592k. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Schmitt I., Lumbsch H.T. Ancient horizontal gene transfer from bacteria enhances biosynthetic capabilities of fungi. PLoS One. 2009;4:e4437. doi: 10.1371/journal.pone.0004437. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Nguyen T., Ishida K., Jenke-Kodama H., Dittmann E., Gurgui C., Hochmuth T., et al. Exploiting the mosaic structure of trans-acyltransferase polyketide synthases for natural product discovery and pathway dissection. Nat. Biotechnol. 2008;26:225–233. doi: 10.1038/nbt1379. [DOI] [PubMed] [Google Scholar]
  • 36.Hertweck C. Decoding and reprogramming complex polyketide assembly lines: prospects for synthetic biology. Trends Biochem. Sci. 2015;40:189–199. doi: 10.1016/j.tibs.2015.02.001. [DOI] [PubMed] [Google Scholar]
  • 37.Bretschneider T., Heim J.B., Heine D., Winkler R., Busch B., Kusebauch B., et al. Vinylogous chain branching catalysed by a dedicated polyketide synthase module. Nature. 2013;502:124–128. doi: 10.1038/nature12588. [DOI] [PubMed] [Google Scholar]
  • 38.Chen A., Re R.N., Burkart M.D. Type II fatty acid and polyketide synthases: deciphering protein–protein and protein–substrate interactions. Nat. Prod. Rep. 2018;35:1029–1045. doi: 10.1039/c8np00040a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Hillenmeyer M.E., Vandova G.A., Berlew E.E., Charkoudian L.K. Evolution of chemical diversity by coordinated gene swaps in type II polyketide gene clusters. Proc. Natl. Acad. Sci. U. S. A. 2015;112:13952–13957. doi: 10.1073/pnas.1511688112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Tang Y., Tsai S.-C., Khosla C. Polyketide chain length control by chain length factor. J. Amer. Chem. Soc. 2003;125:12708–12709. doi: 10.1021/ja0378759. [DOI] [PubMed] [Google Scholar]
  • 41.Komaki H., Harayama S. Sequence diversity of type-II polyketide synthase genes in Streptomyces. Actinomycetologica. 2006;20:42–48. [Google Scholar]
  • 42.Kim J., Yi G.-S. PKMiner: a database for exploring type II polyketide synthases. BMC Microbiol. 2012;12:169. doi: 10.1186/1471-2180-12-169. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Villebro R., Shaw S., Blin K., Weber T. Sequence-based classification of type II polyketide synthase biosynthetic gene clusters for antiSMASH. J. Ind. Microbiol. Biotechnol. 2019;46:469–475. doi: 10.1007/s10295-018-02131-9. [DOI] [PubMed] [Google Scholar]
  • 44.Kwon H.-J., Smith W.C., Scharon A.J., Hwang S.H., Kurth M.J., Shen B. CO bond formation by polyketide synthases. Science. 2002;297:1327–1330. doi: 10.1126/science.1073175. [DOI] [PubMed] [Google Scholar]
  • 45.Walczak R.J., Woo A.J., Strohl W.R., Priestley N.D. Nonactin biosynthesis: the potential nonactin biosynthesis gene cluster contains type II polyketide synthase-like genes. FEMS Microbiol. Lett. 2000;183:171–175. doi: 10.1111/j.1574-6968.2000.tb08953.x. [DOI] [PubMed] [Google Scholar]
  • 46.Rebets Y., Brötz E., Manderscheid N., Tokovenko B., Myronovskyi M., Metz P., et al. Insights into the pamamycin biosynthesis. Angew. Chem. Intern. Ed. 2015;54:2280–2284. doi: 10.1002/anie.201408901. [DOI] [PubMed] [Google Scholar]
  • 47.Cimermancic P., Medema M.H., Claesen J., Kurita K., Brown L.C.W., Mavrommatis K., et al. Insights into secondary metabolism from a global analysis of prokaryotic biosynthetic gene clusters. Cell. 2014;158:412–421. doi: 10.1016/j.cell.2014.06.034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Du D., Katsuyama Y., Horiuchi M., Fushinobu S., Chen A., Davis T.D., et al. Structural basis for selectivity in a highly reducing type II polyketide synthase. Nat. Chem. Biol. 2020;16:776–782. doi: 10.1038/s41589-020-0530-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Lee W.C., Choi S., Jang A., Yeon J., Hwang E., Kim Y. Structural basis of the complementary activity of two ketosynthases in aryl polyene biosynthesis. Sci. Rep. 2021;11:1–10. doi: 10.1038/s41598-021-95890-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Hastie T., Tibshirani R., Friedman J.H., Friedman J.H. Springer; New York: 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. [Google Scholar]
  • 51.Fawcett T. An introduction to ROC analysis. Pattern Recog. Lett. 2006;27:861–874. [Google Scholar]
  • 52.Jiang C., Kim S.Y., Suh D.-Y. Divergent evolution of the thiolase superfamily and chalcone synthase family. Mol. Phylogenet. Evol. 2008;49:691–701. doi: 10.1016/j.ympev.2008.09.002. [DOI] [PubMed] [Google Scholar]
  • 53.Millán-Aguiñaga N., Chavarria K.L., Ugalde J.A., Letzel A.-C., Rouse G.W., Jensen P.R. Phylogenomic insight into Salinispora (bacteria, Actinobacteria) species designations. Sci. Rep. 2017;7:3564. doi: 10.1038/s41598-017-02845-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Zhang G., Zhang W., Saha S., Zhang C. Recent advances in discovery, biosynthesis and genome mining of medicinally relevant polycyclic tetramate macrolactams. Curr. Top. Med. Chem. 2016;16:1727–1739. doi: 10.2174/1568026616666151012112818. [DOI] [PubMed] [Google Scholar]
  • 55.Almeida H., Tsang A., Diallo A.B. 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2019. Supporting supervised learning in fungal Biosynthetic Gene Cluster discovery: new benchmark datasets; pp. 1280–1287. [Google Scholar]
  • 56.Cai H., Li Q., Fang X., Li J., Curtis N.E., Altenburger A., et al. A draft genome assembly of the solar-powered sea slug Elysia chlorotica. Sci. Data. 2019;6:1–13. doi: 10.1038/sdata.2019.22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Torres J.P., Lin Z., Winter J.M., Krug P.J., Schmidt E.W. Animal biosynthesis of complex polyketides in a photosynthetic partnership. Nat. Comm. 2020;11:190022. doi: 10.1038/s41467-020-16376-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Beedessee G., Hisata K., Roy M.C., Satoh N., Shoguchi E. Multifunctional polyketide synthase genes identified by genomic survey of the symbiotic dinoflagellate, Symbiodinium minutum. BMC Genomics. 2015;16:1–11. doi: 10.1186/s12864-015-2195-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Oliver A., Podell S., Pinowska A., Traller J.C., Smith S.R., McClure R., et al. Diploid genomic architecture of Nitzschia inconspicua, an elite biomass production diatom. Sci. Rep. 2021;11:1–14. doi: 10.1038/s41598-021-95106-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Schorn M.A., Verhoeven S., Ridder L., Huber F., Acharya D.D., Aksenov A.A., et al. A community resource for paired genomic and metabolomic data mining. Nat. Chem. Biol. 2021;17:363–368. doi: 10.1038/s41589-020-00724-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Wawrik B., Kerkhof L., Zylstra G.J., Kukor J.J. Identification of unique type II polyketide synthase genes in soil. Appl. Environ. Microbiol. 2005;71:2232–2238. doi: 10.1128/AEM.71.5.2232-2238.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Owen J.G., Reddy B.V.B., Ternei M.A., Charlop-Powers Z., Calle P.Y., Kim J.H., et al. Mapping gene clusters within arrayed metagenomic libraries to expand the structural diversity of biomedically relevant natural products. Proc. Natl. Acad. Sci. U. S. A. 2013;110:11797–11802. doi: 10.1073/pnas.1222159110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Borsetto C., Amos G.C., da Rocha U.N., Mitchell A.L., Finn R.D., Laidi R.F., et al. Microbial community drivers of PK/NRP gene diversity in selected global soils. Microbiome. 2019;7:1–11. doi: 10.1186/s40168-019-0692-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Elfeki M., Alanjary M., Green S.J., Ziemert N., Murphy B.T. Assessing the efficiency of cultivation techniques to recover natural product biosynthetic gene populations from sediment. ACS Chem. Biol. 2018;13:2074–2081. doi: 10.1021/acschembio.8b00254. [DOI] [PubMed] [Google Scholar]
  • 65.Gumerov V.M., Zhulin I.B. Trend: a platform for exploring protein function in prokaryotes based on phylogenetic, domain architecture and gene neighborhood analyses. Nucl. Acids Res. 2020;48:W72–W76. doi: 10.1093/nar/gkaa243. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Bauman K.D., Shende V.V., Chen P.Y.-T., Trivella D.B., Gulder T.A., Vellalath S., et al. Enzymatic assembly of the salinosporamide γ-lactam-β-lactone anticancer warhead. Nat. Chem. Biol. 2022;18:538–546. doi: 10.1038/s41589-022-00993-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Bachmann B.O., Ravel J. Methods for in silico prediction of microbial polyketide and nonribosomal peptide biosynthetic pathways from DNA sequence data. Met. Enzymol. 2009;458:181–217. doi: 10.1016/S0076-6879(09)04808-3. [DOI] [PubMed] [Google Scholar]
  • 68.Katoh K., Standley D.M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 2013;30:772–780. doi: 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Lefort V., Longueville J.-E., Gascuel O. Sms: smart model selection in PhyML. Mol. Biol. Evol. 2017;34:2422–2424. doi: 10.1093/molbev/msx149. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Capella-Gutiérrez S., Silla-Martínez J.M., Gabaldón T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25:1972–1973. doi: 10.1093/bioinformatics/btp348. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Lemoine F., Correia D., Lefort V., Doppelt-Azeroual O., Mareuil F., Cohen-Boulakia S., et al. NGPhylogeny. Fr: new generation phylogenetic services for non-specialists. Nucl. Acids Res. 2019;47:W260–W265. doi: 10.1093/nar/gkz303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Lemoine F., Domelevo Entfellner J.-B., Wilkinson E., Correia D., Dávila Felipe M., De Oliveira T., et al. Renewing Felsenstein’s phylogenetic bootstrap in the era of big data. Nature. 2018;556:452–456. doi: 10.1038/s41586-018-0043-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Rice P., Longden I., Bleasby A. Emboss: the European molecular biology open software suite. Trends Genet. 2000;16:276–277. doi: 10.1016/s0168-9525(00)02024-2. [DOI] [PubMed] [Google Scholar]
  • 74.Edgar R.C. Muscle: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinform. 2004;5:113. doi: 10.1186/1471-2105-5-113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Price M.N., Dehal P.S., Arkin A.P. FastTree 2–approximately maximum-likelihood trees for large alignments. PLoS One. 2010;5 doi: 10.1371/journal.pone.0009490. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Fu L., Niu B., Zhu Z., Wu S., Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28:3150–3152. doi: 10.1093/bioinformatics/bts565. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Marchler-Bauer A., Bryant S.H. CD-search: protein domain annotations on the fly. Nucl. Acids Res. 2004;32:W327–W331. doi: 10.1093/nar/gkh454. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Lu S., Wang J., Chitsaz F., Derbyshire M.K., Geer R.C., Gonzales N.R., et al. CDD/SPARCLE: the conserved domain database in 2020. Nucl. Acids Res. 2020;48:D265–D268. doi: 10.1093/nar/gkz991. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Goksuluk D., Korkmaz S., Zararsiz G., Karaagaoglu A.E. easyROC: an interactive web-tool for ROC curve analysis using R language environment. R. J. 2016;8:213. [Google Scholar]
  • 80.Chen I.-M.A., Chu K., Palaniappan K., Pillay M., Ratner A., Huang J., et al. IMG/M v. 5.0: an integrated data management and comparative analysis system for microbial genomes and microbiomes. Nucl. Acids Res. 2019;47:D666–D677. doi: 10.1093/nar/gky901. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Marchler-Bauer A., Anderson J.B., Derbyshire M.K., DeWeese-Scott C., Gonzales N.R., Gwadz M., et al. CDD: a conserved domain database for interactive domain family analysis. Nucl. Acids Res. 2007;35:237–240. doi: 10.1093/nar/gkl951. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Letunic I., Bork P. Interactive tree of life (ITOL) v4: recent updates and new developments. Nucl. Acids Res. 2019;47:W256–W259. doi: 10.1093/nar/gkz239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Miller, M. A, Pfeiffer, W, and Schwartz, T. Creating the CIPRES science gateway for inference of large phylogenetic trees. In 2010 Gateway Computing Environments Workshop, GCE 2010; New Orleans, LA, 2010; pp 1–8.
  • 84.Darriba D., Taboada G.L., Doallo R., Posada D. ProtTest 3: fast selection of best-fit models of protein evolution. Bioinformatics. 2011;27:1164–1165. doi: 10.1093/bioinformatics/btr088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Stamatakis A. RAxML Version 8: A tool for phylogenetic analysis and post- analysis of large phylogenies. Bioinformatics. 2014;30:1312–1313. doi: 10.1093/bioinformatics/btu033. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information
mmc1.pdf (4MB, pdf)
Table S3
mmc2.xlsx (138.8KB, xlsx)

Data Availability Statement

Relevant alignment files, phylogenetic tree files, webtool documentation file, example tutorials, sequence files, and other supporting information can be found on the corresponding OSF project page: https://osf.io/uzhcp/. The code used to construct version two of the Natural Product Domain Seeker website can be found on the corresponding Github repository: https://github.com/spodell/NaPDoS2_website. All accession information and dataset references can be found in Table S3 (separate Microsoft Excel file).


Articles from The Journal of Biological Chemistry are provided here courtesy of American Society for Biochemistry and Molecular Biology

RESOURCES