SUMMARY
Plant natural products or specialized metabolites play a vital role in drug discovery and development, with many clinically important derivatives such as the anticancer drugs topotecan (derived from the natural alkaloid camptothecin) and etoposide (derived from the natural polyphenol podophyllotoxin). Remarkable advances in understanding plant natural product metabolism have been achieved at an unprecedented pace over the past 15 years. The integration of high‐throughput technologies in genomics, transcriptomics, and metabolomics has generated vast datasets that provide a more comprehensive understanding of plant metabolism. Additionally, advances in computational tools, machine learning, and data analytics have played a crucial role in processing and interpreting the massive amounts of newly available data, enabling researchers to uncover intricate regulatory networks and identify key components of biosynthetic pathways. This review navigates the evolving landscape of plant biosynthetic pathway elucidation accelerated by innovative multidisciplinary strategies that capitalize on big data. We highlight recent advances in plant‐specialized biosynthesis that illustrate how big data are increasingly leveraged to unravel the complexities of plant metabolism.
Keywords: plant, natural product biosynthesis, big data, artificial intelligence, multiomics
Significance Statement
Decoding complex plant metabolic pathways remains a significant scientific challenge. This review highlights how the integration of increasingly abundant big data from multi‐omics technologies with advanced computational approaches, including machine learning and AI‐powered tools, is transforming biosynthetic pathway discovery and enhancing our understanding of plant chemical diversity, with broad implications in innovative bioproduction.
INTRODUCTION
Plants produce an enormous reservoir of chemicals called secondary metabolites or natural products in a lineage‐specialized manner. Also known as specialized metabolites, these chemicals are thought to play important eco‐physiological roles in plant adaptation and have been known for their therapeutic bioactivities (Pichersky & Lewinsohn, 2011). Beyond the vast array of metabolites produced by plants and their uneven distribution across phylogenetic taxa, plant metabolic diversity is further exemplified by variations in metabolism within different plant organs and developmental stages, illustrated by recent high‐resolution analyses at finer scales, such as specific cell types, individual cells, or even organelles (Li et al., 2025; Li, Wood, et al., 2023; Mehta et al., 2024; Vu et al., 2024).
Such diversity and complexity convolute efforts in untangling plant metabolic pathways and devising biosynthetic platforms of valuable plant natural products. The comprehension and practical application of the biosynthesis of high‐value phytochemicals, including pharmaceutical natural products, have been impeded by their low abundance in plants, complex molecular structures, and intricate biosynthetic steps. Over the past half‐century, plant biochemists have dedicated extensive efforts to unravel the organization and mechanisms of plant metabolism and to find alternative methods to produce natural products at feasible yields, such as purification from plant cell cultures and reconstituting biosynthetic pathways in heterologous hosts. Outstanding examples include the complete elucidation of the biosynthetic route toward noscapine (Dang et al., 2015; Winzer et al., 2012), morphine (Farrow et al., 2015; Guo et al., 2018), vinblastine (Caputi et al., 2018; Qu et al., 2018), colchicine (Nett et al., 2020), strychnine (Hong et al., 2022), saponin adjuvants (Martin et al., 2024; Reed et al., 2023), and limonoids (Chuang et al., 2023; De La Peña et al., 2023). The majority of these discoveries have been made in the past decade thanks to the increasingly abundant plant omics data and powerful computational tools. These large data‐driven findings shed light on the intricate nature of plant metabolism, its complex networks of biochemical reactions of numerous enzymes and metabolites, and the unique genomic organization of its genetic elements (Figure 1). Herein, we discuss the integration of multi‐omics analyses and advanced large data analytics approaches in deciphering the spatial and temporal dynamics of plant metabolism, as well as the prospect of an artificial intelligence (AI)‐powered future of natural products biosynthesis innovation.
Figure 1.
Multi‐omics methods for biosynthetic gene discovery.
Homology search (OrthoFinder), co‐expression analysis (SOM, hierarchical clustering), and genome cluster and synteny analysis (PlantiSMASH) (middle panel) are the main approaches to mine omics data (left panel) to speed up plant natural products pathway reconstruction at an unprecedented pace in the past two decades (right panel). Figure created with icons from BioRender.com under academic license (Agreement no.: MH28BH5EQV).
OMICS‐GUIDED BIOSYNTHESIS ELUCIDATION
A biosynthetic pathway can be conceptually viewed as a puzzle assembled from individual pieces representing metabolic steps, pathways, and networks. The arrangement of these puzzle pieces mostly reflects the logical principles of enzymatic reactions, and in principle can be rationally predicted based on the chemical structures/scaffolds of the metabolites. Typically, enzyme identification commences with chemical intuition‐informed prediction by considering plausible chemical transformations and enzymes known to catalyze similar reactions. For example, studies of the strychnine biosynthetic pathway in Strychnos nux‐vomica used the previously elucidated steps of geissochizine oxidation as the starting point for biosynthetic pathway discovery. Based on the chemical logic that the pathway includes decarboxylation, oxidation, and reduction steps through the known strychnos alkaloid norfluorocurarine, candidate enzymes were effectively selected, and the whole pathway was successfully reconstituted (Hong et al., 2022).
Traditionally, enzyme and pathway discovery in specialized metabolism employed classical biochemical methods, including activity assays of crude protein extracts, isotope labeling of metabolites, synthetic oligodeoxynucleotide hybridization probes, homology‐based cloning, and expressed sequence tags library sequencing. Recent examples include the elucidation of the long‐elusive benzaldehyde synthase. Using crude protein extract from Petunia hybrida, researchers showed that two distinct, unrelated subunits of the benzaldehyde synthase were required for its activity (Huang et al., 2022). In addition, the once common radioisotope‐labeled feeding approach was recently employed in elucidating the pathway of galanthamine biosynthesis with great success (Mehta et al., 2024).
Although the aforementioned classical and somewhat labor‐intensive methodologies remain valuable in gene discovery, the emergence of next‐generation sequencing (NGS) in the late 2000s has revolutionized the pathway discovery landscape, providing comprehensive omics datasets as indispensable resources for the efficient identification of functional genes involved in plant‐specialized metabolism. The vast majority of pathway discovery studies have since relied on the large datasets generated by these techniques. For instance, relevant plant tissues, organs, or cells are collected to extract the RNA and DNA materials to construct the transcriptomic and genomic profiles. In addition, either untargeted or targeted metabolomics analyses are carried out from the same tissues/organs/cells to establish the transcriptome‐metabolome correlation network. Biosynthetic pathway elucidation can subsequently start with a robust bioinformatic analysis to identify candidate genes/enzymes or predict the biosynthetic pathway. A set of candidate genes for any single step can then be selected using bioinformatics tools based on various features such as homology to known genes/enzymes catalyzing the predicted reactions (homology‐based screening, BLAST search), expression profiles in relation to previously elucidated genes in the pathway (co‐expression analysis, hierarchical clustering, differential expression analysis), their location in the genome (synteny analysis, cluster finder), etc. The candidate genes are further cloned into expression vectors and transformed into heterologous hosts (e.g., Escherichia coli bacteria, Saccharomyces cerevisiae yeast, or Nicotiana benthamiana tobacco) to functionally validate the recombinant proteins (Dang et al., 2012; Nguyen et al., 2021). Over the past decade, the Agrobacterium‐mediated transient expression in N. benthamiana technique has accelerated the functional characterization of plant biosynthetic enzymes. Compared to heterologous expression in E. coli or yeast, this approach allows for rapid and simultaneous co‐expression of multiple metabolic genes with significantly less effort in engineering and optimizing the cloning platform (Kwan et al., 2023).
Following biochemical characterization, the putative gene can be silenced by virus‐induced gene silencing (VIGS) or RNA interference (RNAi) techniques to confirm its function and/or establish its physiological relevance in planta (Dang et al., 2012). Advances in omics technologies now permit the generation of highly contiguous genome assemblies, detection of transcripts and metabolites at the level of single cells, and high‐resolution determination of gene regulatory features with rapidly decreasing cost. The enormous volume and intricacy of genomics, transcriptomics, and metabolomics data require robust tools for data management (acquisition, processing, and storage) and mining (data visualization, co‐regulation, and correlation). For instance, based on sequenced data, a plant genome ranges from 61 million to 160 billion base pairs and requires effective tools for extracting meaningful insights from the resulting large, complex, and high‐dimensional datasets (Fernández et al., 2024; Zedek et al., 2024). The storage and accessibility of big data to be used for AI training are emerging concerns. Most publicly available datasets fail to have appropriate metadata, standardized formatting, or transparent links for data access. The FAIR (Findability, Accessibility, Interoperability, and Reusability) data principles are critical to making data sharing more efficient and helping original contributors receive proper citation and recognition when their datasets are reused. These would not only facilitate reproducibility and ethical reuse but also provide equal access to data‐driven innovation, especially given that AI tools now depend on large, well‐annotated datasets for data training (Hafner et al., 2025).
Recent works have demonstrated diverse and innovative methods of harnessing the wealth of genomics and metabolomics data to advance our understanding of plant metabolic pathways. While these approaches remain centered around three core categories, namely homology‐based identification, genomic proximity, and expression‐based identification, they have been powered by increasingly advanced machine learning (ML) and data mining techniques (Table 1).
Table 1.
Novel multi‐omics tools in the studies of plant specialized metabolism
Type of analysis | Tools | Examples of elucidated pathways | Notable references |
---|---|---|---|
Co‐expression analysis | Pearson correlation | Etoposide | Lau and Sattely (2015) |
Colchicine | Nett et al. (2020) | ||
Strychnine | Hong et al. (2022) | ||
Triterpene | De La Peña et al. (2023) | ||
Self‐organizing maps | Vinblastine | Payne et al. (2017) | |
Ajmaline | Dang et al. (2017, 2018) | ||
Camptothecin | Nguyen et al. (2021) | ||
Triterpene | Chuang et al. (2023) | ||
Indole alkaloids | Wang, Xiao, et al. (2022) | ||
Supervised machine learning | Tropane alkaloids | Srinivasan and Smolke (2021) | |
Monoterpene indole alkaloid | Stander et al. (2023) | ||
Benzylisoquiloline alkakoids | Vavricka et al. (2022) | ||
Homology‐based gene discovery | OrthoFinder (Emms & Kelly, 2019) | Spiroxindole alkaloids | Nguyen et al. (2023) |
Benzylisoquinoline alkaloids |
Carr et al. (2024) |
||
KIPEs | Flavonoid biosynthesis | Pucker et al. (2020); Rempel et al. (2023) | |
Genomic analysis | plantiSMASH (Kautsar et al., 2017) | Triterpenes | De La Peña et al. (2023); Reed et al. (2023) |
Avenacin | Li et al. (2021) | ||
Noscapine | Winzer et al. (2012) | ||
Synteny and Pangenome analysis | Benzoisoquinoline alkaloids | Li et al. (2020) | |
Thalianol | Liu, Cheema, et al. (2020) | ||
Tomato flavor | Gao et al. (2019) | ||
Monoterpene biosynthesis | Liu et al. (2023) | ||
Monoterpene indole alkaloids | Cuello et al. (2024) | ||
Triterpenoids | Hodgson et al. (2019) | ||
Gene evolution in grass crops | Lovell et al. (2022) | ||
Multiomics | Phylogeny chemoinformatics | Aucubin | Rodríguez‐López et al. (2022) |
MEANtools | α‐Tomatine and falcarinol | Singh et al., 2024 | |
GTC (Graph Transformer and CNN) | Multiple classes of pathways predicted | Bao et al. (2023) | |
ML | Genome information | Toubiana et al. (2019) | |
AgroNT | Mendoza‐Revilla et al. (2024) |
A NEW TAKE ON CO‐EXPRESSION ANALYSIS
One of the most prevalent approaches in pathway discovery is correlation analysis. Arguably the backbone of the field, correlation analysis uses statistical tools to identify genes that share similar expression patterns. It is most often applied to transcriptomic data and can also be adapted to other data, such as proteomics. The core assumption of the method is that genes and enzymes involved in a pathway share a similar expression pattern. A commonly used statistical tool for correlation analysis is the Pearson correlation, which compares expression patterns across different conditions and returns a correlation coefficient between –1 and +1, where 1 represents a perfectly similar correlation. Once a biosynthetic gene is elucidated or a candidate gene is identified, it can then be used as a bait to identify others that share its expression pattern. Pearson correlation was utilized effectively in untangling metabolic genes in the biosynthesis of etoposide aglycon and lycopodium alkaloids (Lau & Sattely, 2015; Nett et al., 2021, 2023). Co‐expression analysis also contributed to the complete elucidation of the strychnine pathway, as well as 22 enzymes involved in limonoid biosynthesis (De La Peña et al., 2023; Hong et al., 2022; Nett et al., 2020). In the past few years, correlation analysis has moved beyond the organ/tissue level to the sub‐tissue (Mehta et al., 2024) and single‐cell levels. An elegant example is a single‐cell multi‐omics study of Catharanthus roseus (Li, Wood, et al., 2023), which spatially identified both metabolites and genes in leaves and then correlated the expression patterns to identify genes involved in the vinblastine pathway. This combination of omics led to the discovery of a previously unknown enzyme responsible for producing the bisindole alkaloid anhydrovinblastine (Li, Wood, et al., 2023) and a new cell‐type‐specific transcription factor (Li et al., 2025). Single‐cell sequencing not only reveals more intricate details of previously studied biosyntheses, such as in the case of vinblastine, but also offers a promising avenue for elucidating elusive pathways, especially those that restrict specialized metabolites to specific cell types. Despite being incredibly useful as a rapid approach for plant transcriptomic co‐expression analysis, a key limitation of Pearson correlation lies in the variability of threshold selection across datasets, typically ranging from 0.6 to 0.8, for determining whether a set of genes exhibits strong correlation. This inconsistency, coupled with batch effects and dataset‐specific noise, can compromise reproducibility. To overcome the batch effect and variability in acquiring Pearson correlation, Spearman correlation or mutual rank (MR) offers a robust alternative for large‐scale co‐expression network analyses. Spearman correlation computes relationships on ranked expression values, which helps to reduce sensitivity to outliers, and demonstrates the monotonic correlation (linear or non‐linear) across samples (Pividori et al., 2019). By ranking gene pairs based on the geometric mean of their reciprocal Pearson correlation ranks, mutual rank reduces the influence of arbitrary thresholds and enhances network stability (Obayashi & Kinoshita, 2009; Wisecaver et al., 2017).
In addition to Pearson correlation, ML algorithms, such as self‐organizing maps (SOM) or feed‐forward networks, have also been explored with encouraging results (Table 1). Unsupervised ML analysis methods are being explored to investigate various biosynthetic pathways. For example, SOM, a form of unsupervised ML that groups data points based on how similar their expression patterns are, has enabled the discovery of many genes in monoterpenoid indole alkaloid (MIA) biosynthesis over the past decade. Successful examples include a nitrate/peptide family (NPF) transporter in C. roseus (Payne et al., 2017), a sarpagan bridge enzyme and vinorine hydroxylase in the ajmaline pathway in Rauvolfia serpentina (Dang et al., 2017, 2018), and camptothecin‐10‐hydroxylase in Camptotheca acuminata (Nguyen et al., 2021). SOM was also recently used in identifying a cytochrome P450 and two isomerases involved in limonoid and quassinoid biosynthesis from Ailanthus altissima (Chuang et al., 2023), and a new cytochrome P450 in the benzophenone and xanthone biosynthetic pathways (Wang, Malaco Morotti, et al., 2022; Wang, Xiao, et al., 2022). When prior knowledge of the pathway is unavailable, supervised ML, which relies on algorithms trained from labeled data to make predictions or decisions on unknown steps of the pathways, could be useful. Artificial neural networks or feed‐forward neural networks have been trained to identify genes involved in the biosynthesis and transportation of specialized metabolites. These methods train the algorithms with the expression patterns of known biosynthetic genes (e.g., those encoding for enzymes previously shown to act on the substrates in the pathway) in contrast with those of unrelated genes (e.g., genes encoding for enzymes with no activity against pathway substrates). Among the first reports of the use of a feed‐forward neural network is the identification of transporters of tropane alkaloids in Atropa belladonna (Srinivasan & Smolke, 2021). These neural networks were further utilized to identify several genes involved in yohimbane biosynthesis in Rauvolfia tetraphylla (Stander et al., 2023). In another example, an artificial neural network trained on the 23 genes encoding for upstream steps leading to strictosidine allowed the successful prediction of the 22 known downstream steps in MIA biosynthesis in C. roseus, demonstrating the power of supervised ML methods for pathway exploration beyond tropane alkaloids (Dugé De Bernonville et al., 2022). To identify the uncharacterized branches in Papaver somniferum alkaloid biosynthesis, support vector machines, a type of supervised ML algorithm, were trained and built to automate the selection of target sequences from over 100 candidates present throughout highly duplicated carboxy‐lyase and oxidase gene families. This led to the discovery of an aromatic acetaldehyde synthase and a phenyl‐pyruvate decarboxylase, which play crucial roles in initiating the benzylisoquinoline alkaloid (BIA) biosynthesis pathway (Vavricka et al., 2022). These examples demonstrate that co‐expressed studies, either in the form of differential gene expression, gene co‐expression network, or hierarchical clustering in combination with supervised or unsupervised clustering methods, can expedite the elucidation of metabolic pathways (Table 1).
AN EXPANSION ON HIGH‐THROUGHPUT HOMOLOGY‐BASED DISCOVERY
Homology‐based discovery is one of the most utilized tools for identifying candidate genes for pathway elucidation. As many plant‐specialized metabolites share a core scaffold, an assumption is often made that their biosynthetic pathways share similar types of reactions, catalyzed by enzyme variants with similar sequences. The simplest method for identifying candidate genes by homology is the Basic Local Alignment Search Tool (BLAST), which searches for the sequence of an enzyme with known function against a database containing the genetic sequences of the plant(s) of interest. While BLAST remains useful in small‐scale analyses, newer high‐throughput homology‐based search methods have been developed to harness the publicly available transcriptome and genomic data with promising potential in new pathway elucidation. For instance, OrthoFinder, initially developed for comparative genomics, has emerged as a viable tool for identifying genes/enzymes of similar sequences and, hypothetically, similar functions, across different plant species (Emms & Kelly, 2019). A typical OrthoFinder pipeline first translates coding sequences into the amino acid sequences and then generates orthologous groups (orthogroups) based on the homology of the sequences across different species. In a recent study, OrthoFinder was used to identify the cytochrome P450 catalyzing the long‐elusive formation of spirooxindole alkaloids in Mitragyna speciosa. The search was based on previous knowledge that cytochrome P450 enzymes from the CYP71 enzyme family catalyze various oxidative rearrangements of heteroyohimbane and secoyohimbane scaffolds to new MIA scaffolds in various plant species (Nguyen et al., 2023). Examples include the formation of the strychnos scaffold from geissoschizine by CYP71D1V1 in C. roseus (Tatsis et al., 2017), sarpagan scaffold by CYP71AY4 in R. serpentina and CYP71AY5 in Gelsemium sempervirens (Dang et al., 2018), and akuammilan scaffold by CYP71AY1 in Alstonia scholaris (Wang, Malaco Morotti, et al., 2022; Wang, Xiao, et al., 2022). OrthoFinder analysis of more than 10 plant species revealed a CYP71 orthogroup specific for the spirooxindole‐producing plant M. speciosa, and was absent from spirooxindole‐free species, of which one enzyme was found to be the long‐sought spirooxindole synthase (Nguyen et al., 2023). OrthoFinder is also commonly used to find the expansion and contraction of gene families across multiple species, providing insight into genetic elements and evolution underlying plant metabolic diversity (Carr et al., 2024; Chen et al., 2020; Kang et al., 2021; Rai et al., 2021; Stander et al., 2023; Yang et al., 2019). OrthoFinder or similar orthologue inference‐based analysis can leverage large transcriptomics or genomics data set tools to find variants of the same enzymes in different species for evolution studies or for biocatalyst mining. Another similar and effective method for gene discovery, ortholog communities (OCs) analysis, provides a robust approach for discovering conserved gene modules involved in specialized metabolism across multiple species. In a recent study on strigolactone biosynthesis in rice, OCs were constructed by integrating transcriptomes from Arabidopsis thaliana, tomato, and rice under phosphorus deficiency conditions. The orthology mapping was conducted by linking co‐expression networks of highly co‐expressed genes in individual species (with PCC > 0.70). Through this analysis, OsCYP706C2 was functionally characterized as a key enzyme in the biosynthesis of strigolactone (Li et al., 2024). By prioritizing genes that exhibit both strong co‐expression and evolutionary conservation, this cross‐species strategy effectively narrows down large datasets to a manageable set of high‐confidence candidates. In addition, the Tool to Infer Orthologs from Genome Alignment (TOGA) is designed to identify orthologous genes between species, enabling the transfer of functional annotations and the detection of gene losses and duplications (Kirilenko et al., 2023). While OrthoFinder uses protein sequences to identify orthologous genes, TOGA detects orthologues by using partial alignments of introns and intergenic regions of genomic data. With such functions, TOGA represents a new generation of orthologous analysis tools with implications beyond pathway discovery in plant science. Firstly, large alignment data from TOGA analyses enable the identification of biosynthetic genes among plants in the same family or plants sharing similar metabolic profiles. Secondly, TOGA‐powered phylogenetic analyses of orthologous genes from various species can illustrate genomic evolutionary events such as gene loss and duplication. Thirdly, TOGA provides transferable gene annotations for the new genome sequences using well‐studied genomes as references. TOGA presents certain limitations, primarily due to its reliance on whole‐genome alignments for orthology inference, which imposes substantial computational demands and typically necessitates access to high‐performance computing infrastructure. Moreover, as TOGA infers orthologues based on alignment quality and syntenic context, its performance can be adversely affected by low‐contiguity assemblies, poorly annotated genome sequences, or substantial sequence divergence among input genome sequences. While TOGA has been used in only a handful of non‐plant studies in the past 2 years, it is expected to become a common tool in future biosynthetic pathway studies.
Functional annotation of genes and enzymes in natural product biosynthesis has also benefited from recent advances in AI and ML techniques. As specialized metabolic pathways in plants are relatively under‐investigated compared to those in microbes and animals (Pucker, 2024), BLAST search and similar tools can provide unreliable results for the identification of orthologous genes as only short lengths of sequences are compared. Other tools, such as InterProScan5, look for conserved regions to provide annotation based on databases of characterized enzymes (Jones et al., 2014). However, like BLAST‐based approaches, it focuses on local alignments and may overlook full‐length sequence context. In contrast, global alignment tools such as MAFFT (Katoh & Standley, 2013) and MUSCLE5 (Edgar, 2022) align entire sequences to capture broader evolutionary relationships, though they might be slower and less scalable for large datasets. Recently, new AI/ML‐powered tools have been developed. A good example is CLEAN (Contrastive Learning–Enabled Enzyme Annotation), which uses a contrastive learning framework to assign the Enzyme Commission (EC) numbers to understudied enzyme classes and help correct mis‐annotations (Yu et al., 2023). CLEAN has been used to identify new halogenases and is gaining increasing interest in studies focused on discovering new unique enzymes and biocatalysts (Reed & Seebeck, 2024; Yuan et al., 2023). Other homology‐based methods, such as KIPEs (Knowledge‐based Identification of Pathway Enzymes) and the MYB_annotator/bHLH_annotator, could also be useful for the identification of enzymes involved in the biosynthesis of other classes of natural products. KIPEs perform MAFFT‐based similarity searches using curated bait sequences against the sequences from the plant of interest (Pucker et al., 2020). The results are then further refined by identifying conserved residues and domains. Originally developed for flavonoid metabolism, KIPEs can be applied to a variety of biosynthetic pathways if suitable bait sequences are available. Similar methods have also been developed by the same researchers for the identification of MYB and bHLH transcription factors, both of which have been shown to be involved in the regulation of plant phenylpropanoid metabolism (Pucker, 2022; Thoben & Pucker, 2023). These flexible tools herald the future of robust and user‐friendly AI/ML techniques capable of providing high‐quality and specific enzyme identification.
BIG DATA‐ENABLED NEW GENOMICS FRONTIER
Genes involved in the same metabolic process are often clustered together in the genomes of bacteria and fungi (Jensen, 2016), facilitating pathway discovery. In plants, there are far fewer examples of biosynthetic pathways for which encoding genes are physically co‐located in the genome. Nevertheless, since the first discovery of the benzoxazinone DIMBOA biosynthetic gene cluster from Zea mays (Frey et al., 1997), evidence of gene clusters in plant natural products biosynthesis has steadily increased with over 40 examples identified to date covering molecules from various classes of specialized metabolites, including terpenoids, cyanogenic glycosides, alkaloids, fatty acid derivatives, and phenylpropanoids (Ji et al., 2024). Such discoveries were enabled by the advances of third‐generation long‐read sequencing platforms, such as PacBio single‐molecule real‐time (SMRT) sequencing (Ferrarini et al., 2013) and Oxford Nanopore technology (Schmidt et al., 2017), which have significantly improved the genome assembly contiguity by overcoming the high fragmentation of short reads from NGS (Schadt et al., 2010). Complementary scaffolding technologies, such as chromosome‐conformation capture (Hi‐C), enable the generation of chromosome‐scale assemblies, and thereby the identification of physically clustered biosynthetic genes (Rai et al., 2021). An emerging technique, Pore‐C, combines the Nanopore sequencing technology with the chromatin conformation capture, which offers high‐resolution genome architecture crucial for chromosome‐scale genome assembly (Jo et al., 2024). One well‐known gene cluster example is found in the noscapine biosynthetic pathway in opium poppy, with 10 clustered genes encoding for cytochrome P450s, O‐methyltransferases, an esterase, and an acetyltransferase (Winzer et al., 2012). The physical clustering of biosynthetic pathway genes has inspired the development of computational methods for mining sequenced plant genomes to identify new candidate biosynthetic gene clusters (BGCs). Tools such as plantiSMASH (Kautsar et al., 2017) and PhytoClust (Töpfer et al., 2017) (Figure 2) are key examples of such algorithms and have been successfully used in the identification of many gene clusters, such as the QS17 (Reed et al., 2023) and QS21 clusters (Martin et al., 2024). Such methods are readily applicable to BGCs containing genes encoding for substantial numbers of consecutive steps in a pathway, such as the 10‐gene noscapine biosynthetic pathway cluster in opium poppy (Winzer et al., 2012) or the 12‐gene avencine biosynthetic pathway cluster in oats (Li et al., 2021). The identification of “loose” or more fragmented clusters would require alternative approaches that involve targeting a specific gene family responsible for key scaffold‐generating enzymes (rather than tailoring enzymes), such as oxidosqualene cyclases (OSCs) across closely related sequenced genomes. Statistical methods are then applied to investigate the genomic regions surrounding predicted OSC genes, helping to identify the enrichment of genes associated with different classes of tailoring enzymes in the triterpenoid biosynthetic pathway in Brassicaceae (Liu, Suarez Duran, et al., 2020). Furthermore, Python scripts have been developed for plants that have many characterized enzymes known to be involved in the production of MIAs, such as R. tetraphyla and C. roseus to identify genomic regions involved in specialized metabolism (Li, Wood, et al., 2023; Stander et al., 2023). DeepBGC (Deep Biosynthetic Gene Cluster) is another tool which uses ML to find patterns in genomic data that may lead to BGC discovery (Hannigan et al., 2019). Its algorithm goes beyond the rules seen in traditional BGC methods and allows for more nuance in its methods. Another example is PRISM4, which predicts microbial BGC and includes ML to provide chemical intuition into what the chemical structure of the predicted product will be (Skinnider et al., 2020). While neither of these programs was designed directly for use with plant genomes, they provide an example of where the field can evolve.
Figure 2.
Representation of the available tools that leverage genomic data from multiple plant species to identify syntenic elements, biosynthetic gene clusters, and evolutionary events between genomes. Figure created with icons from BioRender.com under academic license (Agreement no.: MH28BH5EQV).
The increasing availability of sequenced genomes across plant lineages offers an opportunity to develop effective complementary tools for pathway discovery and evolution studies. Synthetic synteny analysis, especially between plants from the same family, reveals valuable information from both exonic and nonexonic regions. However, genome structure, with its inherent complexities of gene movement, inversions, duplications, deletions, and mutations, complicates this task. Synteny analysis tools such as MCScanx (Multiple Collinearity Scan toolkit X) and TOGA identify genes that are similar enough to be annotated similarly between species (Kirilenko et al., 2023; Wang et al., 2012) (Figure 2). This comparative annotation and parallel analysis can provide insights into enzyme function loss and identify duplication events driving metabolite diversity. For instance, a JCVI (J. Craig Venter Institute) pipeline based on MCScanx was recently used to discover a monoterpene biosynthetic gene cluster in Schizonepeta tenuifolia (Liu et al., 2023). The syntenic analysis found that this cluster evolved from a region associated with the diversity of monoterpenes in the Lamiaceae. Syntenic analysis was also used to investigate the production of MIAs in the Apocynaceae family by comparing the genome sequences of an MIA‐free species, Pachypodium lamerei, to its MIA‐producing relatives (Cuello et al., 2024) and revealed the importance of transposable elements in MIA biosynthesis. In another example, three copies of sterol isomerases were found to be clustered in the genome of Melia azedarach (Hodgson et al., 2019). Two of these were identified as melianol oxide isomerases, while the third was not involved in specialized metabolism. It is important to note that one of the two functional isomerases likely would have been missed due to its low expression level, which precluded amplification. Recently, GENESPACE has emerged as a new computational platform for enzyme discovery in the big data era. By integrating conserved gene order and orthology, it defines the expected physical position of all genes across multiple sequenced genomes, allowing researchers to visualize and explore related DNA sequences and determine gene loss or duplication. GENESPACE has successfully tracked the positions of essential genes in the evolution of grass crops, including maize, wheat, and rice. As exploring the genetic code in this way can lead to a better understanding of the evolution of important genomic regions, this method may also allow scientists to identify target genes for biosynthetic pathway discovery (Lovell et al., 2022).
Genome and transcriptome sequencing have become a routine practice, transforming our understanding of the genetic elements that underlie the metabolic diversification across a wide range of plant taxa. Gene and genome duplications have been recognized as the primary types of genomic variation identified in comparative studies of sequenced genomes of related plants (Lichman et al., 2020; Moore & Purugganan, 2005). These duplications are subsequently shaped by genetic drift and natural selection, which drive the retention and functional diversification of these genetic elements (Ohno, 1970; Panchy et al., 2016). This process results in the remarkable diversity of specialized metabolites observed among plant species. In addition to genome analysis, an approach known as pangenomics has unveiled additional under‐explored types of genomic structural variations in the entire conserved (core genome) and diversified (dispensable genome) genetic information of a species (He et al., 2025). To date, more than 30 plant pangenomes have been constructed, offering deeper insights into the molecular mechanisms that drive plant metabolic diversity (Zhou & Liu, 2022). For example, Li et al. (2020) sequenced 10 opium poppy cultivars and uncovered gene copy‐number variations that explain substantial differences in benzylisoquinoline alkaloid production across the cultivars. Notably, genes involved in the benzylisoquinoline alkaloid biosynthetic pathways within BGCs were more prone to gene copy‐number variations than those outside BGCs. A tomato pangenome comprising 725 accessions allowed for the identification of a presence‐absence variation in the promoter region of TomLoxC, a 13‐lipoxygenase gene associated with tomato flavor. In an A. thaliana pangenome project, Liu, Cheema, et al. (2020) analyzed 22 genomic regions containing the thalianol biosynthetic cluster and showed that chromosomal inversions functioned as a mechanism for reorganizing distant metabolic genes into the core cluster region, resulting in a more compact gene cluster. This compacted thalianol cluster was found in approximately 80% of the genomes examined, suggesting that tighter clustering of metabolic genes offers evolutionary or functional advantages (Liu, Cheema, et al., 2020). The enhanced resolution provided by pangenomics facilitates the elucidation of the genetic underpinnings of plant metabolic complexity. Intriguingly, such analysis highlights how these elements and organizations are evolutionarily repeated beyond the genome of the species in question, opening up a new frontier for mining metabolic diversity and biocatalytic prowess in plants.
Current omics‐based approaches have primarily targeted specific pathways in specific species, typically using one or two types of omics data at a time. However, recent advances in data analysis and integration provide new opportunities to associate temporal and spatial expression levels with metabolite abundance, and to match mass‐spectral or structural features to enzyme families. Increasing data availability now paves the way for de novo pathway discovery based on biosynthetic gene co‐expression or metabolic network analyses via supervised or unsupervised multi‐omics integration. This approach can identify potential connections between metabolites and transcripts exhibiting correlated abundance across samples, using reaction rules associated with enzyme families encoded by those transcripts. MEANtools, for example, highlights a new hypothesis generation capability (Singh et al., 2024) and was successfully used to predict five of seven steps in the well‐characterized α‐tomatine and falcarinol metabolic pathway. The primary advantage of MEANtools lies in its ability to generate testable hypotheses about metabolic pathways with minimal prior knowledge. By integrating metabolomics and transcriptomics data, it can overcome the dependence on specific “bait” genes. Furthermore, MEANtools automates the identification of critical Pfam domains required for specific reactions and allows users to adjust reaction‐Pfam domain associations based on confidence or taxonomic origin of the data. This approach may lay the groundwork for future research in integrated genomic, transcriptomic, and metabolomic data to link natural product chemistry to biosynthesis genes and producers, and to explore biosynthetic diversity in nature.
AI‐POWERED APPROACHES IN NATURAL PRODUCT BIOSYNTHESIS
ML‐based methods are increasingly used in various aspects of metabolic pathway elucidation, including pathway prediction and reconstruction, enzyme prediction, metabolite identification, and reaction prediction (Shah et al., 2021). ML can combine and analyze diverse data types to reveal patterns not apparent through other methods (Figure 3), thereby identifying novel metabolites and predicting reactions and pathways to guide experimental work. An outstanding example is GTC (Graph Transformer and CNN) (Bao et al., 2023), which uses a graph transformer network to capture molecular structural features and a convolutional neural network (CNN) to learn from SMILES string representations. The model was first trained with the KEGG dataset and further fine‐tuned using plant‐derived datasets from the Plant Metabolic Network (Hawkins et al., 2025). Interestingly, GTC could also classify natural products across multiple different classes of compounds. Inspired by the advancements in ML for pathway prediction, Kim et al. (2024) introduced READRetro, a bio‐retrosynthesis tool designed to predict the biosynthetic pathways of plant natural products by integrating cutting‐edge deep learning architectures with a retrieval‐augmented dual‐view approach. READRetro combines an ensemble of two dual‐representation‐based models, Retroformer and Graph2SMILES, with a rule‐based reaction retriever and a pathway retriever that utilizes KEGG pathways. Compared to conventional models such as BioNavi‐NP (Zheng et al., 2022) and RetroPath RL (Koch et al., 2020), READRetro demonstrated marked improvements in predicting both single‐step and multi‐step biosynthetic reactions. Importantly, READRetro goes beyond predicting known pathways, proposing potential routes for compounds with incompletely elucidated biosynthesis. This work highlights the potential of combining advanced deep learning techniques with knowledge retrieval to expand our understanding of plant metabolic pathways and provides a practical tool for exploring biosynthesis and production of secondary metabolites. While READRetro shows significant promise in predicting complex pathways, future enhancements may focus on increasing its predictive accuracy for long, multi‐step pathways and broadening its applicability to uncover previously unknown biosynthetic routes. Further exemplifying the application of ML, Toubiana et al. (2019) used correlation network analysis and ML techniques to predict metabolic pathways in tomatoes. First, metabolite correlation networks were established using metabolomics data. ML then distinguished between tomato and non‐tomato pathways. The ML model confidently predicted 22 pathways, four of which were subsequently validated via experimental functional characterization. An outstanding advantage of this ML method is that it does not require genomic data and can thus be applied to plants without sequenced genomes. It can also complement other genomic methods.
Figure 3.
The increasingly large and complex plant datasets require powerful AI/ML to mine for biocatalysts and even whole pathways. Figure created with icons from BioRender.com under academic license (Agreement no.: MH28BH5EQV).
Generative AI‐based methods, including large language models (LLMs), are beginning to be applied to understand plant natural product biosynthesis and regulation. AgroNT, a foundational LLM trained primarily on the sequenced genomes of crop plants, predicts RNA polyadenylation sites, splice sites, long non‐coding RNAs, chromatin accessibility, and tissue‐specific gene expression (Mendoza‐Revilla et al., 2024). This valuable tool has facilitated the understanding of the underlying mechanisms of specialized metabolite production in crops such as rice, soy, and potato. Integrating the growing knowledge of non‐crop plant genomes into future tools can enhance the accuracy of biosynthetic pathway predictions. LLMs have also found applications in mapping natural products onto the tree of life. Busta et al. (2024) recently developed a proof‐of‐concept workflow to establish a phytochemical database by combining manual annotation and text mining to identify associations between plant species and tyrosine‐derived compounds. LLMs were then developed and tested for their ability to complement the text mining approaches. Although the researchers found LLMs helpful in extracting compound‐species associations, small‐scale manual curation remains necessary. This work demonstrates the potential of LLMs to expand phytochemical maps and provide greater insight into key evolutionary events driving chemical diversity in plants. Despite not directly leading to new enzyme discovery, this LLM‐based phytochemical map lays the groundwork for using LLMs such as Gemini, GPT, and Llama to condense and provide insights into large datasets as they are trained on increasingly more scientific information (Figure 3). One of the largest challenges will be integrating existing data within the models and devising strategies to prevent misleading “hallucinations” of the models.
AI/ML models are increasingly used to interpret genomic data, predicting regulatory elements, splicing, expression, and interactions across various life domains. While early applications were largely focused on human and prokaryotic systems, their utility in plant genomics is rapidly expanding. DeepSEA is an example of a tool that directly predicts chromatin features from DNA sequences, primarily focusing on regulatory element prediction (Zhou & Troyanskaya, 2015). Tools like this could be used to find regulatory elements similar to the ORCA (Octadecanoid‐Responsive Catharanthus AP2/ERF) cluster in C. roseus, which activates genes in MIA pathways in different species (Singh et al., 2020). DeepVariant leverages NGS data to identify single nucleotide polymorphisms (SNPs) and insertion–deletion events (Poplin et al., 2018), enabling the detection of changes in traditionally challenging regions and the discovery of novel events. Being able to detect such events rapidly would allow for the discovery of genomic changes like those seen in hydroxynitrile glucosides in different barley cultivars where SNPs were directly correlated to hydroxynitrile glucoside levels and a deletion event saw the loss of hydroxynitrile glucoside biosynthesis in one cultivar (Ehlert et al., 2019). SpliceAI, released by Illumina, identifies cryptic splice sites and events that alter mRNA transcript splicing (de Sainte Agathe et al., 2023). Being able to rapidly identify alternative splicing events allows for the discovery of phenomena such as what is seen with strictosidine glucosidase (SGD) in C. roseus, where a truncated SGD called shSGD regulates MIA biosynthesis by interfering with full‐length SGD (Carqueijeiro et al., 2021). Enformer, a deep neural network, integrates long‐range genomic interactions up to 200 000 base pairs away to predict gene regulation events (Avsec et al., 2021). This tool could be used to identify events such as long‐range interactions seen in MIA biosynthesis in C. roseus or the marneal biosynthetic cluster in A. thaliana (Li, Wood, et al., 2023; Roulé et al., 2022). AlphaMissense predicts missense variant effects across the human proteome using evolutionary and structural features to classify amino acid substitution impacts (Cheng et al., 2023). Missense mutations can have a large impact on biosynthetic pathways. An excellent example of this is in carotenoid biosynthesis, where missense mutations have been shown to change the color of different plant varieties (Gupta & Hirschberg, 2022). DNABERT, a DNA learning algorithm, performs diverse genomic tasks, including identifying promoters, splice sites, and enhancers (Ji et al., 2021). DNABERT2 is trained on more diverse genome sequences, but the exact species list is undisclosed. EVO and its successor EVO2 are models trained on diverse genome sequences spanning all observed evolution (Nguyen et al., 2024). They hold the potential to revolutionize biology by predicting genetic or biological features and generating proposed biological constructs for answering biological questions or for use in synthetic biology (Brixi et al., 2025). These new models have the potential to provide insight into both newly and previously sequenced genomes at levels that have not been possible before and may provide the key to unlocking many biosynthetic pathways that have remained elusive. The information they provide could help expand and improve upon existing and newly published computational pipelines, such as CoExpPhylo, designed to discover biosynthetic pathways, by adding more context and information to existing tools (Grünig & Pucker, 2025).
AI‐based protein structure predictors, such as AlphaFold (Jumper et al., 2021), ESMFold (Lin et al., 2023), and RoseTTAFold (Baek et al., 2021), hold significant promise for advancing plant biosynthetic pathway elucidation, particularly in predicting enzyme functions. Their remarkable accuracy enables the identification of conserved features and functional motifs, providing a foundation for reliable structure‐based annotation transfer (SAT) and improving orthogroup categorization even at 1% sequence identity (Bordin et al., 2023). Current computational tools, such as PARSE (Derry & Altman, 2023), FASSO (Andorf et al., 2022), and PANDA‐3D (Zhao et al., 2024), now leverage the AlphaFold Database (Varadi et al., 2024) to infer protein function. Notably, CBP60‐DB employed AlphaFold predictions to classify 1996 unique plant proteins, uncovering conserved calmodulin‐binding domains critical for stress signaling (Amani et al., 2023). Similar strategies hold promise for identifying catalytic residues and predicting the scaffolding substrates within the plant cytochrome P450 family. Currently, the AlphaFold DB covers just four plant models: A. thaliana, Glycine max, Oryza sativa, and Z. mays. The ongoing expansion of plant genomic and transcriptomic datasets is expected to substantially enrich the AlphaFold DB and enhance the utility of AI‐guided protein structure prediction in plant biosynthetic research.
Beyond predicting naturally occurring protein structures, generative AI models like RFDiffusion and ESM3 offer exciting potential to transform plant‐specialized metabolism research through the de novo design of novel protein folds that may not exist in nature with tailored catalytic functions (Hayes et al., 2025; Watson et al., 2023). Unlike traditional docking‐based approaches, these tools allow the creation of enzymes for uncharacterized reactions and the engineering of existing enzymes with enhanced or altered properties (Ahern et al., 2025; Lauko et al., 2025). Complementing de novo enzyme design, AI‐based molecular docking offers further opportunities to advance plant‐specialized metabolism research by providing predictive insights into enzyme‐substrate interactions and specificities. A wide range of tools, such as DynamicBind (Lu et al., 2024), DeepDock (Liao et al., 2019), and Lingo3DMol (Feng et al., 2024), each built on distinct computational cores, are currently available. These advanced docking methods enable high‐throughput screening to identify potential substrates, products, or modulators of metabolic enzymes. Collectively, AI‐assisted approaches could significantly enhance enzyme annotation, pathway discovery, the rational design of synthetic pathways, and unlock plant chemical diversity for sustainable biocatalysis, pharmaceuticals, and agricultural applications.
NATURAL PRODUCT METABOLISM IN THE DATA‐INTENSIVE FUTURE OF BIOLOGICAL SCIENCES
Since the dawn of NGS technologies in the 2010s, the study of plant metabolism has made giant leaps. From the traditional technique with inducible cell cultures to the cloning of the coding sequence of phenylalanine ammonia‐lyase in Phaseolus vulgaris in 1985 (Edwards et al., 1985), the field has evolved from elucidating a pathway roughly every year (e.g., avenacin in 2004 and benzylisoquinoline in 2005) to witnessing several whole pathway discoveries annually in recent years (Boccia et al., 2024; Chuang et al., 2023; De La Peña et al., 2023; Hong et al., 2022; Martin et al., 2024; Reed et al., 2023). The emergence of big omics data has enabled not only the wealth of biological data but also sophisticated algorithms and paradigms that pave the way for further biosynthetic gene discoveries and characterization.
As the capacity to generate omics data becomes less of a limiting factor, we can focus on extracting more meaningful information. This includes single‐cell methods that characterize genetic, biochemical, and metabolic profiles at increasingly higher resolutions. Recent work in using quantitative single‐cell mass spectrometry (MS) in C. roseus showed dynamic, cell‐type‐specific vinblastine precursor accumulation, highlighting the precise cellular contexts of steps in the metabolic network. When integrated with single‐cell RNA‐Seq data or spatial transcriptomics, the multi‐omics analyses at the single‐cell level aid in the identification of the co‐localization of transcripts and metabolites, accelerating the discovery of elusive enzymes and regulatory factors (Vu et al., 2024). In addition, the current advances in proteomics also aid in the identification of biosynthetic genes. The nanopore technology for single‐protein sequencing is a novel approach that can serve as a potential method for proteomics studies in plant‐specialized metabolism (Motone et al., 2024). With full‐length enzyme sequences read in one single pass, this method can reveal the post‐translational modification in multiple proteoforms even at low abundance, which is common in plant enzyme modification, such as phosphorylation, ubiquitination, glycosylation, etc. (Vu et al., 2018). As throughput and sensitivity continue to significantly improve, the omics data generation will revolutionize the insights in plant‐specialized metabolism studies.
With more elucidated pathways and curated data providing a positive feedback effect, AI/ML algorithms and resources will continue to be improved for specialized metabolic pathway discovery, even in non‐model plants or species with limited data. New tools could be developed to mine genomic “dark matter,” helping uncover novel natural products and biosynthetic pathways. AI/ML approaches are expected to be a common feature of natural product biosynthesis research, including uses in mining unexplored or orphan BGCs for new metabolites, retrosynthetically predicting novel biosynthetic routes, and identifying potential therapeutic or industrial compounds through computational screening of predicted metabolites.
The increasing abundance and resolution (temporal and spatial) of omics data will require ever more powerful data analytics tools to extract meaningful information (e.g., to pinpoint the exact cell types and events in biosynthetic pathways). Less data‐intensive AI/ML models may also become necessary where data or computational resources are not readily available (Li, Persaud, et al., 2023). In addition to understanding genomic structural variations linked to metabolic traits as seen in pangenomics and presence‐absence variation analysis, AI‐assisted tools can also facilitate evolutionary analyses to uncover pathway diversification across species. Despite challenges in data availability, tool power/efficiency, and model interpretability, the ongoing refinement of algorithms and growing accessibility of AI tools promise to revolutionize plant metabolic research, with implications for agriculture, medicine, and overall sustainability.
CONFLICT OF INTEREST
The authors declare no conflicts of interest.
ACKNOWLEDGMENTS
M. McConnachie receives funding from NSERC‐CGSM, Micheal Smith Foundation Supplementary Scholarship, and Stober Foundation. T.‐T.T. Dang receives funding from NSERC‐DFG (ALLRP 580347‐22) and Michael Smith Health Research BC (SCH‐2020‐0401). T.‐T.T. Dang is grateful to the Plant Journal—Phytochemical Society of North America for the TPJ‐PSNA Early Career Award.
DATA AVAILABILITY STATEMENT
Data sharing not applicable to this article as no datasets were generated or analysed during the current study.
REFERENCES
- Ahern, W. , Yim, J. , Tischer, D. , Salike, S. , Woodbury, S.M. , Kim, D. et al. (2025) Atom level enzyme active site scaffolding using RFdiffusion2. bioRxiv. Available from: 10.1101/2025.04.09.648075 [DOI]
- Amani, K. , Shivnauth, V. & Castroverde, C.D.M. (2023) CBP60‐DB: an alphafold‐predicted plant kingdom‐wide database of the CALMODULIN‐BINDING PROTEIN 60 protein family with a novel structural clustering algorithm. Plant Direct, 7, e509. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Andorf, C.M. , Sen, S. , Hayford, R.K. , Portwood, J.L. , Cannon, E.K. , Harper, L.C. et al. (2022) FASSO: an AlphaFold based method to assign functional annotations by combining sequence and structure orthology. bioRxiv. Available from: 10.1101/2022.11.10.516002 [DOI]
- Avsec, Ž. , Agarwal, V. , Visentin, D. , Ledsam, J.R. , Grabska‐Barwinska, A. , Taylor, K.R. et al. (2021) Effective gene expression prediction from sequence by integrating long‐range interactions. Nature Methods, 18, 1196–1203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baek, M. , DiMaio, F. , Anishchenko, I. , Dauparas, J. , Ovchinnikov, S. , Lee, G.R. et al. (2021) Accurate prediction of protein structures and interactions using a three‐track neural network. Science, 373, 871–876. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bao, H. , Zhao, J. , Zhao, X. , Zhao, C. , Lu, X. & Xu, G. (2023) Prediction of plant secondary metabolic pathways using deep transfer learning. BMC Bioinformatics, 24, 348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Boccia, M. , Kessler, D. , Seibt, W. , Grabe, V. , Rodríguez López, C.E. , Grzech, D. et al. (2024) A scaffold protein manages the biosynthesis of steroidal defense metabolites in plants. Science, 386, 1366–1372. [DOI] [PubMed] [Google Scholar]
- Bordin, N. , Dallago, C. , Heinzinger, M. , Kim, S. , Littmann, M. , Rauer, C. et al. (2023) Novel machine learning approaches revolutionize protein knowledge. Trends in Biochemical Sciences, 48, 345–359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brixi, G. , Durrant, M.G. , Ku, J. , Poli, M. , Brockman, G. , Chang, D. et al. (2025) Genome modeling and design across all domains of life with Evo 2. bioRxiv. 2025.02.18.638918. Available from: 10.1101/2025.02.18.638918v1
- Busta, L. , Hall, D. , Johnson, B. , Schaut, M. , Hanson, C.M. , Gupta, A. et al. (2024) Mapping of specialized metabolite terms onto a plant phylogeny using text mining and large language models. The Plant Journal, 120, 406–419. [DOI] [PubMed] [Google Scholar]
- Caputi, L. , Franke, J. , Farrow, S.C. , Chung, K. , Payne, R.M.E. , Nguyen, T.D. et al. (2018) Missing enzymes in the biosynthesis of the anticancer drug vinblastine in Madagascar periwinkle. Science, 360, 1235–1239. [DOI] [PubMed] [Google Scholar]
- Carqueijeiro, I. , Koudounas, K. , Dugé de Bernonville, T. , Sepúlveda, L.J. , Mosquera, A. , Bomzan, D.P. et al. (2021) Alternative splicing creates a pseudo‐strictosidine β‐d‐glucosidase modulating alkaloid synthesis in Catharanthus roseus . Plant Physiology, 185, 836–856. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carr, S.C. , Rehman, F. , Hagel, J.M. , Chen, X. , Ng, K.K.S. & Facchini, P.J. (2024) Two ubiquitous aldo‐keto reductases in the genus Papaver support a patchwork model for morphine pathway evolution. Communications Biology, 7, 1–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen, Y. , Ma, T. , Zhang, L. , Kang, M. , Zhang, Z. , Zheng, Z. et al. (2020) Genomic analyses of a “living fossil”: the endangered dove‐tree. Molecular Ecology Resources, 20, 756–769. [DOI] [PubMed] [Google Scholar]
- Cheng, J. , Novati, G. , Pan, J. , Bycroft, C. , Žemgulytė, A. , Applebaum, T. et al. (2023) Accurate proteome‐wide missense variant effect prediction with AlphaMissense. Science, 381(6664), eadg7492. Available from: 10.1126/science.adg7492 [DOI] [PubMed] [Google Scholar]
- Chuang, L. , Liu, S. & Franke, J. (2023) Post‐cyclization skeletal rearrangements in plant triterpenoid biosynthesis by a pair of branchpoint isomerases. Journal of the American Chemical Society, 145, 5083–5091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cuello, C. , Jansen, H.J. , Abdallah, C. , Zamar Mbadinga, D.L. , Birer Williams, C. , Durand, M. et al. (2024) The Madagascar palm genome provides new insights on the evolution of Apocynaceae specialized metabolism. Heliyon, 10, e28078. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dang, T.T. , Franke, J. , Tatsis, E. & O'Connor, S.E. (2017) Dual catalytic activity of a cytochrome P450 controls bifurcation at a metabolic branch point of alkaloid biosynthesis in Rauwolfia serpentina . Angewandte Chemie International Edition in English, 56, 9440–9444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dang, T.‐T.T. , Chen, X. & Facchini, P.J. (2015) Acetylation serves as a protective group in noscapine biosynthesis in opium poppy. Nature Chemical Biology, 11, 104–106. [DOI] [PubMed] [Google Scholar]
- Dang, T.‐T.T. , Franke, J. , Carqueijeiro, I.S.T. , Langley, C. , Courdavault, V. & O'Connor, S.E. (2018) Sarpagan bridge enzyme has substrate‐controlled cyclization and aromatization modes. Nature Chemical Biology, 14, 760–763. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dang, T.T.T. , Onoyovwi, A. , Farrow, S.C. & Facchini, P.J. (2012) Biochemical genomics for gene discovery in benzylisoquinoline alkaloid biosynthesis in opium poppy and related species, 1st edition. Amsterdam: Elsevier Inc. Available from: 10.1016/B978-0-12-394290-6.00011-2 [DOI] [PubMed] [Google Scholar]
- De La Peña, R. , Hodgson, H. , Liu, J.C.‐T. , Stephenson, M.J. , Martin, A.C. , Owen, C. et al. (2023) Complex scaffold remodeling in plant triterpene biosynthesis. Science, 379, 361–368. [DOI] [PMC free article] [PubMed] [Google Scholar]
- de Sainte Agathe, J.‐M. , Filser, M. , Isidor, B. , Besnard, T. , Gueguen, P. , Perrin, A. et al. (2023) SpliceAI‐visual: a free online tool to improve SpliceAI splicing variant interpretation. Human Genomics, 17, 7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Derry, A. & Altman, R.B. (2023) Explainable protein function annotation using local structure embeddings. bioRxiv. Available from: 10.1101/2023.10.13.562298 [DOI]
- Dugé De Bernonville, T. , Amor Stander, E. , Dugé De Bernonville, G. , Besseau, S. & Courdavault, V. (2022) Predicting monoterpene indole alkaloid‐related genes from expression data with artificial neural networks. In: Courdavault, V. & Besseau, S. (Eds.) Catharanthus roseus. Methods in molecular biology. New York: Springer US, pp. 131–140. Available from: 10.1007/978-1-0716-2349-7_10 [DOI] [PubMed] [Google Scholar]
- Edgar, R.C. (2022) Muscle5: high‐accuracy alignment ensembles enable unbiased assessments of sequence homology and phylogeny. Nature Communications, 13(1), 6968. Available from: 10.1038/s41467-022-34630-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- Edwards, K. , Cramer, C.L. , Bolwell, G.P. , Dixon, R.A. , Schuch, W. & Lamb, C.J. (1985) Rapid transient induction of phenylalanine ammonia‐lyase mRNA in elicitor‐treated bean cells. Proceedings of the National Academy of Sciences of the United States of America, 82, 6731–6735. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ehlert, M. , Jagd, L.M. , Braumann, I. , Dockter, C. , Crocoll, C. , Motawia, M.S. et al. (2019) Deletion of biosynthetic genes, specific SNP patterns and differences in transcript accumulation cause variation in hydroxynitrile glucoside content in barley cultivars. Scientific Reports, 9, 5730. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Emms, D.M. & Kelly, S. (2019) OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biology, 20, 1–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Farrow, S.C. , Hagel, J.M. , Beaudoin, G.A.W. , Burns, D.C. & Facchini, P.J. (2015) Stereochemical inversion of (S)‐reticuline by a cytochrome P450 fusion in opium poppy. Nature Chemical Biology, 11, 728–732. [DOI] [PubMed] [Google Scholar]
- Feng, W. , Wang, L. , Lin, Z. , Zhu, Y. , Wang, H. , Dong, J. et al. (2024) Generation of 3D molecules in pockets via a language model. Nature Machine Intelligence, 6, 62–73. [Google Scholar]
- Fernández, P. , Amice, R. , Bruy, D. , Christenhusz, M.J.M. , Leitch, I.J. , Leitch, A.L. et al. (2024) A 160 Gbp fork fern genome shatters size record for eukaryotes. iScience, 27, 109889. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ferrarini, M. , Moretto, M. , Ward, J.A. , Šurbanovski, N. , Stevanović, V. , Giongo, L. et al. (2013) An evaluation of the PacBio RS platform for sequencing and de novo assembly of a chloroplast genome. BMC Genomics, 14, 670. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Frey, M. , Chomet, P. , Glawischnig, E. , Stettner, C. , Grün, S. , Winklmair, A. et al. (1997) Analysis of a chemical plant defense mechanism in grasses. Science, 277, 696–699. [DOI] [PubMed] [Google Scholar]
- Gao, L. , Gonda, I. , Sun, H. , Ma, Q. , Bao, K. , Tieman, D.M. et al. (2019) The tomato pan‐genome uncovers new genes and a rare allele regulating fruit flavor. Nature Genetics, 51, 1044–1051. [DOI] [PubMed] [Google Scholar]
- Grünig, N. & Pucker, B. (2025) CoExpPhylo – a novel pipeline for biosynthesis gene discovery. bioRxiv. 2025.04.03.647051. 10.1101/2025.04.03.647051v1 [DOI]
- Guo, L. , Winzer, T. , Yang, X. , Li, Y. , Ning, Z. , He, Z. et al. (2018) The opium poppy genome and morphinan production. Science, 347, 343–347. [DOI] [PubMed] [Google Scholar]
- Gupta, P. & Hirschberg, J. (2022) The genetic components of a natural color palette: a comprehensive list of carotenoid pathway mutations in plants. Frontiers in Plant Science, 12, 806184. Available from: 10.3389/fpls.2021.806184/full [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hafner, A. , DeLeo, V. , Deng, C.H. , Elsik, C.G. , S Fleming, D. , Harrison, P.W. et al. (2025) Data reuse in agricultural genomics research: challenges and recommendations. GigaScience, 14, giae106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hannigan, G.D. , Prihoda, D. , Palicka, A. , Soukup, J. , Klempir, O. , Rampula, L. et al. (2019) A deep learning genome‐mining strategy for biosynthetic gene cluster prediction. Nucleic Acids Research, 47, e110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hawkins, C. , Xue, B. , Yasmin, F. , Wyatt, G. , Zerbe, P. & Rhee, S.Y. (2025) Plant metabolic network 16: expansion of underrepresented plant groups and experimentally supported enzyme data. Nucleic Acids Research, 53, D1606–D1613. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hayes, T. , Rao, R. , Akin, H. , Sofroniew, N.J. , Oktay, D. , Lin, Z. et al. (2025) Simulating 500 million years of evolution with a language model. Science, 387, 850–858. [DOI] [PubMed] [Google Scholar]
- He, W. , Li, X. , Qian, Q. & Shang, L. (2025) The developments and prospects of plant super‐pangenomes: demands, approaches, and applications. Plant Communications, 6(2), 101230. Available from: 10.1016/j.xplc.2024.101230 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hodgson, H. , De La Peña, R. , Stephenson, M.J. , Thimmappa, R. , Vincent, J.L. , Sattely, E.S. et al. (2019) Identification of key enzymes responsible for protolimonoid biosynthesis in plants: opening the door to azadirachtin production. Proceedings of the National Academy of Sciences of the United States of America, 116, 17096–17104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hong, B. , Grzech, D. , Caputi, L. , Sonawane, P. , López, C.E.R. , Kamileen, M.O. et al. (2022) Biosynthesis of strychnine. Nature, 607, 617–622. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang, X.‐Q. , Li, R. , Fu, J. & Dudareva, N. (2022) A peroxisomal heterodimeric enzyme is involved in benzaldehyde synthesis in plants. Nature Communications, 13, 1352. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jensen, P.R. (2016) Natural products and the gene cluster revolution. Trends in Microbiology, 24, 968–977. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ji, W. , Osbourn, A. & Liu, Z. (2024) Understanding metabolic diversification in plants: branchpoints in the evolution of specialized metabolism. Philosophical Transactions of the Royal Society, B: Biological Sciences, 379, 20230359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ji, Y. , Zhou, Z. , Liu, H. & Davuluri, R.V. (2021) DNABERT: pre‐trained bidirectional encoder representations from transformers model for DNA‐language in genome. Bioinformatics, 37, 2112–2120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jo, J. , Park, J.‐S. , Won, H. , Jeong, J.S. , Jung, T.W. , Lee, K.J. et al. (2024) The first chromosomal‐level genome assembly of Sageretia thea using nanopore long reads and pore‐C technology. Scientific Data, 11, 959. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jones, P. , Binns, D. , Chang, H.‐Y. , Fraser, M. , Li, W. , McAnulla, C. et al. (2014) InterProScan 5: genome‐scale protein function classification. Bioinformatics, 30, 1236–1240. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jumper, J. , Evans, R. , Pritzel, A. , Green, T. , Figurnov, M. , Ronneberger, O. et al. (2021) Highly accurate protein structure prediction with ALPHAFOLD. Nature, 596, 583–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kang, M. , Fu, R. , Zhang, P. , Lou, S. , Yang, X. , Chen, Y. et al. (2021) A chromosome‐level Camptotheca acuminata genome assembly provides insights into the evolutionary origin of camptothecin biosynthesis. Nature Communications, 12, 3531. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Katoh, K. & Standley, D.M. (2013) MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution, 30(4), 772–780. Available from: 10.1093/molbev/mst010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kautsar, S.A. , Suarez Duran, H.G. , Blin, K. , Osbourn, A. & Medema, M.H. (2017) PlantiSMASH: automated identification, annotation and expression analysis of plant biosynthetic gene clusters. Nucleic Acids Research, 45, W55–W63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim, T. , Lee, S. , Kwak, Y. , Choi, M. , Park, J. , Hwang, S.J. et al. (2024) READRetro: natural product biosynthesis predicting with retrieval‐augmented dual‐view retrosynthesis. New Phytologist, 243, 2512–2527. [DOI] [PubMed] [Google Scholar]
- Kirilenko, B.M. , Munegowda, C. , Osipova, E. , Jebb, D. , Sharma, V. , Blumer, M. et al. (2023) Integrating gene annotation with orthology inference at scale. Science, 380, eabn3107. Available from: 10.1126/science.abn3107 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Koch, M. , Duigou, T. & Faulon, J.‐L. (2020) Reinforcement learning for bioretrosynthesis. ACS Synthetic Biology, 9, 157–168. [DOI] [PubMed] [Google Scholar]
- Kwan, B.D. , Seligmann, B. , Nguyen, T.‐D. , Franke, J. & Dang, T.‐T.T. (2023) Leveraging synthetic biology and metabolic engineering to overcome obstacles in plant pathway elucidation. Current Opinion in Plant Biology, 71, 102330. [DOI] [PubMed] [Google Scholar]
- Lau, W. & Sattely, E.S. (2015) Six enzymes from mayapple that complete the biosynthetic pathway to the etoposide aglycone. Science, 349, 1224–1228. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lauko, A. , Pellock, S.J. , Sumida, K.H. , Anishchenko, I. , Juergens, D. , Ahern, W. et al. (2025) Computational design of serine hydrolases. Science, 388, eadu2454. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li, C. , Colinas, M. , Wood, J.C. , Vaillancourt, B. , Hamilton, J.P. , Jones, S.L. et al. (2025) Cell‐type‐aware regulatory landscapes governing monoterpene indole alkaloid biosynthesis in the medicinal plant Catharanthus roseus . New Phytologist, 245, 347–362. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li, C. , Haider, I. , Wang, J.Y. , Quinodoz, P. , Suarez Duran, H.G. , Méndez, L.R. et al. (2024) OsCYP706C2 diverts rice strigolactone biosynthesis to a noncanonical pathway branch. Science Advances, 10, eadq3942. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li, C. , Wood, J.C. , Vu, A.H. , Hamilton, J.P. , Rodriguez Lopez, C.E. , Payne, R.M. et al. (2023) Single‐cell multi‐omics in the medicinal plant Catharanthus roseus . Nature Chemical Biology, 19, 1031–1041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li, K. , Persaud, D. , Choudhary, K. , DeCost, B. , Greenwood, M. & Hattrick‐Simpers, J. (2023) Exploiting redundancy in large materials datasets for efficient machine learning with less data. Nature Communications, 14, 7283. Available from: 10.1038/s41467-023-42992-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li, Q. , Ramasamy, S. , Singh, P. , Hagel, J.M. , Dunemann, S.M. , Chen, X. et al. (2020) Gene clustering and copy number variation in alkaloid metabolic pathways of opium poppy. Nature Communications, 11, 1190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li, Y. , Leveau, A. , Zhao, Q. , Feng, Q. , Lu, H. , Miao, J. et al. (2021) Subtelomeric assembly of a multi‐gene pathway for antimicrobial defense compounds in cereals. Nature Communications, 12, 2563. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liao, Z. , You, R. , Huang, X. , Yao, X. , Huang, T. & Zhu, S. (2019) DeepDock: enhancing ligand‐protein interaction prediction by a combination of ligand and structure information. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). San Diego, CA, USA: IEEE, pp. 311–317. [Google Scholar]
- Lichman, B.R. , Godden, G.T. & Buell, C.R. (2020) Gene and genome duplications in the evolution of chemodiversity: perspectives from studies of Lamiaceae. Current Opinion in Plant Biology, 55, 74–83. [DOI] [PubMed] [Google Scholar]
- Lin, Z. , Akin, H. , Rao, R. , Hie, B. , Zhu, Z. , Lu, W. et al. (2023) Evolutionary‐scale prediction of atomic‐level protein structure with a language model. Science, 379, 1123–1130. [DOI] [PubMed] [Google Scholar]
- Liu, C. , Smit, S.J. , Dang, J. , Zhou, P. , Godden, G.T. , Jiang, Z. et al. (2023) A chromosome‐level genome assembly reveals that a bipartite gene cluster formed via an inverted duplication controls monoterpenoid biosynthesis in Schizonepeta tenuifolia . Molecular Plant, 16, 533–548. [DOI] [PubMed] [Google Scholar]
- Liu, Z. , Cheema, J. , Vigouroux, M. , Hill, L. , Reed, J. , Paajanen, P. et al. (2020) Formation and diversification of a paradigm biosynthetic gene cluster in plants. Nature Communications, 11, 5354. Available from: 10.1038/s41467-020-19153-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu, Z. , Suarez Duran, H.G. , Harnvanichvech, Y. , Stephenson, M.J. , Schranz, M.E. , Nelson, D. et al. (2020) Drivers of metabolic diversification: how dynamic genomic neighbourhoods generate new biosynthetic pathways in the Brassicaceae. New Phytologist, 227, 1109–1123. Available from: 10.1111/nph.16338 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lovell, J.T. , Sreedasyam, A. , Schranz, M.E. , Wilson, M. , Carlson, J.W. , Harkess, A. et al. (2022) GENESPACE tracks regions of interest and gene copy number variation across multiple genomes. eLife, 11, e78526. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lu, W. , Zhang, J. , Huang, W. , Zhang, Z. , Jia, X. , Wang, Z. et al. (2024) DynamicBind: predicting ligand‐specific protein‐ligand complex structure with a deep equivariant generative model. Nature Communications, 15, 1071. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Martin, L.B.B. , Kikuchi, S. , Rejzek, M. , Owen, C. , Reed, J. , Orme, A. et al. (2024) Complete biosynthesis of the potent vaccine adjuvant QS‐21. Nature Chemical Biology, 20, 493–502. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mehta, N. , Meng, Y. , Zare, R. , Kamenetsky‐Goldstein, R. & Sattely, E. (2024) A developmental gradient reveals biosynthetic pathways to eukaryotic toxins in monocot geophytes. Cell, 187, 5620–5637.e10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mendoza‐Revilla, J. , Trop, E. , Gonzalez, L. , Roller, M. , Dalla‐Torre, H. , de Almeida, B.P. et al. (2024) A foundational large language model for edible plant genomes. Communications Biology, 7, 1–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moore, R.C. & Purugganan, M.D. (2005) The evolutionary dynamics of plant duplicate genes. Current Opinion in Plant Biology, 8, 122–128. [DOI] [PubMed] [Google Scholar]
- Motone, K. , Kontogiorgos‐Heintz, D. , Wee, J. , Kurihara, K. , Yang, S. , Roote, G. et al. (2024) Multi‐pass, single‐molecule nanopore reading of long protein strands. Nature, 633, 662–669. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nett, R.S. , Dho, Y. , Low, Y.‐Y. & Sattely, E.S. (2021) A metabolic regulon reveals early and late acting enzymes in neuroactive Lycopodium alkaloid biosynthesis. Proceedings of the National Academy of Sciences of the United States of America, 118, e2102949118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nett, R.S. , Dho, Y. , Tsai, C. , Passow, D. , Martinez Grundman, J. , Low, Y.‐Y. et al. (2023) Plant carbonic anhydrase‐like enzymes in neuroactive alkaloid biosynthesis. Nature, 624, 182–191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nett, R.S. , Lau, W. & Sattely, E.S. (2020) Discovery and engineering of colchicine alkaloid biosynthesis. Nature, 584, 148–153. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nguyen, E. , Poli, M. , Durrant, M.G. , Kang, B. , Katrekar, D. , Li, D.B. et al. (2024) Sequence modeling and design from molecular to genome scale with Evo. Science, 386, eado9336. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nguyen, T.‐A.M. , Grzech, D. , Chung, K. , Xia, Z. , Nguyen, T.‐D. & Dang, T.‐T.T. (2023) Discovery of a cytochrome P450 enzyme catalyzing the formation of spirooxindole alkaloid scaffold. Frontiers in Plant Science, 14, 1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nguyen, T.‐A.M. , Nguyen, T.‐D. , Leung, Y.Y. , McConnachie, M. , Sannikov, O. , Xia, Z. et al. (2021) Discovering and harnessing oxidative enzymes for chemoenzymatic synthesis and diversification of anticancer camptothecin analogues. Communications Chemistry, 4, 177. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Obayashi, T. & Kinoshita, K. (2009) Rank of correlation coefficient as a comparable measure for biological significance of gene coexpression. DNA Research, 16, 249–260. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ohno, S. (1970) duplication for the sake of producing more of the same. In: Ohno, S. (Ed.) Evolution by gene duplication. Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 59–65. Available from: 10.1007/978-3-642-86659-3_11 [DOI] [Google Scholar]
- Panchy, N. , Lehti‐Shiu, M. & Shiu, S.‐H. (2016) Evolution of gene duplication in plants. Plant Physiology, 171, 2294–2316. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Payne, R.M.E. , Xu, D. , Foureau, E. , Teto Carqueijeiro, M.I.S. , Oudin, A. , Bernonville, T.D. et al. (2017) An NPF transporter exports a central monoterpene indole alkaloid intermediate from the vacuole. Nature Plants, 3, 16208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pichersky, E. & Lewinsohn, E. (2011) Convergent evolution in plant specialized metabolism. Annual Review of Plant Biology, 62, 549–566. [DOI] [PubMed] [Google Scholar]
- Pividori, M. , Cernadas, A. , De Haro, L.A. , Carrari, F. , Stegmayer, G. & Milone, D.H. (2019) Clustermatch: discovering hidden relations in highly diverse kinds of qualitative and quantitative data without standardization. Bioinformatics, 35, 1931–1939. [DOI] [PubMed] [Google Scholar]
- Poplin, R. , Chang, P.‐C. , Alexander, D. , Schwartz, S. , Colthurst, T. , Ku, A. et al. (2018) A universal SNP and small‐indel variant caller using deep neural networks. Nature Biotechnology, 36(10), 983–987. Available from: 10.1038/nbt.4235 [DOI] [PubMed] [Google Scholar]
- Pucker, B. (2022) Automatic identification and annotation of MYB gene family members in plants. BMC Genomics, 23, 220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pucker, B. (2024) Functional annotation – how to tackle the bottleneck in plant genomics . Available from: https://www.preprints.org/manuscript/202402.0645/v1 [Accessed 18th May 2024].
- Pucker, B. , Reiher, F. & Schilbert, H.M. (2020) Automatic identification of players in the flavonoid biosynthesis with application on the biomedicinal plant Croton tiglium . Plants, 9, 1103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Qu, Y. , Easson, M.E.A.M. , Simionescu, R. , Hajicek, J. , Thamm, A.M.K. , Salim, V. et al. (2018) Solution of the multistep pathway for assembly of corynanthean, strychnos, iboga, and aspidosperma monoterpenoid indole alkaloids from 19E‐geissoschizine. Proceedings of the National Academy of Sciences of the United States of America, 115, 3180–3185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rai, A. , Hirakawa, H. , Nakabayashi, R. , Kikuchi, S. , Hayashi, K. , Rai, M. et al. (2021) Chromosome‐level genome assembly of Ophiorrhiza pumila reveals the evolution of camptothecin biosynthesis. Nature Communications, 12, 405. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Reed, J. , Orme, A. , El‐Demerdash, A. , Owen, C. , Martin, L.B.B. , Misra, R.C. et al. (2023) Elucidation of the pathway for biosynthesis of saponin adjuvants from the soapbark tree. Science, 379, 1252–1264. [DOI] [PubMed] [Google Scholar]
- Reed, J.H. & Seebeck, F.P. (2024) Reagent engineering for group transfer biocatalysis. Angewandte Chemie International Edition, 63, e202311159. [DOI] [PubMed] [Google Scholar]
- Rempel, A. , Choudhary, N. & Pucker, B. (2023) KIPEs3: automatic annotation of biosynthesis pathways. PLoS One, 18, e0294342. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rodríguez‐López, C.E. , Jiang, Y. , Kamileen, M.O. , Lichman, B.R. , Hong, B. , Vaillancourt, B. et al. (2022) Phylogeny‐aware chemoinformatic analysis of chemical diversity in Lamiaceae enables iridoid pathway assembly and discovery of aucubin synthase. Molecular Biology and Evolution, 39, msac057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roulé, T. , Christ, A. , Hussain, N. , Huang, Y. , Hartmann, C. , Benhamed, M. et al. (2022) The lncRNA MARS modulates the epigenetic reprogramming of the marneral cluster in response to ABA. Molecular Plant, 15, 840–856. [DOI] [PubMed] [Google Scholar]
- Schadt, E.E. , Turner, S. & Kasarskis, A. (2010) A window into third‐generation sequencing. Human Molecular Genetics, 19, R227–R240. [DOI] [PubMed] [Google Scholar]
- Schmidt, M.H.‐W. , Vogel, A. , Denton, A.K. , Schmidt, M.H. , Istace, B. , Wormit, A. et al. (2017) De novo assembly of a new Solanum pennellii accession using Nanopore sequencing. The Plant Cell, 29, 2336–2348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shah, H.A. , Liu, J. , Yang, Z. & Feng, J. (2021) Review of machine learning methods for the prediction and reconstruction of metabolic pathways. Frontiers in Molecular Biosciences, 8, 634141. Available from: 10.3389/fmolb.2021.634141/full [DOI] [PMC free article] [PubMed] [Google Scholar]
- Singh, K.S. , Duran, H.S. , Pup, E.D. , Zafra‐Delgado, O. , Wees, S.C.M.V. , Van der Hooft, J.J.J. et al. (2024) MEANtools: multi‐omics integration towards metabolite anticipation and biosynthetic pathway prediction. bioRxiv. 2024.12.22.629970. Available from: 10.1101/2024.12.22.629970v1 [DOI]
- Singh, S.K. , Patra, B. , Paul, P. , Liu, Y. , Pattanaik, S. & Yuan, L. (2020) Revisiting the ORCA gene cluster that regulates terpenoid indole alkaloid biosynthesis in Catharanthus roseus . Plant Science, 293, 110408. [DOI] [PubMed] [Google Scholar]
- Skinnider, M.A. , Johnston, C.W. , Gunabalasingam, M. , Merwin, N.J. , Kieliszek, A.M. , MacLellan, R.J. et al. (2020) Comprehensive prediction of secondary metabolite structure and biological activity from microbial genome sequences. Nature Communications, 11, 6058. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Srinivasan, P. & Smolke, C.D. (2021) Engineering cellular metabolite transport for biosynthesis of computationally predicted tropane alkaloid derivatives in yeast. Proceedings of the National Academy of Sciences of the United States of America, 118, e2104460118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stander, E.A. , Lehka, B. , Carqueijeiro, I. , Cuello, C. , Hansson, F.G. , Jansen, H.J. et al. (2023) The Rauvolfia tetraphylla genome suggests multiple distinct biosynthetic routes for yohimbane monoterpene indole alkaloids. Communications Biology, 6, 1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tatsis, E.C. , Carqueijeiro, I. , Bernonville, T.D.D. , Franke, J. , Dang, T.T. , Oudin, A. et al. (2017) A three enzyme system to generate the Strychnos alkaloid scaffold from a central biosynthetic intermediate. Nature Communications, 8, 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Thoben, C. & Pucker, B. (2023) Automatic annotation of the bHLH gene family in plants. BMC Genomics, 24, 780. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Töpfer, N. , Fuchs, L.‐M. & Aharoni, A. (2017) The PhytoClust tool for metabolic gene clusters discovery in plant genomes. Nucleic Acids Research, 45, 7049–7063. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Toubiana, D. , Puzis, R. , Wen, L. , Sikron, N. , Kurmanbayeva, A. , Soltabayeva, A. et al. (2019) Combined network analysis and machine learning allows the prediction of metabolic pathways from tomato metabolomics data. Communications Biology, 2, 1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Varadi, M. , Bertoni, D. , Magana, P. , Paramval, U. , Pidruchna, I. , Radhakrishnan, M. et al. (2024) AlphaFold protein structure database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Research, 52, D368–D375. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vavricka, C.J. , Takahashi, S. , Watanabe, N. , Takenaka, M. , Matsuda, M. , Yoshida, T. et al. (2022) Machine learning discovery of missing links that mediate alternative branches to plant alkaloids. Nature Communications, 13, 1405. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vu, A.H. , Kang, M. , Wurlitzer, J. , Heinicke, S. , Li, C. , Wood, J.C. et al. (2024) Quantitative single‐cell mass spectrometry provides a highly resolved analysis of natural product biosynthesis partitioning in plants. Journal of the American Chemical Society, 146(34), 23891–23900. Available from: 10.1021/jacs.4c06336 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vu, L.D. , Gevaert, K. & De Smet, I. (2018) Protein language: post‐translational modifications talking to each other. Trends in Plant Science, 23, 1068–1080. [DOI] [PubMed] [Google Scholar]
- Wang, Y. , Malaco Morotti, A.L. , Xiao, Y. , Wang, Z. , Wu, S. , Chen, J. et al. (2022) Decoding the cytochrome P450 catalytic activity in divergence of benzophenone and xanthone biosynthetic pathways. ACS Catalysis, 12, 13630–13637. [Google Scholar]
- Wang, Y. , Tang, H. , DeBarry, J.D. , Tan, X. , Li, J. , Wang, X. et al. (2012) MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Research, 40, e49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang, Z. , Xiao, Y. , Wu, S. , Chen, J. , Li, A. & Tatsis, E.C. (2022) Deciphering and reprogramming the cyclization regioselectivity in bifurcation of indole alkaloid biosynthesis. Chemical Science, 13, 12389–12395. Available from: 10.1039/d2sc03612f [DOI] [PMC free article] [PubMed] [Google Scholar]
- Watson, J.L. , Juergens, D. , Bennett, N.R. , Trippe, B.L. , Yim, J. , Eisenach, H.E. et al. (2023) De novo design of protein structure and function with RFdiffusion. Nature, 620, 1089–1100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Winzer, T. , Gazda, V. , He, Z. , Kaminski, F. , Kern, M. , Larson, T.R. et al. (2012) A Papaver somniferum 10‐gene cluster for synthesis of the anticancer alkaloid noscapine. Science, 336, 1704–1708. [DOI] [PubMed] [Google Scholar]
- Wisecaver, J.H. , Borowsky, A.T. , Tzin, V. , Jander, G. , Kliebenstein, D.J. & Rokas, A. (2017) A global Coexpression network approach for connecting genes to specialized metabolic pathways in plants. Plant Cell, 29, 944–959. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang, X. , Kang, M. , Yang, Y. , Xiong, H. , Wang, M. , Zhang, Z. et al. (2019) A chromosome‐level genome assembly of the Chinese tupelo Nyssa sinensis . Scientific Data, 6, 282. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu, T. , Cui, H. , Li, J.C. , Luo, Y. , Jiang, G. & Zhao, H. (2023) Enzyme function prediction using contrastive learning. Science, 379, 1358–1363. [DOI] [PubMed] [Google Scholar]
- Yuan, Y. , Shi, C. & Zhao, H. (2023) Machine learning‐enabled genome mining and bioactivity prediction of natural products. ACS Synthetic Biology, 12, 2650–2662. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zedek, F. , Šmerda, J. , Halasová, A. , Adamec, L. , Veleba, A. , Plačková, K. et al. (2024) The smallest angiosperm genomes may be the price for effective traps of bladderworts. Annals of Botany, 134, 1131–1138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao, C. , Liu, T. & Wang, Z. (2024) PANDA‐3D: protein function prediction based on alphafold models. NAR Genomics and Bioinformatics, 6, lqae094. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zheng, S. , Zeng, T. , Li, C. , Chen, B. , Coley, C.W. , Yang, Y. et al. (2022) Deep learning driven biosynthetic pathways navigation for natural products with BioNavi‐NP. Nature Communications, 13, 3342. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou, X. & Liu, Z. (2022) Unlocking plant metabolic diversity: a (pan)‐genomic view. Plant Communications, 3(2), 100300. Available from: 10.1016/j.xplc.2022.100300 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou, J. & Troyanskaya, O.G. (2015) Predicting effects of noncoding variants with deep learning–based sequence model. Nature Methods, 12, 931–934. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Data sharing not applicable to this article as no datasets were generated or analysed during the current study.