Abstract
Natural products (NPs) produced by microorganisms and plants are a major source of drugs, herbicides, and fungicides. Thanks to recent advances in DNA sequencing, bioinformatics, and genome mining tools, a vast amount of data on NP biosynthesis has been generated over the years, which has been increasingly exploited to develop machine learning (ML) tools for NP discovery. In this review, we discuss the latest advances in developing and applying ML tools for exploring the potential NPs that can be encoded by genomic language and predicting the types of bioactivities of NPs. We also examine the technical challenges associated with the development and application of ML tools for NP research.
Keywords: machine learning, natural product, genome mining, biosynthetic gene cluster, bioactivity prediction, model construction
Graphical Abstract

Introduction
For thousands of years, natural products (NPs) have been crucial to human health and well-being1. Recent advances in DNA sequencing, bioinformatics, and genome mining have made the discovery of NPs more efficient. However, with more compounds being discovered, it has become increasingly challenging to avoid discovery of previously characterized NPs. Additionally, exploring the biological functions of NPs remains difficult, particularly as some NPs exist in very small quantities, preventing extensive screening of their bioactivity. To aid in the discovery of NPs and characterization of their bioactivity, researchers have developed various strategies such as high-throughput biosynthetic gene cluster (BGC) discovery2, 3, BGC activation by CRISPR/Cas9-mediated genome editing4, 5, elicitor application6, and manipulation of global or pathway-specific regulators7, 8. Over the last twenty years, NP research has been revolutionized by the development of computational tools for every aspect of NP discovery, ranging from BGC identification to structure prediction to linking genes to compounds9, 10.
Machine learning (ML) is a subset of artificial intelligence (AI) that involves the use of algorithms and statistical models to enable computers to learn from data without being explicitly programmed. Over the past few decades, the concept and tools of ML have permeated into various research fields. In NP research, ML tools have played a crucial role in improving our understanding of NPs, including detecting BGCs, predicting chemical structures, and profiling activity, as summarized in a number of recent reviews9, 11–13. With the exceptional prediction power of these ML tools, it is possible to process a vast amount of genomic and molecular data in a high-throughput manner, which aids in selecting an experimentally feasible set for functional validation. A key to comprehending NP chemistry and biology is by understanding the genomes in which their biosynthetic pathways are encoded. Given the increasing availability of microbial genomes, ML-based genome mining approaches offer a profound opportunity to decipher the genomic language of BGCs and better understand NP chemistry and diversity13. Once BGC-derived NP structures are elucidated, various ML tools can provide further information on bioactivity such as anti-bacterial, anti-cancer, and anti-inflammatory activity, and target prediction as well as other features11–13.
The workflow for building a ML model consists of four main parts: dataset preparation, molecular representations and descriptors, model training, and model evaluation (Fig. 1). Dataset preparation is crucial to generate a successful ML model. A high-quality NP dataset is a prerequisite and leads to better model performance. Zhang et al. identified specific aspects that need to be considered when preparing a dataset for ML model training, such as balanced positive and negative instances, applicability domain, data consistency, inevitable data errors, and database structure11. Featurization plays a crucial role in translating genomic language and chemical structure information into computer-readable formats. It is an essential step in modeling and predicting new BGCs as well as the properties of NPs and other compounds. One common example of featurization is the generation of molecular representations and descriptors. These enable the conversion of complex molecular structures into meaningful numerical features that can be utilized in various computational analyses and predictive models. Early molecular representations, such as SMILES (simplified input line entry system)14, SMARTS (SMILES arbitrary target specification)15, Daylight sCIS16, OpenEye Scientific Software17, and InChI (international chemical identifier)18, were created to store and retrieve molecular information and identify shared molecular features or substructures from databases. Novel molecular representations, such as DeepSMILES19 and SELFIES20, have emerged for practical use in ML tools. Molecular fingerprints, such as ECFP (extended connectivity fingerprints)21 and MACCS (molecular access system) keys22, have been developed for efficient substructure searching in growing chemical databases and reduced storage space. Additionally, unlike chemoinformaticians, computational chemists usually use molecular representations to compute molecular descriptors that describe the structure and low dimensional meaningful features of a compound. Model training involves selecting an appropriate ML algorithm for the data and learning task. Supervised algorithms, such as neural networks (e.g., graph neural network (GNN), convolutional neural network (CNN), deep neural network (DNN))23–25, LDA (linear discriminant analysis)26, NB (naive bayes)27, SVM (support vector machine)28, DT (decision tree)29, and RF (random forest)30, are commonly used for NP prediction (Fig. 2). The choice of algorithm depends on factors such as data quantity and quality, type of learning task, and interpretability of results. In ML, the ability of a model to make accurate predictions on new, unseen data is referred to as its generalization ability. To evaluate this ability, the dataset is typically split into training data (the portion of the dataset used to train the models), validation data (the subset employed to tune model hyperparameters and compare different models during cross-validation) and testing data (the held-out set utilized to evaluate the final performance of the selected model). The model is trained on the training set and its performance is evaluated on the testing set using various evaluation metrics depending on the type of problem being solved. Common metrics for classification tasks include accuracy, precision, recall, and F1 score, while regression tasks use mean squared error, mean absolute error, and R-squared. To ensure the model’s performance consistency across various dataset partitions, cross-validation is employed. This process includes random partitioning of the dataset into multiple training and validation sets. By utilizing cross-validation, one can effectively compare different models, select the best model and hyperparameters, and subsequently employ a held-out test set to obtain a more accurate measure of the optimal model’s real-world performance. This approach enhances the reliability and robustness of the model’s evaluation, leading to more meaningful and dependable results in practical applications. A model that performs well on both the testing set and cross-validation is considered to have good generalization ability and can be used to make predictions on new, unseen data.
Fig. 1: An overview of a ML-enabled workflow for discovery of NPs.

The general workflow consists of model construction and experimental validation. Model construction involves four main parts: data preparation, molecular representation and descriptor, model training, and model evaluation. The red frame denotes model construction for BGC prediction. The blue frame represents model construction for bioactivity prediction. BGC: biosynthetic gene cluster; SMILES: simplified input line entry system; InChI: international chemical identifier.
Fig. 2: Supervised learning.

a, Basic architecture of supervised learning. b, Examples present the commonly used supervised algorithms for NP discovery: neural networks, linear discriminant analysis (LDA), naive Bayes (NB), support vector machine (SVM), decision tree (DT), and random forest (RF).
This review will examine how ML tools have been applied in NP discovery, with a particular focus on how ML tools are leveraged to comprehend the unique “genomic language” that provide insights into NP chemistry. Additionally, we will explore the applications of ML tools in predicting the biological effects of NPs.
ML-assisted genome mining of NPs
NPs are structurally diverse and can be grouped into many classes based on their biosynthetic principles. Numerous genome mining tools have been developed to identify BGCs directly from genome information. Most of them utilized Basic Local Alignment Search Tool (BLAST) or profile hidden Markov models (pHMMs) to mine signature genes that are responsible for the biosynthesis of a specific class of NPs (e.g. antiSMASH31, PRISM32) and then determine the boundaries of the BGCs based on a set of pre-defined rules. Over the years, ML tools have been introduced to genome mining with the goal of discovering new BGCs that may be overlooked by traditional rule-based models. Here, we discuss the ML-based genome mining tools developed for different classes of NPs (Table 1).
Table 1.
Machine learning tools for genome mining of NPs.
| Name | Scope of application | ML algorithms | Data source | Training dataset size/feature | Ref. |
|---|---|---|---|---|---|
| NRPSpredictor2 | Predict NRPS adenylation domain specificity | SVM, Transductive SVM | Manually curated dataset | 576 (labeled data for SVM), 5,096 (unlabeled data for TSVM); 12 AAindex and z-scales descriptors | 34 |
| SANDPUMA | Predict NRPS adenylation domain specificity | DT | MIBiG57 and manually curated dataset | 928 | 31 |
| RiPPMiner | Predict RiPP BGC subclasses, the leader cleavage site of precursor peptide, cross-links and post-translationally modified residues in the core peptide | SVM, RF | RiPPDB (manually curated database) | 513 | 36 |
| RODEO | Identify and rank RiPP precursor peptides belonging to specific subclasses and evaluates the genomic neighborhood | SVM | Manually curated dataset | 350 | 38 |
| decRiPPter | Predict RiPP precursor peptides in a class-independent manner and identify corresponding BGCs using pan-genomics | SVM | MIBiG2.058 | 175 (positive dataset), 20,000 (negative dataset) | 44 |
| NeuRiPP | Identify RiPP precursor peptides belonging to known subclasses | Parallel CNN | RiPP-PRISM59, Thiofinder60, high-confident RODEO predictions | 2,726 (positive dataset), 19224 (negative datset); matrix of hot vectors | 45 |
| NLPPrecursor | Identify RiPP precursor peptides belonging to known subclasses | NLP | RiPPs identified by RiPP-PRISM59 | ~3,000 (token vectors of length bptt) | 46 |
| DeepBGC | Identify BGCs for all major NPs classes and predict the molecular activity of the NPs | BiLSTM, RNN, Skip-gram neural network, RF | ClusterFinder53 training set | 617 (positive dataset), 10128 (negative dataset); 102-dimensional vectors and two binary flags | 51 |
| Deep-BGCpred | Identify BGCs for all major NPs classes | CNN, Stacked BiLSTM, RF | MiBiG1.5, ClusterFinder53 training set | 1984 (positive dataset), 10128 (negative dataset); pfam2vec embedding vectors and continuous vector representations | 52 |
| BiGCARP | Identify BGCs for all major NPs classes | ESM-1b, BERT, CNN | antiSMASH | 127,000 (mask and corrupt tokens) | 54 |
| AniAMPpred | Identify AMPs from the animal kingdom. | SVM, CNN | NCBI, StarPepDB61 | 16,096 (positive dataset), 15747 (negative dataset) | 48 |
| AMP prediction | Identify AMPs from peptide sequence features | LSTM, Attention, BERT | ADAM62, APD63, CAMP64, LAMP65 | 10,321 (positive dataset), 3,030,124 (negative dataset); transformer-based bidirectional encoder representations | 49 |
Non-ribosomally synthesized peptides
Non-ribosomally synthesized peptides (NRPs) are synthesized by multi-modular mega-enzymes named non-ribosomal peptide synthetases (NRPSs). Each module minimally consists of three domains: adenylation domain (A-domain), peptidyl carrier domain (PCP-domain), and condensation domain (C), responsible for the recruiting, tethering, and condensation of the substrate into the growing peptide chain33. The primary structure of the NRP depends on the sequential order of the modules and domain composition. To aid in the discovery of new NRPs, Rottig et al. developed NRPSpredictor234 to predict the specificity of the A-domain to the amino acid substrate. Built on 34 specificity-conferring active site residues in A-domain, NRPSpredictor2 employs SVM trained on 576 labeled A-domains and transductive SVM trained on 5,096 unlabeled A-domains for the prediction of substrate specificity. For bacteria, the predictor can predict both gross physicochemical properties of an A-domain’s substrates and detailed single amino acid substrate. For fungi, the predictor can only predict gross physicochemical properties of substrates due to the lack of sufficient fungal training data. In another study, Blin et al. developed SANDPUMA (Specificity of Adenylation Domain Prediction Using Multiple Algorithms)31 for ensemble prediction of substrate specificity of A-domain by using a decision tree schema that performed individual predictions and combined the results into a single prediction. With an expanded training data containing 928 unique A-domain sequences, the ensemble method significantly outperforms individual methods by leveraging the strengths of active site motif (ASM), SVM, prediCAT (a phylogenetically driven algorithm) and pHMMs.
Ribosomally synthesized and post-translationally modified peptides
Ribosomally synthesized and post-translationally modified peptides (RiPPs) are an emerging class of NPs that are especially attractive for ML-based genome mining efforts due to the relatively small size of RiPP BGCs and the lack of universal signature biosynthetic genes across all RiPP families. Based on the type of post-translational modification installed on the precursor peptide, RiPPs can be categorized into more than 40 subclasses35. In 2017, Agrawal et al. developed RiPPMiner36 to predict chemical structures and subclasses of RiPPs directly from precursor peptide sequences based on SVM and RF classifiers trained on 513 experimentally characterized RiPPs from 13 RiPP subclasses. RiPPMiner can also predict the leader cleavage site, complex cross-links, and post-translationally modified residues in the core peptide for the major RiPP subclasses like lanthipeptides, cyanobactins, thiopeptides, and lasso peptides that contain more than 50 entries in the training dataset. An updated version of RiPPMiner called RiPPMiner-Genome37 can directly take genome sequences as input for automated identification of RiPP BGCs.
In another study, Tietz et al. developed RODEO (Rapid ORF Description and Evaluation Online)38 for mining RiPP BGCs. Unlike RiPPMiner that uses a whole genome or precursor peptide sequence as input, RODEO uses a single protein of interest as query and captures the neighboring genomic region to predict the function of nearby genes by analyzing their Pfam pHMMs. A tripartite procedure of heuristic scoring, SVM, and motif analysis was then utilized to predict and rank precursor peptides. The RODEO tool first demonstrated its utility by surveying lasso peptide biosynthetic landscape, revealing over 1,400 BGCs, and guiding the discovery of five novel lasso peptides. It has been further developed to survey additional RiPP subclasses including thiopeptides39, lanthipeptides40, linaridins41, ranthipeptides42 and graspetides43.
Despite the progress in predicting RiPP BGCs belonging to known subclasses, genome mining of new RiPP subclasses remains a daunting challenge. In 2020, Kloosterman et al. established the Data-driven Exploratory Class-independent RiPP TrackER (decRiPPter)44 to tackle this challenge by combining a SVM trained on 175 known RiPP precursors to identify candidate precursor genes regardless of RiPP subclasses and pan-genomic analyses to identify the corresponding BGCs from those operon-like structures that are sparsely distributed among genomes. Analysis of 1,295 Streptomyces genomes using decRiPPter led to the discovery of a new lanthipeptide subfamily, serving as an experimental validation of the approach. Geared towards novelty, this approach inevitably suffered from a higher number of false positives compared with the above-mentioned genome mining tools for RiPPs.
In a departure from traditional ML-based tools including SVM and decision tree classifiers (DT), deep learning-based genome mining methods have also been utilized to identify RiPP precursor peptides with higher accuracy. In 2019, de Los Santos developed a deep neural network (DNN) classifier NeuRiPP45, which was trained on over 9,454 peptide sequences for identifying known precursor peptides and new precursor peptide-like sequences, with the best parallel convolutional neural network (CNN) architecture achieving over 99% accuracy. Another tool developed by Merwin et al. called NLPPrecursor46, employs natural language processing (NLP)47 to identify precursor peptides in a class-independent manner but is parameterized for the detection of known RiPP subclasses. NLPPrecursor is a part of DeepRiPP, which also includes two other modules to automate the selective discovery of novel RiPPs. One module is Basic Alignment of Ribosomal Encoded Products Locally (BARLEY), which is used for prioritizing loci that encode novel products by matching the predicted RiPP to a chemical structure database of previously characterized members using a cheminformatic local alignment algorithm. The last module, Computational Library for Analysis of Mass Spectra (CLAMS), automates the identification of the corresponding product in mass spectrometry data by comparative metabolomic analysis. By integrating these three modules, DeepRiPP successfully guided the discovery of three novel RiPPs, including deepstreptin (lasso peptide) and two lanthipeptides, deepflavo and deepginsen.
Anti-microbial peptides
Beyond RiPPs, anti-microbial peptides (AMPs) discovery also greatly benefited from various ML tools. In 2021, Sharma et al. developed AniAMPpred that utilized a SVM and 1D CNN with Word2vec embedding to identify AMPs from the animal kingdom. Trained on a curated dataset consisting of 10187 AMPs and 15747 non-AMPs, the model can confidently classify both AMPs and non-AMPs for diverse peptides of varying lengths with the F1 Score of 96% on independent datasets. They further utilized AniAMPpred to identify 436 probable antimicrobial peptides from the genome of Helobdella robusta but did not proceed with experimental validation48. In a recent work on AMP prediction, Ma et al. combined three NLP models (Long Short-Term Memory (LSTM), Attention, and Bidirectional Encoder Representations from Transformers (BERT)) for mining AMPs from the human gut microbiome49. The model performance was superior to other available AMP prediction methods using the same test dataset in terms of Area Under the Precision-Recall Curve (AUPRC) and precision. Experimental results showed that 181 of 216 identified candidate AMPs showed antimicrobial activity (positive rate of >83%). For a comprehensive review of ML-enabled AMP discovery and design, we refer readers to the review by Yan et al50.
Other ML-based genome mining tool
Compared with ML-based genome mining tools developed for specific classes of NPs, examples of utilizing ML for comprehensive identification of BGCs regardless of NP classes are still limited. In 2019, Hannigan et al. developed DeepBGC51 that employed a BiLSTM (Bidirectional Long Short-Term Memory), RNN (Recurrent Neural Network) and a word2vec-like word embedding skip-gram neural network (pfam2vec) trained with 617 positive and 10128 negative samples for improved detection of BGCs belonging to known classes and showed great potential to identify novel BGC classes. DeepBGC was supplemented with RF classifier that enables accurate classification of BGC product classes and some degree of prediction of the corresponding biological activities. In 2022, Yang et al. reported an improved version of DeepBGC called Deep-BGCpred52, which combined the multi-source Pfam domain encoder and the stacked BiLSTM model for predicting BGCs with improved accuracy and reduced false-positive rates. Benchmark experiments showed that Deep-BGCpred is superior to the existing NP class-independent genome mining tool, ClusterFinder53, which predicts BGCs via pHMMs of a sequence of Pfam annotations. Similar to other supervised algorithms, the performance of DeepBGC and Deep-BGCpred is highly reliant on the quality of the negative examples that should contain no false negatives and display similarities with true BGCs.
In a recent study, Rios-Martinez et al. pioneered the usage of a self-supervised neural network masked language model called BiGCARP54 that contains the ByteNet encoder dilated CNN architecture55 with linear input embedding and output decoding layers for predicting and classifying BGCs from microbial genomes. Trained on 127,000 BGC sequences represented as ESM-1b-pretrained embeddings of protein family domain56, BiGCARP can capture meaningful patterns in BGCs with AUROC (Area under the receiver operating characteristic) scores ranging from 0.936 to 0.950 and outperforms DeepBGC on classifying four out of seven product classes. This results from the relatively large training data used in BiGCARP which is 100-times larger than that used in DeepBGC. However, it is still unclear if BiGCARP can detect some truly novel BGCs that contain noncanonical biosynthetic domains from underrepresented sources.
ML-based prediction of NP bioactivity
NP discovery has been greatly accelerated by aforementioned antiSMASH, PRISM 4, and emerging ML tools for genome mining. ML has offered a unique opportunity to link molecular structures of NPs with bioactivity. In the context of bioactivity prediction for NPs, several ML tools have been developed for various types of activities such as anti-microbial, anti-cancer, and anti-inflammation, and target prediction. In the following sections, we present examples of ML-assisted bioactivity prediction for NPs, including the ML tools used, data sources, and dataset sizes (Table 2). It is worth noting that these ML tools heavily rely on NP structural information for bioactivity prediction. This may represent a major drawback because they are limited to known NPs that have undergone structural characterization, and obtaining structures for novel NPs can be challenging. However, to address the limitations of these approaches, alternative methods for predicting NP activity from the gene cluster have been reported32, 51, 66. For instance, Skinnider et al. introduced PRISM 4, a comprehensive platform capable of predicting the chemical structures of genomically encoded antibiotics, covering all classes of bacterial antibiotics currently in clinical use. The high accuracy of chemical structure prediction facilitated the development of ML tools to predict the likely biological activity of encoded molecules32. To gain a deeper understanding of these studies and the various methods employed, we encourage readers to review the relevant literature thoroughly. Moreover, it is worth mentioning that data sources such as ChEMBL, PubChem, and FDA approved drugs encompass a combination of both NPs and synthetic compounds. In the context of this review, most of the discussed models have been trained on datasets that include both synthetic molecules and NPs. It is important to recognize that synthetic compounds occupy a distinct area of chemical space compared to NPs, which could potentially lead to reduced accuracy when these models are employed to predict the bioactivity of NPs.
Table 2.
Machine learning tools for bioactivity prediction of NPs.
| Name | Scope of application | ML algorithms | Data source | Training dataset size | Ref. |
|---|---|---|---|---|---|
| Anti-microbial | |||||
| KNIME | Anti-malarial | NB, SMO, RF, VP | ChEMBL, PubChem, literature, thesis | 1147 | 72 |
| -- | Anti-fungal | ISE | FDA approved drugs, ADG | 3132 | 69 |
| -- | Anti-MRSA | RF, SVM, GP, CNN | ChEMBL, PubChem, ZINC literature | 6645 (for molecular descriptors) 155 (for NMR descriptors) | 67 |
| -- | Anti-microbial | ISE | CMC, ADG, literature | 3520 | 68 |
| -- | Antibiotic discovery | DMP-DNN | FDA approved drugs | 2335 | 70 |
| Anti-cancer | |||||
| CDRUG | Anti-cancer | RFW, TC, KMM | NCI-60 DTP | 18369 | 73 |
| -- | Anti-cancer | DT, SVM, RF, RoF | GDSC, PubChem | 8420 | 75 |
| CDK+PM6 Rf | Anti-cancer, Antibiotic | SVM, RF, CT | PubChem, AntiMarin | 1746 | 76 |
| -- | Anti-cancer | ISE | CMC, NCI drug dictionary | 3509 | 79 |
| -- | Anti-cancer | Causation analysis | Experiment data | 28 | 80 |
| KekuleScope | Anti-cancer | CNN, DNN, RF | ChEMBL | 62981 | 78 |
| Anti-inflammation | |||||
| -- | Anti-inflammation | LDA | MicroSource, literature | 824 | 81 |
| -- | Anti-ulcerative colitis | LDA | MicroSource, Sigma-Aldrich databases | 53 | 82 |
| -- | Anti-inflammation | ISE | AnalytiCon Discovery | 3333 | 83 |
| InflamNat | Anti-inflammation, compound-target interaction | MTT | Literature | 1351 | 84 |
| Target prediction | |||||
| STarFish | Target protein | kNN, RF, MLP | AfroCancer, AfroDb, AfroMalaria, Analyti-Con, Carotenoids, ConMedNP, InterBioScreen (IBS), Mitishamba, NANPDB, Natural Product Atlas, NPACT, NPASS, NuBBE, pANAPL, SANCDB, Super Natural II, TCM, TIPdb, UNPD, ZINC ChEMBL | 438258 | 86 |
| DeepDTA | Target protein | CNN | Davis, KIBA | 30056 (from Davis), 118254 (from KIBA) | 87 |
| DeepAffinity | Nuclear estrogen receptors, GPCR, Ion Chanel, Receptor tyrosine kinases | GNN, RNN | BindingDB, STITCH, UniRef | 489280 | 88 |
| DeepConv-DTI | Target protein | CNN | DrugBank, KEGG, IUPHAR | 48193 | 90 |
| -- | GPCR, Ion Chanel, Transporter, Receptor, Enzyme, Others | SVM | DrugBank | 2107 (for GPCR), 502 (for Ion Chanel), 311 (for Transporter), 199 (for Receptor), 410 (for Enzyme), 83 (for others) | 85 |
| DEEPScreen | Target protein | CNN | ChEMBL | 769935 | 91 |
| MolTrans | Target protein | CNN | BIOSNAP, DAVIS, BindingDB | 27482 (from BIOSNAP), 11103 (from DAVIS), 32601 (from BindingDB) | 92 |
| DeepRelations | Target protein | GCN, GIN, RNN | Davis, KIBA, PDBbind | 25046 (from Davis), 98545 (from KIBA), 2921 (from PDBbind) | 89 |
| -- | Target PKBβ | QSAR | SWMD | 157 | 93 |
| DeepCYP | Target CYP450 | DNN | PubChem BioAssay | 17143 | 94 |
| -- | Target human plasma proteins | RF, BT, MLR, KNN, SVR, MNN | Votano, PKDB, DrugBank | 1209 | 95 |
| -- | Target SITR1 | QSAR | PubChem | 354 | 96 |
| -- | Target ERa | NB, RP | BindingDB, DUD-E | 6556 | 97 |
Anti-microbial
The use of ML tools in predicting the bioactivity of NPs has gained significant attention in recent years. One area where ML has been extensively employed is in the prediction of anti-microbial activity. In 2018, Dias et al. developed two QSAR (quantitative structure-activity relationship) models, one using molecular descriptor (approach A) and the other using 1D NMR descriptors (approach B), to discover new inhibiting agents against methicillin-resistant Staphylococcus aureus (MRSA) infection. They used regression models to predict 6645 molecules retrieved from various databases in approach A, achieving R2 of 0.68 and RMSE of 0.59 for the test set. In approach B, a new NP drug discovery methodology was developed using 1D NMR descriptors, with the best model achieving a prediction accuracy of over 77% for both training and test datasets67. Masalha et al. developed a ML tool using ISE (iterative stochastic elimination) algorithm that efficiently predicts NPs to assist in the discovery of low-cost antibacterial drugs, achieving an AUC of 0.957 and identifying 72% of the antibacterial drugs in the top 1% of a mixed set of active and inactive substances68. In another study, they also used ISE algorithm to predict NPs for their antifungal activity, resulting in a predictive model with an AUC of 0.89, successfully detecting 42% of the antifungal drugs in the top 1% of the screened chemicals69. Unlike QSAR and ISE algorithms, in 2020, Stokes et al. developed a deep neural network (DNN) model (Chemprop) to predict molecules with anti-bacterial activity, identifying a molecule called halicin that demonstrated bactericidal activity against various pathogens in murine models. Additionally, the model identified eight anti-bacterial compounds that were structurally different from known antibiotics, highlighting its potential for identifying novel anti-bacterial agents70. In 2023, Liu et al. employed the same algorithm (Chemprop) to train with a growth inhibition dataset for Acinetobacter baumannii. The authors then conducted in silico predictions for structurally novel molecules targeting A. baumannii, which led to the discovery of abaucin, an antibacterial compound exhibiting narrow-spectrum activity against A. baumannii. These notable findings showcase the remarkable potential of Chemprop in predicting multiple targets71. In addition to using a simplex model, Egieyeh et al. trained four different binary classifiers, NB, RF, SMO (sequential minimization optimization), VP (Voted Perceptron) on a dataset of NPs with in vitro antimalarial activity and applied their best models against 450 NPs from InterBioScreen chemical library, achieving consistent antiplasmodial bioactivity class prediction for 54% of the compounds in the NPs library72.
Anti-cancer
Several studies have utilized ML tools in anti-cancer drug discovery to predict the anti-cancer bioactivity of chemical compounds. Li et al. developed CDRUG (Cancer Drug), a web server that uses a hybrid score (HSCORE) to predict the anti-cancer bioactivity of NPs. The model was trained on a dataset of 8565 active compounds and 9804 inactive compounds from the NCI-60 Developmental Therapeutics Program (DTP) project, achieving an AUC of 0.878, indicating its effectiveness in distinguishing active and inactive compounds73. Using CDRUG, the group predicted the anti-cancer bioactivity of 21334 compounds from 2402 plants from the traditional Chinese medicine database (TCM), with 5278 compounds predicted as anti-cancer compounds, and 346 compounds showing high potency in the 60 cancer cell lines test. Similarity analysis revealed that 75% of the 5278 compounds were highly comparable to approved anti-cancer drugs74. Another study by Yue et al. developed a ML tool to predict the sensitivity of cancer cells to NPs using various cell lines. The study designed DT, SVM, RF, and ROF for anti-cancer drug response prediction using both genomic characterizations (gene expression) and chemical descriptors. ROF achieved the best performance with an AUC of 0.87 with 10-fold cross-validation, and curcumin and resveratrol were evaluated to validate the model75. Pereira et al. utilized a QSAR model to predict the bioactivity of compounds for antitumor and antibiotic activities, identifying 25 and 4 lead compounds for antibiotic and antitumor drug design, respectively, using RF76. The study validated the usefulness of quantum-chemical descriptors in discriminating biologically active and inactive compounds, and the predictive performance was better than the previous model using only CDK descriptors77. Cortés Ciriano et al. developed the Kekulescope tool, which utilizes CNN algorithm for drug discovery using high-content screening images or 2D compound representations, demonstrating that in vitro activity of compounds on cancer cell lines and protein targets can be accurately predicted from their Kekulé structure representations alone. The results also showed that including additional fully-connected layers in the CNNs increased their predictive power by up to 10%, and averaging the output of RF models and CNNs led to lower errors in prediction for multiple datasets than either model alone78. In other studies, Rayan et al. used ISE algorithm to create model to predict NPs for their anticancer activity, identifying twelve NPs as potential anticancer drug candidates79. Wang et al. employed a causation discovery algorithm displayed more robust performance than stepwise regression to identify anti-cancer compounds from Panax ginseng (PG) extracts, with ginsenoside Rb1 identified as the most active compound80.
Anti-inflammation
Anti-inflammatory drugs are known for their undesirable side effects. To tackle this issue, Galvez-Llompart et al. used Molecular Topology and LDA to develop a topological-mathematical model to identify new anti-inflammatory drugs from NPs. The model was validated externally and led to the discovery of 74 compounds with actual anti-inflammatory activity, 54 of which had been previously described in the literature as anti-inflammatory81. In a subsequent study, the same group developed a QSAR model based on molecular topology for predicting the IL-6-mediated (interleukin-6) anti-ulcerative colitis activity of compounds, which led to the discovery of four potentially bioactive compounds: alizarin-3-methylimino-N, N-diacetic acid (AMA), Calcein, (+)-dibenzyl-l-tartrate (DLT), and Ro 41–0960. In vitro testing on two cell lines demonstrated that three of these compounds were able to significantly reduce IL-6 levels, with Ro 41–0960 showing particular effectiveness. This study demonstrated the effectiveness of molecular topology as a tool for selecting potentially active compounds in the treatment of ulcerative colitis82. Separately, Aswad et al. developed a predictive model using ISE algorithm to identify NPs with potential anti-inflammatory activity. The model was able to differentiate between active and inactive anti-inflammatory molecules and identified ten NPs as anti-inflammatory drug candidates, which highlights the potential of the ISE algorithm in identifying NPs with anti-inflammatory properties83.
InflamNat is an online tool which contains a database of 1351 NPs with their physicochemical properties, anti-inflammatory bioactivities, and molecular targets, along with two ML-based predictive tools specifically designed for NPs. The tools use a novel multi-tokenization transformer model (MTT) as a sequential encoder to predict the anti-inflammatory activity of NPs and the compound-target relationship. The experimental results showed that the proposed predictive tools achieved high accuracy in predicting both anti-inflammatory activity and compound-target interactions, with AUC values of 0.842 and 0.872, respectively. The study demonstrates the urgent need for well-curated databases and user-friendly predictive tools to facilitate NP-inspired drug development84.
Target prediction
Validating the molecular targets of NPs is crucial in identifying potential candidates for NP-based drugs. However, the traditional process of determining compound-target interaction requires extensive in vitro or in vivo experiments. To address this limitation, utilizing ML tools to predict the compound-target interaction can significantly reduce the required effort.
Several ML tools have been developed to predict protein targets of bioactive compounds. Keum et al. used data from the DrugBank database to develop six classification-prediction models for compound-target interactions in humans. Using these models, the study predicted the interactions of compounds from NPs and identified several disease-related proteins, including G-protein-coupled receptors (GPCR), ion channels, enzymes, receptors, and transporters, as potential targets of natural herbal compounds85. Similarly, Cockroft et al. developed STarFish, a computational target fishing model that utilized kNN, RF, and MLP algorithms to identify protein targets of bioactive compounds by cross-referencing 20 NP databases with ChEMBL bioactivity database. During cross-validation, the models achieved strong performance with AUROC (Area under the receiver operating characteristic) scores ranging from 0.94 to 0.99 and BEDROC (Boltzmann-enhanced discrimination of receiver operating characteristic) scores from 0.89 to 0.94, but their performance decreased when tested on the NP dataset. However, the implementation of a model stacking approach significantly improved the performance of predicting protein targets of NPs with increased AUROC and BEDROC scores86.
Ozturk et al. proposed a deep learning model that predicted drug-target interaction (DTI) binding affinities by using only sequence information of both targets and drugs, which outperformed existing methods such as KronRLS and SimBoost. Unlike most computational methods that focus on binary classification, the proposed model utilized advanced deep learning algorithms such as CNNs to model protein sequences and compound 1D representations for binding affinity prediction87. Karimi et al. used a semi-supervised deep learning model that combines recurrent and convolutional neural networks (RNN-CNN) and integrates domain knowledge to predict target selectivity. The model outperformed conventional options in achieving relative error in IC50 within 5-fold for test cases and 20-fold for protein classes not included in training88. While their subsequent study curated a dataset with both affinities and contacts of compound-protein interactions and assessed the interpretability of various DeepAffinity versions. The model showed generalizability in affinity prediction and superior interpretability, with potential applications in contact-assisted docking, structure-free binding site prediction, and structure-activity relationship studies89. Lee et al. developed a deep learning model which is capable of predicting DTIs on a large scale using raw protein sequences, which can handle a variety of protein lengths and target protein classes90. In addition, Rifaioglu et al. proposed DEEPScreen, a large-scale DTI prediction system for early-stage drug discovery that employed deep CNN to learn complex features from readily available 2D structural representations of compounds91. Another study by Huang et al. described MolTrans, a deep learning model to improve DTI prediction for in silico drug discovery by incorporating a knowledge-inspired sub-structural pattern mining algorithm and interaction modeling module, resulting in DTI prediction with increased accuracy and interpretability, as well as utilizing an augmented transformer encoder to better extract and capture semantic relations among sub-structures from massive unlabeled biomedical data92.
In addition, ML tools have been developed for prediction of specific target proteins such as protein kinase B (PKBβ)93, cytochrome P450 (CYP450)94, human plasma proteins95, sirtuin 1 (SITR1)96, and estrogen receptor а (ERа)97. For instance, Davis et al. utilized QSAR model to identify potential anti-cancer compounds from a seaweed metabolite database. Using a hybrid genetic algorithm and multiple linear regression analysis, they identified molecular descriptors that played a role in anti-cancer activity, with Baumann’s alignment-independent topological descriptors playing a significant role in variation of activity. Subsequently, they performed a docking study of two crystal structures of PKBβ to identify novel ATP-competitive inhibitors of PKBβ, with Callophycin A exhibiting better ligand efficiency than other PKBβ inhibitors. In silico pharmacokinetic and toxicity studies also showed that Callophycin A had a high drug score compared to other inhibitors93. Li et al. developed a multitask DNN model to predict the inhibitive effect of a compound against five major CYP450 isoforms, namely, 1A2, 2C9, 2C19, 2D6, and 3A4. They also built linear regression models to quantify how the other tasks contributed to the prediction difference of a given task between single-task and multi-task models. Furthermore, sensitivity analysis was applied to extract useful knowledge about CYP450 inhibition, which may shed light on the structural features of these isoforms and give hints about how to avoid side effects during drug development94. Sun et al. used six ML algorithms and 26 molecular descriptors to develop QSAR models that could predict plasma protein binding (PPB) fractions of 967 pharmaceuticals. The models demonstrated excellent performance and could be useful for chemists in predicting PPB from molecular structure. Furthermore, the study identified important structural descriptors that contribute to the predictive power of the models, providing guidance for the modification of chemicals95. In another application, the QSAR model was used to generate an inhibitor structure pattern for SIRT1, a deacetylase enzyme associated with aging, diabetes, and cancer. The pattern was used for ligand-based virtual screening for over one million active compounds from Chinese herbs, leading to the identification of 12 compounds as SIRT1 inhibitors. Molecular docking software confirmed that three of these compounds had high affinity for SIRT196. In a separate study, Pang et al. developed two ML models, NB and recursive partitioning (RP), to identify ERα antagonists from an in-house NP library. The models predicted 162 compounds as ER antagonists, which were then evaluated by molecular docking. Eight representative compounds were selected and tested for ERα competitor assay and luciferase reporter gene assay, showing varying levels of antagonistic activity against ERα97
Future perspectives
ML has shown valuable potential in NP research, especially in genome mining and scaffold prediction, and predicting properties of NPs, such as drug-likeness, toxicity, and biological activity98. However, there are several technical limitations that need to be addressed in order to fully exploit the potential of ML for NPs99, 100. One of the main limitations is the lack of integrated and standardized NP databases, which can serve as the training data for ML models. The available databases with structure and bioactivity information for NPs (e.g., ChEMBL, PubChem, ZINC NPs) and database for BGCs (e.g., antiSMASH, MIBiG, BiG-FAM) have been extensively reviewed9, 11. The existing databases are often incomplete, contain errors, and lack standardized annotations, making it difficult to train accurate ML models. The solution to this limitation is to construct high-quality and large-scale NP databases that are standardized and comprehensive, such as the recently launched NPAtlas database101. Another limitation is the featurization of NP structures, which involves transforming chemical structures into numerical descriptors that can be used as inputs for ML models12. Traditional featurization methods may not capture the unique structural features of NPs, requiring the development of new featurization methods that incorporate the structural diversity and complexity of NPs. An example of such a method is the DeepChem library102, which uses deep learning to generate molecular representations that capture 3D structural information. A third limitation is the lack of ML algorithms that can handle small and biased datasets, which are common in NP research103. Traditional ML algorithms may not perform well on small datasets or when the classes are imbalanced. To overcome the challenges posed by small and imbalanced datasets in NP discovery, various techniques to enhance the performance of ML models have been proposed, such as data augmentation, transfer learning, contrastive learning, and ensemble methods. By applying these methods, ML models can better handle limited and unevenly distributed data, leading to improved prediction performance on NP discovery104, 105. Leveraging transfer learning and multitask learning strategies can significantly boost the efficiency and efficacy of ML models for NP discovery. By pre-training models on vast datasets from related domains and subsequently fine-tuning them on smaller NP datasets, the models can adapt and generalize to the specific context of NPs. This approach not only leads to more accurate predictions but also reduces the data requirements for training, making it particularly valuable in scenarios with limited available data. The prospect of detecting NPs with true novelty and accuracy remains a challenge due to the limited and unbalanced training data consisting of canonical BGCs. A possible solution to this limitation is the integration of ML with rule-based models that use predefined rules or logic to make decisions. In the context of imbalanced datasets, combining ML with rule-based models can help improve the performance and generalization of the predictions. This approach could improve the detection of BGCs that deviate significantly from existing biosynthetic schemes. Finally, the integration of ML with other computational approaches, such as molecular docking, molecular dynamics simulations, and quantum chemical calculations, offers a promising direction in NP research. Hybrid models that combine ML with these complementary techniques can provide a more comprehensive understanding of the interactions and activities of NPs106. This synergy allows researchers to gain deeper insights into the molecular mechanisms underlying NP actions. Additionally, the use of NLP can improve the efficiency of data extraction from the vast amount of literature on NPs. However, the use of NLP in NP research is still in its early stage, and there are several challenges to overcome, such as the complexity and variability of natural language and the lack of standardized annotations47.
Conclusions
ML has emerged as a powerful tool for NP discovery, assisting in genome mining and enabling the prediction of bioactivity. This review summarizes the various ML tools utilized in genome mining and bioactivity prediction, along with the associated limitations and potential solutions in the NP research field. Although there are many technical challenges associated with the use of ML tools for NPs, the ongoing development and application of these tools hold immense promise in the discovery of new NPs and understanding of their biological effects.
Appended glossary of ML terms:
| Name | Abbreviation | Feature |
|---|---|---|
| Support Vector Machine | SVM | Supervised machine learning algorithm used for classification, regression, and outlier detection analysis. |
| Natural Language Processing | NLP | A machine learning technology that focuses on enabling computers to understand, interpret, and generate human language. |
| Recurrent Neural Network | RNN | A type of artificial neural network designed to model sequential data by allowing the network to persist information from previous time steps. |
| Long Short-Term Memory | LSTM | A type of RNN architecture designed to handle the vanishing gradient problem in standard RNNs. |
| Bidirectional Long Short-Term Memory | BiLSTM | A variant of the LSTM network that captures the dependencies of a sequence in both forward and backward directions. |
| Convolutional neural network | CNN | A type of neural network designed for image recognition and processing. |
| Bidirectional Encoder Representations from Transformers | BERT | A pre-trained natural language processing model using an unsupervised learning approach. |
| Naive Bayes | NB | A probabilistic classification algorithm based on Bayes’ theorem, which is commonly used in text classification and spam filtering. |
| Random Forest | RF | A type of ensemble ML algorithm that combines multiple decision trees to improve the accuracy and robustness of the model. |
| Sequential Minimization Optimization | SMO | A popular algorithm for solving the optimization problem in SVMs, to find the optimal values of the parameters that define the SVM hyperplane. |
| Voted Perceptron | VP | A type of Perceptron algorithm that uses multiple weight vectors instead of a single weight vector for binary classification. |
| Iterative Stochastic Elimination | ISE | A type of wrapper method evaluating different subsets of features by iteratively removing one feature at a time based on their importance, until a desired level of accuracy is achieved. |
| Gaussian Processes | GP | A type of non-parametric model that is used to model complex, non-linear relationships between variables, without making any assumptions about the underlying distribution of the data. |
| Deep Neural Network | DNN | A type of artificial neural network that is composed of multiple layers of interconnected processing nodes. |
| Directed-message Passing Deep Neural Network | DMP-DNN | A type of deep learning architecture that is used for processing and modeling graph-structured data. |
| Classification Tree/ Decision Tree | CT/ DT | A ML model that is constructed by recursively partitioning the input space into smaller regions, and used for classification and regression tasks. |
| Frequency-Weighted Fingerprint | FWF | A binary vector that encodes the presence or absence of certain chemical substructures in a molecule. |
| Tanimoto Coefficient | TC | A similarity metric used to measure the similarity between two molecular fingerprints. |
| MinMax Kernel | KMM | A type of kernel function, and a similarity measure between two data points in a feature space. |
| Rotation Forest | ROF | An ensemble learning method combining multiple decision tree classifiers into a single model. |
| Linear Discriminant Analysis | LDA | A supervised learning method that seeks to find a linear combination of features that best separates the classes of a given dataset. |
| k-Nearest Neighbor | kNN | A non-parametric and simple algorithm that makes predictions based on the similarity between a new data point and its k nearest neighbors in the training dataset. |
| Multilayer perceptron/ Multilayer Neural Network | MLP/ MNN | A type of feedforward artificial neural network composed of multiple layers of interconnected processing nodes that is widely used for supervised learning tasks such as classification, regression, and prediction. |
| Graph Neural Network | GNN | A type of neural network designed to operate on data structured as graphs and used for tasks such as node classification, link prediction, and graph classification. |
| Graph Convolutional Network | GCN | A type of GNN that use a convolutional-like operation to aggregate information from neighboring nodes in a graph. |
| Graph Isomorphism Network | GIN | A type of GNN that consists of multiple graph convolutional layers and aims to address the problem of graph isomorphism. |
| Quantitative Structure-Activity Relationship | QSAR | Using statistical and ML techniques to establish a relationship between a set of molecular descriptors (such as molecular weight, shape, and chemical properties) and the activity or property of interest (such as biological activity, solubility, or toxicity). |
| Boost Tree | BT | A type of ensemble learning method for combining multiple weak learners to form a strong learner and used for both regression and classification tasks. |
| Multiple Linear Regression | MLR | A statistical modeling technique used to analyze the relationship between two or more independent variables and a dependent variable. |
| Support Vector Regression | SVR | A variation of SVM and used for regression analysis. |
| Recursive Partitioning | RP | Involves recursively splitting the data into smaller subsets based on the values of the input variables to create a decision tree to make predictions and used for classification and regression tasks. |
| Multi-Tokenization Transformer | MTT | A type of neural network architecture used in natural language processing tasks, such as language modeling and text classification. |
Acknowledgements
This work was supported by an AI Research Institutes program supported by U.S. National Science Foundation under grant no. 2019897 (H.Z.) and a grant from the National Institutes of Health (AI144967 to H.Z.).
Footnotes
Conflicts of interest
There are no conflicts to declare.
References
- 1.Newman DJ; Cragg GM, Natural products as sources of new drugs over the nearly four decades from 01/1981 to 09/2019. Journal of natural products 2020, 83 (3), 770–803. [DOI] [PubMed] [Google Scholar]
- 2.Ayikpoe RS; Shi C; Battiste AJ; Eslami SM; Ramesh S; Simon MA; Bothwell IR; Lee H; Rice AJ; Ren H, A scalable platform to discover antimicrobials of ribosomal origin. Nature communications 2022, 13 (1), 6135. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Yuan Y; Cheng S; Bian G; Yan P; Ma Z; Dai W; Chen R; Fu S; Huang H; Chi H, Efficient exploration of terpenoid biosynthetic gene clusters in filamentous fungi. Nature Catalysis 2022, 5 (4), 277–287. [Google Scholar]
- 4.Zhang MM; Wong FT; Wang Y; Luo S; Lim YH; Heng E; Yeo WL; Cobb RE; Enghiad B; Ang EL, CRISPR–Cas9 strategy for activation of silent Streptomyces biosynthetic gene clusters. Nature chemical biology 2017, 13 (6), 607–609. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Culp EJ; Yim G; Waglechner N; Wang W; Pawlowski AC; Wright GD, Hidden antibiotics in actinomycetes can be identified by inactivation of gene clusters for common antibiotics. Nature Biotechnology 2019, 37 (10), 1149–1154. [DOI] [PubMed] [Google Scholar]
- 6.Xu F; Wu Y; Zhang C; Davis KM; Moon K; Bushin LB; Seyedsayamdost MR, A genetics-free method for high-throughput discovery of cryptic microbial metabolites. Nature chemical biology 2019, 15 (2), 161–168. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Bok JW; Keller NP, LaeA, a regulator of secondary metabolism in Aspergillus spp. Eukaryotic cell 2004, 3 (2), 527–535. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Mao XM; Xu W; Li D; Yin WB; Chooi YH; Li YQ; Tang Y; Hu Y, Epigenetic genome mining of an endophytic fungus leads to the pleiotropic biosynthesis of natural products. Angewandte Chemie 2015, 127 (26), 7702–7706. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Hemmerling F; Piel J, Strategies to access biosynthetic novelty in bacterial genomes for drug discovery. Nature Reviews Drug Discovery 2022, 21 (5), 359–378. [DOI] [PubMed] [Google Scholar]
- 10.Ren H; Shi C; Zhao H, Computational tools for discovering and engineering natural product biosynthetic pathways. iScience 2020, 23 (1), 100795. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Zhang R; Li X; Zhang X; Qin H; Xiao W, Machine learning approaches for elucidating the biological effects of natural products. Natural Product Reports 2021, 38 (2), 346–361. [DOI] [PubMed] [Google Scholar]
- 12.Jeon J; Kang S; Kim HU, Predicting biochemical and physiological effects of natural products from molecular structures using machine learning. Natural Product Reports 2021, 38 (11), 1954–1966. [DOI] [PubMed] [Google Scholar]
- 13.Prihoda D; Maritz JM; Klempir O; Dzamba D; Woelk CH; Hazuda DJ; Bitton DA; Hannigan GD, The application potential of machine learning and genomics for understanding natural product diversity, chemistry, and therapeutic translatability. Natural Product Reports 2021, 38 (6), 1100–1108. [DOI] [PubMed] [Google Scholar]
- 14.Weininger D, SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of chemical information and computer sciences 1988, 28 (1), 31–36. [Google Scholar]
- 15.James CA, Daylight theory manual. http://www.daylight.com/dayhtml/doc/theory/theory.toc.html 2004.
- 16.Mayhoub M; Carter D, Towards hybrid lighting systems: A review. Lighting Research & Technology 2010, 42 (1), 51–71. [Google Scholar]
- 17.OEChem T, Openeye scientific software. Inc., Santa Fe, NM, USA 2012. [Google Scholar]
- 18.Heller SR; McNaught A; Pletnev I; Stein S; Tchekhovskoi D, InChI, the IUPAC international chemical identifier. Journal of cheminformatics 2015, 7 (1), 1–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.O’Boyle N; Dalke A, DeepSMILES: an adaptation of SMILES for use in machine-learning of chemical structures. 2018. [Google Scholar]
- 20.Tiidenberg K; Gómez Cruz E, Selfies, image and the re-making of the body. Body & society 2015, 21 (4), 77–102. [Google Scholar]
- 21.Rogers D; Hahn M, Extended-connectivity fingerprints. Journal of chemical information and modeling 2010, 50 (5), 742–754. [DOI] [PubMed] [Google Scholar]
- 22.Polton D, Installation and operational experiences with MACCS (Molecular Access System). Online Review 1982, 6 (3), 235–242. [Google Scholar]
- 23.Scarselli F; Gori M; Tsoi AC; Hagenbuchner M; Monfardini G, The graph neural network model. IEEE transactions on neural networks 2008, 20 (1), 61–80. [DOI] [PubMed] [Google Scholar]
- 24.Yamashita R; Nishio M; Do RKG; Togashi K, Convolutional neural networks: an overview and application in radiology. Insights into imaging 2018, 9, 611–629. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Wen W; Wu C; Wang Y; Chen Y; Li H, Learning structured sparsity in deep neural networks. Advances in neural information processing systems 2016, 29. [Google Scholar]
- 26.Balakrishnama S; Ganapathiraju A, Linear discriminant analysis-a brief tutorial. Institute for Signal and information Processing 1998, 18 (1998), 1–8. [Google Scholar]
- 27.Zhang H, The optimality of naive Bayes. Aa 2004, 1 (2), 3. [Google Scholar]
- 28.Noble WS, What is a support vector machine? Nature biotechnology 2006, 24 (12), 1565–1567. [DOI] [PubMed] [Google Scholar]
- 29.Laurent H; Rivest RL, Constructing optimal binary decision trees is NP-complete. Information processing letters 1976, 5 (1), 15–17. [Google Scholar]
- 30.Belgiu M; Drăguţ L, Random forest in remote sensing: A review of applications and future directions. ISPRS journal of photogrammetry and remote sensing 2016, 114, 24–31. [Google Scholar]
- 31.Blin K; Shaw S; Kloosterman AM; Charlop-Powers Z; van Wezel GP; Medema MH; Weber T, antiSMASH 6.0: improving cluster detection and comparison capabilities. Nucleic Acids Res 2021, 49 (W1), W29–W35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Skinnider MA; Johnston CW; Gunabalasingam M; Merwin NJ; Kieliszek AM; MacLellan RJ; Li H; Ranieri MRM; Webster ALH; Cao MPT; Pfeifle A; Spencer N; To QH; Wallace DP; Dejong CA; Magarvey NA, Comprehensive prediction of secondary metabolite structure and biological activity from microbial genome sequences. Nat Commun 2020, 11 (1), 6058. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Walsh CT, Insights into the chemical logic and enzymatic machinery of NRPS assembly lines. Nat Prod Rep 2016, 33 (2), 127–135. [DOI] [PubMed] [Google Scholar]
- 34.Rottig M; Medema MH; Blin K; Weber T; Rausch C; Kohlbacher O, NRPSpredictor2--a web server for predicting NRPS adenylation domain specificity. Nucleic Acids Res 2011, 39 (Web Server issue), W362–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Montalban-Lopez M; Scott TA; Ramesh S; Rahman IR; van Heel AJ; Viel JH; Bandarian V; Dittmann E; Genilloud O; Goto Y; Burgos MJG; Hill C; Kim S; Koehnke J; Latham JA; Link AJ; Martinez B; Nair SK; Nicolet Y; Rebuffat S; Sahl HG; Sareen D; Schmidt EW; Schmitt L; Severinov K; Sussmuth RD; Truman AW; Wang H; Weng JK; van Wezel GP; Zhang Q; Zhong J; Piel J; Mitchell DA; Kuipers OP; van der Donk WA, New developments in RiPP discovery, enzymology and engineering. Nat Prod Rep 2021, 38 (1), 130–239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Agrawal P; Khater S; Gupta M; Sain N; Mohanty D, RiPPMiner: a bioinformatics resource for deciphering chemical structures of RiPPs based on prediction of cleavage and cross-links. Nucleic Acids Research 2017, 45 (W1), W80–W88. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Agrawal P; Amir S; Deepak; Barua D; Mohanty D, RiPPMiner-Genome: A Web Resource for Automated Prediction of Crosslinked Chemical Structures of RiPPs by Genome Mining. J Mol Biol 2021, 433 (11). [DOI] [PubMed] [Google Scholar]
- 38.Tietz JI; Schwalen CJ; Patel PS; Maxson T; Blair PM; Tai HC; Zakai UI; Mitchell DA, A new genome-mining tool redefines the lasso peptide biosynthetic landscape. Nature Chemical Biology 2017, 13 (5), 470–+. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Schwalen CJ; Hudson GA; Kille B; Mitchell DA, Bioinformatic Expansion and Discovery of Thiopeptide Antibiotics. J Am Chem Soc 2018, 140 (30), 9494–9501. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Walker MC; Eslami SM; Hetrick KJ; Ackenhusen SE; Mitchell DA; van der Donk WA, Precursor peptide-targeted mining of more than one hundred thousand genomes expands the lanthipeptide natural product family. Bmc Genomics 2020, 21 (1). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Georgiou MA; Dommaraju SR; Guo X; Mast DH; Mitchell DA, Bioinformatic and Reactivity-Based Discovery of Linaridins. Acs Chem Biol 2020, 15 (11), 2976–2985. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Hudson GA; Burkhart BJ; DiCaprio AJ; Schwalen CJ; Kille B; Pogorelov TV; Mitchell DA, Bioinformatic Mapping of Radical S-Adenosylmethionine-Dependent Ribosomally Synthesized and Post-Translationally Modified Peptides Identifies New C alpha, C beta, and C gamma-Linked Thioether-Containing Peptides. J Am Chem Soc 2019, 141 (20), 8228–8238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Ramesh S; Guo X; DiCaprio AJ; De Lio AM; Harris LA; Kille BL; Pogorelov TV; Mitchell DA, Bioinformatics-Guided Expansion and Discovery of Graspetides. Acs Chem Biol 2021, 16 (12), 2787–2797. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Kloosterman AM; Cimermancic P; Elsayed SS; Du C; Hadjithomas M; Donia MS; Fischbach MA; van Wezel GP; Medema MH, Expansion of RiPP biosynthetic space through integration of pan-genomics and machine learning uncovers a novel class of lantibiotics. Plos Biol 2020, 18 (12). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.de Los Santos ELC, NeuRiPP: Neural network identification of RiPP precursor peptides. Sci Rep 2019, 9 (1), 13406. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Merwin NJ; Mousa WK; Dejong CA; Skinnider MA; Cannon MJ; Li HX; Dial K; Gunabalasingam M; Johnston C; Magarvey NA, DeepRiPP integrates multiomics data to automate discovery of novel ribosomally synthesized natural products. P Natl Acad Sci USA 2020, 117 (1), 371–380. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Nadkarni PM; Ohno-Machado L; Chapman WW, Natural language processing: an introduction. Journal of the American Medical Informatics Association 2011, 18 (5), 544–551. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Sharma R; Shrivastava S; Kumar Singh S; Kumar A; Saxena S; Kumar Singh R, AniAMPpred: artificial intelligence guided discovery of novel antimicrobial peptides in animal kingdom. Briefings in Bioinformatics 2021, 22 (6), bbab242. [DOI] [PubMed] [Google Scholar]
- 49.Ma Y; Guo Z; Xia B; Zhang Y; Liu X; Yu Y; Tang N; Tong X; Wang M; Ye X, Identification of antimicrobial peptides from the human gut microbiome using deep learning. Nature Biotechnology 2022, 40 (6), 921–931. [DOI] [PubMed] [Google Scholar]
- 50.Yan J; Cai J; Zhang B; Wang Y; Wong DF; Siu SW, Recent Progress in the Discovery and Design of Antimicrobial Peptides Using Traditional Machine Learning and Deep Learning. Antibiotics 2022, 11 (10), 1451. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Hannigan GD; Prihoda D; Palicka A; Soukup J; Klempir O; Rampula L; Durcak J; Wurst M; Kotowski J; Chang D; Wang RR; Piizzi G; Temesi G; Hazuda DJ; Woelk CH; Bitton DA, A deep learning genome-mining strategy for biosynthetic gene cluster prediction. Nucleic Acids Research 2019, 47 (18). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Yang Z; Liao B; Hsieh C; Han C; Fang L; Zhang S, Deep-BGCpred: A unified deep learning genome-mining framework for biosynthetic gene cluster prediction. bioRxiv 2021, 2021.11. 15.468547. [Google Scholar]
- 53.Cimermancic P; Medema MH; Claesen J; Kurita K; Brown LCW; Mavrommatis K; Pati A; Godfrey PA; Koehrsen M; Clardy J; Birren BW; Takano E; Sali A; Linington RG; Fischbach MA, Insights into Secondary Metabolism from a Global Analysis of Prokaryotic Biosynthetic Gene Clusters. Cell 2014, 158 (2), 412–421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Rios-Martinez C; Bhattacharya N; Amini AP; Crawford L; Yang KK, Deep self-supervised learning for biosynthetic gene cluster detection and product classification. bioRxiv 2022, 2022.07. 22.500861. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Kalchbrenner N; Espeholt L; Simonyan K; van den Oord A; Graves A; Kavukcuoglu K, Neural machine translation in linear time. ArXiv preprint, 2016. URL https://arxiv.org/abs/1610.10099. [Google Scholar]
- 56.Rives A; Meier J; Sercu T; Goyal S; Lin Z; Liu J; Guo D; Ott M; Zitnick CL; Ma J, Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences 2021, 118 (15), e2016239118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Medema MH; Kottmann R; Yilmaz P; Cummings M; Biggins JB; Blin K; de Bruijn I; Chooi YH; Claesen J; Coates RC; Cruz-Morales P; Duddela S; Dusterhus S; Edwards DJ; Fewer DP; Garg N; Geiger C; Gomez-Escribano JP; Greule A; Hadjithomas M; Haines AS; Helfrich EJN; Hillwig ML; Ishida K; Jones AC; Jones CS; Jungmann K; Kegler C; Kim HU; Kotter P; Krug D; Masschelein J; Melnik AV; Mantovani SM; Monroe EA; Moore M; Moss N; Nutzmann HW; Pan GH; Pati A; Petras D; Reen FJ; Rosconi F; Rui Z; Tian ZH; Tobias NJ; Tsunematsu Y; Wiemann P; Wyckoff E; Yan XH; Yim G; Yu FG; Xie YC; Aigle B; Apel AK; Balibar CJ; Balskus EP; Barona-Gomez F; Bechthold A; Bode HB; Borriss R; Brady SF; Brakhage AA; Caffrey P; Cheng YQ; Clardy J; Cox RJ; De Mot R; Donadio S; Donia MS; van der Donk WA; Dorrestein PC; Doyle S; Driessen AJM; Ehling-Schulz M; Entian KD; Fischbach MA; Gerwick L; Gerwick WH; Gross H; Gust B; Hertweck C; Hofte M; Jensen SE; Ju JH; Katz L; Kaysser L; Klassen JL; Keller NP; Kormanec J; Kuipers OP; Kuzuyama T; Kyrpides NC; Kwon HJ; Lautru S; Lavigne R; Lee CY; Linquan B; Liu XY; Liu W; Luzhetskyy A; Mahmud T; Mast Y; Mendez C; Metsa-Ketela M; Micklefield J; Mitchell DA; Moore BS; Moreira LM; Muller R; Neilan BA; Nett M; Nielsen J; O’Gara F; Oikawa H; Osbourn A; Osburne MS; Ostash B; Payne SM; Pernodet JL; Petricek M; Piel J; Ploux O; Raaijmakers JM; Salas JA; Schmitt EK; Scott B; Seipke RF; Shen B; Sherman DH; Sivonen K; Smanski MJ; Sosio M; Stegmann E; Sussmuth RD; Tahlan K; Thomas CM; Tang Y; Truman AW; Viaud M; Walton JD; Walsh CT; Weber T; van Wezel GP; Wilkinson B; Willey JM; Wohlleben W; Wright GD; Ziemert N; Zhang CS; Zotchev SB; Breitling R; Takano E; Glockner FO, Minimum Information about a Biosynthetic Gene cluster. Nature Chemical Biology 2015, 11 (9), 625–631. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Kautsar SA; Blin K; Shaw S; Navarro-Munoz JC; Terlouw BR; van der Hooft JJJ; van Santen JA; Tracanna V; Duran HGS; Andreu VP; Selem-Mojica N; Alanjary M; Robinson SL; Lund G; Epstein SC; Sisto AC; Charkoudian L; Collemare J; Linington RG; Weber T; Medema MH, MIBiG 2.0: a repository for biosynthetic gene clusters of known function. Nucleic Acids Research 2020, 48 (D1), D454–D458. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Skinnider MA; Johnston CW; Edgar RE; Dejong CA; Merwin NJ; Rees PN; Magarvey NA, Genomic charting of ribosomally synthesized natural product chemical space facilitates targeted mining. P Natl Acad Sci USA 2016, 113 (42), E6343–E6351. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Li J; Qu XD; He XY; Duan L; Wu GJ; Bi DX; Deng ZX; Liu W; Ou HY, ThioFinder: A Web-Based Tool for the Identification of Thiopeptide Gene Clusters in DNA Sequences. Plos One 2012, 7 (9). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Aguilera-Mendoza L; Marrero-Ponce Y; García-Jacas CR; Chavez E; Beltran JA; Guillen-Ramirez HA; Brizuela CA, Automatic construction of molecular similarity networks for visual graph mining in chemical space of bioactive peptides: an unsupervised learning approach. Scientific reports 2020, 10 (1), 18074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Veltri D; Kamath U; Shehu A, Deep learning improves antimicrobial peptide recognition. Bioinformatics 2018, 34 (16), 2740–2747. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Wang G; Li X; Wang Z, APD3: the antimicrobial peptide database as a tool for research and education. Nucleic acids research 2016, 44 (D1), D1087–D1093. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Waghu FH; Barai RS; Gurung P; Idicula-Thomas S, CAMPR3: a database on sequences, structures and signatures of antimicrobial peptides. Nucleic acids research 2016, 44 (D1), D1094–D1097. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Zhao X; Wu H; Lu H; Li G; Huang Q, LAMP: a database linking antimicrobial peptides. PloS one 2013, 8 (6), e66557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Walker AS; Clardy J, A machine learning bioinformatics method to predict biological activity from biosynthetic gene clusters. Journal of Chemical Information and Modeling 2021, 61 (6), 2560–2571. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Dias T; Gaudêncio SP; Pereira F, A computer-driven approach to discover natural product leads for methicillin-resistant Staphylococcus aureus infection therapy. Marine drugs 2018, 17 (1), 16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Masalha M; Rayan M; Adawi A; Abdallah Z; Rayan A, Capturing antibacterial natural products with in silico techniques. Molecular Medicine Reports 2018, 18 (1), 763–770. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Rayan M; Abdallah Z; Abu-Lafi S; Masalha M; Rayan A, Indexing natural products for their antifungal activity by filters-based approach: Disclosure of discriminative properties. Current Computer-Aided Drug Design 2019, 15 (3), 235–242. [DOI] [PubMed] [Google Scholar]
- 70.Stokes JM; Yang K; Swanson K; Jin W; Cubillos-Ruiz A; Donghia NM; MacNair CR; French S; Carfrae LA; Bloom-Ackermann Z, A deep learning approach to antibiotic discovery. Cell 2020, 180 (4), 688–702. e13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Liu G; Catacutan DB; Rathod K; Swanson K; Jin W; Mohammed JC; Chiappino-Pepe A; Syed SA; Fragis M; Rachwalski K, Deep learning-guided discovery of an antibiotic targeting Acinetobacter baumannii. Nature Chemical Biology 2023, 1–9. [DOI] [PubMed] [Google Scholar]
- 72.Egieyeh S; Syce J; Malan SF; Christoffels A, Predictive classifier models built from natural products with antimalarial bioactivity using machine learning approach. PLoS One 2018, 13 (9), e0204644. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Li G-H; Huang J-F, CDRUG: a web server for predicting anticancer activity of chemical compounds. Bioinformatics 2012, 28 (24), 3334–3335. [DOI] [PubMed] [Google Scholar]
- 74.Dai S-X; Li W-X; Han F-F; Guo Y-C; Zheng J-J; Liu J-Q; Wang Q; Gao Y-D; Li G-H; Huang J-F, In silico identification of anti-cancer compounds and plants from traditional Chinese medicine database. Scientific reports 2016, 6 (1), 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Yue Z; Zhang W; Lu Y; Yang Q; Ding Q; Xia J; Chen Y, Prediction of cancer cell sensitivity to natural products based on genomic and chemical properties. PeerJ 2015, 3, e1425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76.Pereira F; Latino DA; Gaudêncio SP, QSAR-assisted virtual screening of lead-like molecules from marine and microbial natural sources for antitumor and antibiotic drug discovery. Molecules 2015, 20 (3), 4848–4873. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Pereira F; Latino DA; Gaudêncio SP, A chemoinformatics approach to the discovery of lead-like molecules from marine and microbial sources en route to antitumor and antibiotic drugs. Marine drugs 2014, 12 (2), 757–778. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 78.Cortés-Ciriano I; Bender A, KekuleScope: prediction of cancer cell line sensitivity and compound potency using convolutional neural networks trained on compound images. Journal of cheminformatics 2019, 11 (1), 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Rayan A; Raiyn J; Falah M, Nature is the best source of anticancer drugs: Indexing natural products for their anticancer bioactivity. PloS one 2017, 12 (11), e0187925. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Wang Y; Jin Y; Zhou C; Qu H; Cheng Y, Discovering active compounds from mixture of natural products by data mining approach. Medical & biological engineering & computing 2008, 46, 605–611. [DOI] [PubMed] [Google Scholar]
- 81.Galvez-Llompart M; Zanni R; García-Domenech R, Modeling natural anti-inflammatory compounds by molecular topology. International Journal of Molecular Sciences 2011, 12 (12), 9481–9503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Galvez-Llompart M; del Carmen Recio Iglesias M; Gálvez J; García-Domenech R, Novel potential agents for ulcerative colitis by molecular topology: suppression of IL-6 production in Caco-2 and RAW 264.7 cell lines. Molecular diversity 2013, 17, 573–593. [DOI] [PubMed] [Google Scholar]
- 83.Aswad M; Rayan M; Abu-Lafi S; Falah M; Raiyn J; Abdallah Z; Rayan A, Nature is the best source of anti-inflammatory drugs: Indexing natural products for their anti-inflammatory bioactivity. Inflammation Research 2018, 67, 67–75. [DOI] [PubMed] [Google Scholar]
- 84.Zhang R; Ren S; Dai Q; Shen T; Li X; Li J; Xiao W, InflamNat: web-based database and predictor of anti-inflammatory natural products. Journal of Cheminformatics 2022, 14 (1), 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85.Keum J; Yoo S; Lee D; Nam H, Prediction of compound-target interactions of natural products using large-scale drug and protein information. BMC bioinformatics 2016, 17 (6), 417–425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Cockroft NT; Cheng X; Fuchs JR, STarFish: a stacked ensemble target fishing approach and its application to natural products. Journal of chemical information and modeling 2019, 59 (11), 4906–4920. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Öztürk H; Özgür A; Ozkirimli E, DeepDTA: deep drug–target binding affinity prediction. Bioinformatics 2018, 34 (17), i821–i829. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Karimi M; Wu D; Wang Z; Shen Y, DeepAffinity: interpretable deep learning of compound–protein affinity through unified recurrent and convolutional neural networks. Bioinformatics 2019, 35 (18), 3329–3338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Karimi M; Wu D; Wang Z; Shen Y, Explainable deep relational networks for predicting compound–protein affinities and contacts. Journal of chemical information and modeling 2020, 61 (1), 46–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Lee I; Keum J; Nam H, DeepConv-DTI: Prediction of drug-target interactions via deep learning with convolution on protein sequences. PLoS computational biology 2019, 15 (6), e1007129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Rifaioglu AS; Nalbat E; Atalay V; Martin MJ; Cetin-Atalay R; Doğan T, DEEPScreen: high performance drug–target interaction prediction with convolutional neural networks using 2-D structural compound representations. Chemical science 2020, 11 (9), 2531–2557. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Huang K; Xiao C; Glass LM; Sun J, MolTrans: molecular interaction transformer for drug–target interaction prediction. Bioinformatics 2021, 37 (6), 830–836. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Davis GDJ; Vasanthi AHR, QSAR based docking studies of marine algal anticancer compounds as inhibitors of protein kinase B (PKBβ). European Journal of Pharmaceutical Sciences 2015, 76, 110–118. [DOI] [PubMed] [Google Scholar]
- 94.Li X; Xu Y; Lai L; Pei J, Prediction of human cytochrome P450 inhibition using a multitask deep autoencoder neural network. Molecular Pharmaceutics 2018, 15 (10), 4336–4345. [DOI] [PubMed] [Google Scholar]
- 95.Sun L; Yang H; Li J; Wang T; Li W; Liu G; Tang Y, In silico prediction of compounds binding to human plasma proteins by QSAR models. ChemMedChem 2018, 13 (6), 572–581. [DOI] [PubMed] [Google Scholar]
- 96.Sun Y; Zhou H; Zhu H; Leung S. w., Ligand-based virtual screening and inductive learning for identification of SIRT1 inhibitors in natural products. Scientific Reports 2016, 6 (1), 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97.Pang X; Fu W; Wang J; Kang D; Xu L; Zhao Y; Liu A-L; Du G-H, Identification of estrogen receptor α antagonists from natural products via in vitro and in silico approaches. Oxidative medicine and cellular longevity 2018, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.Saldívar-González F; Aldas-Bulos V; Medina-Franco J; Plisson F, Natural product drug discovery in the artificial intelligence era. Chemical Science 2022, 13 (6), 1526–1546. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99.Greener JG; Kandathil SM; Moffat L; Jones DT, A guide to machine learning for biologists. Nature Reviews Molecular Cell Biology 2022, 23 (1), 40–55. [DOI] [PubMed] [Google Scholar]
- 100.Sapoval N; Aghazadeh A; Nute MG; Antunes DA; Balaji A; Baraniuk R; Barberan C; Dannenfelser R; Dun C; Edrisi M, Current progress and open challenges for applying deep learning across the biosciences. Nature Communications 2022, 13 (1), 1728. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101.van Santen JA; Poynton EF; Iskakova D; McMann E; Alsup TA; Clark TN; Fergusson CH; Fewer DP; Hughes AH; McCadden CA, The Natural Products Atlas 2.0: A database of microbially-derived natural products. Nucleic acids research 2022, 50 (D1), D1317–D1323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Ramsundar B; Pande V; Eastman P; Feinberg E; Gomes J; Leswing K; Pappu A; Wu M, Democratizing deep-learning for drug discovery, quantum chemistry, materials science and biology. GitHub repository 2016. [Google Scholar]
- 103.Kaur H; Pannu HS; Malhi AK, A systematic review on imbalanced data challenges in machine learning: Applications and solutions. ACM Computing Surveys (CSUR) 2019, 52 (4), 1–36. [Google Scholar]
- 104.Yu T; Cui H; Li JC; Luo Y; Jiang G; Zhao H, Enzyme function prediction using contrastive learning. Science 2023, 379 (6639), 1358–1363. [DOI] [PubMed] [Google Scholar]
- 105.Yu T; Boob AG; Volk MJ; Liu X; Cui H; Zhao H, Machine learning-enabled retrobiosynthesis of molecules. Nature Catalysis 2023, 1–15. [Google Scholar]
- 106.Tsai C-F; Chen M-L, Credit rating by hybrid machine learning techniques. Applied soft computing 2010, 10 (2), 374–380. [Google Scholar]
