Abstract
Natural products exhibit diverse and typically nonflat structures, which could be essential in drug–target interactions. Given limited bioactivity data for natural products in public databases, multitask learning (MTL) offers a promising strategy to improve quantitative structure–activity relationship (QSAR)-based predictions. This study optimized MTL with evolutionary relatedness metrics of proteins to enhance the prediction of natural product bioactivity, particularly when data are scarce, and identified conditions under which MTL is most effective. A curated data set of predicted natural products with bioactivity against enzymes from ChEMBL was constructed using binary classification filtering. Single-task learning (STL) served as the baseline, feature-based MTL (FBMTL) was applied across all proteins within each protein group, and instance-based MTL (IBMTL), a variant of FBMTL, incorporated evolutionary relatedness metrics. IBMTL outperformed STL and FBMTL across most protein groups, suggesting that evolutionary relatedness improves performance. Significant improvements were observed in the kinase and cytochrome P450 protein groups, whose proteins are classified at more specific levels of ChEMBL’s 6-level hierarchical protein classification. In the kinase group, IBMTL performed best at the target parent level, highlighting a trade-off between relatedness and data set size. This study demonstrates the potential of MTL in natural product-based drug discovery by leveraging evolutionary relatedness despite limited data availability.


Introduction
Natural products have a wide range of metabolites derived from plants, animals, bacteria, and fungi. Over time, their structures and structural analogs have been crucial in many therapeutic areas. Their relatively high degree of three-dimensional structures, in contrast to the frequent flat structures of synthetic compounds, may be important in interactions with drug targets. , Furthermore, their unique and diverse chemical structures, which are often challenging to replicate synthetically, provide a valuable reservoir of potential drug leads. Between 1981 and 2019, while only 6.1% of small-molecule approved drugs were directly derived from natural products, 66.7% had connections to natural products, highlighting their critical role in drug discovery.
Over 250,000 structures of known natural products are available in public databases. Testing such a large number of compounds directly against specific diseases through wet laboratory experiments is highly resource-intensive, expensive, and time-consuming. Quantitative structure–activity relationship (QSAR), a mathematical modeling method designed to uncover relationships between molecular structural features and their biological properties, offers a promising alternative in drug discovery. It enables the rapid and efficient processing of large compound data sets without significant loss of precision. The predictivity and accuracy of QSAR models, however, depend on factors such as the data set size and the quality of chemical and biological data. ,
A few public databases provide biological activity data for natural products against target proteins, such as Natural Product Activity and Species Source (NPASS) database. However, the data currently available in such databases remain generally limited for machine learning-based QSAR studies. To address this limitation, the performance of machine learning-based QSAR models may be improved by adding data on related tasks and attempting to model such tasks in parallel, a technique known as multitask learning (MTL). −
Conventional QSAR models typically follow a single-task learning (STL) approach, where each task is addressed independently, treating different tasks as separate and unrelated entities. In contrast, MTL-based QSAR models enhance learning efficiency and prediction accuracy by simultaneously leveraging multiple tasks and enabling the transfer of shared information among tasks. However, the effectiveness of transferred information in MTL depends on the relatedness of the combined tasks. There are several types of MTL, including traditional MTL, feature-based MTL (FBMTL), and instance-based MTL (IBMTL). ,
Traditional MTL uses a response matrix where rows represent objects and columns represent tasks, with missing values in cases where objects are not annotated for certain tasks. This approach avoids object duplication, ensuring balanced representation, but relies on algorithms such as artificial neural networks (ANN) that can handle missing responses during training. FBMTL reformulates the problem by adding a task identifier column, enabling a shared response vector without missing values. IBMTL extends FBMTL by incorporating similarity measures between tasks as additional variables, where these similarity values provide quantitative relationships among the tasks. Both FBMTL and IBMTL duplicate objects for each task that they are annotated for, making them suitable for traditional algorithms such as random forest (RF) and support vector machines (SVM), designed to manage one response at a time. ,
Sadawi et al. demonstrated that IBMTL outperformed FBMTL and STL as the baseline in QSAR studies using an evolutionary relatedness metric of proteins. Building on this finding, we, in this study, aimed to optimize MTL with evolutionary relatedness metrics specifically for predicting the activity of natural products against target proteins using limited data sets. These data sets consisted of predicted natural products and their associated bioactivities against target proteins from the ChEMBL database, obtained through binary classification filtering. Predicted natural products were used due to the severe lack of experimentally validated bioactivity data for natural products in public databases. Using a predicted natural products data set ensures that the training set reflects the cases to which we aim to generalize.
In addition, previous studies had not extensively explored the conditionsparticularly evolutionary relatedness, the number of proteins, and the amount of bioactivity data for compounds against target proteinsthat favor the performance of MTL models. This study investigated how these factors influence MTL performance.
Finally, we conducted virtual screening to evaluate whether our optimized model could identify potential compounds from a large natural product library against a target protein and whether its results aligned with those obtained from protein–ligand docking. For this evaluation, we selected AKT2 as the representative target protein, as it is notably overexpressed in pancreatic cancera disease with a 5 year relative survival rate of only 12.8% (2014–2020), making it the most lethal type of cancer. Therefore, targeting the AKT signaling pathway could be a promising therapeutic strategy for treating pancreatic cancer. Besides, AKT2 belongs to the kinase group, in which IBMTL demonstrated significantly superior performance compared to STL, supporting its suitability as a representative target for evaluation. Using the best-performing model, we screened over 200,000 natural products against AKT2 and identified ten potential inhibitors.
Results and Discussion
Classification-Based Filtering of the ChEMBL Database for Predicted Natural Products and Preprocessed Data Distribution
The data set was derived from the ChEMBL 35 database, as specific public databases providing biological activity data for natural products against target proteins (e.g., NPASS database) remain very limited. A filtering model was designed to identify likely natural products from the ChEMBL 35 database. The model was built through binary classification studies using a data set comprising natural products and synthetic compounds from the InterBioScreen database. ,
Figure a depicts the accuracy of models created from combinations of 21 molecular fingerprints and four machine-learning algorithms in distinguishing natural products and synthetic compounds. Most models demonstrated excellent performance, with accuracy exceeding 90%, highlighting their strong ability to distinguish between natural products and synthetic compounds. Among these models, the best performance was achieved by the model combining the Avalon fingerprint with the random forest algorithm, attaining an accuracy of 95.79%.
1.
Performance of machine learning models in compound classification: (a) accuracy comparison for classifying natural products and synthetic compounds across molecular fingerprint–learning algorithm combinations, (b) sensitivity comparison of the best binary classification model in predicting natural product labels using internal and external data sets.
The robustness of the best-performing classification model was subsequently evaluated using external test sets composed of natural products from several databases, including NPASS, Natural Product (NP) Atlas, Comprehensive Marine Natural Products Database (CMNPD), and Molport. From Figure b, the external validation results show that the classification model for generating natural product labels still yields a high sensitivity percentage, above 80%, indicating that the model has good generalizability to external test sets. The sensitivities for the CMNPD and NP Atlas databases were the two lowest, possibly because these databases focus on more specialized natural product collectionsmarine natural products and microbially derived natural products, respectivelyresulting in lower generalizability of the model to these data sets.
The optimized model was used to filter the ChEMBL 35 database, identifying predicted natural products along with their bioactivity data against proteins at level 2 and below within the enzyme group (level 1) based on ChEMBL’s 6-level hierarchical protein classification (as shown in Figure S1). This filtering process, followed by data preprocessing, resulted in a data set comprising 83,641 unique predicted natural products and 123,248 bioactivity data points.
After obtaining the predicted natural products, their chemical space was visualized by using the t-distributed stochastic neighbor embedding (t-SNE) approach, along with the chemical space of the natural products and the synthetic compounds used for building the binary classification model. Representative compounds from all three groups were selected for this comparison, as shown in Figure .
2.
Chemical space visualization of natural products, synthetic compounds, and predicted natural products using t-SNE.
In Figure , some natural products overlap with the region of synthetic compounds due to the ambiguous nature of chemical space, where certain natural products share structural similarities with synthetic compounds. For example, salicylic acid is a natural product but is structurally similar to aspirin, a synthetic compound derived from it. Nevertheless, the majority of natural products occupy distinct positions, indicating that their structural characteristics differ from those of synthetic compounds. Moreover, natural products tend to occupy a broader region compared to synthetic compounds, reflecting greater structural diversity. Most of the predicted natural products align with the natural products region, although some overlap with the synthetic compounds region. These results confirm that the predicted natural products used in this study reasonably represent natural products.
The distribution of the number of proteins shows a drastic decrease from before to after data preprocessing, as illustrated in Figure S2. Similarly, the distribution of the number of compounds per protein after preprocessing is dominated by the 50–250 compounds class, followed by the 251–500 compounds class, as shown in Figure S3. These distribution patterns indicate that this study relies on small data sets, making the MTL technique relevant, as it is expected to enhance learning performance by leveraging knowledge from related tasks.
Comparative Performance Analysis of STL and MTL Models on Protein Groups within the Enzyme Group
In the MTL studies, we employed FBMTL, IBMTL, and a baseline MTL (BMTL). FBMTL and IBMTL approaches were selected to prevent missing data in the data frame, enabling the use of various traditional machine learning algorithms rather than being limited to deep learning approaches. Deep learning models typically require large data sets to learn meaningful patterns. When data are limited, they tend to struggle to generalize well, whereas traditional machine learning methods handle small data sets more effectively, leading to more reliable predictions. BMTL was included as a comparative approach, representing a widely used MTL method that utilizes a traditional MTL data frame and employs a deep learning algorithmspecifically, a graph convolutional network in this study.
After comparing three regression algorithmsANN, RF, and SVMon proteins in the kinase group using STL, FBMTL, and two types of IBMTL: one utilizing amino acid global sequence similarity (AA-GSS) attributes and the other utilizing amino acid local sequence similarity (AA-LSS) attributes, RF emerged as the preferred algorithm. This learning algorithm, together with the Avalon fingerprintidentified above as the best-performing molecular fingerprintwas selected for subsequent MTL studies (excluding BMTL) across all protein groups, as it consistently achieved the lowest average RMSE values in FBMTL, IBMTL AA-GSS, and IBMTL AA-LSS models in the kinase group, as shown in Figure S5.
Figure compares the average RMSE values among the five approaches across different protein groups. FBMTL achieves lower average RMSE values than STL in some groups. Notably, the improvement is statistically significant in the kinase group (p < 0.001), while in other groups, such as phosphatase, lyase, and ligase, the improvements are not statistically significant. However, the average RMSE values of FBMTL models are higher than those of IBMTL models in all protein groups, indicating that incorporating protein similarity attributes, represented in this study by amino acid global and local similarities, can improve model performance.
3.
Performance comparison of five models: STL, FBMTL, IBMTL AA-GSS, IBMTL AA-LSS, and BMTL across all protein groups within the enzyme group.
IBMTL AA-GSS and IBMTL AA-LSS exhibit lower average RMSE values than STL in most protein groups, except for the hydrolase group. Statistically significant improvements over STL are observed only in the kinase group (p < 0.001) and cytochrome P450 group (p < 0.01) for both IBMTL methods. These two groups consist of proteins classified at more specific (higher-numbered) levels of ChEMBL’s 6-level hierarchical protein classification (Tables S3 and S9). In contrast, both IBMTL models perform poorly in hydrolase, oxidoreductase, and transferase groups, whose members are classified at less specific (lower-numbered) levels, specifically level 2 (Tables S5–S7). In ChEMBL’s protein classification system, as depicted in Figure S1, protein groups assigned to higher-numbered levels tend to consist of proteins with greater similarity, whereas those at lower-numbered levels generally contain more diverse members. These results suggest that the evolutionary relatedness of proteins within a protein group plays a crucial role in the performance of IBMTL models.
However, this pattern does not hold for the protease group. Although most of its proteins are classified at level 5 (Table S4), and its data set size is comparable to that of the kinase group (Figure S4), the performance improvement of IBMTL over STL is not statistically significant. This anomaly is further discussed in the section on applying multitask learning models across the three hierarchical protein group levels.
BMTL underperforms compared to other models across most protein groups, showing good performance only in the phosphodiesterase group. The performance gap is even more pronounced in the isomerase, lyase, and ligase groups, which can be attributed to the relatively small data set sizes (Figure S4). These limited data present challenges for the graph convolutional networka deep learning methodmaking it difficult to generalize well. These results highlight that our models tend to exhibit better performance than BMTLa widely used type of MTL that employs a more complex deep learning architectureespecially for protein groups with small data sets.
Impact of Three Hierarchical Protein Group Levels on MTL Models’ Performance
The previous results, where IBMTL significantly outperformed STL in the kinase group but showed only a slight performance improvement in the protease group, led us to examine the impact of the three hierarchical protein group levels on MTL model performance in the kinase and protease groups. As explained in the experimental section, the three hierarchical levels are the superclass, target parent, and protein class. To ensure a fair comparison across these levels, each Protein Class ID was required to contain at least two proteins. As a result, the number of target proteins analyzed in this section was reduced to 85 for the kinase group (from 123 in the previous section) and 90 for the protease group (from 105 in the previous section).
Figure a compares the average RMSE values among the four approaches across three hierarchical protein group levels in the kinase group. In the FBMTL models, performance at all three levels was statistically superior to that of STL (p < 0.001). The best performance was observed at the target parent level, although the difference from the protein class level was only marginal.
4.
Performance comparison of STL, FBMTL, IBMTL AA-GSS, and IBMTL AA-LSS models across superclass, target parent, and protein class levels in the kinase group (a) and protease group (b).
IBMTL models consistently outperformed STL (p < 0.001) across all three hierarchical levels and performed better than FBMTL at every level. Among these three levels, IBMTL models performed best at the target parent level, compared to the protein superclass and protein class levels. This pattern suggests that, within the kinase group, IBMTL’s optimal performance results from a trade-off between the number of proteins (and the amount of bioactivity data for compounds) included in a model and the average similarity among these proteins. At the superclass level, the model consisted of a single protein group, meaning that all proteins with bioactivity data for the predicted natural products were used together, leading to a lower average protein similarity. In contrast, at the protein class level, multiple protein groups were used, distributing proteins among them, which reduced the number of proteins per model but increased the average protein similarity.
Figure b compares the average RMSE values of the four approaches across three hierarchical protein group levels within the protease group. A gradual performance improvement of all MTL approaches is observed from the superclass to the target parent level and, ultimately, to the protein class level. Unlike kinases, which share a highly conserved catalytic domain within their family, proteases exhibit considerable diversity in their catalytic domains, and some are not evolutionarily related to other proteases. , This diversity accounts for the poorer performance of MTL models at the superclass level, where the protein set is more heterogeneous. However, as proteins are grouped into more specific protein groups at the protein class level, the performance of MTL models improves.
Among the MTL models in the protease group, both IBMTL AA-GSS and IBMTL AA-LSS outperform STL and FBMTL at all levels. Notably, IBMTL AA-GSS and IBMTL AA-LSS achieved their best performance at the protein class level, showing a statistically significant improvement over STL (p < 0.01). These findings, observed in both the kinase and protease groups, were consistent with a study by Moon and Kim, which reported that applying MTL models to more diverse targets tends to degrade, rather than improve, performance.
Impact of Protein Sequence Similarity, Number of Proteins, and Amount of Bioactivity Data on MTL Model’s Performance in the Kinase Group at the Target Parent Level
To investigate factors influencing MTL models’ performance, we examined the impact of average protein sequence similarity, the number of proteins, and the amount of bioactivity data available for compounds against target proteins within the kinase group at the target parent level, where FBMTL and IBMTL previously demonstrated the lowest average RMSE. Since both evolutionary relatedness metrics used in IBMTL yield similar average RMSE values, we included only oneAA-LSSin this section.
Figure a shows that when the average local sequence similarity of proteins is greater than 30%, FBMTL and IBMTL AA-LSS perform better than STL. STL only outperforms FBMTL and IBMTL AA-LSS when average local sequence similarity is very low, below 30%. At low similarity, IBMTL outperforms FBMTL with a noticeable difference, but as similarity increases, the average RMSE of IBMTL tends to become closer to that of FBMTL. These results imply that in the high average local sequence similarity, evolutionary relatedness attributes tend to have little impact on IBMTL compared to FBMTL.
5.
Performance comparison of STL, FBMTL, and IBMTL AA-LSS models in the kinase group at the target parent level: (a) by average local protein sequence similarity class, (b) by number of proteins, and (c) by amount of bioactivity data for compoundseach within the same Target Parent ID.
Besides protein sequence similarity, the performance of MTL models is also affected by the number of proteins associated with the same Target Parent ID. Figure b reveals that both FBMTL and IBMTL AA-LSS consistently outperform STL by achieving lower average RMSE values across all numbers of proteins. When the number of proteins within the same Target Parent ID is high, IBMTL AA-LSS performs better than FBMTL, and the performance gap becomes more pronounced. Conversely, as the number of proteins decreases, the performance difference tends to narrow. This pattern indicates that when the number of proteins in a given group is small, protein sequence similarity attributes used by IBMTL AA-LSS do not contribute substantially to performance improvement over FBMTL, and vice versa.
The amount of bioactivity data available for compounds against target proteins within the same Target Parent ID also affects performance. Figure c shows that both FBMTL and IBMTL AA-LSS models outperform STL across all classes of bioactivity data availability. When the amount of bioactivity data within the same Target Parent ID is high, IBMTL AA-LSS shows superior performance compared to FBMTL, with a widening performance gap. As the amount of bioactivity data decreases, this gap becomes smaller. However, this pattern is partly influenced by the number of proteins within the same Target Parent ID, since the amount of bioactivity data is correlated with the number of proteins.
Virtual Screening Using the Best-Performing Model to Predict Natural Product Bioactivities for AKT2
After evaluating the performance of the four modelsSTL, FBMTL, IBMTL AA-GSS, and IBMTL AA-LSSbased on their average RMSE values across three hierarchical protein group levels, their specific performance for AKT2 was also determined, as shown in Table . However, performance evaluation at the protein class level was not conducted, as AKT2 belongs to a Target Parent ID that includes only a single Protein Class ID. Consequently, the model performances for the target parent and protein class levels are identical.
1. Performance of STL, FBMTL, IBMTL AA-GSS, and IBMTL AA-LSS Models for AKT2 Based on Average RMSE Across Two Hierarchical Protein Group Levels.
| model | protein group level | average RMSE |
|---|---|---|
| STL | 0.640 ± 0.084 | |
| FBMTL | superclass | 0.618 ± 0.066 |
| IBMTL AA-GSS | superclass | 0.582 ± 0.104 |
| IBMTL AA-LSS | superclass | 0.593 ± 0.110 |
| FBMTL | target parent | 0.561 ± 0.066 |
| IBMTL AA-GSS | target parent | 0.561 ± 0.071 |
| IBMTL AA-LSS | target parent | 0.559 ± 0.072 |
IBMTL AA-LSS achieved the best performance at the target parent level, yielding the lowest average RMSE. Therefore, for this virtual screening, we selected this model at the target parent level to predict the biological activity of compounds collected from multiple natural product databases.
After removing structural duplicates, we obtained a data set comprising 242,118 unique natural product structures. From this virtual screening, the ten compounds with the lowest predicted IC50 valuesindicating the highest potential biological activity against AKT2are listed in Table . Also included are reference compounds, consisting of representative predicted natural products from ChEMBL against AKT2 across a range of activities and a known synthetic inhibitor of AKT2, GSK690693, with their actual IC50 values.
2. Ten Most Promising Natural Products with the Lowest Predicted IC50 Values against AKT2 Identified by the Best-Performing Model, Along with Reference Compounds and Their Binding Affinities from Protein–Ligand Docking.
| IC50 (nM) |
binding
affinity (kcal/mol) |
|||||||
|---|---|---|---|---|---|---|---|---|
| No. | compound ID | database source | compound name | molecular formula | predicted | actual | AutoDock Vina | MOE |
| ten most promising natural products with the lowest predicted IC50 values against AKT-2 identified by the best-performing model | ||||||||
| 1 | STOCK1N-98616 | InterBioScreen | C37H35ClN6O8S | 4.92 | –9.9 | –9.38 | ||
| 2 | CMNPD3014 | CMNPD | Methoxydechlorochartelline A | C21H16Br4N4O2 | 5.62 | –8.4 | –7.69 | |
| 3 | CMNPD2423 | CMNPD | Chartelline A | C20H13Br4ClN4O | 5.64 | –8.0 | –6.94 | |
| 4 | CMNPD24363 | CMNPD | Premarineosin A | C25H33N3O2 | 6.00 | –8.1 | –5.98 | |
| 5 | CMNPD24364 | CMNPD | 16-ketopremarineosin A | C25H31N3O3 | 6.30 | –8.0 | –6.41 | |
| 6 | CMNPD3015 | CMNPD | Chartelline A | C20H13Br4ClN4O | 6.32 | –8.1 | –5.99 | |
| 7 | CMNPD3013 | CMNPD | Chartelline C | C20H15Br2ClN4O | 6.60 | –7.9 | –6.22 | |
| 8 | STOCK1N-95086 | InterBioScreen | C23H19BrClNO6S | 6.64 | –8.2 | –7.56 | ||
| 9 | CMNPD3012 | CMNPD | Chartelline B | C20H14Br3ClN4O | 6.64 | –8.4 | –5.80 | |
| 10 | CMNPD23804 | CMNPD | Atkamine A | C40H52BrN3O3S | 6.69 | –8.4 | –9.40 | |
| representative predicted natural products from ChEMBL against AKT2 across a range of activities | ||||||||
| 11 | CHEMBL2177367 | ChEMBL | C28H38ClN5O3 | 10 | –9.2 | –8.85 | ||
| 12 | CHEMBL2325740 | ChEMBL | C22H27ClN6O2 | 130 | –8.9 | –8.58 | ||
| 13 | CHEMBL227605 | ChEMBL | C19H17N5 | 430 | –8.7 | –7.19 | ||
| 14 | CHEMBL227727 | ChEMBL | C13H13N5 | 1600 | –7.2 | –6.50 | ||
| 15 | CHEMBL264666 | ChEMBL | C10H13N5 | 59,000 | –6.7 | –6.06 | ||
| known synthetic inhibitor of AKT2 | ||||||||
| 16 | GSK690693 | PDB | C21H27N7O3 | 13 | –10.3 | –9.57 | ||
Label to distinguish the compound from another that shares the same name but has a different structure.
We further evaluated those virtual screening results using an alternative method, protein–ligand docking. Using the crystal structure of human AKT2 in complex with GSK690693 (PDB code: 3D0E), we obtained binding affinity scores for the ten hit candidates. As summarized in Table , two compounds among the ten hit candidates exhibited promising binding affinity scores when compared to the reference compounds. Compounds STOCK1N-98616 and CMNPD23804 achieved binding affinity scores of −9.9 and −9.38, and −8.4 and −9.40 kcal/mol in AutoDock Vina and Molecular Operating Environment (MOE), respectively.
Representative predicted natural products from ChEMBL targeting AKT2 were used to examine the relationship between biological activity and binding affinity across a range of activity levels. These compounds showed actual IC50 values that correlated with their binding affinity scores, where lower IC50 values tend to correspond to more negative binding affinity scores. For comparison, the representative predicted natural product with the lowest actual IC50 value, CHEMBL2177367, scored −9.2 and −8.85 kcal/mol, while GSK690693 scored −10.3 and −9.57 kcal/mol. These findings indicate that IBMTL AA-LSS can yield predictions that, for certain compounds, align with docking results, although this evidence is limited to AKT2. To fully establish the reliability of the model, further experimental confirmation in the laboratory will be necessary.
Conclusion
In this paper, we employed evolutionary relatedness metricsAA-GSS and AA-LSSin an MTL model called IBMTL. We compared its performance against an MTL model that does not incorporate evolutionary relatedness attributes (FBMTL) and a baseline model (STL). This study used a small data set of predicted natural products along with their biological activity data against target proteins, which are classified under protein groups (level 2) within the enzyme group (level 1) in the ChEMBL 35 database.
IBMTL AA-GSS and IBMTL AA-LSS outperformed FBMTL and STL in most protein groups, suggesting that incorporating evolutionary relatedness attributes can enhance model performance. However, statistically significant improvements by the IBMTL models were observed in protein groups consisting of proteins that share common ancestry. In contrast, both models exhibited lower performance in protein groups that included diverse and evolutionarily unrelated members. These findings indicate that the evolutionary relatedness of proteins influences the performance of IBMTL models.
Further analysis within the kinase group at three hierarchical protein group levels confirmed that FBMTL and IBMTL models exhibit statistically significant improvements over STL. Notably, the models achieved the best performance at the target parent level, suggesting that this reflects a trade-off between evolutionary relatedness and data set size.
We also investigated how average protein similarity, the number of proteins, and the amount of bioactivity data for compounds within the same Target Parent ID influence model performance for the kinase group at the target parent level. FBMTL and IBMTL AA-LSS consistently outperform STL when average amino acid local sequence similarity exceeds 30%. However, the performance gap between FBMTL and IBMTL AA-LSS tends to narrow as the average amino acid local sequence similarity increases. A similar trend is observed concerning the number of proteins and the amount of bioactivity data for compounds: FBMTL and IBMTL AA-LSS outperform STL across all levels of data availability, but their performance difference narrows under limited data conditions.
Finally, using the best-performing model for AKT2, we predicted the biological activity of natural products from public databases and identified ten compounds with the lowest predicted IC50 values. These compounds had lower predicted IC50 values than the actual IC50 value of GSK690693, a known synthetic inhibitor of AKT2. The ten compounds and the reference compounds with their actual IC50 values were also assessed using protein–ligand docking. The results showed that, for certain compounds, the model produced outcomes that aligned with docking results, although this finding is limited to the AKT2 target.
Experimental Section
This study was conducted in four main stages. First, an optimized natural product filter was developed through binary classification studies and applied to screen predicted natural products along with their bioactivities from the ChEMBL database, followed by data preprocessing. Second, the performance of STL and MTL regression models was evaluated for protein groups under the ChEMBL enzyme group. Third, MTL models were further applied at three hierarchical levels of protein groups: superclass, target parent, and protein class. Fourth and finally, virtual screening was performed using the best-performing model for AKT2 to predict the bioactivities of natural products from public databases. The overall workflow of this study is illustrated in Figure .
6.
(a) Schematic illustration of data collection and preparation. (b) Schematic illustration of STL and MTL studies for protein groups under the enzyme group. (c) Illustration of MTL model applications across three protein group levels: superclass, target parent, and protein class. (d) Schematic illustration of virtual screening using the best-performing model for AKT2 to predict bioactivities of natural products from public databases.
Data Set for STL and MTL Studies
Given the limited availability of public databases specifically providing biological activity data for natural products against target proteins, this study utilized the CHEMBL 35 database. A filter model was constructed to identify likely natural products from the CHEMBL 35 database. This filter was developed through binary classification studies using a data set of 10,000 natural products and 10,000 synthetic compounds, both randomly selected from InterBioScreen.
The binary classification studies utilized 21 molecular fingerprints from RDKit and CDK (i.e., AtomPair, Avalon, ECFP0, ECFP2, ECFP4, ECFP6, Estate, Extended, FCFP0, FCFP2, FCFP4, FCFP6, FeatMorgan, Layered, MACCS, Morgan, Pattern, Pubchem, RDKit, Standard, and Torsion) and four machine-learning algorithms (i.e., 1-nearest neighbor (1-NN), random forest (RF), support vector machine (SVM), and artificial neural network (ANN)). 5-fold stratified cross-validation was used to maintain balanced labels within each fold and to ensure robust model evaluation across data sets.
The combination of a molecular fingerprint and a machine-learning algorithm that achieved the highest accuracy in distinguishing natural products and synthetic compounds was further evaluated using external test sets, each comprising natural products from one of the following databases: NPASS, NP Atlas, CMNPD, and Molport. This external validation was performed five times using different sets of 2000 natural products from each database. Since these databases do not contain synthetic compounds to serve as decoys, as in the InterBioScreen database, sensitivity (%) was used as the performance metric instead of accuracy (%) to compare internal and external validations. The molecular fingerprint–machine-learning algorithm combination was then selected as the optimized trained filter and subsequently applied to screen data from the CHEMBL 35 database to obtain predicted natural products along with their biological activities against target proteins.
The binary classification and external validation studies were performed using KNIME Analytics Platform (v. 5.2.0). In this platform, OpenBabel was used for compound file format conversions, while RDKit supported tasks such as format conversion, salt and small fragments removal, and the generation of molecular fingerprints. Chemistry Development Kit (CDK) was also utilized to generate molecular fingerprints from compound structures.
Data Preprocessing
This study included only the proteins classified into protein groups at level 2 and below within the enzyme group (defined at level 1), based on ChEMBL’s 6-level hierarchical protein classification (Figure S1). Additionally, proteins were filtered to include only single proteins. The data set was further refined to include only compounds with biological activity specifically reported as half-maximal inhibitory concentration (IC50). Compounds with IC50 values reported in nanomolar units were retained, while values in other units were converted to nanomolar. Records with IC50 values of 0 or negative, missing units, or relational operators other than “ = ” were excluded. The IC50 values in nanomolar for each structure were then converted into logarithmic values (pIC50).
In the molecular-level preprocessing, charged compounds were neutralized using the nonforced uncharging method in RDKitwhere, if a positive or negative charge cannot be removed, the corresponding opposite charge is preserved to maintain overall neutrality. Salt counterions, which are not relevant to the core structure responsible for bioactivity, were removed. Small fragments were also removed by retaining only the largest fragment in each molecule, ensuring that only the main chemical structure was used for modeling. Additionally, SMILES normalization was performed by converting each structure into canonical SMILES, thereby standardizing atom ordering and aromaticity representation.
In cases of duplicate records (i.e., multiple IC50 values for the same compound against the same protein), the most recent record by year was selected. Additional criteria were applied to the data set, ensuring that the protein group contains at least two proteins and that each protein has at least 50 predicted natural products with their IC50 values. The predicted natural products in SMILES were then converted into molecular fingerprints, using the type that performed best in the prior binary classification study.
Data preprocessing was conducted in Python (v. 3.10.13). Pandas (v. 1.5.3) was utilized for tabular data manipulation, including reading, filtering, deduplication, and sorting. NumPy (v. 1.26.2) facilitated numerical transformations, such as log conversion of IC50 values. RDKit was employed for molecular structure processing, including compound uncharging, salt and small fragments removal, SMILES standardization, and molecular fingerprint generation.
The predicted natural products obtained after data preprocessing, together with the natural products and synthetic compounds used to build the binary classification model, were visualized in chemical space using the t-SNE approach. For this comparison, 2500 representative compounds were selected from each of the three groups to evaluate whether the predicted natural products used in this study adequately represent natural products.
STL and MTL Studies
This study employed STL and three types of MTL: FBMTL, IBMTL, and BMTL. STL and BMTL served as baselines for comparing the performance of FBMTL and IBMTL. As illustrated in Figure a, STL utilizes a data frame consisting of rows representing compounds described by molecular descriptors, such as molecular fingerprints (MFPs), and annotated with pIC50 values for a specific protein. In STL, a separate model is built for each protein within a protein group. It uses only molecular fingerprints of a particular compound as input features and outputs a pIC50 value.
7.
Data frame formulation for single-task learning (a), traditional multitask learning (b), feature-based multitask learning (c), and instance-based multitask learning (d).
Figure b illustrates traditional MTL, where a data set consisting of N compounds is associated with a pIC50 matrix with T columns, each representing a specific target protein. Since not all compounds may be annotated for every target, the pIC50 value matrix may include missing values. A single model is built with molecular fingerprints of a particular compound as input features and a set of pIC50 values for all the targets as the output. BMTL adopts this type of MTL setting.
In contrast, FBMTL, as shown in Figure c, introduces a target identifier represented by TargetID. This TargetID is converted using one-hot encoding, which is used as an input feature alongside the molecular fingerprints. All the data sets from individual proteins within the same protein group are concatenated into a single data frame. This data frame formulation allows for a shared pIC50 value vector across all proteins within a protein group, ensuring no missing values, and each compound is replicated k times, where k represents the number of proteins with available annotations for that compound. This concatenated data set is used to build a single model. The TargetID in the data set and molecular fingerprints are used as input features, and the model outputs a pIC50 value for the given compound–target pair. FBMTL is thus an MTL approach that appends a task identifier (in this study, TargetID) to the features, enabling shared learning across tasks, assuming they utilize identical or similar feature representations. ,
IBMTL extends FBMTL by incorporating quantitative similarity among proteins as instance weights, which serve as additional features. As shown in Figure d, the STargetID columns indicate the levels of similarity among proteins in the concatenated data set. These values are calculated as the number of amino acid matches with similar properties, divided by the aligned length, expressed as a percentage. This study used two types of similarity: AA-GSS, calculated using the Needleman-Wunsch algorithm, and AA-LSS, determined using the Smith-Waterman algorithm. For example, in the row showing Compound1’s activity against Target15, the feature STarget10 indicates the similarity between Target10 and Target15. In this row, the value of STarget15 will be 1, as the protein is compared to itself. IBMTL is therefore an MTL approach that builds on FBMTL by leveraging data instances from all tasks to construct a model for each task through instance weighting. ,
Before conducting MTL studies, we first carried out a preliminary study to identify the best machine-learning algorithm for regression. Using the kinase group and its subgroups as examples, we evaluated three algorithms: ANN, RF, and SVM. After selecting the best algorithm, we conducted MTL studies on the entire data set.
Model Construction and Hyperparameter Optimization
We developed code to generate STL and MTL models in Python (v. 3.10.13). For STL and MTL studies, TensorFlow (v. 2.10.0) was used for models with the ANN algorithm, while Scikit-learn (v. 1.0.2) was used for models with RF and SVM algorithms.
The predictive performance was evaluated using the root-mean-square error (RMSE), calculated as the square root of the mean squared difference between the predicted and actual pIC50 values. A lower RMSE indicates more accurate predictions. This metric was consistently applied across both STL and MTL models for fair comparison.
In STL, each protein within a protein group was treated with a separate model. Nested cross-validation with a 5-fold outer loop for performance estimation and a 5-fold inner loop for hyperparameter tuning was performed using a machine learning algorithm. The model predicted the pIC50 values of compounds in the outer test sets for each protein, and RMSE was then calculated individually for each protein.
In FBMTL, nested stratified cross-validation with 5-fold outer and inner loops was applied using a machine learning algorithm to predict pIC50 values of compounds in the test sets across all proteins. Stratified sampling based on TargetID ensured that compounds associated with each protein were proportionally represented in each fold’s training and test sets. After obtaining the predicted pIC50 values for all compounds in the outer test sets, the data were reassigned to each protein based on the TargetID. This reassignment allowed the RMSE to be determined for each protein. IBMTL used a similar model construction to FBMTL, but with additional protein similarity features, AA-GSS and AA-LSS, which were generated using Biopython (v. 1.83).
As a comparative MTL method, BMTL, which represents a widely used MTL approach, was used in this section. This method is a multitask deep learning method that utilizes a traditional MTL data frame and employs a deep learning algorithm, in this study, a graph convolutional network. BMTL was implemented using kMoL (v. 1.1.9.3).
Hyperparameters for each model were optimized using Bayesian optimization with a Tree-structured Parzen Estimator (TPE) in Optuna (v. 4.0.0). To ensure robust and unbiased evaluation, a nested cross-validation strategy was employed for all models. For each outer training set, an inner 5-fold cross-validation was performed to optimize the hyperparameters. During this inner loop, the inner training folds were used for model training, and the inner validation folds for hyperparameter tuning. The best hyperparameters from this inner loop were then used to train a model on the entire outer training set, and its performance was evaluated on the outer test set. This nested strategy ensures that the test data remain completely unseen during both model training and hyperparameter optimization. The hyperparameters used in the preliminary and main studies are summarized in Tables S1 and S2.
Statistical Analysis
The performance of five modelsSTL, FBMTL, IBMTL AA-GSS, IBMTL AA-LSS, and BMTLwas evaluated by the average RMSE, calculated as the mean RMSE of all proteins within a given protein group. The performance of the MTL models was compared to that of the STL model. A one-tailed paired t-test (right-tailed) was conducted to evaluate the statistical significance under the alternative hypothesis that the average RMSE of the STL model is greater than that of the MTL model. Statistical significance was determined at p < 0.05, p < 0.01, and p < 0.001. The statistical significance analyses were performed using SciPy (v. 1.13.0).
Applying MTL Models Across Three Hierarchical Protein Group Levels
After performing MTL studies using all pIC50 data of compounds for all proteins classified as level 2 and below within the enzyme group, we extended the analysis to lower levels. The term “superclass level” was used for the previously analyzed levels. At the next lower level, referred to as the “target parent level”, the analysis focused on data for proteins within the enzyme group classified as level 2 and below that share the same Target Parent ID. Finally, at the “protein class level”, the analysis examined data for proteins classified as level 2 and below within the enzyme group that shares both the same Target Parent ID and Protein Class ID.
In the ChEMBL 35 database, Protein Class ID is a unique identifier for each protein family classification, while Target Parent ID serves as a unique identifier for the parent of a protein family. To ensure the feasibility of MTL studies down to the lowest level, an additional criterion was applied: each Protein Class ID must contain at least two proteins.
Virtual Screening for Potential AKT2 Inhibitors from Natural Product Databases
To identify AKT2 inhibitor hit compounds, virtual screening was conducted across several natural products databases, including NPASS, NP Atlas, CMNPD, InterBioScreen, and Molport, comprising a total of 243,547 natural product structures. Prior to screening, all structures were converted to their uncharged form using the nonforced uncharger in RDKit to ensure consistency with the charge-state treatment used in building the machine learning model. Salt counterions and small fragments were removed. SMILES normalization was performed, and duplicate SMILES in the concatenated data set were eliminated, retaining only one unique entry per compound.
Afterward, all compounds were converted into molecular fingerprints of the same type used in the STL and MTL studies. Target ID and Similarity features, based on AKT2 protein data, were also included. Finally, the best-performing model and hyperparameters from the previous MTL studies were used to predict the pIC50 values of the natural products in the data set. The predictions were then sorted to identify ten compounds with the highest predicted pIC50 values, which were subsequently converted into predicted IC50 values. The predicted IC50 values of these ten compounds were compared with the actual IC50 values of reference compounds, including representative predicted natural products from ChEMBL against AKT2 across a range of activities from low to high and a known synthetic inhibitor of AKT2, GSK690693.
Evaluation of Virtual Screening Results Using Protein–Ligand Docking
We compared the virtual screening results with those obtained using protein–ligand docking. The docking results were further compared with those of the reference compounds. Compounds that lacked 3D structures, available only as 2D structures or SMILES, were first converted into 3D structures using Avogadro (v. 1.2.0). The geometries of the ten selected natural products and the negative control were optimized using a semiempirical method, PM6, implemented in MOPAC.
The crystal structure of human AKT2 in complex with GSK690693 (PDB code: 3D0E) was obtained from the Protein Data Bank and prepared using AutoDockTools (v. 1.5.7). Water molecules were removed, and the cocrystallized ligand, GSK690693, was extracted from the structure. Polar hydrogens were added, nonpolar hydrogens were merged, and Gasteiger charges were assigned to the target protein.
Protein–ligand docking was carried out using two software programs: AutoDock Vina (v. 1.1.2) and MOE (v. 2024.0601). Docking with AutoDock Vina was performed using PyRx (v. 1.1), with the docking grid box centered on the cocrystallized ligand at coordinates x = 22.521, y = −19.611, z = 7.41, and a box size of 26.25 × 26.25 × 26.25 Å. The docking parameters included an exhaustiveness of 64 and the number of models set to 27.
For docking in MOE, the binding site was identified using the Site Finder tool and selected based on the highest Propensity for Ligand Binding (PLB) score. The Triangle Matcher method was used for docking placement, and Rigid Receptor was selected for refinement. The London dG scoring function was applied during placement to generate 60 poses, followed by refinement using the Generalized-Born Volume Integral/Weighted Surface Area (GBVI/WSA) dG scoring function, resulting in 10 final poses.
Supplementary Material
Acknowledgments
The authors acknowledge the Indonesia Endowment Fund for Education (LPDP), Ministry of Finance, Republic of Indonesia, for funding support through a doctoral scholarship awarded to the first author under contract no. SKPB-512/LPDP/LPDP.3/2023.
Glossary
Abbreviations
- 1-NN
1-nearest neighbor
- AA-GSS
amino acid global sequence similarity
- AA-LSS
amino acid local sequence similarity
- AKT2
AKT serine/threonine kinase 2
- ANN
artificial neural network
- BMTL
baseline multitask learning CMNPD comprehensive marine natural products database
- FBMTL
feature-based multitask learning
- GBVI/WSA
generalized-born volume integral/weighted surface area
- IBMTL
instance-based multitask learning
- IC50
50% inhibitory concentration
- MFP
molecular fingerprint
- MOE
molecular operating environment
- MTL
multitask learning NP natural product
- NPASS natural product activity and species source PLB
propensity for ligand binding
- QSAR
quantitative structure–activity relationship
- RF
random forest
- RMSE
root-mean-square error
- SMILES
simplified molecular input line entry system
- STL
single-task learning
- SVM
support vector machine
All data, KNIME workflows, and Python code needed to reproduce the experiments in this paper are available on GitHub at https://github.com/donny-ramadhan/MTL-Natural-Products
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acsomega.5c05094.
Figure S1: ChEMBL’s 6-level hierarchical protein classification. Figure S2: distribution of proteins before and after data preprocessing for protein groups. Figure S3: distribution of protein-to-bioactivity data for compounds per protein. Figure S4: distribution of bioactivity data for compounds per protein group. Figure S5: performance comparison of ANN, RF, and SVM on protein in the kinase group. Figure S6: superposition of the crystal pose and the redocked pose of the cocrystallized ligand. Table S1: Hyperparameters used in nested-cross validation in STL, FBMTL, and IBMTL studies. Table S2: Hyperparameters used in nested-cross validation in BMTL studies. Tables S3–S13: amount of bioactivity data for predicted natural products per protein in each protein group. Table S14: species of origin and structures of the ten most promising natural products against AKT2 identified by virtual screening (PDF)
The authors declare no competing financial interest.
References
- Mullowney M. W., Duncan K. R., Elsayed S. S., Garg N., van der Hooft J. J. J., Martin N. I., Meijer D., Terlouw B. R., Biermann F., Blin K., Durairaj J., Gorostiola González M., Helfrich E. J. N., Huber F., Leopold-Messer S., Rajan K., de Rond T., van Santen J. A., Sorokina M., Balunas M. J., Beniddir M. A., van Bergeijk D. A., Carroll L. M., Clark C. M., Clevert D. A., Dejong C. A., Du C., Ferrinho S., Grisoni F., Hofstetter A., Jespers W., Kalinina O. V., Kautsar S. A., Kim H., Leao T. F., Masschelein J., Rees E. R., Reher R., Reker D., Schwaller P., Segler M., Skinnider M. A., Walker A. S., Willighagen E. L., Zdrazil B., Ziemert N., Goss R. J. M., Guyomard P., Volkamer A., Gerwick W. H., Kim H. U., Müller R., van Wezel G. P., van Westen G. J. P., Hirsch A. K. H., Linington R. G., Robinson S. L., Medema M. H.. Artificial Intelligence for Natural Product Drug Discovery. Nat. Rev. Drug Discovery. 2023;22:895–916. doi: 10.1038/s41573-023-00774-7. [DOI] [PubMed] [Google Scholar]
- Atanasov A. G., Zotchev S. B., Dirsch V. M., Orhan I. E., Banach M., Rollinger J. M., Barreca D., Weckwerth W., Bauer R., Bayer E. A., Majeed M., Bishayee A., Bochkov V., Bonn G. K., Braidy N., Bucar F., Cifuentes A., D’Onofrio G., Bodkin M., Diederich M., Dinkova-Kostova A. T., Efferth T., El Bairi K., Arkells N., Fan T. P., Fiebich B. L., Freissmuth M., Georgiev M. I., Gibbons S., Godfrey K. M., Gruber C. W., Heer J., Huber L. A., Ibanez E., Kijjoa A., Kiss A. K., Lu A., Macias F. A., Miller M. J. S., Mocan A., Müller R., Nicoletti F., Perry G., Pittalà V., Rastrelli L., Ristow M., Russo G. L., Silva A. S., Schuster D., Sheridan H., Skalicka-Woźniak K., Skaltsounis L., Sobarzo-Sánchez E., Bredt D. S., Stuppner H., Sureda A., Tzvetkov N. T., Vacca R. A., Aggarwal B. B., Battino M., Giampieri F., Wink M., Wolfender J. L., Xiao J., Yeung A. W. K., Lizard G., Popp M. A., Heinrich M., Berindan-Neagoe I., Stadler M., Daglia M., Verpoorte R., Supuran C. T.. Natural Products in Drug Discovery: Advances and Opportunities. Nat. Rev. Drug Discovery. 2021;20:200–216. doi: 10.1038/s41573-020-00114-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Singh K., Gupta J. K., Chanchal D. K., Shinde M. G., Kumar S., Jain D., Almarhoon Z. M., Alshahrani A. M., Calina D., Sharifi-Rad J., Tripathi A.. Natural Products as Drug Leads: Exploring Their Potential in Drug Discovery and Development. Naunyn Schmiedebergs Arch. Pharmacol. 2025;398:4673. doi: 10.1007/s00210-024-03622-6. [DOI] [PubMed] [Google Scholar]
- Newman D. J., Cragg G. M.. Natural Products as Sources of New Drugs over the Nearly Four Decades from 01/1981 to 09/2019. J. Nat. Prod. 2020;83:770–803. doi: 10.1021/acs.jnatprod.9b01285. [DOI] [PubMed] [Google Scholar]
- Chen Y., Garcia de Lomana M., Friedrich N.-O., Kirchmair J.. Characterization of the Chemical Space of Known and Readily Obtainable Natural Products. J. Chem. Inf. Model. 2018;58(8):1518–1532. doi: 10.1021/acs.jcim.8b00302. [DOI] [PubMed] [Google Scholar]
- Wei H., McCammon J. A.. Structure and Dynamics in Drug Discovery. npj Drug Discov. 2024;1(1):1. doi: 10.1038/s44386-024-00001-2. [DOI] [Google Scholar]
- Mao J., Akhtar J., Zhang X., Sun L., Guan S., Li X., Chen G., Liu J., Jeon H.-N., Kim M. S., No K. T., Wang G.. Comprehensive Strategies of Machine-Learning-Based Quantitative Structure-Activity Relationship Models. iScience. 2021;24(9):103052. doi: 10.1016/j.isci.2021.103052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mansouri K., Moreira-Filho J. T., Lowe C. N., Charest N., Martin T., Tkachenko V., Judson R., Conway M., Kleinstreuer N. C., Williams A. J.. Free and Open-Source QSAR-Ready Workflow for Automated Standardization of Chemical Structures in Support of QSAR Modeling. J. Cheminf. 2024;16(1):19. doi: 10.1186/s13321-024-00814-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rácz A., Bajusz D., Héberger K.. Effect of Dataset Size and Train/Test Split Ratios in QSAR/QSPR Multiclass Classification. Molecules. 2021;26(4):1111. doi: 10.3390/molecules26041111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao H., Yang Y., Wang S., Yang X., Zhou K., Xu C., Zhang X., Fan J., Hou D., Li X., Lin H., Tan Y., Wang S., Chu X.-Y., Zhuoma D., Zhang F., Ju D., Zeng X., Chen Y. Z.. NPASS Database Update 2023: Quantitative Natural Product Activity and Species Source Database for Biomedical Research. Nucleic Acids Res. 2023;51(D1):D621–D628. doi: 10.1093/nar/gkac1069. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Valsecch, C. ; Grisoni, F. ; Consonni, V. ; Ballabio, D. ; Todeschini, R. . Multitask Learning for Quantitative Structure–Activity Relationships: A Tutorial. In Machine Learning and Deep Learning in Computational Toxicology; Huixiao, H. , Ed.; Springer Nature, 2023. [Google Scholar]
- Valsecchi C., Grisoni F., Motta S., Bonati L., Ballabio D.. NURA: A Curated Dataset of Nuclear Receptor Modulators. Toxicol. Appl. Pharmacol. 2020;407:115244. doi: 10.1016/j.taap.2020.115244. [DOI] [PubMed] [Google Scholar]
- Valsecchi C., Collarile M., Grisoni F., Todeschini R., Ballabio D., Consonni V.. Predicting Molecular Activity on Nuclear Receptors by Multitask Neural Networks. J. Chemom. 2022;36(2):e3325. doi: 10.1002/cem.3325. [DOI] [Google Scholar]
- Zhao Z., Qin J., Gou Z., Zhang Y., Yang Y.. Multi-Task Learning Models for Predicting Active Compounds. J. Biomed. Inf. 2020;108:103484. doi: 10.1016/j.jbi.2020.103484. [DOI] [PubMed] [Google Scholar]
- Wenzel J., Matter H., Schmidt F.. Predictive Multitask Deep Neural Network Models for ADME-Tox Properties: Learning from Large Data Sets. J. Chem. Inf. Model. 2019;59(3):1253–1268. doi: 10.1021/acs.jcim.8b00785. [DOI] [PubMed] [Google Scholar]
- Sadawi N., Olier I., Vanschoren J., Van Rijn J. N., Besnard J., Bickerton R., Grosan C., Soldatova L., King R. D.. Multi-Task Learning with a Natural Metric for Quantitative Structure Activity Relationship Learning. J. Cheminf. 2019;11(1):68. doi: 10.1186/s13321-019-0392-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gaulton A., Bellis L. J., Bento A. P., Chambers J., Davies M., Hersey A., Light Y., McGlinchey S., Michalovich D., Al-Lazikani B., Overington J. P.. ChEMBL: A Large-Scale Bioactivity Database for Drug Discovery. Nucleic Acids Res. 2012;40(D1):D1100–D1107. doi: 10.1093/nar/gkr777. [DOI] [PMC free article] [PubMed] [Google Scholar]
- National Cancer Institute Cancer Stat Facts: Pancreatic Cancer. https://seer.cancer.gov/statfacts/html/pancreas.html (accessed 02 27, 2025).
- He Y., Sun M. M., Zhang G. G., Yang J., Chen K. S., Xu W. W., Li B.. Targeting PI3K/Akt Signal Transduction for Cancer Therapy. Signal Transduction Targeted Ther. 2021;6(1):425. doi: 10.1038/s41392-021-00828-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Seo M., Shin H. K., Myung Y., Hwang S., No K. T.. Development of Natural Compound Molecular Fingerprint (NC-MFP) with the Dictionary of Natural Products (DNP) for Natural Product-Based Drug Development. J. Cheminf. 2020;12(1):6. doi: 10.1186/s13321-020-0410-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- InterBioScreen Ltd Natural Compounds. https://www.ibscreen.com/natural-compounds (accessed 02 27, 2025).
- Poynton E. F., van Santen J. A., Pin M., Contreras M. M., McMann E., Parra J., Showalter B., Zaroubi L., Duncan K. R., Linington R. G.. The Natural Products Atlas 3.0: Extending the Database of Microbially Derived Natural Products. Nucleic Acids Res. 2025;53(D1):D691–D699. doi: 10.1093/nar/gkae1093. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lyu C., Chen T., Qiang B., Liu N., Wang H., Zhang L., Liu Z.. CMNPD: A Comprehensive Marine Natural Products Database towards Facilitating Drug Discovery from the Ocean. Nucleic Acids Res. 2021;49(D1):D509–D515. doi: 10.1093/nar/gkaa763. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Molport . Natural Product and NP-Like Compound Library. https://www.molport.com/shop/natural-compound-database (accessed 02 28, 2025).
- Simões R. S., Maltarollo V. G., Oliveira P. R., Honorio K. M.. Transfer and Multi-Task Learning in QSAR Modeling: Advances and Challenges. Front. Pharmacol. 2018;9:74. doi: 10.3389/fphar.2018.00074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dou B., Zhu Z., Merkurjev E., Ke L., Chen L., Jiang J., Zhu Y., Liu J., Zhang B., Wei G.-W.. Machine Learning Methods for Small Data Challenges in Molecular Science. Chem. Rev. 2023;123(13):8736–8780. doi: 10.1021/acs.chemrev.3c00189. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hanks S. K., Quinn A. M., Hunter T.. The Protein Kinase Family: Conserved Features and Deduced Phylogeny of the Catalytic Domains. Science. 1988;241(4861):42–52. doi: 10.1126/science.3291115. [DOI] [PubMed] [Google Scholar]
- López-Otín C., Bond J. S.. Proteases: Multifunctional Enzymes in Life and Disease*. J. Biol. Chem. 2008;283(45):30433–30437. doi: 10.1074/jbc.R800035200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moon C., Kim D.. Prediction of Drug–Target Interactions through Multi-Task Learning. Sci. Rep. 2022;12(1):18323. doi: 10.1038/s41598-022-23203-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heerding D. A., Rhodes N., Leber J. D., Clark T. J., Keenan R. M., Lafrance L. V., Li M., Safonov I. G., Takata D. T., Venslavsky J. W., Yamashita D. S., Choudhry A. E., Copeland R. A., Lai Z., Schaber M. D., Tummino P. J., Strum S. L., Wood E. R., Duckett D. R., Eberwein D., Knick V. B., Lansing T. J., McConnell R. T., Zhang S., Minthorn E. A., Concha N. O., Warren G. L., Kumar R.. Identification of 4-(2-(4-Amino-1,2,5-Oxadiazol-3-Yl)-1-Ethyl-7-{[(3S)-3-Piperidinylmethyl]Oxy}-1H-Imidazo[4,5-c]Pyridin-4-Yl)-2-Methyl-3-Butyn-2-Ol (GSK690693), a Novel Inhibitor of AKT Kinase. J. Med. Chem. 2008;51(18):5663–5679. doi: 10.1021/jm8004527. [DOI] [PubMed] [Google Scholar]
- Berthold, M. R. ; Cebron, N. ; Dill, F. ; Gabriel, T. R. ; Kötter, T. ; Meinl, T. ; Ohl, P. ; Sieb, C. ; Thiel, K. ; Wiswedel, B. K. . The Konstanz Information Miner. In Data Analysis, Machine Learning and Applications; Preisach, C. , Burkhardt, H. , Schmidt-Thieme, L. , Decker, R. , Eds.; Springer Berlin Heidelberg: Berlin, Heidelberg, 2008; pp 319–326. [Google Scholar]
- O’Boyle N. M., Banck M., James C. A., Morley C., Vandermeersch T., Hutchison G. R.. Open Babel: An Open Chemical Toolbox. J. Cheminf. 2011;3(1):33. doi: 10.1186/1758-2946-3-33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- RDKit . RDKit: Open-Source Cheminformatics. https://www.rdkit.org.
- Beisken S., Meinl T., Wiswedel B., de FigueiredoBerthold L. F. M., Steinbeck C.. KNIME-CDK: Workflow-Driven Cheminformatics. BMC Inf. 2013;14(1):257. doi: 10.1186/1471-2105-14-257. [DOI] [PMC free article] [PubMed] [Google Scholar]
- RDKit RDKit Mol Standardize Uncharger Class Reference. https://www.rdkit.org/docs/cppapi/classRDKit_1_1MolStandardize_1_1Uncharger.html (accessed 08 18, 2025).
- Hermansyah O., Bustamam A., Yanuar A.. Virtual Screening of Dipeptidyl Peptidase-4 Inhibitors Using Quantitative Structure–Activity Relationship-Based Artificial Intelligence and Molecular Docking of Hit Compounds. Comput. Biol. Chem. 2021;95:107597. doi: 10.1016/j.compbiolchem.2021.107597. [DOI] [PubMed] [Google Scholar]
- Kausar S., Falcao A. O.. An Automated Framework for QSAR Model Building. J. Cheminf. 2018;10(1):1. doi: 10.1186/s13321-017-0256-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weininger D.. SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. J. Chem. Inf. Comput. Sci. 1988;28(1):31–36. doi: 10.1021/ci00057a005. [DOI] [Google Scholar]
- McKinney, W. Data Structures for Statistical Computing in Python. In Python in Science Conference, 2010, pp 56–61. [Google Scholar]
- Harris C. R., Millman K. J., van der Walt S. J., Gommers R., Virtanen P., Cournapeau D., Wieser E., Taylor J., Berg S., Smith N. J., Kern R., Picus M., Hoyer S., van Kerkwijk M. H., Brett M., Haldane A., del Río J. F., Wiebe M., Peterson P., Gérard-Marchant P., Sheppard K., Reddy T., Weckesser W., Abbasi H., Gohlke C., Oliphant T. E.. Array Programming with NumPy. Nature. 2020;585(7825):357–362. doi: 10.1038/s41586-020-2649-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Needleman S. B., Wunsch C. D.. A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins. J. Mol. Biol. 1970;48(3):443–453. doi: 10.1016/0022-2836(70)90057-4. [DOI] [PubMed] [Google Scholar]
- Smith T. F., Waterman M. S.. Identification of Common Molecular Subsequences. J. Mol. Biol. 1981;147(1):195–197. doi: 10.1016/0022-2836(81)90087-5. [DOI] [PubMed] [Google Scholar]
- Abadi, M. ; Agarwal, A. ; Barham, P. ; Brevdo, E. ; Chen, Z. ; Citro, C. ; Corrado, G. S.. ; Davis, A. ; Dean, J. ; Devin, M. ; Ghemawat, S. ; Goodfellow, I. ; Harp, A. ; Irving, G. ; Jia, Y. ; Jozefowicz, R. ; Kaiser, L. ; Kudlur, M. ; Levenberg, J. ; Mané, D. ; Monga, R. ; Moore, S. ; Murray, D. ; Olah, C. ; Schuster, M. ; Shlens, J. ; Steiner, B. ; Sutskever, I. ; Talwar, K. ; Tucker, P. ; Vanhoucke, V. ; Vasudevan, V. ; Viégas, F. ; Vinyals, O. ; Warden, P. ; Wattenberg, M. ; Wicke, M. ; Yu, Y. ; Zheng, X. . TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. https://www.tensorflow.org/.
- Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., Duchesnay E. ´.. Scikit-Learn: Machine Learning in Python. J. Mach. Learn. Res. 2011;12(85):2825–2830. doi: 10.5555/1953048.2078195. [DOI] [Google Scholar]
- Cock P. J. A., Antao T., Chang J. T., Chapman B. A., Cox C. J., Dalke A., Friedberg I., Hamelryck T., Kauff F., Wilczynski B., de Hoon M. J. L.. Biopython: Freely Available Python Tools for Computational Molecular Biology and Bioinformatics. Bioinformatics. 2009;25(11):1422–1423. doi: 10.1093/bioinformatics/btp163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cozac R., Hasic H., Choong J. J., Richard V., Beheshti L., Froehlich C., Koyama T., Matsumoto S., Kojima R., Iwata H., Hasegawa A., Otsuka T., Okuno Y.. KMoL: An Open-Source Machine and Federated Learning Library for Drug Discovery. J. Cheminf. 2025;17(1):22. doi: 10.1186/s13321-025-00967-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bergstra, J. ; Bardenet, R. ; Bengio, Y. ; Kégl, B. . Algorithms for Hyper-Parameter Optimization. In Proceedings of the 25th International Conference on Neural Information Processing Systems; NIPS’11; Curran Associates Inc.: Red Hook, NY, USA, 2011, pp 2546–2554. [Google Scholar]
- Akiba, T. ; Sano, S. ; Yanase, T. ; Ohta, T. ; Koyama, M. . Optuna: A Next-Generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining; KDD ’19; Association for Computing Machinery: New York, NY, USA, 2019, pp 2623–2631. [Google Scholar]
- Virtanen P., Gommers R., Oliphant T. E., Haberland M., Reddy T., Cournapeau D., Burovski E., Peterson P., Weckesser W., Bright J., van der Walt S. J., Brett M., Wilson J., Millman K. J., Mayorov N., Nelson A. R. J., Jones E., Kern R., Larson E., Carey C. J., Polat İ., Feng Y., Moore E. W., VanderPlas J., Laxalde D., Perktold J., Cimrman R., Henriksen I., Quintero E. A., Harris C. R., Archibald A. M., Ribeiro A. H., Pedregosa F., van Mulbregt P., Vijaykumar A., BardelliRothberg A. P. A., Hilboll A., Kloeckner A., Scopatz A., Lee A., Rokem A., Woods C. N., Fulton C., Masson C., Häggström C., Fitzgerald C., Nicholson D. A., Hagen D. R., Pasechnik D. V., Olivetti E., Martin E., Wieser E., Silva F., Lenders F., Wilhelm F., Young G., Price G. A., Ingold G. L., Allen G. E., Lee G. R., Audren H., Probst I., Dietrich J. P., Silterra J., Webber J. T., Slavič J., Nothman J., Buchner J., Kulick J., Schönberger J. L., de Miranda Cardoso J. V., Reimer J., Harrington J., Rodríguez J. L. C., Nunez-Iglesias J., Kuczynski J., Tritz K., Thoma M., Newville M., Kümmerer M., Bolingbroke M., Tartre M., Pak M., Smith N. J., Nowaczyk N., Shebanov N., Pavlyk O., Brodtkorb P. A., Lee P., McGibbon R. T., Feldbauer R., Lewis S., Tygier S., Sievert S., Vigna S., Peterson S., More S., Pudlik T., Oshima T., Pingel T. J., Robitaille T. P., Spura T., Jones T. R., Cera T., Leslie T., Zito T., Krauss T., Upadhyay U., Halchenko Y. O., Vázquez-Baeza Y., Vázquez-Baeza Y., Contributors S.. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods. 2020;17(3):261–272. doi: 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hanwell M. D., Curtis D. E., Lonie D. C., Vandermeersch T., Zurek E., Hutchison G. R.. Avogadro An Advanced Semantic Chemical Editor, Visualization, and Analysis Platform. J. Cheminf. 2012;4(1):17. doi: 10.1186/1758-2946-4-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stewart J. J. P.. Optimization of Parameters for Semiempirical Methods V: Modification of NDDO Approximations and Application to 70 Elements. J. Mol. Model. 2007;13(12):1173–1213. doi: 10.1007/s00894-007-0233-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stewart, J. J. P. MOPAC2016; Stewart Computational Chemistry: Colorado Springs, Colorado. http://openmopac.net/ (accessed 04 01, 2025).
- Berman H. M., Westbrook J., Feng Z., Gilliland G., Bhat T. N., Weissig H., Shindyalov I. N., Bourne P. E.. The Protein Data Bank. Nucleic Acids Res. 2000;28(1):235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morris G. M., Huey R., Lindstrom W., Sanner M. F., Belew R. K., Goodsell D. S., Olson A. J.. AutoDock4 and AutoDockTools4: Automated Docking with Selective Receptor Flexibility. J. Comput. Chem. 2009;30(16):2785–2791. doi: 10.1002/jcc.21256. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Trott O., Olson A. J.. AutoDock Vina: Improving the Speed and Accuracy of Docking with a New Scoring Function, Efficient Optimization, and Multithreading. J. Comput. Chem. 2010;31(2):455–461. doi: 10.1002/jcc.21334. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Molecular Operating Environment (MOE) . 2024.0601 Chemical Computing Group ULC, 910–1010 Sherbrooke Street West., Montreal, QC H3A 2R7, 2025.
- Dallakyan, S. ; Olson, A. J. . Small-Molecule Library Screening by Docking with PyRx. In Chemical Biology: Methods and Protocols; Hempel, J. E. , Williams, C. H. , Hong, C. C. , Eds.; Springer New York: New York, NY, 2015; pp 243–250. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data, KNIME workflows, and Python code needed to reproduce the experiments in this paper are available on GitHub at https://github.com/donny-ramadhan/MTL-Natural-Products







