Skip to main content
mSystems logoLink to mSystems
. 2025 Feb 24;10(3):e00844-24. doi: 10.1128/msystems.00844-24

Rationally minimizing natural product libraries using mass spectrometry

Monica Ness 1,2,#, Thilini Peramuna 1,#, Karen L Wendt 1, Jennifer E Collins 3, Jarrod B King 1, Raphaella Paes 3, Natalia Mojica Santos 3, Crystal Okeke 1, Cameron R Miller 1, Debopam Chakrabarti 3, Robert H Cichewicz 4,, Laura-Isobel McCall 1,2,
Editor: Jack A Gilbert5
PMCID: PMC11915828  PMID: 39992101

ABSTRACT

Natural products are a critical source of novel chemotypes for drug discovery. However, the implementation of natural product extract libraries in high throughput screening is hampered by natural product structural redundancy and potential for bioactive re-discovery. This challenge and large library sizes drastically increase the time and cost during initial high throughput screens. To address these limitations, we developed a method that leverages liquid chromatography-tandem mass spectrometry spectral similarity to dramatically reduce natural product library size, with minimal bioactive loss, and applied this to a collection of fungal extracts. Importantly, this method also afforded increased bioassay hit rates against microbial targets, with broad applicability across assays and natural product sources. Thus, this method offers a broadly applicable strategy for accelerated and cost-effective natural product drug discovery.

IMPORTANCE

Natural product libraries are large collections of extracts derived from fungi, plants, bacteria, or any other natural sources. These libraries play an important role in the initial phases of drug discovery, providing the basis for bioassays against a target of interest. However, these collections often comprise thousands of extracts with sometimes overlapping chemical structures, which can result in a bottleneck in both time and costs for the initial phases of drug discovery. Here, we have developed a method that uses mass spectrometry to dramatically reduce the size of these libraries, with minimal tradeoffs and improved success rates in bioassays. Ultimately, this will speed up the process of bioactive candidate identification and isolation, and drug development overall.

KEYWORDS: metabolomics, fungi, drug discovery, specialized metabolites, mass spectrometry, computational techniques

INTRODUCTION

Natural products play a major role in the development of novel pharmaceutical agents, accounting for nearly 70% of newly approved drugs in the past 40 years as direct natural molecules or as natural product mimics (1). Natural product discovery pipelines typically begin by screening large libraries of extracts (crude or pre-fractionated), then identifying and isolating bioactive candidates from these extracts. However, large libraries can result in long development times, high costs, and duplicate drug candidate identification (rediscovery). Prior efforts to address these challenges often focused on the initial collection of library extracts, rather than on rationally reducing library size. These prior studies analyzed the impact of factors like geography, phylogenetics, or culturing conditions on small molecule diversity, using approaches like molecular networking, principal coordinate analysis, principal component analysis, and hierarchical clustering, for example, to select optimal culture conditions (e.g., references 24). Considerable efforts have also focused on DNA-based approaches to prioritize or select samples, for example, relying on biosynthetic gene clusters (e.g., reference 5). Clark et al. and Costa et al. combined DNA sequencing with Matrix-assisted laser desorption ionization–time of flight-based protein analysis to reduce redundancy (68). However, none of this prior work evaluated their methods in terms of applicability of the reduced libraries in the context of high-throughput screening hit rates or retention of potential bioactive candidates and often required complex pipelines with multiple forms of data collection.

Instead, we sought to develop a method that only requires MS/MS spectral data to design natural product extract libraries and tested its applicability in real-world drug development activity screening assays. Here, we report the development of a new method to rationally reduce the size of natural product extract screening libraries by directly addressing cross-organismal redundancy in small molecule natural product production, through liquid chromatography-tandem mass spectrometry (LC-MS/MS) and molecular networking (9). This method led to (i) little loss of diversity, (ii) little loss of bioactive candidate molecules, (iii) increased bioactivity hit rate for a variety of whole-organism and purified protein targets, and (iv) greater library size reduction than previously published methods (6, 10).

RESULTS AND DISCUSSION

Our method uses an untargeted LC-MS/MS approach to choose extracts from a large natural product extract library. These extracts are complex and contain large numbers of different small molecules, some of which may be bioactive. Starting with MS/MS fragmentation patterns, data are then processed through GNPS classical molecular networking software to group MS/MS spectra into scaffolds (Fig. 1). These scaffolds are based on MS/MS fragmentation similarity, which correlates to structural similarity (9). Unlike approaches by Anderson et al. or Ito et al. (11, 12), our rational libraries focus on scaffold diversity, as molecules with similar structures often demonstrate similar biological activity (1315). Adducts and in-source fragments of the same molecule with similar MS/MS fragmentation will group together in the same scaffold in this approach, leading to less duplication in the rational library than if this method was applied to individual molecular signals. Additionally, many drug development pipelines synthetically modify natural product scaffolds to improve structure-activity relationships (16). Therefore, diversifying core scaffolds, rather than individual molecules, can be prioritized in the initial library design. Our method also differs from prior work maximizing the diversity of known chemical structures in libraries (1719), since our method does not require a priori structure elucidation.

Fig 1.

Workflow depicts fungal extracts processed through untargeted MS/MS analysis and automated analysis using GNPS and R, resulting in reduced library from large library.

Method conceptual overview.

Using custom R code, which we make freely available (see Data Availability), our method selects the natural product extract with the greatest scaffold diversity. Next, the extract that contains the most scaffolds not already accounted for is added to the rational library. This process is automatically iterated until a desired percentage of scaffold diversity is reached in the rational library or maximal scaffold diversity is achieved. Compared to random selection, this accelerates the accumulation of diversity and offers an 84.9% reduction in the library size needed to reach maximal scaffold diversity. Specifically, in our evaluation library of 1,439 fungal extracts, random sample selection achieved 80% of maximal scaffold diversity with an average of 109 extracts, whereas our method reached the same level of diversity with only 50 extracts. Similarly, to reach 100% scaffold diversity (representation in the non-reduced library of all detected scaffolds), random selection required an average of 755 extracts. Our method only requires 216 extracts (Fig. 2; Table S1). Thus, a rational library of 216 extracts would achieve the same scaffold diversity as the full library of 1,439 extracts, a 6.6-fold reduction in size. In a scenario where 80% diversity is acceptable, this would represent a 28.8-fold library size reduction, from 1,439 extracts to 50 extracts. This size reduction far exceeds the size reduction achievable with alternative methods (6)

Fig 2.

Graph depicts percent of total scaffolds versus number of extracts. Our method achieves faster scaffold accumulation, nearing 100% with fewer extracts, compared to random selections, which depict slower scaffold accumulation and later saturation.

Rapid aggregation of scaffold diversity using our method, outperforming random selection.

Such dramatic library size reduction leads to concerns that key bioactive extracts will be lost. To assess bioactive extract loss in our rational libraries, we compared the bioactivity hit rate of the full library and of our minimal rationally designed library on the eukaryotic parasites Plasmodium falciparum and Trichomonas vaginalis, as well as the influenza virus enzyme neuraminidase. Importantly, these assays represent two of the major types of assays in high throughput screening: phenotypic assays (P. falciparum and T. vaginalis), and target-based assays on purified enzymes (neuraminidase). To prevent bias, rational minimal library selection was blinded to bioactivity scores.

In the full library, the hit rate against P. falciparum was 11.3%, whereas the rational library designed to capture 80% scaffold diversity had an increased hit rate of 22%, and the rational library to 100% diversity had a hit rate of 15.7%. This pattern was replicated against T. vaginalis and neuraminidase (Table 1): our method increased the hit rate from 7.64% for the full library against T. vaginalis, to 18% in the 80% scaffold diversity library. A similar increase from 2.57% to 8% was observed for the full library versus the 80% scaffold diversity library against influenza virus neuraminidase, confirming that the increased hit rate is observed across a range of screening campaign designs and baseline hit rates in the full library (Table 1). Therefore, our method chooses extracts more likely to contain bioactivity, potentially because our method reduces the chemical redundancy inherent in natural product libraries (2, 6). The higher hit rates in the 80% maximum diversity library could also be a result of the algorithm adding the most diverse extracts to the rational libraries first, whereas the later additions contain less diversity, but more rare scaffolds. These scaffolds may not be active in the assay tested but could be active in other assays. To confirm that these findings are not merely an artifact of the smaller library size, we compared our method with 1,000 iterations selecting the same number of random extracts. In all cases, our method outperformed random extract selection (Table 1; Table S2).

TABLE 1.

Higher hit rates for rational libraries

Activity assay Hit rate in full library (1,439 extracts) Hit rate in the 80% scaffold diversity library (50 extracts) Lower and upper quartile hit rates for 50 random extracts (1,000 iterations) Hit rate in the 95% scaffold diversity library (116 extracts) Hit rate in the 100% scaffold diversity library (216 extracts)
P. falciparum 11.26% 22.00% 8.00–14.00% 19.83% 15.74%
T. vaginalis 7.64% 18.00% 4.00–10.00% 17.24% 12.50%
Neuraminidase 2.57% 8.00% 0.00–2.00% 6.03% 5.09%

To further address concerns with regards to loss of bioactive molecules, we identified features correlated with bioactivity in the full library, and assessed whether they were retained in the minimal rational library. In this case, “feature” is defined as a unique m/z and retention time combination found in the MS data (see “Bioactivity correlations”). We found that, of the 10 features significantly correlated with anti-Plasmodium activity in the full library (ρ > 0.5, P < 0.05, FDR corrected), 8 were retained in the 80% scaffold diversity library, and all were retained in the 95% and 100% scaffold diversity libraries. A similar conclusion was reached with regard to molecules correlated with anti-T. vaginalis or anti-neuraminidase activity (Table 2). Using this information, we were also able to identify and dereplicate known active molecules, using methods previously described (20, 21) (Fig. S1). No overlap was observed between features significantly correlated to bioactivity in each assay (Fig. S1).

TABLE 2.

High retention of bioactive candidate molecules in our rational libraries, for three different activity assays

Activity assay Features found to be significantly correlated to activity in full data set Retained in the 80% scaffold diversity library Retained in the 95% scaffold diversity library Retained in the 100% scaffold diversity library
P. falciparum 10 8 10 10
T. vaginalis 5 5 5 5
Neuraminidase 17 16 16 17

To validate our method and demonstrate its utility beyond our laboratory, we applied the same rational library-building process to LC-MS data collected by independent investigators and tested for bioactivity against another pathogen, the parasite Trypanosoma cruzi (22). Notably, their library consisted of pre-fractionated plant samples, rather than crude fungal samples. After rationally reducing their full library size (1,600 total extracts) down to 104 extracts with our method (Fig. 3; Table S3) to achieve 80% maximal scaffold diversity, we also observed a rapid accumulation of chemical diversity, an increased hit rate in the rational libraries when compared to the full library (from 0.5% to 1.92%), and the retention of 90–100% of the features correlated with bioactivity (Tables 3 and 4). These results confirm the utility of our method across pathogens and natural product sources and also in assays with lower initial hit rates.

Fig 3.

Line graph depicts percent of total scaffolds versus number of extracts, with greater efficiency of our method compared to slower performance in random selections. Trend of the lines slopes upward.

Rapid accumulation of scaffold diversity with our rational library building method, applied to publicly available data.

TABLE 3.

Higher hit rates in rational libraries, applied to publicly available data

Activity assay Hit rate in full library Hit rate in the 80% scaffold diversity library Hit rate in the 95% scaffold diversity library Hit rate in the 100% scaffold diversity library
T. cruzi 0.5% 1.92% 1.72% 1.47%

TABLE 4.

High retention of bioactive candidate molecules in rational libraries, applied to publicly available MS/MS data with T. cruzi activity data

Activity assay Features found to be significantly correlated to activity in full data set Retained in the 80% scaffold diversity library Retained in the 95% scaffold diversity library Retained in the 100% scaffold diversity library
T. cruzi 21 19 21 21

In addition to the accelerated screening enabled by our rational library development method, there are also many other advantages of the LC-MS/MS data acquired through this pipeline. Any bioactive candidates identified in a screen of these rational libraries will already have LC-MS/MS data readily available for initial structure elucidation. This data can also be input into many convenient platforms for dereplication (23, 24) and structural annotation (9, 25). For example, we readily identified leucinostatin- and altersetin-related molecules in this data, which have known antimicrobial activity (26, 27) (Fig. S1). If their rediscovery is undesirable, the extracts containing these scaffolds could be excluded from the rational library. Alternatively, in under-studied diseases, investigators may wish to retain these extracts, to enable repurposing of these known compounds to other disease models. Likewise, extracts containing molecules previously described as promiscuously toxic in the target biological system can also be excluded, to prevent rediscovery. Additionally, the MS data can be put into other convenient platforms for adduct and in-source fragment identification, further narrowing features of interest for isolation (28, 29). Abundance information generated from LC-MS data can also be leveraged to build bioactive correlations (21) and identify candidate bioactive molecules, as demonstrated in this study. Such data can be used to guide compound isolation and purification for definitive structure elucidation by NMR.

The rational libraries presented above were built with positive ionization data, as more molecules can be detected in positive mode (30) (Fig. S2). We detected 45,601 features and 5,126 scaffolds in positive mode, whereas negative ionization detected 33,149 features and 3,530 scaffolds. Despite this, we found the same conclusions when analyzing libraries built with negative mode data (Tables S1 and S4; Fig. S3; Data S1). Likewise, modulating GNPS classical molecular networking parameters (9) found similar conclusions even after each parameter adjustment (Data S2). To test if our reduction method would also be functional with low-resolution mass spectrometers, we adjusted the classical molecular networking fragment and precursor m/z tolerance parameters to mimic low-resolution data (Table S5). This adjustment had limited impact on rational library hit rate increases and retention of bioactive candidates. This suggests that our minimization method does not require high-resolution MS data acquisition for effective results (Tables S1 and S6; Fig. S4; Data S1). These observations indicate a broad tolerance of our approach to data acquisition and processing parameters.

Certain fungal genera, like Penicillium, Pseudogymnoascus, and Trichoderma, are more likely to be selected for the rational libraries, while others, like Mucor, are less likely to be selected for the rational libraries (Data S3). This suggests that with our fungal culturing methods, some genera are less likely to produce unique scaffolds, as expected, and therefore are not prioritized using our selection method. Additionally, certain genera were more likely to be active in our activity assays, beyond their relative presence in the libraries (Fig. S5). Extracts that are not a hit in a given assay are however still valuable, as there was a minimal overlap of active extracts between activity assays, albeit more between P. falciparum and T. vaginalis (Fig. S6).

Geographically, the distribution of our fungal extracts was very broad, with extracts from almost every state. Active extracts from each assay did not significantly concentrate in one climate or geographic region (Fig. S7 and S8). Many of our active extracts were close to urban areas, notably in the Dallas, Oklahoma City, and Columbus counties. However this is likely due to ease of collection from the Citizen Science participants and increase in samples sent from those in urban areas, rather than innate bioactivity from urban fungi.

While library screening costs represent a small proportion of the total drug development investment, they remain a preliminary bottleneck that is especially critical in settings where financial resources are limited, such as neglected tropical diseases or rare conditions. We calculated the costs of the supplies used for LC-MS data acquisition on the full library to $1.81 per extract, for an estimated total cost of $2,604.59 to acquire LC-MS data for all 1,439 extracts. While this was more costly than the T. vaginalis and P. falciparum bioactivity assays (estimated at $0.061 per extract and $0.17 per extract, respectively), the neuraminidase assay was much more costly, at $5.35/extract. Dreiman et al. note an average cost per well of $1.50 for phenotypic assays (31). At such an average assay cost per extract, marginal cost-savings accrue from our approach by two assays, with noticeably greater cost savings within three assays (Fig. 4). At a lower assay cost per well of $0.50, our rational method provides cost savings within four assays. Given the challenges of establishing natural product screening libraries, the anticipation is that they will be used for many different diseases, and thus performing multiple assays with the same screening library is standard procedure. In addition, by reducing the size of the library to be screened using our method, additional screening, dose-response assessment, and counter-screening assays can be implemented, leading to reduced false positive rates. Rationally designed library size reduction also enables the implementation of more complex (and thus more costly per-well) assays that better mimic in vivo conditions, increasing translatability and likelihood of clinical success.

Fig 4.

Graphs depict total cost versus number of assays run at $0.50 and $1.50 per well assay cost. Full library costs increase linearly, while 80% and 95% reduced libraries have lower and slower cost increases.

Cost analysis for the rational library approach accrues after a few screening campaigns. This is a cost analysis of our reduction method with a library size of 1,500 extracts. Assuming an 80% or 95% library size reduction can be achieved (see Table 1; Table S3) and a cost-per-well average of $0.50 or $1.50 per assay, the upfront cost of LC-MS analysis can be rapidly justified.

Thus, while we acknowledge the initial up-front cost and time needed for the LC-MS analysis, this is mitigated by the fact that it can enable cost reduction across all subsequent high throughput screening projects using this library (as demonstrated here), prevent rediscovery of known bioactive or pan-cytotoxic compounds, and expedite bioactive candidate structure elucidation. Our method should also be suitable for libraries of pre-fractions, or to prioritize crude extracts for fractionation, leading to further cost reduction and time savings. The method can be further expedited by refining the separation method, multiplexing LC columns or introducing advances such as microfluidic separation (32). Finally, our method was applicable to data sets and screening assays implemented outside our laboratory, demonstrating its broad utility.

MATERIALS AND METHODS

Collection of fungal extracts

The University of Oklahoma’s Citizen Science Soil Collection Program (33) asked citizen scientists to submit soil samples that were then used to isolate fungi. These fungi were grown to generate extracts that are used in various drug discovery efforts. Fungi were selected for the library based on morphological analysis, and the unique fungi from each soil sample were retained. ITS-based taxonomic identifications were also generated. Over 79,000 fungi originating from all 50 states and the District of Columbia are included in the fungal library (Citizen Science Soil Collection Program [shareok.org]). The subset of fungal extracts used in this study was isolated from 443 soil samples with a broad distribution across the United States (Fig. S7) and includes members of 141 different genera (Data S4). Taxonomic identification was accomplished via sequencing of the internal transcribed spacer region of the genomic DNA (11). These fungal isolates were cultured for 3 weeks on a solid-state medium composed of Cheerios breakfast cereal supplemented with a 0.3% sucrose solution containing 0.005% chloramphenicol, as previously optimized (34, 35).

Metabolite sample preparation

Samples for bioactivity assays and LC-MS analysis were prepared on an automated platform that combined both extraction and partitioning steps as previously described (4). Fungal cultures prepared in 16-by-100 mm borosilicate tubes were subjected two times to a vol:vol water:ethyl acetate partitioning. The aqueous portion was discarded. The organic solvent was removed in vacuo and the remaining organic residues were stored at −20°C until resuspended in DMSO for screening. These same extracts were diluted 1:10 in methanol containing 2 µM sulfadimethoxine as an internal standard for LC-MS/MS analysis.

P. falciparum assay

Parasites were cultured following a protocol by Trager and Jensen, with minor modifications (36, 37). Specifically, the multidrug-resistant P. falciparum line Dd2 was grown at 37°C in 5% CO2 with Rosewell Park Memorial Institute 1640 media supplemented with 25 mM of 4-(2-hydroxyethyl)-1-piperazineethanesulfonic acid (pH 7.4), 26 mM NaHCO3, 2% dextrose, 15 mg/L hypoxanthine, 25 mg/L gentamicin, and 0.5% Albumax II in human A+ blood. The impact of fungal extracts on P. falciparum growth was measured via a SYBR Green I fluorescence-based assay as described (36). Extracts were resuspended in DMSO and plated on microtiter plates with asynchronous Dd2 culture at 1% parasitemia, 1% hematocrit, for a final concentration of 2 µg/mL, maintaining a DMSO concentration of <0.25% in all cases to avoid assay interference. Culture plates were then incubated with extracts under standard growth conditions for 72 h. After incubation, assay plates were frozen at −80°C and subsequently thawed to promote lysis. Once thawed, an equal volume of lysis buffer (20 mM Tris-HCl, 0.08% saponin, 5 mM ethylenediaminetetraacetic acid, and 0.8% Triton X-100) with 1× SYBR Green I was added. Plates were incubated for 45 min to 1 h, protected from light, prior to fluorescence reading at excitation 485 nm, emission 530 nm, on a Synergy Neo2 multi-mode reader (BioTek, Winooski, VT, USA). Values were then normalized to 10 mM chloroquine (positive control) and vehicle-only (negative control) wells. Extracts showing more than 75% inhibition of parasite growth were considered hits. Bioactivity assays were performed separately from rational library generation; investigators performing this assay were unaware of the rationally selected extracts, and vice versa.

T. vaginalis assay

T. vaginalis Donne (PRA-98) from the American Type Culture Collection (ATCC, Bethesda, MD, USA) was grown at 37°C in filter-sterilized Keister’s modified TYI-S33 medium (2% [wt/vol] casein, 1% [wt/vol] yeast extract, 55.6 mM glucose, 34.2 mM NaCl, 4.4 mM KH2PO4, 5.7 mM K2HPO4, 12.7 mM l-cysteine HCl, 1.1 mM l-ascorbic acid, 86.7 µM ferric ammonium citrate, 10% [vol/vol] heat-inactivated fetal bovine serum, 0.052% [wt/vol] bovine bile salts in 1 L of Millipore water). Micro aerophilic conditions were maintained with the use of BD GasPak EZ Campy sachets. Samples consisting of 4 × 104 trichomonads per well were treated with DMSO, 25 µM metronidazole, or fungal extracts at 10 µg/mL, not exceeding 0.5% DMSO. After a 16-h incubation, the cells were fixed (1% glutaraldehyde, 5 µM propidium iodide, and 5 µM acridine orange in phosphate-buffered saline). After a 3-h incubation at 37°C, the cells were imaged using a PerkinElmer Operetta high-content imaging system. Data were analyzed using the Harmony 3.5.1 software package as previously described (35). The number of live cells imaged was normalized to the DMSO control (100% growth). Extracts were considered active if they inhibited the growth of T. vaginalis by more than 80%. Bioactivity assay was performed separately from rational library generation; investigators performing this assay were unaware of the rationally selected extracts, and vice versa.

Neuraminidase assay

Fungal extracts plus 0.01% TritonX to break up aggregates were tested in high throughput format at 10 µg/mL using a commercial neuraminidase activity assay, according to the manufacturer’s instructions (Sigma Aldrich, catalog number MAK121). Those identified as reducing neuraminidase activity to less than 30% of vehicle control were retested in a dose-response curve to verify activity. Hits were considered confirmed if they showed dose-dependent inhibition. Briefly, the reaction mixture containing buffer, substrate, cofactors, enzyme, and dye was mixed with the extracts and incubated at 37°C. Absorbance at 570 nm was recorded at 20 and 50 min. The activity of the neuraminidase enzyme was calculated according to the kit instructions. Activity scores were then normalized such that the positive and negative controls were set to 0% and 100%, respectively. Bioactivity assay was performed separately from rational library generation; investigators performing this assay were unaware of the rationally selected extracts, and vice versa.

LC-MS/MS data acquisition

Fungal extract separation was done with Thermo Scientific Vanquish ultra-high-performance liquid chromatography instrument equipped with a Kinetex 1.7  µm C18 50 × 2.1 mm LC column with a C18 guard cartridge (Phenomenex). The mobile phases used were water with 0.1% formic acid (mobile phase A), and acetonitrile with 0.1% formic acid (mobile phase B). The following 12.5-min gradient was used: (i) 0–1 min, 5% B; (ii) 1–8 min, linear increase to 100% B; (iii) 8–10 min, 100% B; (iv) 10–10.5 min, linear decrease to 5% B; and (v) 10.5–12 min, 5% B. A flow rate of 0.5 mL/min was maintained, and the column chamber temperature was kept at 40°C.

For MS/MS acquisition, a Q-Exactive Plus (Thermo Scientific) high-resolution mass spectrometer was used. Both positive and negative ionization data were collected, with the parameters shown in (Table S7). Data acquisition was performed in randomized order. Injection volume was 5 µL. A blank and pooled quality control samples were analyzed every 12 injections to monitor instrument performance. Retention time shifts were analyzed by using a mix of sulfadimethoxine, sulfachloropyridazine, sulfamethazine, sulfamethizole, amitriptyline, and coumarin-314 standards. This standard mix was taken at the beginning of the run, every 100 samples throughout the run, and at the end of the run.

The Q-Exactive Plus was calibrated using Pierce LTQ Velos ESI positive and negative ion calibration solution (ThermoFisher) prior to analysis.

Data processing

Raw files were converted to mzML files using MSConvert (38). Classical molecular networking was performed using GNPS (9). The parameters used are listed in Table S5, and we found that varying the parameters did not significantly impact rational library building (Data S2). Molecular networks were visualized using Cytoscape version 3.9.1 (39) . All data processing was done in Rstudio version 4.3.0 or Jupyter Notebook 6.5.4. The map of soil sample collection and active extract locations was done in qGIS 3.30.3.

Rational Libraries were generated using an R studio code made freely available (see Data Availability). This code uses the node table from the classical molecular networking job, which summarizes each feature, its scaffold family, and which extracts contain the feature. Instructions for downloading the node table are in Fig. S9. To choose extracts to be added to the rational library, the code aggregates the features by scaffold, then chooses the extract that contains the most scaffolds. Then, those scaffolds are deleted from the data set. After that, the extract with the most scaffolds not already accounted for is added to the rational library. This process repeats until a desired percent of maximum diversity is reached or maximum diversity.

Figure 1 was generated using BioRender.com.

Bioactivity correlations

The quantitative feature data were generated using MZmine 2.53 (40) using the parameters listed in Table S8. The positive and negative ionization data were processed separately. To generate the quantitative table, highly repetitive m/zs with close retention times, indicating background or oversplit features, were removed. A fivefold blank removal was applied, removing peaks that can be attributed to the solvent and cheerio media. All data were normalized to the total signal in each extract (TIC normalization).

It is possible that the same molecule can produce multiple features, in the form of different adducts and/or in-source fragments (ISFs). This represents a limitation in our bioactivity correlation and retention analysis. However, if choosing features for isolation, this issue can be identified, because different adducts and ISFs often appear in the same scaffold due to similar MS/MS fragmentation.

Bioactivity correlations were done in R studio with code that is also freely available at https://github.com/mmness/RationalLibraryBuilding, and were loosely adapted from the method described by Nothias et al. (21). In their studies, bioactivity scores of extract features were calculated and imported into a feature-based molecular network (41)to identify bioactive scaffolds. Our method is similar in using quantitative tables generated from MZmine to predict bioactive features. However, there are some key differences. First, we filtered out features present in less than three extracts. While these may represent bioactive candidates, an accurate correlation coefficient cannot be calculated with less than three points. Second, we applied a Spearman correlation instead of a Pearson correlation. Unlike Pearson correlations, which assume linear relationships and normally distributed data, metabolomics data often exhibits logarithmic or exponential relationships (42). Therefore, Spearman correlations are more suitable for our calculations. Finally, we analyze the retention of the bioactive molecules in the rational libraries, which was not performed by Nothias et al. (21).

The code requires the input of the activity data for one of the assays. Then, it builds a Spearman correlation between the number of feature signals and the activity. It also calculates and corrects the P value of each feature to activity. The features are filtered based on an FDR-adjusted P value less than 0.05, and a Spearman correlation statistic greater than 0.5 (positively correlated with activity).

Bioactivity scores were calculated for both positive and negative ionization MS data, and the data represented in Table 2 show the sum of significant features from both ionization methods. For information about whether the feature was detected in positive or negative mode, see Data S1.

For the publicly available data analysis, the MZmine quantitative table provided by the authors was used (22). Rather than a Spearman correlation, a Pearson’s Bivariate correlation coefficient was calculated. This is because the T. cruzi activity data provided was presented as binary data rather than continuous. The data presented in Table 4 represent only positive ionization data, as negative data were not collected on these samples.

The codes to process the MZmine quantitative table, and bioactivity correlation calculations are available at https://github.com/mmness/RationalLibraryBuilding.

ACKNOWLEDGMENTS

This project was supported by NIH awards R01GM145649, R33AI119713, and R01AI154777. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Contributor Information

Robert H. Cichewicz, Email: rcic@med.umich.edu.

Laura-Isobel McCall, Email: lmccall@sdsu.edu.

Jack A. Gilbert, University of California San Diego, La Jolla, California, USA

DATA AVAILABILITY

The .raw and .mzML data files used in this article are available in the MassIVE repository under accession numbers MSV000091950 (positive data) and MSV000091980 (negative data) (massive.ucsd.edu). The classical molecular networking jobs are available at https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=37a75c380d7c464ba278433c6434f7c1 (fungal extracts, positive data, and parameters listed in Table S5), https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=ab2cceee30e347d185e17d3e2dc95092 (fungal extracts, negative data, and parameters listed in Table S5), and https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=8ad5a0af7cbc4859a7b943fcaebb7e9d (low-resolution mimicking data, fungal extracts, positive data, and parameters listed in Table S5). The jobs used to verify parameter sensitivity are listed in Data S2. The publicly available data (22) used to verify our method is available in the MassIVE repository under accession number MSV000087728. The classical molecular networking job is available at https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=a0349e711a214913b0222041e76c2346. The codes used, including for library building and bioactivity correlations, are available at https://github.com/mmness/RationalLibraryBuilding. The bioactivity correlation code is loosely based on the technique presented by Nothias et al. (21) (see "Bioactivity correlations"). The GenBank accession numbers used for sequencing are PP664564PP665462.

SUPPLEMENTAL MATERIAL

The following material is available online at https://doi.org/10.1128/msystems.00844-24.

Supplemental Data Sheets. msystems.00844-24-s0001.xlsx.

Data S1 to S4.

DOI: 10.1128/msystems.00844-24.SuF1
Supplemental material. msystems.00844-24-s0002.pdf.

Tables S1 to S8, Fig. S1 to S9, and captions for Data S1 to S4.

DOI: 10.1128/msystems.00844-24.SuF2

ASM does not own the copyrights to Supplemental Material that may be linked to, or accessed through, an article. The authors have granted ASM a non-exclusive, world-wide license to publish the Supplemental Material files. Please contact the corresponding author directly for reuse.

REFERENCES

  • 1. Newman DJ, Cragg GM. 2016. Natural products as sources of new drugs from 1981 to 2014. J Nat Prod 79:629–661. doi: 10.1021/acs.jnatprod.5b01055 [DOI] [PubMed] [Google Scholar]
  • 2. Crüsemann M, O’Neill EC, Larson CB, Melnik AV, Floros DJ, da Silva RR, Jensen PR, Dorrestein PC, Moore BS. 2017. Prioritizing natural product diversity in a collection of 146 bacterial strains based on growth and extraction protocols. J Nat Prod 80:588–597. doi: 10.1021/acs.jnatprod.6b00722 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Macintyre L, Zhang T, Viegelmann C, Martinez IJ, Cheng C, Dowdells C, Abdelmohsen UR, Gernert C, Hentschel U, Edrada-Ebel R. 2014. Metabolomic tools for secondary metabolite discovery from marine microbial symbionts. Mar Drugs 12:3416–3448. doi: 10.3390/md12063416 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Anderson VM, Wendt KL, Caughron JB, Matlock HP, Rangu N, Najar FZ, Miller AN, Luttenton MR, Cichewicz RH. 2022. Assessing microbial metabolic and biological diversity to inform natural product library assembly. J Nat Prod 85:1079–1088. doi: 10.1021/acs.jnatprod.1c01197 [DOI] [PubMed] [Google Scholar]
  • 5. Navarro-Muñoz JC, Selem-Mojica N, Mullowney MW, Kautsar SA, Tryon JH, Parkinson EI, De Los Santos ELC, Yeong M, Cruz-Morales P, Abubucker S, Roeters A, Lokhorst W, Fernandez-Guerra A, Cappelini LTD, Goering AW, Thomson RJ, Metcalf WW, Kelleher NL, Barona-Gomez F, Medema MH. 2020. A computational framework to explore large-scale biosynthetic diversity. Nat Chem Biol 16:60–68. doi: 10.1038/s41589-019-0400-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Clark CM, Nguyen L, Pham VC, Sanchez LM, Murphy B. 2022. Automated microbial library generation using the bioinformatics platform IDBac. Molecules 27. doi: 10.3390/molecules27072038 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Clark CM, Costa MS, Sanchez LM, Murphy BT. 2018. Coupling MALDI-TOF mass spectrometry protein and specialized metabolite analyses to rapidly discriminate bacterial function. Proc Natl Acad Sci USA 115:4981–4986. doi: 10.1073/pnas.1801247115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Costa MS, Clark CM, Ómarsdóttir S, Sanchez LM, Murphy BT. 2019. Minimizing taxonomic and natural product redundancy in microbial libraries using MALDI-TOF MS and the bioinformatics Pipeline IDBac. J Nat Prod 82:2167–2173. doi: 10.1021/acs.jnatprod.9b00168 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Wang M, Carver JJ, Phelan VV, Sanchez LM, Garg N, Peng Y, Nguyen DD, Watrous J, Kapono CA, Luzzatto-Knaan T, et al. 2016. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat Biotechnol 34:828–837. doi: 10.1038/nbt.3597 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Hernandez A, Nguyen LT, Dhakal R, Murphy BT. 2021. The need to innovate sample collection and library generation in microbial drug discovery: a focus on academia. Nat Prod Rep 38:292–300. doi: 10.1039/d0np00029a [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Anderson VM, Wendt KL, Najar FZ, McCall L-I, Cichewicz RH. 2021. Building natural product libraries using quantitative Clade-Based and chemical clustering strategies. mSystems 6:e00644-21. doi: 10.1128/mSystems.00644-21 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Ito T, Odake T, Katoh H, Yamaguchi Y, Aoki M. 2011. High-throughput profiling of microbial extracts. J Nat Prod 74:983–988. doi: 10.1021/np100859a [DOI] [PubMed] [Google Scholar]
  • 13. Kim S, Han L, Yu B, Hähnke VD, Bolton EE, Bryant SH. 2015. PubChem structure-activity relationship (SAR) clusters. J Cheminform 7:33. doi: 10.1186/s13321-015-0070-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Lassalas P, Gay B, Lasfargeas C, James MJ, Tran V, Vijayendran KG, Brunden KR, Kozlowski MC, Thomas CJ, Smith AB 3rd, Huryn DM, Ballatore C. 2016. Structure property relationships of carboxylic acid isosteres. J Med Chem 59:3183–3203. doi: 10.1021/acs.jmedchem.5b01963 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Pinheiro P de SM, Franco LS, Fraga CAM. 2023. The magic methyl and its tricks in drug discovery and development. Pharmaceuticals (Basel) 16:1157. doi: 10.3390/ph16081157 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Hughes JP, Rees S, Kalindjian SB, Philpott KL. 2011. Principles of early drug discovery. Br J Pharmacol 162:1239–1249. doi: 10.1111/j.1476-5381.2010.01127.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Sukuru SCK, Jenkins JL, Beckwith REJ, Scheiber J, Bender A, Mikhailov D, Davies JW, Glick M. 2009. Plate-based diversity selection based on empirical HTS data to enhance the number of hits and their chemical diversity. J Biomol Screen 14:690–699. doi: 10.1177/1087057109335678 [DOI] [PubMed] [Google Scholar]
  • 18. Crisman TJ, Jenkins JL, Parker CN, Hill WAG, Bender A, Deng Z, Nettles JH, Davies JW, Glick M. 2007. “Plate cherry picking”: a novel semi-sequential screening paradigm for cheaper, faster, information-rich compound selection. J Biomol Screen 12:320–327. doi: 10.1177/1087057107299427 [DOI] [PubMed] [Google Scholar]
  • 19. O’Hagan S, Kell DB. 2019. Generation of a small library of natural products designed to cover chemical space inexpensively. Pharm Front 1:e190005. doi: 10.20900/pf20190005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Olivon F, Allard P-M, Koval A, Righi D, Genta-Jouve G, Neyts J, Apel C, Pannecouque C, Nothias L-F, Cachet X, Marcourt L, Roussi F, Katanaev VL, Touboul D, Wolfender J-L, Litaudon M. 2017. Bioactive natural products prioritization using massive multi-informational molecular networks. ACS Chem Biol 12:2644–2651. doi: 10.1021/acschembio.7b00413 [DOI] [PubMed] [Google Scholar]
  • 21. Nothias L-F, Nothias-Esposito M, da Silva R, Wang M, Protsyuk I, Zhang Z, Sarvepalli A, Leyssen P, Touboul D, Costa J, Paolini J, Alexandrov T, Litaudon M, Dorrestein PC. 2018. Bioactivity-based molecular networking for the discovery of drug leads in natural product bioassay-guided fractionation. J Nat Prod 81:758–767. doi: 10.1021/acs.jnatprod.7b00737 [DOI] [PubMed] [Google Scholar]
  • 22. Allard P-M, Gaudry A, Quirós-Guerrero L-M, Rutz A, Dounoue-Kubo M, Walker TWN, Defossez E, Long C, Grondin A, David B, Wolfender J-L. 2022. Open and reusable annotated mass spectrometry dataset of a chemodiverse collection of 1,600 plant extracts. Gigascience 12:giac124. doi: 10.1093/gigascience/giac124 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Smith CA, O’Maille G, Want EJ, Qin C, Trauger SA, Brandon TR, Custodio DE, Abagyan R, Siuzdak G. 2005. METLIN: a metabolite mass spectral database. Ther Drug Monit 27:747–751. doi: 10.1097/01.ftd.0000179845.53213.39 [DOI] [PubMed] [Google Scholar]
  • 24. Qin G-F, Zhang X, Zhu F, Huo Z-Q, Yao Q-Q, Feng Q, Liu Z, Zhang G-M, Yao J-C, Liang H-B. 2022. MS/MS-based molecular networking: an efficient approach for natural products dereplication. Molecules 28:157. doi: 10.3390/molecules28010157 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Morehouse NJ, Clark TN, McMann EJ, van Santen JA, Haeckl FPJ, Gray CA, Linington RG. 2023. Annotation of natural product compound families using molecular networking topology and structural similarity fingerprinting. Nat Commun 14:308. doi: 10.1038/s41467-022-35734-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Niu G, Wang X, Li J. 2024. Leucinostatins target plasmodium mitochondria to block malaria transmission. Parasit Vectors 17:524. doi: 10.1186/s13071-024-06608-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Peramuna T, et al. 2024. Semisynthetic tetramate-containing fungal metabolites with activity against and. ACS Med Chem Lett 15:1933–1939. doi: 10.1021/acsmedchemlett.4c00386 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Schmid R, Petras D, Nothias L-F, Wang M, Aron AT, Jagels A, Tsugawa H, Rainer J, Garcia-Aloy M, Dührkop K, et al. 2021. Ion identity molecular networking for mass spectrometry-based metabolomics in the GNPS environment. Nat Commun 12:3832. doi: 10.1038/s41467-021-23953-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Guo J, Shen S, Xing S, Yu H, Huan T. 2021. ISFrag: de Novo recognition of in-source fragments for liquid chromatography-mass spectrometry data. Anal Chem 93:10243–10250. doi: 10.1021/acs.analchem.1c01644 [DOI] [PubMed] [Google Scholar]
  • 30. Hossain E, Khanam S, Dean DA, Wu C, Lostracco-Johnson S, Thomas D, Kane SS, Parab AR, Flores K, Katemauswa M, Gosmanov C, Hayes SE, Zhang Y, Li D, Woelfel-Monsivais C, Sankaranarayanan K, McCall L-I. 2020. Mapping of host-parasite-microbiome interactions reveals metabolic determinants of tropism and tolerance in Chagas disease. Sci Adv 6:eaaz2015. doi: 10.1126/sciadv.aaz2015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Dreiman GHS, Bictash M, Fish PV, Griffin L, Svensson F. 2021. Changing the HTS paradigm: AI-driven iterative screening for hit finding. SLAS Discov 26:257–262. doi: 10.1177/2472555220949495 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Miggiels P, Wouters B, van Westen GJP, Dubbelman A-C, Hankemeier T. 2019. Novel technologies for metabolomics: more for less. TrAC Trends in Analytical Chemistry 120:115323. doi: 10.1016/j.trac.2018.11.021 [DOI] [Google Scholar]
  • 33. Du L, Robles AJ, King JB, Powell DR, Miller AN, Mooberry SL, Cichewicz RH. 2014. Crowdsourcing natural products discovery to access uncharted dimensions of fungal metabolite diversity. Angew Chem Int Ed Engl 53:804–809. doi: 10.1002/anie.201306549 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Du L, King JB, Morrow BH, Shen JK, Miller AN, Cichewicz RH. 2012. Diarylcyclopentendione metabolite obtained from a Preussia typharum isolate procured using an unconventional cultivation approach. J Nat Prod 75:1819–1823. doi: 10.1021/np300473h [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. King JB, Carter AC, Dai W, Lee JW, Kil Y-S, Du L, Helff SK, Cai S, Huddle BC, Cichewicz RH. 2019. Design and application of a high-throughput, high-content screening system for natural product inhibitors of the human parasite Trichomonas vaginalis. ACS Infect Dis 5:1456–1470. doi: 10.1021/acsinfecdis.9b00156 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Collins JE, Lee JW, Rocamora F, Saggu GS, Wendt KL, Pasaje CFA, Smick S, Santos NM, Paes R, Jiang T, Mittal N, Luth MR, Chin T, Chang H, McLellan JL, Morales-Hernandez B, Hanson KK, Niles JC, Desai SA, Winzeler EA, Cichewicz RH, Chakrabarti D. 2024. Antiplasmodial peptaibols act through membrane directed mechanisms. Cell Chem Biol 31:312–325. doi: 10.1016/j.chembiol.2023.10.025 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Trager W, Jensen JB. 1976. Human malaria parasites in continuous culture. Science 193:673–675. doi: 10.1126/science.781840 [DOI] [PubMed] [Google Scholar]
  • 38. Chambers MC, Maclean B, Burke R, Amodei D, Ruderman DL, Neumann S, Gatto L, Fischer B, Pratt B, Egertson J, et al. 2012. A cross-platform toolkit for mass spectrometry and proteomics. Nat Biotechnol 30:918–920. doi: 10.1038/nbt.2377 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. 2003. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13:2498–2504. doi: 10.1101/gr.1239303 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Pluskal T, Castillo S, Villar-Briones A, Oresic M. 2010. MZmine 2: modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data. BMC Bioinformatics 11:395. doi: 10.1186/1471-2105-11-395 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Nothias L-F, Petras D, Schmid R, Dührkop K, Rainer J, Sarvepalli A, Protsyuk I, Ernst M, Tsugawa H, Fleischauer M, et al. 2020. Feature-based molecular networking in the GNPS analysis environment. Nat Methods 17:905–908. doi: 10.1038/s41592-020-0933-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Yu H, Xing S, Nierves L, Lange PF, Huan T. 2020. Fold-change compression: an unexplored but correctable quantitative bias caused by nonlinear electrospray Ionization responses in untargeted metabolomics. Anal Chem 92:7011–7019. doi: 10.1021/acs.analchem.0c00246 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Data Sheets. msystems.00844-24-s0001.xlsx.

Data S1 to S4.

DOI: 10.1128/msystems.00844-24.SuF1
Supplemental material. msystems.00844-24-s0002.pdf.

Tables S1 to S8, Fig. S1 to S9, and captions for Data S1 to S4.

DOI: 10.1128/msystems.00844-24.SuF2

Data Availability Statement

The .raw and .mzML data files used in this article are available in the MassIVE repository under accession numbers MSV000091950 (positive data) and MSV000091980 (negative data) (massive.ucsd.edu). The classical molecular networking jobs are available at https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=37a75c380d7c464ba278433c6434f7c1 (fungal extracts, positive data, and parameters listed in Table S5), https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=ab2cceee30e347d185e17d3e2dc95092 (fungal extracts, negative data, and parameters listed in Table S5), and https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=8ad5a0af7cbc4859a7b943fcaebb7e9d (low-resolution mimicking data, fungal extracts, positive data, and parameters listed in Table S5). The jobs used to verify parameter sensitivity are listed in Data S2. The publicly available data (22) used to verify our method is available in the MassIVE repository under accession number MSV000087728. The classical molecular networking job is available at https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=a0349e711a214913b0222041e76c2346. The codes used, including for library building and bioactivity correlations, are available at https://github.com/mmness/RationalLibraryBuilding. The bioactivity correlation code is loosely based on the technique presented by Nothias et al. (21) (see "Bioactivity correlations"). The GenBank accession numbers used for sequencing are PP664564PP665462.


Articles from mSystems are provided here courtesy of American Society for Microbiology (ASM)

RESOURCES