Skip to main content
BMC Bioinformatics logoLink to BMC Bioinformatics
. 2026 Jan 7;27:36. doi: 10.1186/s12859-025-06358-z

Predicting the pathway involvement of metabolites annotated in the MetaCyc knowledgebase

Erik D Huckvale 1, Hunter N B Moseley 1,2,3,4,5,
PMCID: PMC12870939  PMID: 41495634

Abstract

Background

The associations of metabolites with biochemical pathways are highly useful information for interpreting molecular datasets generated in biological and biomedical research. However, such pathway annotations are sparse in most molecular datasets, limiting their utility for pathway level interpretation. To address these shortcomings, several past publications have presented machine learning models for predicting the pathway association of small biomolecule (metabolite and xenobiotic) using data from the Kyoto Encyclopedia of Genes and Genomes (KEGG). But other similar knowledgebases exist, for example MetaCyc, which has more compound entries and pathway definitions than KEGG.

Results

As a logical next step, we trained and evaluated multilayer perceptron models on compound entries and pathway annotations obtained from MetaCyc. From the models trained on this dataset, we observed a mean Matthews correlation coefficient (MCC) of 0.845 with 0.0101 standard deviation, compared to a mean MCC of 0.847 with 0.0098 standard deviation for the KEGG dataset. However, KEGG’s 184 metabolic-only pathway predictions (out of 502 total pathways) have a mean MCC of 0.800 with 0.021 standard deviation. Since MetaCyc pathways are metabolic focused, the MetaCyc results represent over a 5.6% improvement in metabolic pathway prediction performance.

Conclusions

These performance results are pragmatically the same, demonstrating that in aggregate, the 4055 MetaCyc pathways can be effectively predicted at the current state-of-the-art performance level.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12859-025-06358-z.

Keywords: Pathways, Metabolites, Machine learning, Biochemistry, Neural networks

Background

Metabolism is the set of biochemical reactions and related processes for sustaining life. These biochemical reactions convert environmentally-derived and endogenous molecular and energy resources into useful forms of chemical energy and molecular building blocks that drive biological processes, while converting metabolic waste into forms that can be disposed of. Metabolism can also be viewed as the entirety of these biochemical reactions organized as a metabolic network. Related biological processes can be included into a more general molecular interaction network that is centered on metabolism. Such networks are representable as a hypergraph with biomolecules as nodes and chemical reactions and non-covalent molecular interactions as edges. Certain connected parts or subgraphs of the metabolic-centric molecular interaction network are then defined as distinct “pathways” [13].

Compounds connected within a pathway are defined as being associated with that pathway, which is typically represented as a pathway annotation for the compound. Pathway annotations are highly useful for mechanistic interpretation of complex molecular datasets. These annotations are often used in pathway annotation enrichment analysis to identify perturbed pathways based on large datasets of perturbed biomolecules, providing pathway-level insight and interpretation of such datasets. While knowledgebases such as the Kyoto Encyclopedia of Genes and Genomes (KEGG) [4] and MetaCyc [5] contain useful pathway annotations for several thousand compounds, the majority of detected compounds in metabolomics datasets have no pathway annotations, limiting their use in pathway annotation enrichment analysis. Considering the costly and cumbersome work involved in experimentally determining these annotations in vitro, many researchers could benefit from predicting pathway annotations for compounds in silico. Several prior publications present machine learning models that perform this pathway prediction task using a novel feature and dataset engineering approach that were trained and tested with data from KEGG [610]. This novel modeling approach generated features for every compound entry representing a full enumeration of all chemical subgraph neighborhoods within 3 bonds generated for every non-hydrogen atom in the chemical structure. Compound feature vectors associated with the same pathway are aggregated to produce pathway feature vectors. All compound feature vectors are cross joined with all pathway feature vectors so that a single binary classifier is trained to predict whether a given compound is associated with a given pathway. This novel multitask classification approach simplifies the problem to using only a single binary classification model while greatly increasing the number of usable entries in the dataset. All prior approaches used a one-vs-rest or multiclass strategy to classify compound feature vectors to specific pathway classes and only attempted classifying associations with the 12 Level 2 metabolic pathway categories defined in KEGG.

Based on the success of these models trained on KEGG-derived datasets, a logical next step is to construct a similar machine learning dataset from MetaCyc compounds and pathway annotations. The MetaCyc dataset focuses on metabolic pathways as compared to KEGG which contains a wider variety of types of pathways: only 184 out of 502 pathways in KEGG are metabolic pathways. Moreover, MetaCyc contains far more pathways (4,055 pathways) and over 50% more compound entries (9,847 vs. 6,485). In this work, we demonstrate that MetaCyc pathways can be predicted as effectively as KEGG pathways, while providing finer pathway granularity.Background

Methods

The construction of a dataset for model training and evaluation requires the molecular structures of the compounds along with their pathway annotations. Our molecular structure parsing methods expect molfiles [11]. On July 9th 2024, we downloaded the molfiles from MetaCyc’s website here https://metacyc.org/download.shtml. The MetaCyc pathways, similar to KEGG, are organized into a hierarchy as seen here: https://metacyc.org/META/class-tree?object=Pathways. The compound to pathway mappings were downloaded on July 9th, 2024. We accessed them by traversing the pathway hierarchy using MetaCyc’s web API at the following base URL: https://biocyc.org/META/ajax-direct-subs. With the pathway ID query parameter, you can access the children of a given pathway node. Starting with the base ID ‘Pathways’, the first requestion URL is https://biocyc.org/META/ajax-direct-subs?object=Pathways. Next, you obtain children pathway IDs and recursively request children pathways. Traversing the hierarchy in this way enables access to all the MetaCyc pathway IDs from the root node down to the leaf nodes. Compound associations were obtained from MetaCyc’s web API using this base URL: https://websvc.biocyc.org/apixml?fn=compounds-of-pathway. Again, a pathway ID query parameter is specified. For example, the URL https://websvc.biocyc.org/apixml?fn=compounds-of-pathway%26id=META:PWY-7723%26detail=low obtains the compounds associated with the pathway corresponding to pathway ID META: PWY-7723.

After obtaining the MetaCyc molfiles and compound-pathway mappings, we constructed the MetaCyc dataset with the methods used for the KEGG dataset [10]. This involved creating feature vectors representing compounds, feature vectors representing pathways, and concatenating them together in a cross join. The concatenated compound-pathway paired feature vector is further concatenated with a boolean label indicating whether the given compound is associated with the given pathway [8], which creates a feature vector that represents a unique classification task in a multitask classification approach. Since the resulting list of concatenated metabolite-pathway feature vectors represent unique classification tasks, there is no data leakage issue when they are split into separate training and test sets. We created the compound features using an atom graph neighborhood coloring technique introduced in our lab [7, 1214]. The atom colors represent specific molecular substructures 1, 2, or 3 bonds away from a central atom node and the corresponding feature values are the number of times that these chemical substructures appear in the compound. With these atom color counts acting as features for an individual compound, the pathway features are an aggregation of the features of the compounds associated with a given pathway [8]. Both the compound features and pathway features were deduplicated and then normalized. Specifically, we performed feature-wise normalization and then an entry-wise normalization. The feature-wise normalization was a softmax, dividing the feature values by the sum of all features for each compound or pathway feature vector. This softmax for the pathway feature vectors in particular allows the model to learn differences within a pathway rather than differences in pathway size across pathways. The entry-wise normalization was a min-max normalization [10]. Figure 1 displays an overview of this dataset creation process.

Fig. 1.

Fig. 1

Overview of the dataset creation, from the molfiles and pathway annotations to the metabolite and pathway features. In this figure, ac stands for atom color i.e. the atom color feature headers, n represents the number of atom colors enumerated across all molfiles, c represents the atom color count within the feature vector of a compound (metabolite), p represents that within a pathway feature vector, m represents the number of metabolites in the dataset, and o represents the number of pathways in the dataset

Table 1 compares the MetaCyc dataset to the KEGG dataset. We see that with nearly 40 million entries, the MetaCyc-derived dataset is the largest published at the date of submission, having more than 10-fold the number of paired compound-pathway entries (39,929,585 in bold) than the KEGG-derived dataset (3,255,470). This is mostly due to the 8-fold larger number of pathway definitions in MetaCyc vs. KEGG.

Table 1.

Comparing the KEGG dataset to the metacyc dataset

Dataset #Compounds #Pathways #Compound
Features
#Pathway
Features
#Concatenated
Features
#Entries Proportion
Positive
Reference
KEGG 6,485 502 16,509 11,321 27,830 3,255,470 1.38% [10]
MetaCyc 9,847 4,055 19,081 15,349 34,430 39,929,585 0.33% Current Study

With the pathways being hierarchically organized, there are pathways at different hierarchical levels. We will refer to the top-level pathways, e.g. ‘Biosynthesis’, ‘Detoxification’, ‘Glycan Pathways’, ‘Transport’, etc. as level 1 or L1. Likewise, we refer to the level 2 pathways as L2, and the remaining as L3, L4, L5, L6, L7, and L8. To determine the impact of excluding certain pathway levels from the training set on the performance of the remaining pathways, we constructed two subsets of the full MetaCyc dataset i.e. L2+ and L3+. The L1+ dataset contains all of the pathways in the hierarchy, while the L2+ excludes the L1 pathways and contains the L2 pathways and all pathways underneath them in the hierarchy. Likewise, the L3+ dataset excludes the L1 and L2 pathways. Table 2 shows stats for each of these datasets.

Table 2.

Comparing the subsets of the metacyc dataset where L1+ is the full metacyc dataset, L2+ excludes L1 pathways, and L3+ excludes L1 and L2 pathways

Dataset #Entries #Pathways
L1+ 39,929,585 4,055
L2+ 39,811,421 4,043
L3+ 38,541,158 3,914

For our analyses related to the hierarchy levels, we rolled L8 into L7 to form the L7 + L8 hierarchy level for more statistically meaningful results, since L8 only had 11 pathways. The reason we did not roll the L1 pathways into L2, is because the L1 pathways are much larger than the L8 pathways, even though there are only 12 L1 pathways. Figure 2 shows the number of pathways in each hierarchy level from L1 to L7 + L8. Figure S1 shows this same info, but it shows the number of L7 and L8 pathways separately.

Fig. 2.

Fig. 2

The number of pathways within each hierarchy level

We used a cross validation (CV) analysis with characteristics of both a bootstrap and jackknife analysis, involving 200 or 50 separate iterations of 9:1 train / test splits. On each CV iteration, we stochastically split the dataset into a 90% train and 10% test set in a stratified manner [15], ensuring that the proportion of positive entries in the train set is as close as possible to the proportion in the test set. Table S1 details information about these proportions, demonstrating that the train and test set positive entry proportions are nearly identical to that of the respective full datasets. Remember that identical proportions are only possible if the number of positive and negative are both divisible by 10 for the 90:10 train: test splits. To handle the class imbalance problem, the positive entries were duplicated in the training set only. All positive samples were duplicated a calculated number of times such that the number of positive entries is maximally close to the number of negative entries without exceeding the number of negative entries. We trained a multi-layer perceptron binary classifier on the train set and evaluated on the test set, collecting the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). The hyperparameters of the multi-layer perceptron are detailed in Table S2. Additionally we used a binary cross entropy loss function, a batch size of 25,000, a maximum number of epochs of 250 with early stopping based on average training loss across batches, the Adam optimizer [16], and random seeding for reproducing the model weight initialization and train / test splits. This enabled us to calculate metrics including the Matthews correlation coefficient (MCC) [17] on each CV iteration and then to calculate a mean, median, and standard deviation of each metric across CV iterations. It also enabled us to calculate the MCC per compound and per pathway by counting the TP, TN, FP, and FN in the test set for a given compound or pathway (since each entry is a compound-pathway pair) and then summing those counts across all CV iterations. An MCC cannot be calculated on a single CV iteration since there may not be enough positive entries in the test set for a single compound or pathway to avoid a division by zero. However, we can calculate an overall MCC for an individual compound or pathway by summing TP, TN, FP, and FN across all CV iterations. Likewise, we calculated the MCC for each hierarchy level by summing the TP, TN, FP, and FN across all pathways in a given hierarchy level across all CV iterations. We calculated the overall MCCs of individual compounds and pathways from the L1+ dataset (i.e. the full dataset) and ran the L1+ dataset for 200 CV iterations. We ran the L2+ and L3+ datasets for 50 CV iterations and used the results from those datasets to compare overall MCC across different hierarchy levels. We ran the L1+ dataset for more iterations to ensure a valid MCC calculation for individual compounds or pathways, some of which might not have many positive entries, necessitating more CV iterations to have enough TP or FP. Figure 3 displays an overview of these model training and evaluation procedures.

Fig. 3.

Fig. 3

Overview of the model training and evaluation procedures. In this figure, L represents the binary label indicating whether the given metabolite is part of the given pathway, m represents the number of metabolites in the dataset after deduplication, and o represents that for the number of pathways

The hardware used for this work included machines with up to 2 terabytes (TB) of random-access memory (RAM) and central processing units (CPUs) of 3.8 GHz (GHz) of processing speed. The name of the CPU chip was ‘Intel(R) Xeon(R) Platinum 8480CL’. The graphic processing units (GPUs) used had 81.56 gigabytes (GB) of GPU RAM, with the name of the GPU card being ‘NVIDIA H100 80GB HBM3’.

All code for this work was written in major version 3 of the Python programming language [18]. Data processing and storage was done using the Pandas [19], NumPy [20], and H5Py [21] packages. Models were constructed and trained using the PyTorch Lightning [22] package built upon the PyTorch [23] package. The stratified train test splits were computed using the Sci-Kit Learn package [24]. Hyperparameters were tuned using the Optuna Python Package [25]. Results were initially stored in an SQL database [26] using the DuckDB [27] package. Results were processed and visualized using Jupyter notebooks [28], the seaborn package [29] built upon the MatPlotLib [30] package, and the Tableau business intelligence application [31].

Results

Main results

Table 3 includes the mean, median, and standard deviation of the MCC calculated across CV iterations of the L1+ (full), L2+, and L3+ datasets. The mean and median MCCs are very close, even though distributions of MCC do not look unimodal nor very symmetric. Table S3 shows these results for other metrics i.e. accuracy, precision, recall, F1 score, and specificity. We observe a decrease in performance when we exclude higher level pathways.

Table 3.

Matthew’s correlation coefficient statistics for models trained on the L1+, L2+, and L3+ datasets

Hierarchy levels included Mean MCC Median MCC Standard Deviation
L1+ Dataset 0.8446 0.8454 0.0101
L2+ Dataset 0.8337 0.8347 0.0113
L3+ Dataset 0.8199 0.8214 0.0080

While Table 3 provides summary statistics describing the center and spread of the CV performances of the best performing dataset (L1+), Fig. 4 shows the distribution of MCC across the 200 CV iterations of the L1+ dataset.

Fig. 4.

Fig. 4

Distribution of the MCC across all CV iterations on the L1+ (full) MetaCyc dataset

While Table 3 suggests a decrease in performance when excluding higher level pathways, it’s not clear how much this is attributed to the quality of model training decreasing due to excluding these pathways in the training or to the predictive accuracy decreasing due to excluding them in the test set. Figure 5 shows the overall MCC across pathways at certain hierarchy levels and across all CV iterations of each dataset i.e. L1+, L2+, and L3+. When keeping the training set consistent, we still see that performance consistently declines at deeper hierarchy levels, which is to be expected since the size of the pathways decrease with depth as illustrated in the “MCC and compound/pathway size” subsection.

Fig. 5.

Fig. 5

Comparing the overall MCC of pathway hierarchy levels within each dataset

Figure 6 shows the same information as Fig. 5, but compares performance across the datasets within hierarchy levels rather than the performance of hierarchy levels within the datasets. We see that inclusion of the L1 pathways in the L1+ dataset improved the performance of the L2 pathways. Similarly, inclusion of the L2 pathways in the L2+ dataset improved performance of the L3 pathways, though inclusion of the L1 pathways did not provide further improvement of the L3 pathways. Meanwhile the pathways in the L4, L5, L6, and L7 + L8 hierarchy levels did not experience significant change in performance regardless of whether the L1 or L2 pathways were included in training. We interpret these results to indicate a slight transfer learning effect between higher level pathways and pathways at one step deeper in the hierarchy.

Fig. 6.

Fig. 6

Comparing the overall MCC between datasets within each hierarchy level

MCC and compound/pathway size

We’ve described how MCC is related to pathway hierarchy level, so the next logical questions are how MCC relates to the size of pathways, how it relates to compound size, and how pathway size correlates with hierarchy level. We define compound size as the number of non-hydrogen atoms in the compound. We define pathway size as the sum of the sizes of the compounds associated with that pathway. Figure 7 shows the distribution of the size of compounds and pathways in the full MetaCyc dataset.

Fig. 7.

Fig. 7

Distribution of compound size (number of non-hydrogen atoms in the molecule) and pathway size (sum of the sizes of compounds associated with a pathway)

While Fig. 7 shows the distribution of the size of all pathways, Fig. 8 shows the distribution of pathway size within each hierarchy level. We see an overall trend of pathway size declining deeper into the hierarchy. Note that the y-axis is on a log scale to illustrate the wide range of pathway sizes that span multiple orders of magnitude.

Fig. 8.

Fig. 8

Violinplot showing the distribution of the sizes of the pathways within each hierarchy level

While Fig. 7 shows the distribution of the size of individual pathways and compounds, Fig. 9 shows the distribution of the overall MCC of these pathways and compounds. There were 9,847 compounds and 4,055 pathway entries in the full MetaCyc dataset. The overall MCC distributions are negatively skewed for both pathways and compounds, with the compounds highly negatively skewed, likely due to truncation effects from boundary conditions.

Fig. 9.

Fig. 9

Distribution of the overall MCC of individual pathways and compounds

Now that we’ve seen the distribution of overall MCC and size of pathways and compounds, Fig. 10 contains scatterplots comparing the overall MCC of pathways and compounds to their size. Figure 10a and c compare pathway size to overall MCC and compound size to overall MCC respectively, while Fig. 10b and d are the same plots but with the x axis on a log scale for better visibility. We see that the pathway to overall MCC plot exhibits a funnel shape indicating that the variance of pathway MCC decreases as pathway size increases. Additionally, we see that the maximum overall MCC is lower for smaller pathways. Likewise, the maximum compound overall MCC does not reach 1.0 until reaching a compound size of 7 and we see before that, maximum compound overall MCC decreases as compound size decreases. The spearman correlation comparing overall MCC to pathway size and overall MCC to compound size respectively results in a coefficient of 0.361 and a p value of 6.37 × 10−125 and a coefficient of 0.190 and a p value of 8.04 × 10−81. The upward trends are not linear and the low correlation coefficients are considered small (0.1 ≤ 0.19 < 0.3) and medium (0.3 ≤ 0.361 < 0.5) in terms of effect size; however, the very low p values clearly indicate that the trends are real. Additionally, Fig. S2 shows the MCC standard deviation versus pathway size calculated from a sliding window of pathway size. The downward trend in MCC standard deviation versus pathway size has a Spearman correlation coefficient of − 0.6741, with a p value of 0 due to the numerical limit on precision.

Fig. 10.

Fig. 10

Scatterplots comparing the size of pathways and compounds to their overall MCC

Comparing MetaCyc to KEGG

Now that we’ve seen the results for the novel MetaCyc dataset, we will compare it to the results of the KEGG datasets of past publications. Table 4 compares the performance of the full MetaCyc dataset, which primarily contains metabolic pathways, to the performance of the 184 KEGG pathways under the ‘Metabolism’ category as well as the full KEGG dataset containing all 502 pathways. We see that the KEGG metabolic pathways alone have the lowest mean MCC and the highest standard deviation. The full KEGG dataset has the highest mean MCC and lowest standard deviation. However, the difference in mean MCCs between the MetaCyc and full KEGG datasets is 0.002. While the 0.002 difference between the two models appears real, as indicated by a p value of 0.048 from a Welch’s t test, the percent mean difference is only 0.24%, which represents a very small mean-normalized effect size. Even a variance-normalized effect size like a Cohen’s d is only 0.2, which is also considered small. Thus, the average performance of models trained on the MetaCyc dataset is pragmatically comparable to that of models trained on the full KEGG dataset. However, if the comparison is limited to metabolic pathways in KEGG, then MetaCyc represents over a 5.6% improvement in prediction performance. The results of a one-sided Welch’s t-test, with the hypothesis being that the MetaCyc mean MCC is greater than the metabolism-only KEGG MCC, resulted in a p value of 1.76 × 10−41.

Table 4.

Comparing the overall MCC of the metacyc dataset to past KEGG datasets

Dataset Mean MCC Median MCC Standard deviation References
KEGG (‘metabolism’ pathways only) 0.800 0.021 [8]
MetaCyc 0.845 0.845 0.0101 Current Study
KEGG 0.847 0.848 0.0098 [10]

Figure 11 compares the distribution of MCC for the full KEGG and MetaCyc datasets. Visually, the variance and median are comparable for the KEGG and MetaCyc datasets, which is quantitatively corroborated in Table 4.

Fig. 11.

Fig. 11

Violinplot comparing the distribution of the MetaCyc dataset’s MCC to that of KEGG

Figure 12a shows the distribution of pathway size (sum of the number of non-hydrogen atoms across all associated compounds) in an overlapping histogram. Figure 12b shows the corresponding smoothed density plot. Metacyc clearly has a unimodal distribution, while KEGG has a bimodal distribution. KEGG has a higher proportion of larger pathways as indicated by the higher blue mode, while MetaCyc has several times the number of larger pathways, even though its proportion is smaller. MetaCyc also has a higher proportion of pathways with 100 to 1000 pathway size, indicated by the single orange mode.

Fig. 12.

Fig. 12

Distribution and density plot comparing pathway size between the MetaCyc and KEGG datasets

Discussion

While the majority of publications on pathway prediction have used KEGG data, the methods clearly are applicable to data from other knowledgebases. In this work, we demonstrate effective training with the MetaCyc knowledgebase for prediction of MetaCyc pathway involvement to a level comparable to the performance on the KEGG knowledgebase. There is less than a 0.24% difference in performance and the Cohen’s d effect size is only 0.2. Pragmatically, the overall performance is equivalent; however, if the comparison is limited to metabolic pathways in KEGG, then MetaCyc represents over a 5.6% improvement in prediction performance. Given the 10-fold larger size of the MetaCyc dataset, we may have reached asymptotic performance level of metabolic pathways with increasing dataset size. Moreover, the MetaCyc results expand the number of pathway definitions that can be predicted by 8-fold. The MetaCyc dataset also contains roughly 50% more compounds than the KEGG dataset, expanding the diversity of compounds available for model training and testing. Another key difference between the MetaCyc and KEGG datasets is the depth of the pathway hierarchy. KEGG pathways only span 3 levels, while MetaCyc pathways span 8 levels. The wider range of pathway granularity may have utility for biological and biomedical interpretation.

In addition, the results of the MetaCyc dataset are consistent with past results from the KEGG dataset pertaining to compound size, pathway size, hierarchy depth, and MCC. The deeper you go into the pathway hierarchy, the more granular the pathways become, having fewer associated compounds. The lower number of associated compounds results in smaller pathway size. Pathway performance declines with shrinking pathway size and with increasing depth in the pathway hierarchy. Additionally, the variance of pathway performance decreases as pathway size increases, suggesting that larger pathways have more robust prediction performance. We observe similar results with compound size, with the maximum MCC increasing as compound size increases. We recommend that researchers take compound and pathway size into account when using predicted pathway annotations.

The results of the MetaCyc dataset are also consistent with past results regarding the inclusion of higher-level pathways. Even if researchers are primarily interested in lower-level pathways, inclusion of higher-level pathways in the training set results in better performance of the lower-level pathways, but only to the pathways one step deeper. However, the larger transfer learning effects occurs between pathways of the same depth and from deeper pathways to shallower pathways [810]. Thus, we recommend that the full dataset be used to train a model, even if only a subset of the pathways is used in a downstream analysis.

The results presented here are very promising and these metabolite-pathway annotation prediction models have high potential for a variety of applications, including in pathway enrichment analysis [32, 33] or metabolic pathway reconstruction [34, 35]. But first, the generalizability of these models with respect to chemical representation has not been evaluated, which represents a logical next step in this work, which will include comparisons between prediction performance for compound entries with cross-references between knowledgebases [36].

Conclusions

The comparable performance on the MetaCyc dataset to the KEGG dataset demonstrates that MetaCyc pathways can be predicted just as effectively. Now thousands of additional pathway definitions can be predicted using this approach. Average performance is effective even when incorporating more granular pathway definitions, which are more difficult to predict. The results of this work are consistent with past work showing that smaller pathways and smaller compounds are both more difficult to predict and prediction improves with more chemical information.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 2 (643.7KB, csv)
Supplementary Material 3 (248KB, docx)

Acknowledgements

We thank the University of Kentucky Institute for Biomedical Informatics, the University of Kentucky Center for Computational Sciences, and National Science Foundation grant number 1626364 for their support and associated computing resources.

Author contributions

EDH obtained the data, wrote the code for the data engineering and analysis, finalized the results, and wrote the first draft of the manuscript. HNBM conceptualized the project, provided funding, worked in a supervisory role, reviewed the results, revised the manuscript, and provided mentorship. Both EDH and HNBM worked together on the methodology.

Funding

This research was funded by the National Science Foundation, grant number 2020026 (PI Moseley), and by the National Institutes of Health, grant number P42 ES007380 (University of Kentucky Superfund Research Program Grant; PI Pennell). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Science Foundation nor the National Institute of Environmental Health Sciences.

Data availability

All data and code for reproducing the results of this manuscript are available in the following Figshare item: 10.6084/m9.figshare.27317163.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Voet D, Voet JG, Pratt CW. Fundamentals of biochemistry: life at the molecular. 5th ed. Hoboken: Wiley; 2016. [Google Scholar]
  • 2.Berg JM, Tymoczko JL, Gatto GJ, Stryer L. Biochemistry. 9th ed. New York: W. H. Freeman; 2019. [Google Scholar]
  • 3.Nelson DL, Cox MM. Principles of biochemistry. 8th ed. New York: W. H. Freeman; 2021. [Google Scholar]
  • 4.Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Caspi R, Billington R, Keseler IM, Kothari A, Krummenacker M, Midford PE, et al. The metacyc database of metabolic pathways and enzymes—a 2019 update. Nucleic Acids Res. 2020;48(D1):D445–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Huckvale ED, Moseley HNB. A cautionary Tale about properly vetting datasets used in supervised learning predicting metabolic pathway involvement. PLoS ONE. 2024;19(5):e0299583. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Huckvale ED, Powell CD, Jin H, Moseley HNB. Benchmark dataset for training machine learning models to predict the pathway involvement of metabolites. Metabolites. 2023;13(11):1120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Huckvale ED, Moseley HNB. Predicting the pathway involvement of metabolites based on combined metabolite and pathway features. Metabolites. 2024;14(5):266. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Huckvale ED, Moseley HNB. Predicting the association of metabolites with both pathway categories and individual pathways. Metabolites. 2024;14(9):510. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Huckvale ED, Moseley HNB. Predicting the pathway involvement of all pathway and associated compound entries defined in the Kyoto encyclopedia of genes and genomes. Metabolites. 2024;14(11):582. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Dalby A, Nourse JG, Hounshell WD, Gushurst AKI, Grier DL, Leland BA, et al. Description of several chemical structure file formats used by computer programs developed at molecular design limited. J Chem Inf Model. 1992;32(3):244–55. [Google Scholar]
  • 12.Jin H, Moseley HNB. Hierarchical harmonization of atom-resolved metabolic reactions across metabolic databases. Metabolites. 2021;11(7):431. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Jin H, Mitchell JM, Moseley HNB. Atom identifiers generated by a neighborhood-specific graph coloring method enable compound harmonization across metabolic databases. Metabolites. 2020;10(9):368. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Jin H, Moseley HNB. md_harmonize: a python package for atom-level harmonization of public metabolic databases. Metabolites. 2023;13(12):1199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Verstraeten G, Van den Poel D. Using predicted outcome stratified sampling to reduce the variability in predictive performance of a One-Shot Train-and-Test split for individual customer predictions. ICDM (Posters). 2006;214:1–10.
  • 16.Kingma DP, Ba J. Adam: A Method for Stochastic Optimization. arXiv. 2014.
  • 17.Chicco D, Starovoitov V, Jurman G. The benefits of the Matthews correlation coefficient (MCC) over the diagnostic odds ratio (DOR) in binary classification assessment. IEEE Access. 2021;9:47112–24. [Google Scholar]
  • 18.Rossum GV, Drake FL. Python 3 reference manual. CreateSpace; 2009.
  • 19.The pandas development team. pandas-dev/pandas: pandas 1.0.3. Zenodo. 2020.
  • 20.Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, et al. Array programming with numpy. Nature. 2020;585(7825):357–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Collette A. Python and HDF5. O’Reilly; 2013.
  • 22.Falcon W, Borovec J, Wälchli A, Eggert N, Schock J, Jordan J et al. PyTorchLightning/pytorch-lightning: 0.7.6 release. Zenodo. 2020.
  • 23.Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv. 2019.
  • 24.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O et al. Scikit-learn: Machine Learning in Python. arXiv. 2012.
  • 25.Akiba T, Sano S, Yanase T, Ohta T, Koyama M, Optuna. A Next-generation Hyperparameter Optimization Framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining—KDD ’19. New York: ACM Press; 2019. pp. 2623–31.
  • 26.Chamberlin D. SQL. In: Liu L, Özsu MT, editors. Encyclopedia of database systems. Boston: Springer US; 2009. pp. 2753–60. [Google Scholar]
  • 27.Raasveldt M, Mühleisen H. Duckdb: an embeddable analytical database. Proceedings of the 2019 International Conference on Management of Data. New York: ACM; 2019. pp. 1981–4.
  • 28.Kluyver T, Ragan-Kelley B, Pérez F, Granger B, Bussonnier M, Frederic J, et al. Jupyter Notebooks—a publishing format for reproducible computational workflows. In: Loizides F, Scmidt B, editors. Positioning and power in academic publishing: Players, agents and agendas. Netherlands: IOS; 2016. pp. 87–90. [Google Scholar]
  • 29.Waskom M. Seaborn: statistical data visualization. JOSS. 2021;6(60):3021. [Google Scholar]
  • 30.Hunter JD, Matplotlib: A 2D graphics environment. Comput Sci Eng. 2007;9(3):90–5. [Google Scholar]
  • 31.Salesforce. Tableau Public. Salesforce; 2024.
  • 32.Marco-Ramell A, Palau-Rodriguez M, Alay A, Tulipani S, Urpi-Sarda M, Sanchez-Pla A, et al. Evaluation and comparison of bioinformatic tools for the enrichment analysis of metabolomics data. BMC Bioinform. 2018;19(1):1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Lu Y, Pang Z, Xia J. Comprehensive investigation of pathway enrichment methods for functional interpretation of LC-MS global metabolomics data. Brief Bioinf. 2023;24(1):bbac553. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Wang L, Dash S, Ng CY, Maranas CD. A review of computational tools for design and reconstruction of metabolic pathways. Synth Syst Biotechnol. 2017;2(4):243–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Shah HA, Liu J, Yang Z, Feng J. Review of machine learning methods for the prediction and reconstruction of metabolic pathways. Front Mol Biosci. 2021;8:634141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Huckvale ED, Moseley HNB. Chemical representation standardization needed to generalize metabolic pathway involvement prediction across the Kyoto Encyclopedia of Genes and Genomes, Reactome, and MetaCyc knowledgebases. BioRxiv. 2025.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 2 (643.7KB, csv)
Supplementary Material 3 (248KB, docx)

Data Availability Statement

All data and code for reproducing the results of this manuscript are available in the following Figshare item: 10.6084/m9.figshare.27317163.


Articles from BMC Bioinformatics are provided here courtesy of BMC

RESOURCES