Abstract
Metabolomics, the large-scale study of metabolites, has significant appeal as a source of information for metabolic modeling and other scientific applications. One common approach for measuring metabolomics data is gas chromatography-mass spectrometry (GC-MS). However, GC-MS metabolomics data are typically reported as relative abundances, precluding their use with approaches and tools where absolute concentrations are necessary. While chemical standards can be used to help provide quantification, their use is time-consuming, expensive, or even impossible due to their limited availability. The ability to infer absolute concentrations from GC-MS metabolomics data without chemical standards would have significant value. We hypothesized that when analyzing time-course metabolomics datasets, the mass balances of metabolism and other biological information could provide sufficient information towards inference of absolute concentrations. To demonstrate this, we developed and characterized MetaboPAC, a computational framework that uses two approaches—one based on kinetic equations and another using biological heuristics—to predict the most likely response factors that allow translation between relative abundances and absolute concentrations. When used to analyze noiseless synthetic data generated from multiple types of kinetic rate laws, MetaboPAC performs significantly better than negative control approaches when 20% of kinetic terms are known a priori. Under conditions of lower sampling frequency and high noise, MetaboPAC is still able to provide significant inference of concentrations in 3 of 4 models studied. This provides a starting point for leveraging biological knowledge to extract concentration information from time-course intracellular GC-MS metabolomics datasets, particularly for systems that are well-studied and have partially known kinetic structures.
Introduction
Metabolomics, the systems-scale study of the small-molecule intermediates of metabolism, has been a valuable tool in a wide variety of applications, ranging from elucidation of biological pathways to identification of disease biomarkers and drug targets [1, 2]. While genomics, proteomics, and transcriptomics provide an upstream view of cellular function, metabolomics is a direct readout of a system’s current metabolic state that provides unique insight into what is occurring in a biological system [3]. Metabolomics data also have the potential to be integrated into metabolic modeling frameworks to better understand how cellular systems function and how they respond to perturbations [4]. To enable broader development of such metabolic models, a significant amount of metabolomics data will be necessary.
Depending on the scope of a study and the metabolites of interest, metabolomics researchers commonly use a number of different analytical techniques with their own respective limitations, including nuclear magnetic resonance (NMR) spectroscopy and mass spectrometry (MS) techniques that are often coupled to gas chromatography (GC) or liquid chromatography (LC). While NMR is nondestructive [5] and quantification of metabolites is less difficult, it is significantly less sensitive than mass spectrometry techniques that can be used to measure hundreds or thousands of metabolites at low concentrations in a single sample [6]. The relative utility of MS varies with the accompanied chromatographic technique, with LC-MS typically enabling detection of a wider variety of metabolites without any sample derivatization [5, 7], and GC-MS being less expensive and its ionization method typically helping enable metabolite identification via extensive mass spectral libraries [5, 8] without needing additional dimensions of mass fragmentation data (MS/MS).
One of the greatest weaknesses of the GC-MS approach, though, is quantification [9]: the metabolomics data these instruments yield are typically relative abundances and not absolute concentrations. Relative abundances still allow for some types of analysis, including principal component analysis (PCA) [10] and univariate statistical tests, as relative abundance measurements of the same analyte can be compared from sample to sample. However, relative abundances of different metabolites cannot be compared [11]: even if two metabolites have similar absolute concentrations, their relative abundances can be very different depending on their chemical properties such as ionization efficiency, derivatization, and degree of fragmentation [12, 13]. Similarly, peaks with comparable intensities do not necessarily have equal absolute concentrations. Other methods like LC-MS have even more confounders to quantification, such as ion suppression. This precludes the use of raw GC-MS or LC-MS metabolomics data in many computational tools used to study metabolism. For example, several metabolic modeling platforms, such as MetDFBA, TMFA, and LK-DFBA [14–16] can directly integrate metabolite data into their frameworks, but they all use absolute concentrations.
One approach to address this issue is to use chemical standards to quantify metabolites. However, these standards can be costly and time-consuming to use and are in fact unavailable for many metabolites [12, 17, 18]. Moreover, while using standards may be feasible for quantification of a few metabolites, for the purposes of untargeted metabolomics where one attempts to measure all metabolites [19], it quickly becomes infeasible. As a result of these challenges in quantification, untargeted metabolomics data often cannot be used with metabolic models [11], and are instead restricted to analysis with methods that are agnostic to whether the data are absolutely quantitative. A method for determining absolute concentrations at a metabolism-wide scale without the use of chemical standards would greatly expand the usability of metabolomics data in computational tools and would thus be quite beneficial to the metabolomics community.
Previous efforts to address this critical challenge have made some progress. Much of the related work has focused on LC-MS metabolomics data, linking relative abundance to absolute concentrations by predicting ionization efficiencies of different chemicals. It has been shown that intrinsic thermodynamic properties, electrokinetic properties, structural properties, and solvent properties are all key factors that contribute to the prediction of ionization efficiencies [20, 21]. Recently, Liigand et al. developed a method for predicting ionization efficiencies using random forest machine learning [22]. On the other hand, quantification of GC-MS metabolomics data still mostly relies on chemical or internal standards. For example, Tumanov et al. developed MetabQ, which allows quantification of amino and organic acids using isotope-coded derivatization without traditional calibration curves with chemical standards [23]. However, it relies on standards to be run and calibrated against the internal standard and is restricted to a specific subset of metabolites.
We hypothesized that for metabolic systems, additional biological information—rather than just chemical properties—could provide sufficient mathematical constraints to move towards inference of absolute concentrations from relative abundances. The calculation of response factors is defined by an underdetermined system of equations with many degrees of freedom, but including additional requirements or constraints inferred from biological principles could reduce the degrees of freedom in the mathematical system. For metabolomics data derived from intracellular analyses, there are many stoichiometric constraints and biological heuristics that describe the relationships between different metabolites. When coupled with time-course metabolomics data, these relationships serve as a promising starting point for the additional information needed to move toward inference of absolute concentrations.
Accordingly, we developed a new computational framework for inferring the most likely absolute concentrations from relative abundance GC-MS metabolomics data for cellular metabolism, which we have named Metabolomics Prediction of Absolute Concentrations (MetaboPAC). MetaboPAC attempts to avoid the need for chemical standards by leveraging the mass balances and other characteristics of a metabolic system and determining the most biologically likely metabolic profiles. To the best of our knowledge, this is the first computational platform for standard-free inference of absolute concentrations using metabolic mass balances rather than physiochemical properties of individual metabolites. The underlying principles established in the development of MetaboPAC could play a significant future role in improving the ability to measure absolute concentrations in metabolomics and thus more readily integrate metabolomics data with metabolic modeling and other -omics data in the future.
Methods
Synthetic models
To assess MetaboPAC on different types of possible metabolic systems, two synthetic models were created. The first synthetic model (Figure S1A) contains four metabolites and five fluxes, where the initial influx is a known, constant reaction rate. With four metabolites and four unknown fluxes, the system is determined, which allows for the fluxes to be trivially solved using Equation 1, where is the vector of the change in concentration over time for each metabolite, S is the stoichiometric matrix of the system, and is the vector of fluxes. The fact that a determined system typically has a single unique flux profile solution makes this a particularly useful system for initial testing.
| (1) |
The second synthetic model (Figure S1B) contains four metabolites and eight fluxes. Once again, the influx is assumed to have a known, constant reaction rate. Unlike the determined model, the second model contains more unknown fluxes than metabolites and is therefore underdetermined. Furthermore, two allosteric regulatory interactions are included, inhibiting flux v3 and promoting flux v8. Most biological systems are underdetermined and include metabolite-dependent regulatory interactions, making this model a more complex and more relevant test for MetaboPAC. Both the determined and underdetermined with regulation synthetic models were constructed using Michaelis-Menten kinetics to model each reaction, and noiseless time-course data were generated with these in silico models to be used for testing.
These synthetic models certainly have some limitations in terms of how accurately they represent the typical metabolic systems to be analyzed. They each include only a simple subnetwork rather than a larger, more complex network; however, we choose to use these reductionist models for smaller-scale, interpretable analysis of the potential and limitations of our approach. In addition, it is important to note that the availability of time course metabolomics data for these models is assumed to be sufficient for analysis and that the absolute time scale for these models is not explicitly defined. In practice, acquisition of many samples over a very short period of time like a few minutes would be experimentally impractical for most organisms or systems. Fortunately, many biological systems have culture time scales and thus variations in metabolism on the order of hours, so it is more reasonable to assume that sufficient time-course data can be acquired such that the use of simplified models like these is a reasonable first-order approximation.
Biological models
While synthetic models are pragmatic for initially developing and testing MetaboPAC, they do not resemble biological models sufficiently to allow generalization of initial results. Accordingly, models of Escherichia coli [24] and Saccharomyces cerevisiae [25] metabolism were also studied. Both of these systems are underdetermined and include numerous allosteric regulatory interactions, with the E. coli model containing 18 metabolites and 48 fluxes, and the S. cerevisiae model containing 22 metabolites and 24 fluxes (Figure S2). The kinetic reaction terms for both models include a mixture of Michaelis-Menten, Hill, and mass action kinetics, and noiseless time-course data were generated with these in silico models to be used for testing.
Response Factors
To emulate relative abundance data, 5 sets of response factors were generated for each metabolite in each of the four models. Response factors are the linear proportionality constants between relative abundances and absolute concentrations (Equation 2). While the assumption of a constant linear relationship between relative abundances and absolute concentrations may not always be true, experimental efforts to quantify metabolite levels typically make the same implicit assumption when using calibration curves, and it is generally a valid assumption over the linear dynamic range of an instrument for a particular analyte [26]. Metabolite responses are typically linear over two to four orders of magnitude [12, 27], making it reasonable to assume a linear relationship between relative abundances and absolute concentrations. Each response factor was randomly selected from a uniform distribution between 1 and 1000. These sets of response factors (RFT) were multiplied by the true absolute concentration values simulated by the kinetic models to calculate the relative abundances. The relative abundance data then serve as the input to the MetaboPAC algorithm. To infer absolute concentrations, relative abundances would be divided by the response factors predicted by MetaboPAC. The absolute concentrations for the systems used in this work ranged from 1e-4 mM to 20 mM, with changes over time courses typically spanning less than an order of magnitude.
| (2) |
Noise-added data
To generate noisy data that more accurately represent experimental metabolomics data, two sampling frequencies and two coefficients of variation (CoV) for randomly-added noise were used, for a total of four conditions. Sampling frequencies of 50 and 15 timepoints (nT) and CoVs of 0.05 and 0.15 were tested, where a higher CoV represents more noise (experimental error). (A higher CoV of 0.25 was also tested for the two synthetic models.) Starting from the data generated by the ODEs defining the systems, each concentration value in each metabolomics dataset was replaced with a random value drawn from Ni,k ~ (yi(tk),CoV∙yi(tk)), where yi(tk) is the value of metabolite i at timepoint k. Noisy data was smoothed by fitting to an impulse function [28]. For our initial proof-of-concept work, it was assumed that there are no missing values for the measured metabolites. We briefly addressed the impact of missing values in the datasets using models of missingness that have been previously developed to better capture the characteristics of real experimental missing data [29].
MetaboPAC approaches for estimating response factors
Kinetic Equation Approach
MetaboPAC uses a combination of time-course metabolomics data with biological and biochemical constraints to enable estimation of response factors. Based on mass balances, the rate of change in concentration of a metabolite must always equal the sum of stoichiometrically balanced influxes and effluxes of the metabolite (Equation 1). When the kinetics of the reaction fluxes are known, each influx and efflux can be represented by a mathematical term containing kinetic parameters and the concentration of the metabolites that participate in the reaction, either as substrates or allosteric regulators. The concentrations of the metabolites can be replaced by their respective relative abundances divided by their response factors (Equation 2). Across the time-course data that are used for the basis of this analysis, the response factors are expected to remain constant for each metabolite. Together, the mass balances at each timepoint create a system of non-linear equations. A non-linear least-squares solver is used to determine the set of response factors that minimizes mass balance violations; however, the optimization landscape is complex, with many local optima. Accordingly, the system of non-linear equations is solved 48 times (chosen in initial development based on the maximum number of local workers—12 workers, each used 4 times—when performing parallel computations) with different initial seeds selected from a uniform distribution; the medians of the predicted response factors for each metabolite are calculated at the conclusion of all the runs and are considered to be the most likely set of response factors. We refer to this procedure as the kinetic equation approach.
Optimization approach using Heuristic Penalties
When the kinetics of the reaction fluxes are not known, additional information is needed to estimate the response factors. Unknown kinetic flux terms must instead be replaced with fluxes inferred (see “Solving for flux distributions” below) using putative absolute concentrations calculated from putative response factors, but the removal of the kinetic terms introduces more degrees of freedom to the system that must be counteracted to allow inference of response factors and concentrations. Several characteristics that are typical or uncommon in true metabolic systems were identified [30], and objective function penalties based on those characteristics were created to help eliminate sets of response factors that lead to biologically unlikely absolute concentrations (Table S1). For example, if a metabolite is the sole substrate of an enzyme, the reaction rate is generally expected to increase as the concentration of the metabolite increases. A set of candidate response factors and its corresponding concentration profiles may result in putative flux profiles that do not obey this biological heuristic, in which case the set of response factors would be heavily penalized. These penalties are implemented in the objective function of the optimization problem to serve as soft constraints, and the penalty weight was optimized for each model using no a priori information on response factors (See Supplementary Methods ‘Penalty weight optimization’). As in the kinetic equation approach, the optimization problem is performed 48 times with different initial seeds, and the medians of the predicted response factors from all the runs are calculated to determine the most likely set of response factors. We refer to this procedure using heuristic penalties as soft constraints as the optimization approach, though it is worth noting that there is still optimization involved in the kinetic equation approach described above.
Overall workflow
The kinetic equation approach and the optimization approach are then combined to establish the MetaboPAC workflow for analysis of time-course intracellular metabolomics data, as shown in Figure 1. First, the kinetic equation approach is used for the metabolite mass balance equations where all the kinetic terms of the influxes and effluxes of that metabolite are known. If only some kinetic terms of the fluxes in the mass balance of a metabolite are known, the response factor of the metabolite cannot be predicted in the kinetic equation approach; this is common when only a small percentage of the kinetic structure of the model is known. Only the response factors associated with the metabolites present in the useable mass balances can be identified in this step. After predicting all the possible response factors using the kinetic equation approach, the optimization approach proceeds as described above, except the response factors that have already been identified are fixed within the optimization problem and the remaining response factors are predicted. Compared to the optimization approach, this combined approach reduces the degrees of freedom for the optimization problem by setting the predicted response factors from the kinetic equation approach as fixed rather than variables. The optimization approach and the combined approach are studied in the context of varying degrees of known kinetics. The number of known kinetic terms for fluxes is the percentage of known kinetic terms multiplied by the total number of known kinetic terms, rounded up to the next integer.
Figure 1. MetaboPAC workflow for inferring absolute concentrations from relative abundances in time-course intracellular metabolomics datasets.

In the kinetic equation approach, the mass balances at each timepoint are used to create a system of non-linear equations where the response factors in the useable mass balances are predicted. These initial predicted response factors are transferred and fixed in the optimization approach, where penalties are used to eliminate biologically unlikely sets of the remaining response factors. The final predicted response factors are used to infer the absolute concentrations of the data. RAi is the relative abundance and RFi is the unknown response factor of the ith metabolite, tn is a particular timepoint in the data, Sxi is the stoichiometric mass balance coefficients of the ith metabolite, and vtn is a vector of the fluxes at timepoint tn. The kinetic terms (if known) of vtn also contain relative abundances and response factors and are as shown in the inset (Michaelis-Menten kinetics are used as an example, where Vmaxf and KMf are kinetic parameters of the fth flux).
Solving for flux distributions
In the optimization approach, the flux profiles of the reactions in the model are used to calculate some penalties that describe the relationship between inferred absolute concentrations and the reactions they control. Because fluxomics data are not assumed to be available, the fluxes must be inferred by solving Equation 1. The rate of change is determined by calculating the difference in relative abundance between two time points dividing by the time difference. This rate of change is divided by the corresponding response factor to infer the rate of change of absolute concentration for each metabolite. While the fluxes of a determined system can be trivially calculated, underdetermined systems have an infinite number of flux solutions, making this inverse problem challenging. The Moore-Penrose pseudoinverse can potentially be used to minimize the norm of the flux solution, though its flux estimation can be inaccurate. However, if the kinetic equation terms or the values of some of the fluxes are known, they can be used to create a more constrained system that could possibly be determined or even overdetermined, which would allow a unique and accurate flux solution to be found. For all underdetermined systems used in this study, we assumed that enough kinetic terms were known to yield a determined system such that fluxes could be estimated accurately (See Supplementary Methods ‘Accurate flux estimation assumption for underdetermined system’). Significant progress has been made toward the development of a complete algorithm for accurate flux estimation based on time-series metabolomics data [31, 32], suggesting this assumption may be easily satisfied in the future; alternatively, measurement of metabolic fluxes could also complement the flux estimation [33]. The results of MetaboPAC on noiseless data for the 2 biological models without assuming sufficient kinetics to ensure a determined system for accurate flux inference is shown in Figure S3, showing performance that is often still significantly better than baseline comparators though less effective than assuming accurate flux inference.
Evaluation metrics and comparator methods
To measure the performance of MetaboPAC, the relative difference between the true and predicted values of the response factors using a logarithmic scale was calculated and binned based on whether it was within a range of log2(1.1), log2(1.3), and log2(1.5) error, as shown in Equation 3, where RFT is the true response factor, RFP is the predicted response factor, and x is the value that determines the log2 error range (i.e. 1.1, 1.3, or 1.5). A logarithmic scale was chosen because using absolute percent error led to large error ranges and less meaningful interpretation (e.g., 100% error for a response factor of 500 would cover the entire space of response factors from 0 to 1000, whereas using a logarithmic scale of two-fold error would cover a range from only 250 to 1000).
Since there exists no other published approach to accomplish the task of inference of response factors for arbitrary metabolites, we identified two baseline comparators for performance. The first method randomly predicts response factors using a uniform distribution between 1 and 1000 for each metabolite. The second method uses a response factor of 500 for each metabolite, as predicted response factors close to the middle of the search space will have the greatest chance of being contained within the error range of true response factors drawn from a uniform distribution (maximizing the number of predictions that are within a factor of 2 of the real values).
| (3) |
Results
MetaboPAC performance on noiseless data
To test our hypothesis that additional biological information could provide constraints to move towards inference of absolute concentrations from relative abundances for metabolic systems, we first began with a simplified case that would be applicable to well-characterized systems. If the kinetic rate law of each reaction in the model is known, then the mass balances of the model and the dynamic nature of time-course metabolomics data can be leveraged to move towards identifying the response factors necessary to infer absolute concentrations from time-course intracellular metabolomics data; we refer to this as the “kinetic equation approach”, which is described in more detail in “MetaboPAC approaches for estimating response factors” in the Methods. The performance of the kinetic equation approach was assessed on 2 synthetic models and 2 biological models (Figure 2). For all models, the kinetic equation approach showed excellent performance on noiseless data. For the 2 synthetic models, 100% of the response factors are predicted within a log2(1.1) error. For the 2 biological models, 100% of the response factors are predicted within a log2(1.3) error and over 90% are predicted within a log2(1.1) error. The slightly worse performance in biological models is due to discretization error in calculation of fluxes from metabolite measurements; when this discretization error is eliminated, all response factors can be predicted accurately (Figure S4).
Figure 2. Kinetic equation approach performance on noiseless data with known kinetics.

Bars indicate the percentage of response factors predicted within the error ranges of log2(1.1), log2(1.3), and log2(1.5). Error bars represent the standard error of the mean (n = 5 for different sets of true response factors).
Although the success of the kinetic equation approach shows proof-of-principle that leveraging mass balances and time-course intracellular metabolomics data can lead to absolute concentrations from relative abundances, it is much more common not to know all kinetic terms and parameters in a metabolic pathway. Accordingly, we developed the optimization approach and studied the performance with varying degrees of known kinetic information, where unknown kinetic terms were replaced with fluxes inferred from the putative concentration profiles, which are in turn dependent on the putative response factors. The optimization approach performs quite well on noiseless data in the four tested models (Figure S5 and S6).
While the optimization approach gives significantly better results than the negative control comparator methods, we predicted that by combining these two approaches and determining some response factors first using the kinetic equation approach and then the rest using the optimization approach, the overall prediction accuracy as well as the prediction accuracy of the optimization approach step itself may be further improved. However, only the first of these predictions was seen to be true.
Overall, the performance of the combined approach is improved compared to the optimization approach for the two synthetic models (Figure 3) when higher percentages of kinetic terms are known. At lower percentages of known kinetic terms there is not a significant improvement, likely due to the less accurate prediction at the lower percentages by the kinetic equation approach, which passes incorrect information into the optimization approach (Figures S7 and S8). Overall, the combined approach was able to predict 50% of the response factors for the determined model and 83% for the underdetermined with regulation model within log2(1.3) error range when only 20% of the kinetic equation terms are known. And when 80% of the kinetic terms are known, which is possible for well-studied pathways, over 90% of the response factors can be predicted within the log2(1.3) error range.
Figure 3. Performance of the combined approach on noiseless data for synthetic models.

The combined approach is compared to random response factors and response factors of 500 for the A. determined and B. underdetermined with regulation models using error ranges of log2(1.1), log2(1.3), and log2(1.5). Lines represent the mean percent of predicted response factors within the error ranges for each method. Error bars represent the standard error of the mean (n = 50 for different sets of true response factors and different sets of known kinetic equation terms for 20–80% known kinetics, n=5 for different sets of true response factors for 0% and 100%). Asterisks denote when the optimization approach performed significantly better at predicting response factors than both of the other two comparator control methods (two-sample t-test with α = 0.05).
For the two biological models, a similar trend is evident where the combined approach improves overall performance when a higher percentage of kinetic terms is known and does not when a lower percentage of kinetic terms is known (Figure 4). Again, the response factors predicted from the kinetic equation approach are less accurate when fewer kinetic terms are known compared to when more are known (Figures S9 and S10). For the S.cerevisiae model, we believe that the combined approach performance being worse than that of the optimization approach when less than 80% kinetic terms are known is likely due to the bias introduced by low prediction accuracy in the kinetic equation approach (Figure S9).
Figure 4. Performance of the combined approach on noiseless data for biological models.

The combined approach is compared to random response factors and response factors of 500 for the A. S. cerevisiae and B. E. coli models using error ranges of log2(1.1), log2(1.3), and log2(1.5). Lines represent the mean percent of predicted response factors within the error ranges for each method. Error bars represent the standard error of the mean (n = 50 for different sets of true response factors and different sets of known kinetic equation terms for 20–80% known kinetics, n=5 for different sets of true response factors for 0% and 100%). Asterisks denote when the optimization approach performed significantly better at predicting response factors than both of the other two methods (two-sample t-test with α = 0.05).
Although the performance of the combined approach on the two biological models is not as accurate as for the two synthetic models, the combined approach is able to predict 52% of the response factors for the S. cerevisiae model and 77% of the response factors for the E.coli model within the log2(1.3) error range when at least 80% of the kinetic equation terms are known. This indicates that the combined approach can provide valuable insights into response factors when the model has well-characterized kinetics.
MetaboPAC performance on noisy data
While noiseless data provides a good benchmark for the performance of MetaboPAC under ideal conditions, real experimental metabolomics data will have some degree of noise. To test the robustness of MetaboPAC under more realistic conditions, we assessed MetaboPAC on datasets with different sampling frequencies (nT = 50 or 15) and different amounts of added noise (CoV = 0.05 or 0.15).
In the synthetic models, MetaboPAC was significantly better than both the random and 500 response factor approaches for almost all log2 error ranges when 80 to 100% of kinetic equation terms were known under the different sampling frequency and noise conditions (Figure 5 shows log2(1.3); Figures S11 and S12 show all error ranges). MetaboPAC exhibited varying degrees of robustness against noise and sampling frequency for the different models, particularly at low percentages of known kinetic terms. Specifically, MetaboPAC performance on the determined model is robust to the decrease in sampling rate but sensitive to the addition of noise (Figure S13). On the other hand, MetaboPAC performance on the underdetermined with regulation model is more robust to high noise than to a decrease in sampling frequency. The decrease in performance at 40% and 60% additional known kinetic equation terms at low sampling rate appears to be due to a decrease in prediction accuracy of the kinetic equation approach in those conditions (Figure S14). By including potentially inaccurate response factors from an underperforming kinetic equation approach in the optimization approach, the accuracy of the optimization approach decreases.
Figure 5. MetaboPAC performance on various conditions of noisy data for the synthetic models.

MetaboPAC is compared to random response factors and response factors of 500 for the determined and underdetermined with regulation models using an error range of log2(1.3) on data with different sampling frequencies (nT = 50 or 15) and noise added (CoV = 0.05 or 0.15). Lines represent the mean percent of predicted response factors within the error ranges for each method. Error bars represent the standard error of the mean (n=9 for 3 different sets of true response factors and 3 replicates of noisy data for 0% and 100% known kinetic equation terms, n = 27 for 3 different sets of true response factors, 3 different subsets of known kinetic equation terms, and 3 replicates of noisy data for the rest). Asterisks denote when MetaboPAC performed significantly better at predicting response factors than two control comparator methods (two-sample t-test with α = 0.05).
The biological models yielded different trends (Figures 6 and S15–S18). For example, MetaboPAC performance on the S. cerevisiae model is robust to both noise and sampling frequency. MetaboPAC performance on the E. coli model, however, suffered significantly from both addition of noise and low sampling frequency: it only outperforms the negative control when 100% of kinetic equation terms are known when evaluated within the log2(1.3) and log2(1.5) error range (Figure S16). The underlying reason that these models have varying robustness to noise and sampling frequency remains to be further explored.
Figure 6. MetaboPAC performance on all conditions of noisy data for the biological models.

MetaboPAC is compared to random response factors and response factors of 500 for the E.coli and S. cerevisiae models using an error range of log2(1.3) on data with different sampling frequencies (nT = 50 or 15) and noise added (CoV = 0.05 or 0.15). Lines represent the mean percent of predicted response factors within the error ranges for each method. Error bars represent the standard error of the mean (n=9 for 3 different sets of true response factors and 3 replicates of noisy data for 0% and 100% known kinetic equation terms, n = 27 for 3 different sets of true response factors, 3 different subsets of known kinetic equation terms, and 3 replicates of noisy data for the rest). Asterisks denote when MetaboPAC performed significantly better at predicting response factors than both of the other two methods (two-sample t-test with α = 0.05).
Discussion
The ability to measure absolute metabolite concentrations via metabolomics without extensive use of chemical standards could be broadly impactful. This significant potential value has motivated numerous previous efforts in the field, including the use of specific experimental reagents and the use of machine learning techniques to predict physiochemical properties of metabolites. However, achieving this ability remains an open challenge.
Here, we have explored a new approach to inferring absolute concentrations from relative abundance data: leveraging mass balances and other biological insight to analyze time-course intracellular data. Mass balances are widely used as the basis for important techniques and concepts in fields including metabolic modeling and metabolic engineering. Mass balances have also been used to determine quenching leakage in metabolomics sample processing protocols [34]. We hypothesized that for intracellular metabolomics data, where all metabolic reactions happen in the context of a typically well-defined metabolic network, the stoichiometric balances of that network would apply and could serve as the basis for additional information necessary to allow inference of absolute concentrations from relative abundance data. To the best of our knowledge, this is the first time mass balances have been used in the context of improving quantitative metabolomics data analysis.
The results presented in this manuscript provide evidence that, in principle, these approaches could be useful for improving the data acquired in metabolomics experiments. When combined with time-course metabolomics analysis of intracellular samples, mass balances place additional constraints on the relationships between relative abundances of different metabolites, decreasing the number of degrees of freedom for the response factors for a given set of metabolites. Additional quantitative relationships between metabolite concentration and flux profiles provide further features to distinguish between biologically reasonable and unlikely sets of response factors. Taken together, this yields statistically significant inference of absolute metabolite concentrations from just relative abundances and a priori biological knowledge. Nonetheless, there remains significant room for improvement in the concentration predictions provided by this computational framework, as different metabolic networks, different aspects of data quality, and different error tolerances can lead to varying levels of prediction, from 40% to nearly 100% accuracy.
While we envision the longer-term application of future improved versions of this framework to be in the context of quantifying all metabolites in a metabolomics sample, it could still have significant immediate value under certain conditions. For example, while metabolomics scientists may try to measure all analytes, there is often significant focus on and interest in a subset of metabolic pathways (e.g., central carbon metabolism). The reactions in these pathways, especially for model species like E. coli and S. cerevisiae, are extremely well-studied, and are likely to have kinetic reaction terms defined and parameterized for them. In these cases, the kinetic equation step of MetaboPAC could allow inference of absolute concentrations of this subset of metabolites, which could have substantial value for modelers or for general analysis of metabolic behavior.
One of the strengths of MetaboPAC is that additional information can be easily integrated into the framework to reduce the number of possible sets of response factors and further improve the optimization approach. If the minimum or maximum possible or predicted concentrations of each (or a few) metabolites are known, this can greatly reduce the search space of possible sets of response factors. Other a priori knowledge can also be integrated into the framework such as thermodynamic constraints to further constrain the variable space. During the development of the optimization approach, there were cases where multiple sets of response factors had similar or even smaller penalty values than the true response factors, suggesting that the penalties need to be refined or additional penalties that reflect biological feasibility need to be developed.
Along with incorporating additional information directly into the framework, MetaboPAC could also be used concurrently with other metabolomics data analysis methods to improve the accuracy of response factor predictions. The use of methods to predict ionization efficiencies [20–22] in conjunction with MetaboPAC could provide further insight about which response factors are most biologically and biochemically reasonable. Alternatively, if minimum or maximum values of individual fluxes are known, this could also reduce the number of possible flux solutions and benefit the calculation of penalties in the optimization approach.
As already discussed, there are limitations to the utility and accuracy of the proposed framework. Perhaps most obviously, since MetaboPAC assumes that all of the metabolites in a sample are directly connected via chemical reactions (whether via mass balances or other characteristic features), it can only be used for intracellular metabolomics samples in a single compartment. For example, MetaboPAC could not infer the absolute concentrations of metabolites in a blood sample because blood metabolite profiles are determined from metabolic contributions from across organs or systems in an organism with no stoichiometric basis to connect concentrations [35]. Further, MetaboPAC does not take into consideration the transport reactions across different intracellular compartments in eukaryotic systems. Additionally, the response factors were assumed to be constant with concentration, but this assumption will not always hold, whether due to the linear dynamic range of an analyte or instrumentation challenges like ion suppression. Considering possible nonlinearities may be important to improving MetaboPAC predictions, though it will not be evident which specific metabolites would need such allowances, and it would undoubtedly be computationally challenging. Finally, it was assumed that fluxes could be reasonably calculated based on time-series metabolomics data using existing and in-development computational approaches [31, 32]; the accuracy of MetaboPAC inferences will be limited by the accuracy of those approaches.
One other limitation worth consideration is that data quality—whether noise levels or temporal frequency—can have a significant impact on MetaboPAC’s accuracy. Since a key principle of MetaboPAC is its reliance on mass balances, high noise levels in the data can lead to violations of those mass balances and thus complicate the assessment of the biological feasibility of a set of response factors. Issues like variability in sample preparation that could add time-specific, metabolome-wide variability to measured levels could further complicate these calculations, though with appropriate experimental design via randomization of samples across batches, these effects should instead manifest as an apparent increase in random, rather than systematic, noise. Nonetheless, even under the noisiest condition analyzed for all networks (CoV = 0.15), MetaboPAC generally still performs significantly better than baseline comparators for predicting response factors when all kinetic equation terms are known, indicating utility for the framework at least in certain circumstances. A noisier condition was also analyzed for two networks as shown in Figure S19, with the performance on the determined pathway deteriorating even further but the performance on the underdetermined with regulation pathway remaining somewhat robust. Here, we have used an impulse function fitting approach to smooth the noisy data—which was useful—but other approaches to mitigating the effects of noise, such as filtering, normalization, scaling, other smoothing methods [36–38], or using triplicate samples, should be considered. Other aspects of data quality, like the fraction of missing values in the dataset, can also have negative impacts on MetaboPAC performance. We found that if 10% of the values in a dataset are missing and subsequently estimated using a missing value imputation method, the inferred concentrations will be less accurate but qualitatively similar in behavior to when no missing values were assumed (Figure S20).
Conclusion
While analytical techniques like mass spectrometry may not be able to yield absolute concentrations on a metabolome-wide scale without significant and often infeasible experimental efforts, there is a priori biological information and insight that has to date not been exploited to help address this issue. For intracellular metabolomics data, mass balances and other relationships between metabolites and fluxes can help provide constraints that limit the space of feasible response factors that relate relative abundances to absolute concentrations. Here we establish that exploiting this information can lead to statistically significant inference of absolute concentrations of metabolites without the use of chemical standards. Under certain circumstances, the accuracy of this inference can be extremely high, though there are significant current limitations that remain to be overcome. Nonetheless, this proof of principle that relative abundances could someday be directly transformed into absolute concentrations is a promising development in metabolomics data analysis that warrants further investigation and algorithm development.
Supplementary Material
Acknowledgments
The authors acknowledge financial support from the National Institutes of Health (R35GM119701).
Footnotes
Conflicts of Interest
There are no conflicts of interest to declare.
Availability
The source code for MetaboPAC is available at https://github.com/gtStyLab/MetaboPAC_v2
References
- 1.Chen L, Zhong F, and Zhu J, Bridging Targeted and Untargeted Mass Spectrometry-Based Metabolomics via Hybrid Approaches. Metabolites, 2020. 10(9). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Wishart DS, Applications of metabolomics in drug discovery and development. Drugs R D, 2008. 9(5): p. 307–22. [DOI] [PubMed] [Google Scholar]
- 3.Reiekeberg E and Powers R, New frontiers in metabolomics: from measurement to insight. F1000Res., 2017. 6(1148). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Volkova S, et al. , Metabolic Modelling as a Framework for Metabolomics Data Integration and Analysis. Metabolites, 2020. 10(8). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Emwas AH, The strengths and weaknesses of NMR spectroscopy and mass spectrometry with particular focus on metabolomics research. Methods Mol Biol, 2015. 1277: p. 161–93. [DOI] [PubMed] [Google Scholar]
- 6.Emwas AH, et al. , NMR Spectroscopy for Metabolomics Research. Metabolites, 2019. 9(7). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Wishart DS, Emerging applications of metabolomics in drug discovery and precision medicine. Nat Rev Drug Discov, 2016. 15(7): p. 473–84. [DOI] [PubMed] [Google Scholar]
- 8.Fiehn O, Metabolomics by Gas Chromatography-Mass Spectrometry: the combination of targeted and untargeted profiling. Curr Protoc Mol Biol., 2016. 114. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Veenstra TD, Metabolomics: the final frontier? Genome Med, 2012. 4(4): p. 40. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Worley B and Powers R, Multivariate Analysis in Metabolomics. Curr Metabolomics, 2013. 1(1): p. 92–107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Kapoore RV and Vaidyanathan S, Towards quantitative mass spectrometry-based metabolomics in microbial and mammalian systems. Philos Trans A Math Phys Eng Sci, 2016. 374(2079). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Lu W, et al. , Metabolite Measurement: Pitfalls to Avoid and Practices to Follow. Annu Rev Biochem, 2017. 86: p. 277–304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Moldoveanu SC and David V, Derivatization Methods in GC and GC/MS, in Gas Chromatography - Derivatization, Sample Preparation, Application. 2018, Books on Demand. [Google Scholar]
- 14.Bennett BD, et al. , Absolute metabolite concentrations and implied enzyme active site occupancy in Escherichia coli. Nat Chem Biol, 2009. 5(8): p. 593–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Dromms RA, Lee JY, and Styczynski MP, LK-DFBA: a linear programming-based modeling strategy for capturing dynamics and metabolite-dependent regulation in metabolism. BMC Bioinformatics, 2020. 21(1): p. 93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Willemsen AM, et al. , MetDFBA: incorporating time-resolved metabolomics measurements into dynamic flux balance analysis. Molecular BioSystems, 2015. 11(137). [DOI] [PubMed] [Google Scholar]
- 17.Fernie AR, et al. , Recommendations for reporting metabolite data. Plant Cell, 2011. 23(7): p. 2477–82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Khodadadi M and Pourfarzam M, A review of strategies for untargeted urinary metabolomic analysis using gas chromatography–mass spectrometry. Metabolomics, 2020. 16. [DOI] [PubMed] [Google Scholar]
- 19.Schrimpe-Rutledge AC, et al. , Untargeted metabolomics strategies – Challenges and Emerging Directions. J Am Soc Mass Spectrom, 2016. 27(12): p. 1897–1905. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Chalcraft KR, et al. , Virtual quantification of metabolites by capillary electrophoresis-electrospray ionization-mass spectrometry: predicting ionization efficiency without chemical standards. Analytical Chemistry, 2009. 81(7). [DOI] [PubMed] [Google Scholar]
- 21.Wu L, et al. , Quantitative structure–ion intensity relationship strategy to the prediction of absolute levels without authentic standards. Analytica Chimica Acta, 2013. 794: p. 67–75. [DOI] [PubMed] [Google Scholar]
- 22.Liigand J, et al. , Quantification for non-targeted LC/MS screening without standard substances. Scientific Reports, 2020. 10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Tumanov S, et al. , Calibration curve-free GC–MS method for quantitation of amino and non-amino organic acids in biological samples. Metabolomics, 2016. 16(64). [Google Scholar]
- 24.Chassagnole C, et al. , Dynamic modeling of the central carbon metabolism of Escherichia coli. Biotechnol Bioeng, 2002. 79(1): p. 53–73. [DOI] [PubMed] [Google Scholar]
- 25.Hynne F, Dano S, and Sorensen PG, Full-scale model of glycolysis in Saccharomyces cerevisiae. Biophys Chem, 2001. 94(1–2): p. 121–63. [DOI] [PubMed] [Google Scholar]
- 26.Response Factors, Determination, Accuracy and Precision, in Quantitative Analysis By Gas Chromatography Guiochon G and Guillemin CL, Editors. 1988. p. 587–627. [Google Scholar]
- 27.Koek MM, et al. , Quantitative metabolomics based on gas chromatography mass spectrometry: status and perspectives. Metabolomics, 2011. 7(3): p. 307–328. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Dromms RA and Styczynski MP, Improved metabolite profile smoothing for flux estimation. Mol Biosyst, 2015. 11(9): p. 2394–405. [DOI] [PubMed] [Google Scholar]
- 29.Lee JY and Styczynski MP, NS-kNN: a modified k-nearest neighbors approach for imputing metabolomics data. Metabolomics, 2018. 14(12): p. 153. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Lee JY, et al. , SCOUR: a stepwise machine learning framework for predicting metabolite-dependent regulatory interactions. BMC Bioinformatics, 2021. 22(1): p. 365. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Chou IC and Voit EO, Estimation of dynamic flux profiles from metabolic time series data. BMC Syst Biol, 2012. 6: p. 84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Goel G, Chou IC, and Voit EO, System estimation from metabolic time-series data. Bioinformatics, 2008. 24(21): p. 2505–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Antoniewicz MR, A guide to metabolic flux analysis in metabolic engineering: Methods, tools and applications. Metab Eng, 2021. 63: p. 2–12. [DOI] [PubMed] [Google Scholar]
- 34.Canelas AB, et al. , Leakage-free rapid quenching technique for yeast metabolomics. Metabolomics, 2008. 4. [Google Scholar]
- 35.Stringer KA, et al. , Whole Blood Reveals More Metabolic Detail of the Human Metabolome than Serum as Measured by 1H-NMR Spectroscopy: Implications for Sepsis Metabolomics. Shock., 2015. 44(3). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Thonusin C, et al. , Evaluation of intensity drift correction strategies using MetaboDrift, a normalization tool for multi-batch metabolomics data. J Chromatogr A, 2017. 1523: p. 265–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Wei X, et al. , Data preprocessing method for liquid chromatography-mass spectrometry based metabolomics. Anal Chem, 2012. 84(18): p. 7963–71. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Yang J, et al. , A data preprocessing strategy for metabolomics to reduce the mask effect in data analysis. Front Mol Biosci, 2015. 2: p. 4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The source code for MetaboPAC is available at https://github.com/gtStyLab/MetaboPAC_v2
