Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2024 Nov 4;40(11):btae652. doi: 10.1093/bioinformatics/btae652

ENKIE: a package for predicting enzyme kinetic parameter values and their uncertainties

Mattia G Gollub 1, Thierry Backes 2, Hans-Michael Kaltenbach 3, Jörg Stelling 4,
Editor: Anthony Mathelier
PMCID: PMC11588206  PMID: 39495107

Abstract

Motivation

Relating metabolite and enzyme abundances to metabolic fluxes requires reaction kinetics, core elements of dynamic and enzyme cost models. However, kinetic parameters have been measured only for a fraction of all known enzymes, and the reliability of the available values is unknown.

Results

The ENzyme KInetics Estimator (ENKIE) uses Bayesian Multilevel Models to predict value and uncertainty of KM and kcat parameters. Our models use five categorical predictors and achieve prediction performances comparable to deep learning approaches that use sequence and structure information. They provide calibrated uncertainty predictions and interpretable insights into the main sources of uncertainty. We expect our tool to simplify the construction of priors for Bayesian kinetic models of metabolism.

Availability and implementation

Code and Python package are available at https://gitlab.com/csb.ethz/enkie and https://pypi.org/project/enkie/.

1 Introduction

To construct dynamic mathematical models for metabolic networks (Kim et al. 2018) or to relate reaction fluxes to metabolite and enzyme abundances in protein cost models (Noor et al. 2016), a main challenge is to find kinetic parameter values for kcat (rate constants) and KM (Michaelis constants). Kinetics databases such as BRENDA (Chang et al. 2021) (parameter values for ∼90 000 enzymes from >150 000 publications) and SABIO-RK (Wittig et al. 2018) (7500 publications) cover only a minority of known enzymes (e.g. in few species).

To predict parameter values from existing databases, a recent combination of machine- and deep-learning models achieved a coefficient of determination (R2) of 0.53 for affinities (Kroll et al. 2021) and of 0.44 for catalytic rates (Kroll et al. 2023) on independent test sets. Unfortunately, R2 (or RMSE) values are dataset-specific, and accuracy is not guaranteed for other data, e.g. with different overlap with the training data. The uncertainty of measured values is also poorly characterized: most parameters have few measurements. For reliable uncertainty quantification in kinetic models of various pathways and organisms, uncertainties of measured and predicted values should be computed per parameter.

Here, we develop Bayesian Multilevel Models (BMMs) that capture a simple hierarchy of coarse-grained enzyme properties to predict KM and kcat values as well as their uncertainties without considering structural and physical properties of enzymes, reactions, and metabolites explicitly. The models can be easily used through ENKIE, a python package for the estimation of thermodynamically consistent enzyme kinetic parameter values and their uncertainties.

2 Materials and methods

Statistical models. We use hierarchical BMMs to express the relationship between the properties of reactions/enzymes/metabolites and the value of a kinetic parameter KM or kcat (Supplementary Methods). The BMMs linearly combine population-level effects (with a single value for the entire population) and group-level effects (with different values across groups and a common normal distribution). They capture the reduction in variability as the classification of a parameter value becomes more precise. For KM and kcat values, we assume normally distributed residuals with mean μ and standard deviation σ ({M,cat}). We describe μ by a population average and a series of nested group-level effects. For KM, the substrate is the first grouping level because parameter values for a substrate are conserved across reactions (Park et al. 2016). We then nest the Enzyme Commission (EC)-reaction pair, the protein family, and the protein identifier denoting a specific enzyme of a specific organism. For kcat, we replace the substrate with a nested hierarchy of the first three components of the EC number and include the reaction direction to distinguish between forward and backward rates. To model σ, we use two different population averages for entries with/without an annotated protein identifier because entries with unknown identifier are presumably harder to fit. We use a group-level effect to incorporate that the size of the residuals can differ across reactions because residuals capture many unmodeled factors, such as experimental errors, different experimental conditions, and annotation errors.

Fitting and evaluation. The BMMs capture model uncertainty (confidence in the estimated model parameters given the data) and the residuals σ (discrepancy between measurements of the same parameter, possibly because of unmodeled effects). Because these can be strongly correlated, we fitted the models in a Bayesian framework. Specifically, we implemented the models in R and fitted them using the Markov Chain Monte Carlo (MCMC)-based brms package (Bürkner 2017) to 94’968 entries extracted from BRENDA and SABIO-RK (Supplementary Methods). The simulated chains had no divergent transitions and a potential scale reduction factor R^1.02, suggesting good fit quality (Gelman et al. 2013) (Supplementary Table S7).

Parameter balancing. Constraints by e.g. thermodynamics induce dependencies between kinetic parameters. Optionally, ENKIE estimates free energies with eQuilibrator (Beber et al. 2022). Parameter balancing (Lubitz et al. 2010) combines them with samples from the joint distribution of predicted parameters to obtain a multivariate normal distribution of thermodynamically consistent estimates and uncertainties of free energies and kinetic parameters. For model analysis, we omitted this step to simplify benchmarking. (see Supplementary Methods).

Implementation. ENKIE (Fig. 1a) uses generally available inputs: reaction stoichiometries, metabolite/reaction identifiers in any namespace supported by MetaNetX (Moretti et al. 2021), ECs, protein identifiers and, optionally, physiological properties of the reaction compartments (pH, pMg, temperature and ionic strength). It uses MetaNetX to standardize identifiers and Uniprot (The UniProt Consortium 2021) to obtain protein family annotations. Inputs are passed to the brmspredict method via the rpy2 package.

Figure 1.

Figure 1.

ENKIE design and performance. (a) Overview. We gather kinetic parameter values from BRENDA and SABIO-RK, use Uniprot and MetaNetX to consistently annotate them, and train models for value and uncertainty of KM and kcat parameters. The ENKIE package uses the models to predict value and uncertainty for query parameters, optionally with eQuilibrator (Beber et al. 2022) and parameter balancing (Lubitz et al. 2010) to construct thermodynamically consistent parameters. (b) R2 for ENKIE predictions of parameter values for measurements, proteins, protein families, and reactions outside of the training set. (c, d) Distribution of predicted parameter uncertainties on different dataset splits. Dashed lines: SD of parameter values over the entire dataset. (e) Improvement in the Negative Expected Log Predictive Density (NELPD) score for uncertainties predicted by ENKIE over the test RMSEs. (f, g) Predicted versus test KM (f) or kcat (g) RMSEs binning the test data by metabolite (f, blue), reaction (g, blue) or predicted uncertainty (f, g, red). Correlations: Pearson’s ρ; dashed lines: RMSEs of the entire test dataset.

3 Results

We first analyzed the features’ contributions to the predicted KM and kcat values (Supplementary Table S1). All population- and group-level effects were estimated with small uncertainty. The strongest determinants (i.e. the group-level effects with the largest average size) for KM and kcat were the substrate and reaction identifier respectively. Importantly, the standard deviations of the only organism specific group-level effects (the Uniprot identifier, Supplementary Table S1) were significantly smaller than the standard deviations over the entire dataset (Supplementary Table S5). Hence, measurements from characterized organisms are informative for predicting values in new organisms.

In terms of training data quality, the average values of the estimated residuals σM and σcat were slightly larger with missing protein identifier annotations (Supplementary Table S1), justifying the split into two groups. For entries with a protein identifier (which we expect to be always available in practical applications), estimated upper bounds of the 95% Confidence Intervals above one imply that, for certain parameters, multiple measurements with the same protein span several orders of magnitude.

Next, we characterized the predictive performance of our BMMs for completely unknown measurements, proteins, protein families, and reactions. Specifically, we performed 5-fold cross-validation with different types of folds (equally sized partitions of the input dataset; Supplementary Methods and Supplementary Fig. S2). The models remain predictive even when extrapolating to new reactions (Fig. 1b). Grouped folds are notoriously challenging and a drop in performance is expected. KM models generally achieve a better R2 than kcat models. This could be because affinities are more conserved across organisms than rates, as shown by the lower variance explained by protein effects in KM models (13.2%) than in kcat models (23.9%) (Supplementary Table S1). We compared the R2 for predicting KM and kcat achieved by BMMs to state of the art Gradient Boosting (GB) approaches (Kroll et al. 2021,2023) using the training and test data of these GB publications. Our BMMs’ R2 scores (KM: 0.46, kcat: 0.36) are only marginally lower than those of GB methods (KM:0.53, kcat: 0.44). This is surprising because our models only use ECs, identifiers, and protein family annotations.

We characterized the BMMs’ predicted uncertainties first by their distributions (Fig. 1c and d). Uncertainties are low when the training set contains measurements for a parameter; they increase when training set and predicted parameter share less information. Without measurement, multiple peaks arise when only some test parameters had information about lower levels of the hierarchy (e.g. protein family); relying on higher-level information (e.g. reaction) increases uncertainties. Thus, ENKIE predicts uncertainties for varying overlaps with the training data.

To measure the quality of combined predictions of parameter values and uncertainties, we used the Negative Expected Log Predictive Density (NELPD) (Supplementary Methods), which quantifies the likelihood of the test data in the predicted distributions. Notably, the BMMs’ predicted uncertainties were generally more accurate than RMSEs computed on the test data, which represent the true error distribution in a set of predictions by an average (Fig. 1e). To make such NELPD improvements more tangible, we re-trained the models using random folds. This reflects real use cases where query parameters have different degrees of overlap with the training data (Supplementary Table S8). After binning the test data points by predicted uncertainties, the predicted and effective RMSEs match well for both KM and kcat (Fig. 1f and g). Also, predicted uncertainties and actual errors for each metabolite (reaction) for KM (kcat) show strong correlations. By contrast, the constant test RMSE frequently over- or under-estimates prediction errors. Overall, ENKIE’s predicted uncertainties are well calibrated.

Lastly, we examined the group-level effects of the residuals to determine if there are systematic differences across reactions and organisms, hypothetically due to difficulties in quantifying specific compounds, the speed of reactions, differences across kingdoms, or annotation inaccuracies. We used the already fitted models and additionally fitted models for KM and kcat where the group-level effects of the residuals were based on the organism (Supplementary Methods). Supplementary Figure S4 shows clear differences among reaction groups. For example, pyruvate kinase (MNXR103371) had a large positive effect for the residuals of kcat, caused by a single publication contributing 52 measurements for the same protein with strong variations in the concentrations of several ions and sugar phosphates. In contrast, the RuBisCO carboxylase and oxygenase reactions (MNXR191185 and MNXR191183) had negative effects for both KM and kcat, suggesting that the enzyme has been characterized with particular precision. For organism effects, we did not observe systematic differences across kingdoms (Supplementary Fig. S4).

4 Discussion

We build on linear regression to predict KM values (Borger et al. 2006), but introduce group-level effects. This enables prediction qualities comparable to current deep-learning methods that additionally use enzyme sequence and structure information. Importantly, ENKIE requires only the training data to provide reliable uncertainty estimates independent of the composition of the query. This makes it suitable for predicting parameter distributions for kinetic models of different pathways and organisms.

Compared to deep learning approaches, our models are easily interpretable. The effect sizes of the fitted models directly inform us about how much variance is explained by simple features such as substrate, reaction and protein family. We also showed how the effects on the residuals can be directly explained with the data. This is important because, in addition to ambiguously defined parameters, we frequently encountered challenges in the standardization and interpretation of database values. The residuals estimated by our Bayesian Multilevel Models (BMMs) highlight inconsistencies and can thus help improve the structure and accuracy of enzyme databases.

We addressed the challenge of uncertainty quantification through BMMs because their estimation methodology is well established. Uncertainties can in principle be computed in Bayesian deep learning as well, but this requires developments beyond the current approximate methods (Papamarkou et al. 2024). Compromising on prediction accuracy, ENKIE provides calibrated uncertainties that can increase the robustness of applications relying on kinetic parameters.

Supplementary Material

btae652_Supplementary_Data

Acknowledgements

We thank Janina Linnik, Lukas Widmer, and Sebastian Weber for helpful discussions.

Contributor Information

Mattia G Gollub, Department of Biosystems Science and Engineering and SIB Swiss Institute of Bioinformatics, ETH Zurich, 4056 Basel, Switzerland.

Thierry Backes, Department of Biosystems Science and Engineering and SIB Swiss Institute of Bioinformatics, ETH Zurich, 4056 Basel, Switzerland.

Hans-Michael Kaltenbach, Department of Biosystems Science and Engineering and SIB Swiss Institute of Bioinformatics, ETH Zurich, 4056 Basel, Switzerland.

Jörg Stelling, Department of Biosystems Science and Engineering and SIB Swiss Institute of Bioinformatics, ETH Zurich, 4056 Basel, Switzerland.

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest: None declared.

Funding

This work was supported by the Swiss National Science Foundation Sinergia Project #177164.

References

  1. Beber ME, Gollub MG, Mozaffari D. et al. eQuilibrator 3.0: a database solution for thermodynamic constant estimation. Nucleic Acids Res 2022;50:D603–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Borger S, Liebermeister W, Klipp E. et al. Prediction of enzyme kinetic parameters based on statistical learning. Genome Inform 2006;17:80–7. [PubMed] [Google Scholar]
  3. Bürkner P-C. BRMS: an R package for Bayesian multilevel models using Stan. J Stat Soft 2017;80:28. [Google Scholar]
  4. Chang A, Jeske L, Ulbrich S. et al. BRENDA, the ELIXIR core data resource in 2021: new developments and updates. Nucleic Acids Res 2021;49:D498–508. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Gelman A, Carlin JB, Stern HS et al. Bayesian Data Analysis. New York: Chapman and Hall/CRC, 2013. [Google Scholar]
  6. Kim OD, Rocha M, Maia P. et al. A review of dynamic modeling approaches and their application in computational strain optimization for metabolic engineering. Front Microbiol 2018;9:1690. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Kroll A, Engqvist MKM, Heckmann D. et al. Deep learning allows genome-scale prediction of Michaelis constants from structural features. PLoS Biol 2021;19:e3001402. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Kroll A, Rousset Y, Hu X-P. et al. Turnover number predictions for kinetically uncharacterized enzymes using machine and deep learning. Nat Commun 2023;14:4139. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Lubitz T, Schulz M, Klipp E. et al. Parameter balancing in kinetic models of cell metabolism. J Phys Chem B 2010;114:16298–303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Moretti S, Tran VDT, Mehl F. et al. MetaNetX/MNXref: unified namespace for metabolites and biochemical reactions in the context of metabolic models. Nucleic Acids Res 2021;49:D570–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Noor E, Flamholz A, Bar-Even A. et al. The protein cost of metabolic fluxes: prediction from enzymatic rate laws and cost minimization. PLoS Comput Biol 2016;12:e1005167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Papamarkou T, Skoularidou M, Palla K et al. Position: Bayesian deep learning is needed in the age of large-scale AI. In: Proceedings of the 41st International Conference on Machine Learning. Vienna, Austria: PMLR 235, 2024.
  13. Park JO, Rubin SA, Xu Y-F. et al. Metabolite concentrations, fluxes and free energies imply efficient enzyme usage. Nat Chem Biol 2016;12:482–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. The UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res 2021;49:D480–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Wittig U, Rey M, Weidemann A. et al. SABIO-RK: an updated resource for manually curated biochemical reaction kinetics. Nucleic Acids Res 2018;46:D656–60. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btae652_Supplementary_Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES