Skip to main content
PLOS Biology logoLink to PLOS Biology
. 2021 Oct 19;19(10):e3001402. doi: 10.1371/journal.pbio.3001402

Deep learning allows genome-scale prediction of Michaelis constants from structural features

Alexander Kroll 1, Martin K M Engqvist 2, David Heckmann 1,*, Martin J Lercher 1,*
Editor: Jason W Locasale3
PMCID: PMC8525774  PMID: 34665809

Abstract

The Michaelis constant KM describes the affinity of an enzyme for a specific substrate and is a central parameter in studies of enzyme kinetics and cellular physiology. As measurements of KM are often difficult and time-consuming, experimental estimates exist for only a minority of enzyme–substrate combinations even in model organisms. Here, we build and train an organism-independent model that successfully predicts KM values for natural enzyme–substrate combinations using machine and deep learning methods. Predictions are based on a task-specific molecular fingerprint of the substrate, generated using a graph neural network, and on a deep numerical representation of the enzyme’s amino acid sequence. We provide genome-scale KM predictions for 47 model organisms, which can be used to approximately relate metabolite concentrations to cellular physiology and to aid in the parameterization of kinetic models of cellular metabolism.


To understand the action of an enzyme, we need to know its affinity for its substrates, quantified by Michaelis constants, but these are difficult to measure experimentally. This study shows that a deep learning model that can predict them from structural features of the enzyme and substrate, providing KM predictions for all enzymes across 47 model organisms.

Introduction

The Michaelis constant, KM, is defined as the concentration of a substrate at which an enzyme operates at half of its maximal catalytic rate; it hence describes the affinity of an enzyme for a specific substrate. Knowledge of KM values is crucial for a quantitative understanding of enzymatic and regulatory interactions between enzymes and metabolites: It relates the intracellular concentration of a metabolite to the rate of its consumption, linking the metabolome to cellular physiology.

As experimental measurements of KM and kcat are difficult and time-consuming, no experimental estimates exist for many enzymes even in model organisms. For example, in Escherichia coli, the biochemically best characterized organism, in vitro KM measurements exist for less than 30% of natural substrates (see Methods, “Download and processing of KM values”), and turnover numbers have been measured in vitro for only about 10% of the approximately 2,000 enzymatic reactions [1].

KM values, together with enzyme turnover numbers, kcat, are required for models of cellular metabolism that account for the concentrations of metabolites. The current standard approach in large-scale kinetic modeling is to estimate kinetic parameters in an optimization process [24]. These optimizations typically attempt to estimate many more unknown parameters than they have measurements as inputs, and, hence, the resulting KM and kcat values have wide confidence ranges and show little connection to experimentally observed values [2]. Therefore, predictions of these values from artificial intelligence, even if only up to an order of magnitude, would represent a major step toward more realistic models of cellular metabolism and could drastically increase the biological understanding provided by such models.

Only few previous studies attempted to predict kinetic parameters of natural enzymatic reactions in silico. Heckmann and colleagues [5] successfully employed machine learning models to predict unknown turnover numbers for reactions in E. coli. They found that the most important predictors of kcat were the reaction flux catalyzed by the enzyme, estimated computationally through parsimonious flux balance analysis, and structural features of the catalytic site. While many E. coli kcat values could be predicted successfully with this model, active site information was not available for a sizeable fraction of enzymes [5]. Moreover, neither active site information nor reaction flux estimates are broadly available beyond a small number of model organisms, preventing the generalization of this approach.

Borger and colleagues [6] trained a linear model to predict KM values based on other KM measurements for the same substrate paired with different enzymes in the same organism and with the same enzymes in other organisms; they fitted an independent model for each of 8 different substrates. Yan and colleagues [7] later followed a similarly focused strategy, predicting KM values of beta-glucosidases for the substrate cellobiose based on a neural network. These 2 previous prediction approaches for KM targeted individual, well-studied enzyme–substrate combinations with ample experimental KM data for training and testing. Their strategies are thus unsuitable for less well-studied reactions and cannot be applied to genome-scale predictions.

A related problem to the prediction of KM is the prediction of drug–target interactions, an important task in drug development. Multiple approaches for the prediction of drug–target binding affinities (DTBAs) have been developed (reviewed in [8]). Most of these approaches are either similarity-based, structure-based, or feature-based. Similarity-based methods rely on the assumption that similar drugs tend to interact with similar targets; these methods use known drug–target interactions to learn a prediction function based on drug–drug and target–target similarity measures [9,10]. Structure-based models for DTBA prediction utilize information on the target protein’s 3D structure [11,12]. Neither of these 2 strategies can easily be generalized to genome-scale, organism-independent predictions, as many enzymes and substrates share only distant similarities with well-characterized molecules, and 3D structures are only available for a minority of enzymes.

In contrast to these first 2 approaches, feature-based models for drug–target interaction predictions use numerical representations of the drug and the target as the input of fully connected neural networks (FCNNs) [1316]. The drug feature vectors are most often either SMILES representations [17], expert-crafted fingerprints [1820], or fingerprints created with graph neural networks (GNNs) [21,22], while those of the targets are usually sequence-based representations. As this information can easily be generated for most enzymes and substrates, we here use a similar approach to develop a model for KM prediction.

An important distinction between the prediction of KM and DTBA prediction is that the former aims to predict affinities for known, natural enzyme–metabolite combinations. These affinities evolved under natural selection for the enzymes’ functions, an evolutionary process strongly constrained by the metabolite structure. In contrast, wild-type proteins did not evolve in the presence of a drug, and, hence, molecular structures are likely to contain only very limited information about the binding affinity for a target without information about the target protein.

Despite the central role of the metabolite molecular structure for the evolved binding affinity of its consuming enzymes, important information on the affinity must also be contained in the enzyme structure and sequence. To predict KM, it would be desirable to employ detailed structural and physicochemical information on the enzyme’s substrate binding site, as done by Heckmann and colleagues for their kcat predictions in E. coli [5]. However, these sites have only been characterized for a minority of enzymes [23]. An alternative approach is to employ a multidimensional numerical representation of the entire amino acid sequence of the enzyme, as provided by UniRep [24]. UniRep vectors are based on a deep representation learning model and have been shown to retain structural, evolutionary, and biophysical information.

Here, we combine UniRep vectors of enzymes and diverse molecular fingerprints of their substrates to build a general, organism-, and reaction-independent model for the prediction of KM values, using machine and deep learning models. In the final model, we employ a 1,900-dimensional UniRep vector for the enzyme together with a task-specific molecular fingerprint of the substrate as the input of a gradient boosting model. Our model reaches a coefficient of determination of R2 = 0.53 between predicted and measured values on a test set, i.e., the model explains 53% of the variability in KM values across different, previously unseen natural enzyme–substrate combinations. In S1 Data, we provide complete KM predictions for 47 genome-scale metabolic models, including those for Homo sapiens, Mus musculus, Saccharomyces cerevisiae, and E. coli.

Results

For all wild-type enzymes in the BRENDA database [25], we extracted organism name, Enzyme Commission (EC) number, UniProt ID, and amino acid sequence, together with information on substrates and associated KM values. If multiple KM values existed for the same combination of substrate and enzyme amino acid sequence, we took the geometric mean. This resulted in a dataset with 11,675 complete entries, which was split into a training set (80%) and a test set only used for the final validation (20%). All KM values were log10-transformed.

Predicting KM from molecular fingerprints

To train a prediction model for KM, we first had to choose a numerical representation of the substrate molecules. For each substrate in our dataset, we calculated 3 different expert-crafted molecular fingerprints, i.e., bit vectors where each bit represents a fragment of the molecule. The expert-crafted fingerprints used are extended connectivity fingerprints (ECFPs), RDKit fingerprints, and MACCS keys. We calculated them with the python package RDKit [19] based on MDL Molfiles of the substrates (downloaded from KEGG [26]; a Molfile lists a molecule’s atom types, atom coordinates, and bond types [27]).

MACCS keys are 166-dimensional binary fingerprints, where each bit contains the information if a certain chemical structure is present in a molecule, e.g., if the molecule contains a ring of size 4 or if there are fewer than 3 oxygen atoms present in the molecule [20]. RDKit fingerprints are generated by identifying all subgraphs in a molecule that do not exceed a particular predefined range. These subgraphs are converted into numerical values using hash functions, which are then used to indicate which bits in a 2,048-dimensional binary vector are set to 1 [19]. Finally, to calculate ECFPs, molecules are represented as graphs by interpreting the atoms as nodes and the chemical bonds as edges. Bond types and feature vectors with information about every atom are calculated (types, masses, valences, atomic numbers, atom charges, and number of attached hydrogen atoms) [18]. Afterwards, these identifiers are updated for a predefined number of steps by iteratively applying predefined functions to summarize aspects of neighboring atoms and bonds. After the iteration process, all identifiers are used as the input of a hash function to produce a binary vector with structural information about the molecule. The number of iterations and the dimension of the fingerprint can be chosen freely. We set them to the default values of 3 and 1,024, respectively; lower or higher dimensions led to inferior predictions.

To compare the information on KM contained in the different molecular fingerprints independent of protein information, we used the molecular fingerprints as the sole input to elastic nets, FCNNs, and gradient boosting models. To the fingerprints, we added the 2 features molecular weight (MW) and octanol–water partition coefficient (LogP), which were shown to be correlated with the KM value [28]. The models were then trained to predict the KM values of enzyme–substrate combinations (Fig 1A). The FCNNs consisted of an input layer with the dimension of the fingerprint (including the additional features MW and LogP), 2 hidden layers, and a 1D output layer (for more details, see Methods). Gradient boosting is a machine learning technique that creates an ensemble of many decision trees to make predictions. Elastic nets are regularized linear regression models, where the regularization coefficient is a linear combination of the L1− and L2-norm of the model parameters. For each combination of the 3 model types and the 3 fingerprints, we performed a hyperparameter optimization with 5-fold cross-validation on the training set, measuring performance through the mean squared error (MSE). For all 3 types of fingerprints, the gradient boosting model outperformed the FCNN and the elastic net (S1S3 Tables).

Fig 1. Model overview.

Fig 1

(a) Predefined molecular fingerprints. Molecular fingerprints are calculated from MDL Molfiles of the substrates and then passed through machine learning models like the FCNN together with 2 global features of the substrate, the MW and LogP. (b) GNN fingerprints. Node and edge feature vectors are calculated from MDL Molfiles and are then iteratively updated for T time steps. Afterwards, the feature vectors are pooled together into a single vector that is passed through an FCNN together with the MW and LogP. FCNN, fully connected neural network; GNN, graph neural network; LogP, octanol–water partition coefficient; MW, molecular weight.

The KM predictions with the gradient boosting model based solely on the substrate ECFP, MACCS keys, and RDKit molecular fingerprints showed very similar performances on the test set, with MSE = 0.83 and coefficients of determination R2 = 0.40 (Fig 2).

Fig 2. When using only substrate features as inputs, task-specific molecular fingerprints (GNN) lead to better KM predictions than predefined, expert-crafted fingerprints.

Fig 2

(a) MSE on log10-scale. (b) Coefficients of determination R2. Boxplots summarize the results of the 5-fold cross-validations on the training set; blue dots show the results on the test set. The data underlying the graphs shown in this figure can be found at https://github.com/AlexanderKroll/KM_prediction/tree/master/figures_data. ECFP, extended connectivity fingerprint; GNN, graph neural network; MSE, mean squared error.

Best KM predictions from metabolite fingerprints using graph neural networks and gradient boosting

Recent work has shown that superior prediction performance can be achieved through task-specific molecular fingerprints, where a deep neural network simultaneously optimizes the fingerprint and uses it to predict properties of the input. In contrast to conventional neural networks, these GNNs can process non-Euclidean inputs, such as molecular structures. This approach led to state-of-the-art performances on many biological and chemical datasets [21,22].

As an alternative to the predefined, expert-crafted molecular fingerprints, we thus also tested how well we can predict KM from a task-specific molecular fingerprint based on a GNN (Fig 1; for details, see Methods, “Architecture of the graph neural network”). As for the calculations of the ECFPs, each substrate molecule is represented as a graph by interpreting the atoms as nodes and the chemical bonds as edges, for which feature vectors are calculated from the MDL Molfiles. These are updated iteratively for a fixed number of steps, in each step applying functions with learnable parameters to summarize aspects of neighboring atoms and bonds. After the iterations, the feature vectors are pooled into 1 molecular fingerprint vector. In contrast to ECFPs, the parameters of the update functions are not fixed but are adjusted during the training of the FCNN that predicts KM from the pooled fingerprint vector (Methods). As for the predefined molecular fingerprints, we defined an extended GNN fingerprint by adding the 2 global molecular features LogP and MW to the model before the KM prediction step.

To compare the learned substrate representation with the 3 predefined fingerprints, we extracted the extended GNN fingerprint for every substrate in the dataset and fitted an elastic net, an FCNN, and a gradient boosting model to predict KM. As before, we performed a hyperparameter optimization with 5-fold cross-validation on the training set for all models. The gradient boosting model again achieved better results than the FCNN and the elastic net (S1S3 Tables). The performance of our task-specific fingerprints is better than that of the predefined fingerprints, reaching an MSE = 0.80 and a coefficient of determination R2 = 0.42 on the test set, compared to an MSE = 0.83 and R2 = 0.40 for the other fingerprints (Fig 2). To compare the performances statistically, we used a one-sided Wilcoxon signed-rank test for the absolute errors of the predictions for the test set, resulting in p = 0.0080 (ECFP), p = 0.073 (RDKit), and p = 0.062 (MACCS keys). While the differences in the error distributions are only marginally statistically significant for RDKit and MACCS keys at the 5% level, these analyses support the choice of the task-specific GNN molecular fingerprint for predicting KM.

It is noteworthy that the errors on the test set are smaller than the errors achieved during cross-validation. We found that the number of training samples has a great influence on model performance (see below, “Model performance increases linearly with training set size”). Hence, the improved performance on the test set may result from the fact that before validation on the test set, models are trained with approximately 2,000 more samples than before each cross-validation.

Effects of molecular weight and octanol–water partition coefficient

Before predicting KM from the molecular fingerprints, we added the MW and the LogP. Do these extra features contribute to improved predictions by the task-specific GNN fingerprints? To answer this question, we trained GNNs without the additional features LogP and MW, as well as with only one of those additional features. Fig 3 displays the performance of gradient boosting models that are trained to predict KM with GNN fingerprints with and without extra features, showing that the additional features have only a small effect on performance: Adding both features reduces MSE from 0.82 to 0.80, while increasing R2 from 0.41 to 0.42. The difference in model performance is not statistically significant (p = 0.13, one-sided Wilcoxon signed-rank test for the absolute errors of the predictions for the test set). This indicates that most of the information used to predict KM can be extracted from the graph of the molecule itself. However, since the addition of the 2 additional features slightly improves KM predictions on the test dataset, we include the features MW and LogP in our further analyses.

Fig 3. Adding MW and LogP as features has only a minor effect on the performance of the GNN in predicting KM.

Fig 3

(a) MSE on log10-scale. (b) Coefficients of determination R2. Models use the GNN with additional features LogP and MW; with only one of the additional features; and without the 2 features. Boxplots summarize the results of the 5-fold cross-validations on the training set; blue dots show the results on the test set. The data underlying the graphs shown in this figure can be found at https://github.com/AlexanderKroll/KM_prediction/tree/master/figures_data. GNN, graph neural network; LogP, octanol–water partition coefficient; MSE, mean squared error; MW, molecular weight.

UniRep vectors as additional features

So far, we have only considered substrate-specific information. As KM values are features of specific enzyme–substrate interactions, we now need to add input features that represent enzyme properties. Important information on substrate binding affinity is contained in molecular features of the catalytic site; however, active site identities and structures are available only for a small minority of enzymes in our dataset.

We thus restrict the enzyme information utilized by the model to a deep numerical representation of the enzyme’s amino acid sequence, calculating an UniRep vector [24] for each enzyme. UniRep vectors are 1,900-dimensional statistical representation of proteins, created with an mLSTM, a recurrent neural network architecture for sequence modeling that combines the long short-term memory and multiplicative recurrent neural network architectures. The model was trained with 24 million unlabeled amino acid sequences to predict the next amino acid in an amino acid sequence, given the previous amino acids [24]. In this way, the mLSTM learns to store important information about the previous amino acids in a numerical vector, which can later be extracted and used as a representation for the protein. It has been shown that these representations lead to good results when used as input features in prediction tasks concerning protein stability, function, and design [24].

Predicting KM using substrate and enzyme information

To predict the KM value, we concatenated the 52-dimensional task-specific extended fingerprint learned with the GNN and the 1,900-dimensional UniRep vector with information about the enzyme’s amino acid sequence into a global feature vector. This vector was then used as the input for a gradient boosting model for regression in order to predict the KM value. We also trained an FCNN and an elastic; however, predictions were substantially worse (S4S6 Tables), consistent with the results obtained when using only the substrate fingerprints as inputs.

The gradient boosting model that combines substrate and enzyme information achieves an MSE = 0.65 on a log10-scale and results in a coefficient of determination R2 = 0.53, substantially superior to the above models based on substrate information alone. We also validate our model with an additional metric, rm2, which is a commonly used performance measurement tool for quantitative structure–activity relationship (QSAR) prediction models. It is defined as rm2=r2×(1r2r02), where r2 and r02 are the squared correlation coefficients with and without intercept, respectively [29,30]. Our model achieves a value of rm2=0.53 on the test set.

Fig 4A and 4B compare the performance of the full model to models that use only substrate or only enzyme information as inputs, applied to the BRENDA test dataset (which only contains previously unseen enzyme–substrate combinations). To predict the KM value from only the enzyme UniRep vector, we again fitted a gradient boosting model, leading to MSE = 1.01 and R2 = 0.27. To predict the KM value from substrate information only, we chose the gradient boosting model with extended task-specific fingerprints as its inputs, which was used for the comparison with the other molecular fingerprints. (Fig 2).

Fig 4. Performance of the optimized models.

Fig 4

(a) MSE. (b) Coefficients of determination (R2). Values in (a) and (b) are calculated using the gradient boosting model with different inputs: substrate and enzyme information; substrate information only (GNN); and enzyme information only (domain content). Boxplots summarize the results of the 5-fold cross-validations on the training set; blue dots show the results on the test set. For comparison, we also show results on the test set from a naïve model using the mean of the KM values in the training set for all predictions. (c) Scatter plot of log10-transformed KM values of the test set predicted with the gradient boosting model with substrate and enzyme information as inputs versus the experimental values downloaded from BRENDA. Red dots are for combinations where neither enzyme nor substrate were part of the training set. The data underlying the graphs shown in this figure can be found at https://github.com/AlexanderKroll/KM_prediction/tree/master/figures_data. GB, gradient boosting; GNN, graph neural network; MSE, mean squared error.

Fig 4A and 4B also compare the 3 models to the naïve approach of simply using the mean over all KM values in the training set as a prediction for all KM values in the test set, resulting in MSE = 1.38 and R2 = 0. Fig 4C compares the values predicted using the full model with the experimental values of the test set obtained from BRENDA.

Predicting KM for an independently acquired test dataset

Our model was trained and tested on data from BRENDA. To confirm its prediction power, it is desirable to test it on data from other sources. We thus created an additional, independent test set by obtaining the same type of information from the Sabio-RK database [31], keeping only entries that were not already included in the BRENDA dataset. This resulted in a second test set with 274 entries. The model trained on the BRENDA data achieves a very similar performance (MSE = 0.67, R2 = 0.49) on the independent Sabio-RK test data (orange dots in S1 Fig).

Predicting KM for enzymes and substrates not represented in the training data

Homologous enzymes that catalyze the same reaction tend to have broadly similar kinetic parameters. To test to what extent such similarities affect our results, we investigated how well our model performs for the 664 data points in the test set that have substrate–EC number combinations not found in the training set (violet dots in S1 Fig). The KM predictions for these data points resulted in an MSE = 0.79 and R2 = 0.45, compared to MSE = 0.65 and R2 = 0.53 for the full test data.

It is conceivable that predictions are substantially better if the training set contains entries with the same substrate or with the same enzyme, even if not in the same combination. In practice, one may however want to also make predictions for combinations where the enzyme and/or the substrate are not represented in the training data at all. To test how our model performs in such cases, we separately analyzed those 57 entries in the test data where neither enzyme nor substrate occurred in the training data, resulting in MSE = 0.74 and R2 = 0.26, compared to MSE = 0.65 and R2 = 0.53 for the full test data (red points in Fig 4C). At least in part, the smaller R2 value can be explained by the poor predictions for KM values below 10−2 mM (see the residuals in panel a in S2 Fig). The training dataset contained few KM values in this region (panel b in S2 Fig)—there may have simply been too little training data here for the challenging task of predicting KM for unseen enzymes and substrates. In contrast, the model performs substantially better for unseen substrates and enzymes with KM values between 10−2 and 100 mM, where much more training data were available. We conclude that given enough training data, the proposed model appears capable of predicting KM values also for data points where substrate and/or enzyme are not in the training set.

Model performance increases linearly with training set size

The last analysis indicates that prediction performance may be strongly affected by the amount of relevant training data. Indeed, the training datasets employed for AI prediction tasks are typically vastly larger than those available for predicting KM. To test if the size of the training set has a substantial, general effect on prediction quality, we trained the final gradient boosting model with different amounts of the available training samples. We excluded randomly data points from the original training set for this analysis, creating 6 different training sets with sizes ranging from about 4,500 to approximately 9,500 data points. Fig 5 shows that model performance—measured either in terms of MSE or R2—increases approximately linearly with the size of the training set. This result indicates that our models are still far from overfitting and that increasing availability of data will allow more accurate predictions in the future.

Fig 5. Effect of the training set size on model performance.

Fig 5

(a) MSE. (b) Coefficients of determination (R2). Values in (a) and (b) are calculated for the test sets, using the gradient boosting model with substrate and enzyme information as the input. The gradient boosting model is trained with different amounts of the available training samples. The data underlying the graphs shown in this figure can be found at https://github.com/AlexanderKroll/KM_prediction/tree/master/figures_data. MSE, mean squared error.

KM predictions for enzymatic reactions in genome-scale metabolic models

Above, we have described the development and evaluation of a pipeline for genome-scale, organism-independent prediction of KM values. This pipeline and its parameterization can be used, for example, to obtain preliminary KM estimates for enzyme–substrate combinations of interest or to parameterize kinetic models of enzymatic pathways or networks. To facilitate such applications, we predicted KM values for all enzymes in 47 curated genome-scale metabolic models, (S1 Data), include models for E. coli, S. cerevisiae, M. musculus, and H. sapiens.

These models are for organisms from different domains, while the training and test data are dominated by bacteria. To test if this uneven training data distribution leads to biases, we divided our test set into subsets belonging to the domains Archaea, Bacteria, and Eukarya, calculating separate MSE and R2 values for each domain. The test set contained 142 data points from Archaea, with MSE = 0.71 and R2 = 0.37; 1,439 data points from Bacteria, with MSE = 0.65 and R2 = 0.51; and 749 data points from Eukarya, with MSE = 0.64 and R2 = 0.56. We therefore conclude that our model can predict KM values for different domains approximately equally well.

The predictions for the genome-scale metabolic models in S1 Data are based on a machine learning model trained with all of the available data, including all data points from the test set. For 73% of the reactions across all 47 metabolic models, substrate and enzyme information were available, such that the full prediction model could be applied. For 15% only substrate information, for 10% only enzyme information, and for 2% neither substrate nor enzyme information were available. We treated situations with missing information as follows: If information on only one of the 2 molecules (enzyme or substrate) was available, we used the corresponding reduced prediction model (with either only UniRep vector or only extended GNN representation as input, respectively). If both substrate and enzyme information were missing, we predicted the KM value as the geometric mean of all KM values in our dataset.

Discussion

In conclusion, we found that Michaelis constants of enzyme–substrate pairs, KM, can be predicted through artificial intelligence with a coefficient of determination of R2 = 0.53: More than half of the variance in KM values across enzymes and organisms can be predicted from deep numerical representations of enzyme amino acid sequence and substrate molecular structure. This performance is largely organism-independent and does not require that either enzyme or substrate are covered by the dataset used for training; the good performance was confirmed using a second, independent and nonoverlapping test set from Sabio-RK (R2 = 0.49). To obtain this predictive performance, we used task-specific fingerprints of the substrate (GNN) optimized for the KM prediction, as these appear to contain more information about KM values than predefined molecular fingerprints based on expert-crafted transformations (ECFP, RDKit fingerprint, MACCS keys). The observed differences between GNNs and predefined fingerprints is in line with the results of a previous study on the prediction of chemical characteristics of small molecules [22].

Fig 4, which compares KM predictions across different input feature sets, indicates that the relevant information contained in an enzyme’s amino acid sequence may be less important for its evolved binding affinity to a natural substrate than the substrate’s molecular structure: Predictions based only on substrate structures explain almost twice as much variance in KM compared to predictions based only on enzyme representations. It is possible, though, that improved (possibly task-specific) enzyme representations will modify this picture in the future.

A direct comparison of the prediction quality of our model to the results of Yan and colleagues [7] would not be meaningful, as the scope of their model is very different from that of ours. Yan and colleagues trained a model specific to a single enzyme–substrate pair with only 36 data points, aiming to distinguish KM values between different sequences of the same enzyme (beta-glucosidase) for the same substrate (cellobiose). However, the performance of our general model, with MSE = 0.65, compares favorably to that of the substrate-specific statistical models of Borger and colleagues [6], which resulted in an overall MSE = 1.02.

We compare our model to 2 different models for DTBA prediction, DeepDTA and SimBoost [10,16]. These two, which were trained and tested on the same 2 datasets, achieved rm2 values ranging from 0.63 to 0.67 on test sets. This compares to rm2=0.53 achieved for KM predictions with our approach. It is generally difficult to compare prediction performance between models trained and tested on different datasets. Here, this difficulty is exacerbated by the different prediction targets (DTBA versus KM). Crucially, the datasets used for DTBA and KM prediction differ substantially with respect to their densities, i.e., the fraction of possible protein–ligand combinations covered by the training and test data. One of the datasets used for DTBA prediction encompasses experimental data for all possible drug–target combinations between 442 different proteins and 68 targets (442×68 = 30,056). The second dataset contains data for approximately 25% of all possible combinations between 229 proteins and 2,111 targets (118,254 out of 229×2,111 = 483, 419). In contrast, our KM dataset features 7,001 different enzymes and 1,582 substrates but comprises only about 0.1% of their possible combinations (11,600 out of 7,001×1582 = 11,075, 582). Thus, our dataset is not only much smaller, but also has an extremely low coverage of possible protein–ligand combinations compared to the DTBA datasets used in [10,16]. As shown in Fig 5, the number of available training samples has a strong impact on model performance, and the same is likely true for the data density. Against this background, the performance of our KM prediction model could be seen as being surprisingly good. Fig 5 indicates that KM predictions can be improved substantially once more training data become available.

To provide the model with information about the enzyme, we used statistical representations of the enzyme amino acid sequence. We showed that these features provide important enzyme-specific information for the prediction of KM. It appears likely that predictions could be improved further by taking features of the enzyme active site into account—such as hydrophobicity, depth, or structural properties [5]—once such features become widely available [23]. Adding organism-specific information, such as the typical intracellular pH or temperature, may also increase model performance.

We wish to emphasize that our model is trained to predict KM values for enzyme–substrate pairs that are known to interact as part of the natural cellular physiology, meaning that their affinity has evolved under natural selection. The model should thus be used with care when making predictions for enzyme interactions with other substrates, such as nonnatural compounds or substrates involved in moonlighting activities. In such cases, DTBA prediction models (with their higher data density) may be better suited, and estimates with our model should be regarded as a lower bound for KM that might be reached under appropriate natural selection.

To put the performance of the current model into perspective, we consider the mean relative prediction error MRPE = 4.1, meaning that our predictions deviate from experimental estimates on average by 4.1-fold. This compares to a mean relative deviation of 3.4-fold between a single KM measurement and the geometric mean of all other measurements for the same enzyme–substrate combination in the BRENDA dataset (the geometric means of enzyme–substrate combinations were used for training the models). Part of the high variability across values in BRENDA is due to varying assay conditions in the in vitro experiments [28]. Moreover, entries in BRENDA are not free from errors; on the order of 10% of the values in the database do not correspond to values in the original papers, e.g., due to errors in unit conversion [28].

Especially on the background of this variation, the performance of our enzyme–substrate specific KM model appears remarkable. In contrast to previous approaches [6,7,1316], the model requires no previous knowledge about measured KM values for the considered substrate or enzyme. Furthermore, only one general purpose model is trained, and it is not necessary to obtain training data and to fit new models for individual substrates, enzyme groups, or organisms. Once the model has been fitted, it can provide genome-scale KM predictions from existing features within minutes. We here provide such predictions for a broad set of model organisms, including mouse and human; these data can provide base estimates for unknown kinetic constants, e.g., to relate metabolomics data to cellular physiology, and can help to parameterize kinetic models of metabolism. Future work may develop similar prediction frameworks for enzyme turnover numbers (kcat), which would facilitate the completion of such parameterizations.

Methods

Software and code availability

We implemented all code in Python [32]. We implemented the neural networks using the deep learning library TensorFlow [33] and Keras [34]. We fitted the gradient boosting models using the library XGBoost [35].

All datasets generated and the Python code used to produce the results (in Jupyter notebooks) are available from https://github.com/AlexanderKroll/KM_prediction. Two of the Jupyter notebooks contain all the necessary steps to download the data from BRENDA and Sabio-RK and to preprocess it. Execution of a second notebook performs training and validation of our final model. Two additional notebooks contain code to train the models with molecular fingerprints as inputs and to investigate the effect of the 2 additional features, MW and LogP, for the GNN.

Downloading and processing KM values from BRENDA

We downloaded KM values together with organism and substrate name, EC number, UniProt ID of the enzyme, and PubMed ID from the BRENDA database [25]. This resulted in a dataset with 156,387 entries. We mapped substrate names to KEGG Compound IDs via a synonym list from KEGG [26]. For all substrate names that could not be mapped to a KEGG Compound ID directly, we tried to map them first to PubChem Compound IDs via a synonym list from PubChem [36] and then mapped these IDs to KEGG Compound IDs using the web service of MBROLE [37]. We downloaded amino acid sequences for all data points via the UniProt mapping service [38] if the UniProt ID was available; otherwise, we downloaded the amino acid sequence from BRENDA via the organism name and EC number.

We then removed (i) all duplicates (i.e., entries with identical values for KM, substrate, and amino acid sequence as another entry); (ii) all entries with non-wild-type enzymes (i.e., with a commentary field in BRENDA labeling it as mutant or recombinant); (iii) entries for nonbacterial organisms without an UniProt ID for the enzyme; and (iv) entries with substrate names that could not be mapped to a KEGG Compound ID. This resulted in a filtered set of 34,526 data points. Point (iii) was motivated by the expectation that isoenzymes are frequent in eukaryotes but rare in bacteria, such that organism name and EC number are sufficient to unambiguously identify an amino acid sequence in the vast majority of cases for bacteria but not for eukaryotes. If multiple log10-transformed KM values existed for 1 substrate and 1 amino acid sequence, we took the geometric mean across these values. For 11,737 of these, we could find an entry for the EC number–substrate combination in the KEGG reaction database. Since we are only interested in KM values for natural substrates, we only kept these data points [28]. We log10-transformed all KM values in this dataset. We split the final dataset with 11,737 entries randomly into training data (80%) and test data (20%). We further split the training set into 5 subsets, which we used for 5-fold cross-validations for the hyperparameter optimization of the machine learning models. We used the test data to evaluate the final models after hyperparameter optimization.

To estimate the proportion of metabolic enzymes with KM values measured in vitro for E. coli, we mapped the E. coli KM values downloaded from BRENDA to reactions of the genome scale metabolic model iML1515 [39], which comprises over 2,700 different reactions. To do this, we extracted all enzyme–substrate combinations from the iML1515 model for which the model annotations listed an EC number for the enzyme and a KEGG Compound ID for the substrate, resulting in 2,656 enzyme–substrate combinations. For 795 of these combinations (i.e., 29.93%), we were able to find a KM value in the BRENDA database.

Download and processing of KM values from Sabio-RK

We downloaded KM values together with the name of the organism, substrate name, EC number, UniProt ID of the enzyme, and PubMed ID from the Sabio-RK database. This resulted in a dataset with 8,375 entries. We processed this dataset in the same way as described above for the BRENDA dataset. We additionally removed all entries with a PubMed ID that was already present in the BRENDA dataset. This resulted in a final dataset with 274 entries, which we used as an additional test set for the final model for KM prediction.

Calculation of predefined molecular fingerprints

We first represented each substrate through 3 different molecular fingerprints (ECFP, RDKit fingerprint, MACCS keys). For every substrate in the final dataset, we downloaded an MDL Molfile with 2D projections of its atoms and bonds from KEGG [26] via the KEGG Compound ID. We then used the package Chem from RDKit [19] with the Molfile as the input to calculate the 2,048-dimensional binary RDKit fingerprints [19], the 166-dimensional binary MACCS keys [20], and the 1,024-dimensional binary ECFPs [18] with a radius of 3.

Architecture of the fully connected neural network with molecular fingerprints

We used an FCNN to predict KM values using only representations of the substrates as input features. We performed a 5-fold cross-validation on the training set for each of the 4 substrate representations (ECFP, RDKit fingerprints, MACCS keys, and task-specific fingerprints) for the hyperparameter optimization. The FCNN consisted of 2 hidden layers, and we used rectified linear units (ReLUs), which are defined as ReLU(x) = max(x, 0), as activation functions in the hidden layers to introduce nonlinearity. We applied batch normalization [40] after each hidden layer. Additionally, we used L2-regularization in every layer to prevent overfitting. Adding dropout [41] did not improve the model performance. We optimized the model by minimizing the MSE with the stochastic gradient descent with Nesterov momentum as an optimizer. The hyperparameters regularization factor, learning rate, learning rate decay, dimension of hidden layers, batch size, number of training epochs, and momentum were optimized by performing a grid search. We selected the set of hyperparameters with the lowest mean MSE during cross-validation. The results of the cross-validations and best set of hyperparameters for each fingerprint are displayed in S1 Table.

Fitting of the gradient boosting models with molecular fingerprints

We used gradient boosting models to predict KM values using only representations of the substrates as input features. As for the FCNNs, we performed a 5-fold cross-validation on the training set for each of the 4 substrate representations (ECFP, RDKit fingerprints, MACCS keys, and task-specific fingerprints) for hyperparameter optimization. We fitted the models using the gradient boosting library XGBoost [35] for Python. The hyperparameters regularization coefficients, learning rate, maximal tree depth, maximum delta step, number of training rounds, and minimum child weight were optimized by performing a grid search. We selected the set of hyperparameters with the lowest mean MSE during cross-validation. The results are displayed in S2 Fig.

Fitting of the elastic nets with molecular fingerprints

We used elastic nets to predict KM values with representations of the substrates as input features. Elastic nets are linear regression model with additional L1- and L2-penalties for the coefficients of the model in order to apply regularization. We performed 5-fold cross-validations on the training set for all 4 substrate representations (ECFP, RDKit fingerprints, MACCS keys, and task-specific fingerprints) for hyperparameter optimization. During hyperparameter optimization, the coefficients for L1-regularization and L2-regularization were optimized by performing a grid search. The models were fitted using the machine learning library scikit-learn [42] for Python. The results of the hyperparameter optimizations are displayed in S3 Table.

Calculation of molecular weight (MW) and the octanol–water partition coefficient (LogP)

We calculated the additional 2 molecular features, MW and LogP, with the package Chem from RDKit [19], with the MDL Molfile of the substrate as the input.

Calculation of the input of the graph neural network

Graphs in GNNs are represented with tensors and matrices. To calculate the input matrices and tensors, we used the package Chem from RDKit [19] with MDL Molfiles of the substrates as inputs to calculate 8 features for very atom v (atomic number, number of bonds, charge, number of hydrogen bonds, mass, aromaticity, hybridization type, chirality) and 4 features for every bond between 2 atoms v and w (bond type, part of ring, stereo configuration, aromaticity). Converting these features (except for atom mass) into one-hot encoded vectors resulted in a feature vector with Fb = 10 dimensions for every bond and in a feature vector with Fa = 32 dimensions for every atom.

For a substrate with N atoms, we stored all bonds in an N×N-dimensional adjacency matrix A, i.e., entry Avw is equal to 1 if there is a bond between the 2 atoms v and w and 0 otherwise. We stored the bond features in a (N×N×Fb)-dimensional tensor E, where entry EvwRFb contains the feature vector of the bond between atom v and atom w. Afterwards, we expanded tensor E by concatenating the feature vector of atom v to the feature vector Evw. If there was no bond between the atoms v and w, i.e., Avw = 0, we set all entries of Evw to zero. We then used the resulting (N×N×(Fa+Fb))-dimensional tensor E, together with the adjacency matrix A, as the input of the GNN.

During training, the number of atoms N in a graph has to be restricted to a maximum. We set the maximum to 70, which allowed us to include most of the substrates in the training. After training, the GNN can process substrates of arbitrary sizes.

Architecture of the graph neural network

In addition to the predefined fingerprints, we also used a GNN to represent the substrate molecules. We first give a brief overview over such GNNs, before detailing our analysis.

As in the calculations of the ECFPs, a molecule is represented as a graph by interpreting the atoms as nodes and the chemical bonds as edges. Before a graph is processed by a GNN, feature vectors xv for every node v and feature vectors evw for every edge between 2 nodes v and w are calculated. We calculated 8 features for every atom and 4 features for every bond of a substrate, including mass, charge, and type of atom as well as type of bond (see Methods, “Calculation of the input of the graph neural network”). The initial representations xv=xv(0) and evw=evw(0) are updated iteratively for a predefined number of steps T using the feature vectors of the neighboring nodes and edges (Fig 1B). During this process, the feature vectors are multiplied with matrices with trainable entries, which are fitted during the optimization of the GNN. After k iterations, each node representation xv(k) contains information about its k-hop neighborhood graph. After completing T iteration steps, all node representations are averaged to obtain a single vector x, which represents the entire graph [43,44]. The vector x can then be used as an input of an FCNN to predict properties of the graph (the KM value of the molecule in our case; Fig 1).

The described processing of a graph with a GNN can be divided into 2 phases. The first, message passing phase consists of the iteration process. The second, readout phase comprises the averaging of the node representations and the prediction of the target graph property [43]. During the training, both phases are optimized simultaneously. The vector x can thus be viewed as a task-specific fingerprint of the substrate. Since the model is trained end to end, the GNN learns to store all information necessary to predict KM in this vector [44,45].

We use a variant of GNNs called directed message passing neural network (D-MPNN) [22,46]. In D-MPNNs, every edge is viewed as 2 directed edges pointing in opposite directions. During the iteration process (the message passing phase), feature vectors of nodes and edges are iteratively updated. To update them, feature vectors of neighboring nodes and edges are multiplied by matrices with learnable parameters and the results are summed. Then, an activation function, the ReLU, is applied to the resulting vector to introduce nonlinearities.

We set the number of iterations for updating the feature vector representations to T = 2. The dimension of the feature vectors during the message passing phase are set to D = 50. We apply batch normalization before every activation function. Additionally, we tried to apply dropout at the end of the message passing phase, but this does not improve model performance.

After the message passing phase, the readout phase starts, and feature vectors of all nodes and edges are pooled together using an order-invariant function to obtain a single vector xRD, which is a representation of the input. The pooling is done using the element-wise mean of the feature vectors. We then concatenate x with the MW and the LogP, which are global molecular features that are correlated with the KM value [28]. This results in an extended fingerprint x^=(x,MW,LogP)RD+2.

Afterwards, x^ is used as the input of an FCNN with 2 layers with dimensions 32 and 16, again using ReLUs as activation functions. Batch normalization and L2-regularization are applied to the fully connected layers to avoid overfitting.

During training, the values of the matrices from the message passing phase and the parameters of the FCNN from the readout phase are fitted simultaneously. We trained the model by minimizing the MSE with the optimizer Adadelta [47] with a decaying learning rate (decay rate to ρ = 0.95), starting at 0.05 for 50 epochs. We used a batch size of 64, a regularization parameter λ = 0.01 for the parameters in the message passing phase, and a regularization parameter λ = 1 for the parameters in the readout phase. The hyperparameters regularization factor, learning rate, batch size, dimension of feature vectors D, and decay rate were optimized with a 5-fold cross-validation on the training set by performing a grid search. We selected the set of hyperparameters with the lowest mean MSE during cross-validation.

UniRep vectors

To obtain a 1,900-dimensional UniRep vector for every amino acid sequence in the dataset, we used Python code that is a simplified and modified version of the original code from the George Church group [24] and which contains the already trained UinRep model (available from https://github.com/EngqvistLab/UniRep50). The UniRep vectors were calculated from a file in FASTA format [48], which contained all amino acid sequences of our dataset.

Fitting of the gradient boosting model with substrate and enzyme information

We concatenated the task-specific substrate fingerprint x^R52 and the 1,900-dimensional UniRep vector with information about the enzyme’s amino acid sequence. We used the resulting 1,952-dimensional vector as the input for a gradient boosting model for regression, which we trained to predict the KM value. We set the maximal tree depth to 7, minimum child weight to 10.6, maximum delta step to 4.24, the learning rate to 0.012, the regularization coefficient λ to 3.8, and the regularization coefficient α to 3.1. We trained the model for 1,381 iterations. The hyperparameters regularization coefficients, learning rate, maximal tree depth, maximum delta step, number of training iterations, and minimum child weight were optimized by performing a grid search during a 5-fold cross-validation on the training set. We selected the set of hyperparameters with the lowest mean MSE during cross-validation.

Model comparison

To test if the differences in performance between the models with predefined fingerprints as input and the model with the task-specific fingerprint as input are statistically significant, we applied a one-sided Wilcoxon signed-rank test. The Wilcoxon signed-rank test tests the null hypothesis that the median of the absolute errors on the test set for predictions made with the model with task-specific fingerprints, e¯1, is greater or equal to the corresponding median for predictions made with a model with predefined fingerprints, e¯2 (H0:e¯1e¯2 versus H1:e¯2>e¯1). We could reject H0 (p = 0.0022 (ECFP), p = 0.0515 (RDKit), p = 0.030 (MACCS keys)), accepting the alternative hypothesis H1.

Analogous to the described procedure, we tested if the difference in model performance between the GNNs with and without the 2 additional features, MW and LogP, is statistically significant. We could reject the null hypothesis H0 that the median of the absolute errors on the test set for predictions made with the GNN with MW and LogP is greater or equal to the corresponding median for predictions made with the GNN without additional feature (p = 0.0454). To execute the tests, we used the Python library SciPy [49].

Prediction of KM values for genome-scale models

We downloaded 46 genome-scale models from BiGG [50] and the genome-scale model yeast8 for S. cerevisiae [51]. We extracted all enzymatic reactions from these models and created 1 entry for every substrate in an enzymatic reaction. We extracted the KEGG Compound IDs for every substrate from the annotations of the model, if available; otherwise, we mapped the substrate names to KEGG Compound IDs via synonym lists from KEGG and PubChem in the same way as described for the substrate names in the BRENDA and Sabio-RK datasets. To obtain the enzyme information, we used the gene reaction rules, which contain the names of the involved genes. To obtain the amino acid sequence and the UniProt ID for every enzyme, we used the UniProt mapping service [38]. If multiple enzymes are given for one reaction, we made a prediction for all of the given enzymes. If an enzyme complex consisted of multiple genes, we tried to figure out which of the genes has a binding activity. Therefore, we downloaded for all of the associated UniProt IDs the GO annotations via QuickGO [52]. For every UniProt ID, we checked if a binding activity was stated in the annotations. If we found a binding activity for more than 1 UniProt ID or for none of the UniProt IDs in the enzyme complex, we did not use any enzyme information.

If enzyme and substrate information was available, we used the full model to predict KM. If only substrate or only enzyme information was available, we used a gradient boosting model that only uses substrate or enzyme information as its input. If neither substrate nor enzyme information were available, we used the geometric mean over all KM values in the BRENDA dataset as a prediction.

To train the gradient boosting model to predict KM values, we used the whole BRENDA dataset for model training, including the test set.

Supporting information

S1 Table. Results of the hyperparameter optimizations of fully connected neural networks (FCNNs), which were trained to predict KM from substrate information only.

The hypeparameter optimizations were performed for each of 4 different fingerprints of the substrates with a 5-fold cross-validation on the training set.

(TIF)

S2 Table. Results of the hyperparameter optimizations of gradient boosting models, which were trained to predict KM from substrate information only.

The hypeparameter optimizations were performed for each of 4 different fingerprints of the substrates with a 5-fold cross-validation on the training set.

(TIF)

S3 Table. Results of the hyperparameter optimizations of elastic nets, which were trained to predict KM from substrate information only.

The hyperparameter optimizations were performed for each of 4 different fingerprints of the substrates with a 5-fold cross-validation on the training set.

(TIF)

S4 Table. Result of the hyperparameter optimization of a fully connected neural networks (FCNN), which was trained to predict KM from substrate and enzyme information (GNN fingerprint and UniRep vector).

The hypeparameter optimization was performed with a 5-fold cross-validation on the training set.

(TIF)

S5 Table. Result of the hyperparameter optimization of the gradient boosting model, which was trained to predict KM from substrate and enzyme information (GNN fingerprint and UniRep vector).

The hypeparameter optimization was performed with a 5-fold cross-validation on the training set.

(TIF)

S6 Table. Result of the hyperparameter optimization of an elastic net, which was trained to predict KM from substrate and enzyme information (GNN fingerprint and UniRep vector).

The hyperparameter optimization was performed with a 5-fold cross-validation on the training set.

(TIF)

S1 Fig. Scatter plot of log10-transformed KM values predicted with the gradient boosting model with substrate and enzyme information as inputs versus the experimental values downloaded from BRENDA and Sabio-RK.

The scatter plot displays all data points of the Sabio-RK test set (orange) and all data points from the BRENDA test set with an EC number–substrate combination not present in the training set (violet). The data underlying the graphs shown in this figure can be found at https://github.com/AlexanderKroll/KM_prediction/tree/master/figures_data.

(TIF)

S2 Fig

(a) Scatter plot of measured KM values and the absolute prediction errors of the BRENDA test data points for which neither the substrate nor the enzyme occurs in the training set. (b) Histogram with the distribution of the KM values in the training set. The data underlying the graphs shown in this figure can be found at https://github.com/AlexanderKroll/KM_prediction/tree/master/figures_data.

(TIF)

S1 Data. Dataset in xlsx format containing complete KM predictions for 47 genome-scale metabolic models, including those for Homo sapiens, Mus musculus, Saccharomyces cerevisiae, and Escherichia coli.

(XLSX)

Acknowledgments

We thank Hugo Dourado, Markus Kollmann, and Kyra Mooren for helpful discussions. Computational support and infrastructure was provided by the “Centre for Information and Media Technology (ZIM) at the University of Düsseldorf (Germany).

Abbreviations

D-MPNN

directed message passing neural network

DTBA

drug–target binding affinity

EC

Enzyme Commission

ECFP

extended connectivity fingerprint

FCNN

fully connected neural network

GNN

graph neural network

LogP

octanol–water partition coefficient

MRPE

mean relative prediction error

MSE

mean squared error

MW

molecular weight

QSAR

quantitative structure–activity relationship

ReLU

rectified linear unit

Data Availability

All datasets generated and the Python code used to produce the results (in Jupyter notebooks) are available from https://github.com/AlexanderKroll/KM_prediction.

Funding Statement

This work was funded through grants to M.J.L. by the Volkswagenstiftung (in the "Life?" program) and by the Deutsche Forschungsgemeinschaft (CRC 1310 and, under Germany’s Excellence Strategy, EXC 2048/1, Project ID: 390686111). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Davidi D, Noor E, Liebermeister W, Bar-Even A, Flamholz A, Tummler K, et al. Global characterization of in vivo enzyme catalytic rates and their correspondence to in vitro kcat measurements. Proc Natl Acad Sci. 2016;113(12):3401–6. doi: 10.1073/pnas.1514240113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Khodayari A, Maranas CD. A genome-scale Escherichia coli kinetic metabolic model k-ecoli457 satisfying flux data for multiple mutant strains. Nat Commun. 2016;7(1):13806. doi: 10.1038/ncomms13806 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Saa PA, Nielsen LK. Formulation, construction and analysis of kinetic models of metabolism: A review of modelling frameworks. Biotechnol Adv. 2017;35(8):981–1003. doi: 10.1016/j.biotechadv.2017.09.005 [DOI] [PubMed] [Google Scholar]
  • 4.Strutz J, Martin J, Greene J, Broadbelt L, Tyo K. Metabolic kinetic modeling provides insight into complex biological questions, but hurdles remain. Curr Opin Biotechnol. 2019;59:24–30. doi: 10.1016/j.copbio.2019.02.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Heckmann D, Lloyd CJ, Mih N, Ha Y, Zielinski DC, Haiman ZB, et al. Machine learning applied to enzyme turnover numbers reveals protein structural correlates and improves metabolic models. Nat Commun. 2018;9(1):5252. doi: 10.1038/s41467-018-07652-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Borger S, Liebermeister W, Klipp E. Prediction of enzyme kinetic parameters based on statistical learning. Genome Inform. 2006;17(1):80–7. [PubMed] [Google Scholar]
  • 7.Yan SM, Shi DQ, Nong H, Wu G. Predicting KM values of beta-glucosidases using cellobiose as substrate. Interdisciplinary Sciences: Computational Life Sciences. 2012;4(1):46–53. [DOI] [PubMed] [Google Scholar]
  • 8.Thafar M, Raies AB, Albaradei S, Essack M, Bajic VB. Comparison study of computational prediction tools for drug-target binding affinities. Front Chem. 2019;7:782. doi: 10.3389/fchem.2019.00782 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Pahikkala T, Airola A, Pietilä S, Shakyawar S, Szwajda A, Tang J, et al. Toward more realistic drug–target interaction predictions. Brief Bioinform. 2015;16(2):325–37. doi: 10.1093/bib/bbu010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.He T, Heidemeyer M, Ban F, Cherkasov A, Ester M. SimBoost: a read-across approach for predicting drug–target binding affinities using gradient boosting machines. J Chem. 2017;9(1):1–14. doi: 10.1186/s13321-017-0209-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Jiménez J, Skalic M, Martinez-Rosell G, De Fabritiis G. K deep: protein–ligand absolute binding affinity prediction via 3d-convolutional neural networks. J Chem Inf Model. 2018;58(2):287–96. doi: 10.1021/acs.jcim.7b00650 [DOI] [PubMed] [Google Scholar]
  • 12.Trott O, Olson AJ. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading, Journal of computational chemistry. 2010;31(2):455–61. doi: 10.1002/jcc.21334 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Öztürk H, Ozkirimli E, Özgür A. WideDTA: prediction of drug-target binding affinity [preprint]. arXiv. 2019:190204166. [Google Scholar]
  • 14.Feng Q, Dueva E, Cherkasov A, Ester M. Padme: A deep learning-based framework for drug-target interaction prediction [preprint]. arXiv. 2018:180709741. [Google Scholar]
  • 15.Karimi M, Wu D, Wang Z, Shen Y. DeepAffinity: interpretable deep learning of compound–protein affinity through unified recurrent and convolutional neural networks. Bioinformatics. 2019;35(18):3329–38. doi: 10.1093/bioinformatics/btz111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Öztürk H, Özgür A, Ozkirimli E. DeepDTA: deep drug–target binding affinity prediction. Bioinformatics. 2018;34(17):i821–9. doi: 10.1093/bioinformatics/bty593 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci. 1988;28(1):31–6. [Google Scholar]
  • 18.Rogers D, Hahn M. Extended-connectivity fingerprints. J Chem Inf Model. 2010;50(5):742–54. doi: 10.1021/ci100050t [DOI] [PubMed] [Google Scholar]
  • 19.Landrum G. RDKit: Open-source cheminformatics. 2006. Available from: http://www.rdkit.org. [Google Scholar]
  • 20.Durant JL, Leland BA, Henry DR, Nourse JG. Reoptimization of MDL keys for use in drug discovery. J Chem Inf Comput Sci. 2002;42(6):1273–80. doi: 10.1021/ci010132r [DOI] [PubMed] [Google Scholar]
  • 21.Zhou J, Cui G, Zhang Z, Yang C, Liu Z, Wang L, et al. Graph neural networks: A review of methods and applications [preprint]. arXiv. 2018. p. arXiv:1812.08434. [Google Scholar]
  • 22.Yang K, Swanson K, Jin W, Coley C, Eiden P, Gao H, et al. Analyzing learned molecular representations for property prediction. J Chem Inf Model. 2019;59(8):3370–88. doi: 10.1021/acs.jcim.9b00237 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Furnham N, Holliday GL, de Beer TA, Jacobsen JO, Pearson WR, Thornton JM. The Catalytic Site Atlas 2.0: cataloging catalytic sites and residues identified in enzymes. Nucleic Acids Res. 2014;42(D1):D485–9. doi: 10.1093/nar/gkt1243 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Alley EC, Khimulya G, Biswas S, AlQuraishi M, Church GM. Unified rational protein engineering with sequence-based deep representation learning. Nat Methods. 2019;16(12):1315–22. doi: 10.1038/s41592-019-0598-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Jeske L, Placzek S, Schomburg I, Chang A, Schomburg D. BRENDA in 2019: a European ELIXIR core data resource. Nucleic Acids Res. 2019;47 (D1):D542–9. doi: 10.1093/nar/gky1048 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30. doi: 10.1093/nar/28.1.27 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Dalby A, Nourse JG, Hounshell WD, Gushurst AK, Grier DL, Leland BA, et al. Description of several chemical structure file formats used by computer programs developed at Molecular Design Limited. J Chem Inf Comput Sci. 1992;32(3):244–55. [Google Scholar]
  • 28.Bar-Even A, Noor E, Savir Y, Liebermeister W, Davidi D, Tawfik DS, et al. The moderately efficient enzyme: evolutionary and physicochemical trends shaping enzyme parameters. Biochemistry. 2011;50(21):4402–10. doi: 10.1021/bi2002289 [DOI] [PubMed] [Google Scholar]
  • 29.Pratim Roy P, Paul S, Mitra I, Roy K. On two novel parameters for validation of predictive QSAR models. Molecules. 2009;14(5):1660–701. doi: 10.3390/molecules14051660 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Roy K, Chakraborty P, Mitra I, Ojha PK, Kar S, Das RN. Some case studies on application of “rm2ˮ metrics for judging quality of quantitative structure–activity relationship predictions: emphasis on scaling of response data. J Comput Chem. 2013;34(12):1071–82. doi: 10.1002/jcc.23231 [DOI] [PubMed] [Google Scholar]
  • 31.Wittig U, Kania R, Golebiewski M, Rey M, Shi L, Jong L, et al. SABIO-RK–database for biochemical reaction kinetics. Nucleic Acids Res. 2012;40(D1):D790–6. doi: 10.1093/nar/gkr1046 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Van Rossum G, Drake FL. Python 3 Reference Manual. Scotts Valley, CA: CreateSpace; 2009. [Google Scholar]
  • 33.Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available from: https://www.tensorflow.org/. [Google Scholar]
  • 34.Chollet F. Keras. 2015. Available from: https://keras.io.
  • 35.Chen T, Guestrin C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2016. p. 785–794.
  • 36.Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, et al. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 2019;47(D1):D1102–9. doi: 10.1093/nar/gky1033 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.López-Ibáñez J, Pazos F, Chagoyen M. MBROLE 2.0—functional enrichment of chemical compounds. Nucleic Acids Res. 2016;44(W1):W201–4. doi: 10.1093/nar/gkw253 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Consortium TU. UniProt: The universal protein knowledgebase in 2021. Nucleic Acids Res. 2021;49(D1):D480–9. doi: 10.1093/nar/gkaa1100 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Monk JM, Lloyd CJ, Brunk E, Mih N, Sastry A, King Z, et al. iML1515, a knowledgebase that computes Escherichia coli traits. Nat Biotechnol. 2017;35(10):904–8. doi: 10.1038/nbt.3956 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift [preprint]. arXiv. 2015. p. arXiv:1502.03167. [Google Scholar]
  • 41.Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15(1):1929–58. [Google Scholar]
  • 42.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J Mach Learn Res. 2011;12:2825–30. [Google Scholar]
  • 43.Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE. Neural message passing for quantum chemistry [preprint]. arXiv. 2017. p. arXiv:1704.01212. [Google Scholar]
  • 44.Kearnes S, McCloskey K, Berndl M, Pande V, Riley P. Molecular graph convolutions: moving beyond fingerprints. J Comput Aided Mol Des. 2016;30(8):595–608. doi: 10.1007/s10822-016-9938-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Duvenaud DK, Maclaurin D, Iparraguirre J, Bombarell R, Hirzel T, Aspuru-Guzik A, et al. Convolutional networks on graphs for learning molecular fingerprints. Adv Neural Inf Process Syst. 2015:2224–32. [Google Scholar]
  • 46.Dai H, Dai B, Song L. Discriminative embeddings of latent variable models for structured data. International Conference on Machine Learning; 2016. p. 2702–2711. [Google Scholar]
  • 47.Zeiler MD. Adadelta: an adaptive learning rate method [preprint]. arXiv. 2012. p. arXiv:1212.5701. [Google Scholar]
  • 48.Lipman DJ, Pearson WR. Rapid and sensitive protein similarity searches. Science. 1985;227(4693):1435–41. doi: 10.1126/science.2983426 [DOI] [PubMed] [Google Scholar]
  • 49.Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nat Methods. 2020;17:261–72. doi: 10.1038/s41592-019-0686-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Norsigian CJ, Pusarla N, McConn JL, Yurkovich JT, Dräger A, Palsson BO, et al. BiGG Models 2020: multi-strain genome-scale models and expansion across the phylogenetic tree. Nucleic Acids Res. 2020;48(D1):D402–6. doi: 10.1093/nar/gkz1054 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Lu H, Li F, Sánchez BJ, Zhu Z, Li G, Domenzain I, et al. A consensus S. cerevisiae metabolic model Yeast8 and its ecosystem for comprehensively probing cellular metabolism. Nat Commun. 2019;10(1):1–13. doi: 10.1038/s41467-018-07882-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Binns D, Dimmer E, Huntley R, Barrell D, O’donovan C, Apweiler R. QuickGO: a web-based tool for Gene Ontology searching. Bioinformatics. 2009;25(22):3045–6. doi: 10.1093/bioinformatics/btp536 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Roland G Roberts

17 Dec 2020

Dear Dr Lercher,

Thank you for submitting your manuscript entitled "Prediction of Michaelis constants from structural features using deep learning" for consideration as a Short Report by PLOS Biology.

Your manuscript has now been evaluated by the PLOS Biology editorial staff, as well as by an academic editor with relevant expertise, and I'm writing to let you know that we would like to send your submission out for external peer review.

IMPORTANT: We note that you submitted your paper as a Short Report, but we think it would be better considered as a Methods and Resources paper. No re-formatting is needed, but please could you change the article type to "Methods and Resources" when you upload your metadata (see next paragraph)?

However, before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire.

Please re-submit your manuscript within two working days, i.e. by Dec 21 2020 11:59PM.

Login to Editorial Manager here: https://www.editorialmanager.com/pbiology

During resubmission, you will be invited to opt-in to posting your pre-review manuscript as a bioRxiv preprint. Visit http://journals.plos.org/plosbiology/s/preprints for full details. If you consent to posting your current manuscript as a preprint, please upload a single Preprint PDF when you re-submit.

Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. Once your manuscript has passed all checks it will be sent out for review.

Given the disruptions resulting from the ongoing COVID-19 pandemic, please expect delays in the editorial process. We apologise in advance for any inconvenience caused and will do our best to minimize impact as far as possible.

Feel free to email us at plosbiology@plos.org if you have any queries relating to your submission.

Kind regards,

Roli Roberts

Roland G Roberts, PhD,

Senior Editor

PLOS Biology

Decision Letter 1

Roland G Roberts

10 Mar 2021

Dear Dr Lercher,

Thank you very much for submitting your manuscript "Prediction of Michaelis constants from structural features using deep learning" for consideration as a Research Article at PLOS Biology. Your manuscript has been evaluated by the PLOS Biology editors, an Academic Editor with relevant expertise, and by five independent reviewers. I must first apologise for the extraordinary time that it has taken us to return a decision to you, and for the unusual number of reviewers (we usually aim for three or four). These two issues are both related to difficulties that we encountered in recruiting reviewers during these difficult times.

However, we think that despite the five reviews, there is considerable overlap between reviewers #1 and #2 and between reviewers #3, #4 and #5, so we think that the revision task will not be proportionately arduous. After discussion with the Academic Editor, we provide the following guidance:

IMPORTANT: While we recognise the reasons behind the negative assessments of reviewers #1 and #2, we also see the potential that is perceived by reviewer #3, #4 and #5. On that basis, we have decided to invite a revision. However, you should ensure that you address some of the points raised by reviewers #1 and #2 by clarifying who/what the target market for this method/resource is, and why the large-scale but approximate prediction of Km is useful; you should also ensure that your coverage of the prior literature is balanced.

In terms of reviewers #3, #4 and #5, we see significant overlap, with the requests largely falling into the following categories:

a) perform more extensive cross-validation and sensitivity testing (#4, #5).

b) perform additional analyses (domain order for #3, use multiple empirical values rather than means for #4, different database and "fingerprint" method for #5).

c) provide a handy user interface (updatable website for #3 or Shiny app for #5).

d) make the article more accessible to a broader readership (#5).

We feel that categories "a" and "d" are essential, "c" is advisable (to maximise utility to the community, and uptake), and you should either attempt or reasonably rebut the analyses requested under "b" (most of which sound sensible to us).

In light of the reviews (below), we will not be able to accept the current version of the manuscript, but we would welcome re-submission of a much-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent for further evaluation by the reviewers.

We expect to receive your revised manuscript within 3 months.

Please email us (plosbiology@plos.org) if you have any questions or concerns, or would like to request an extension. At this stage, your manuscript remains formally under active consideration at our journal; please notify us by email if you do not intend to submit a revision so that we may end consideration of the manuscript at PLOS Biology.

**IMPORTANT - SUBMITTING YOUR REVISION**

Your revisions should address the specific points made by each reviewer. Please submit the following files along with your revised manuscript:

1. A 'Response to Reviewers' file - this should detail your responses to the editorial requests, present a point-by-point response to all of the reviewers' comments, and indicate the changes made to the manuscript.

*NOTE: In your point by point response to the reviewers, please provide the full context of each review. Do not selectively quote paragraphs or sentences to reply to. The entire set of reviewer comments should be present in full and each specific point should be responded to individually, point by point.

You should also cite any additional relevant literature that has been published since the original submission and mention any additional citations in your response.

2. In addition to a clean copy of the manuscript, please also upload a 'track-changes' version of your manuscript that specifies the edits made. This should be uploaded as a "Related" file type.

*Re-submission Checklist*

When you are ready to resubmit your revised manuscript, please refer to this re-submission checklist: https://plos.io/Biology_Checklist

To submit a revised version of your manuscript, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' where you will find your submission record.

Please make sure to read the following important policies and guidelines while preparing your revision:

*Published Peer Review*

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*PLOS Data Policy*

Please note that as a condition of publication PLOS' data policy (http://journals.plos.org/plosbiology/s/data-availability) requires that you make available all data used to draw the conclusions arrived at in your manuscript. If you have not already done so, you must include any data used in your manuscript either in appropriate repositories, within the body of the manuscript, or as supporting information (N.B. this includes any numerical values that were used to generate graphs, histograms etc.). For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5

*Blot and Gel Data Policy*

We require the original, uncropped and minimally adjusted images supporting all blot and gel results reported in an article's figures or Supporting Information files. We will require these files before a manuscript can be accepted so please prepare them now, if you have not already uploaded them. Please carefully read our guidelines for how to prepare and upload this data: https://journals.plos.org/plosbiology/s/figures#loc-blot-and-gel-reporting-requirements

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Thank you again for your submission to our journal. We hope that our editorial process has been constructive thus far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Roli Roberts

Roland G Roberts, PhD,

Senior Editor,

rroberts@plos.org,

PLOS Biology

*****************************************************

REVIEWERS' COMMENTS:

Reviewer #1:

This paper describes several machine learning approaches to predict Km values. The most successful of these has an R^2 value of 0.42 between predicted and measured values for a test dataset. The predictions do not take into account experimental conditions (pH, temperature, ionic strength). They also assume that each enzyme has just one Km but allosteric enzymes, which are very common, have multiple Km values (1 per site) although a midpoint concentration corresponding to 0.5 Vmax can still be defined. Consequently, I'm skeptical whether the method is of value for researchers studying a specific system (e.g. as in drug development) but it could be helpful for large-scale modelling of metabolic fluxes, for example, when the details might matter less.

Reviewer #2:

In the manuscript "Prediction of Michaelis constants from structural features using deep learning" by Kroll et. al. a graph-based method of predicting Km for a ligand is illustrated utilizing both chemical descriptors as well as a binary vector of a protein domains. Broadly, the work done is executed well but conclusions are significantly overstated. As the authors state line ~70 "a machine learning model trained on domain structures will not be able to distinguish [specific molecule binding affinity]; the model will assign the same predictions for both substrates to each of the two proteins". The authors then go and make a model based on domains. While I agree this model will provide a likely generic estimate for a likely biological revenant ligand based on the chemical structure (which is understandable given the dominance of chemical structure in feature importance) it is not going to be predictive of a molecule's affinity to the protein as it will not account for the effects mutations have within a binding site or tertiary effects. It simply will predict the average likely Km level for the class of proteins, which generally can be extracted simply from a simple look-up table where the average Km for a domain annotated (or slightly more elaborated version where MW or LogP or such are incorporated). Furthermore, the authors seem to have either ignored or been unaware of the decades of research in the field of affinity prediction, the following review (I am not an author or affiliated with any of the authors) nicely summarizes:

Comparison Study of Computational Prediction Tools for Drug-Target Binding Affinities

https://www.frontiersin.org/articles/10.3389/fchem.2019.00782/full

A simple google search "predicting ligand affinity" or "predicting small molecule binding" or related identifies hundreds of manuscripts using ML to predict Ki/Kd/IC50 which in all essence are equivalent to Km being the pseudo-binding constant of an enzyme.

Therefore, while this article is interesting and contributes to the field, it should do a better job placing the advance within the field, make more accurate claims about the utility and advance, and is likely more appropriate for a trade journal. I do apologize if this review seems overly harsh, however I was truly excited when I read the title and abstract and found myself deeply frustrated by the time I had concluded reading the article.

Reviewer #3:

In this manuscript, the authors aimed to predict the values of Km from structures of the substrate and the enzyme. They represented the substrate structure with a graph neural network, accounting for comprehensive information of the atoms and their interactions with neighbors. For the enzyme structure, however, they used a much simpler representation: what functional domains exist in the enzyme. Using training data set from the BRENDA database, the authors were able to make predictions of Km values. The results were better than a previous work (Borger et al. 2006).

Given the huge variation of Km values in the BRENDA database - for a typical enzyme in BRENDA, the Km value deviates 3.7-fold from the mean of all the values for that enzyme - the predicted values here on average deviate 4.3-fold and thus seem valuable. Furthermore, with the readily obtainable results from the computational model and the expandability of the model, I think this manuscript is an important contribution to understanding enzymatics at the systems level. I however do have some issues that I like the authors to address.

1. The authors present the model as the main result. However, to most biologists including me, the predicted values of Km are probably most valuable. I thus suggest the authors apply the model to all possible reactions in popular organisms such as human and mouse. Ideally, the results can be made available in the format of website (which can be updated as the model is updated in future). At least an excel file containing the numbers should be included in the supplementary materials.

2. The representation of the enzyme structure seems over-simplified. The author discussed a number of enzyme properties that can be considered when they become more widely available in future. The order of the domains (along the protein sequence) however is available. I wonder if this structurally important information can be incorporated into the model?

3. Can the authors discuss the applicability of the model to the prediction of Kcat?

4. The origins of the Y-axis of the MSE and R2 figures were randomly chosen at the moment. I suggest set all of them to 0.

Reviewer #4:

[please see attachment for correctly formatted version]

The manuscript by Kroll et al. uses a combination of machine and deep learning to predict Km values based on available information on substrate fingerprints. Currently, machine and deep learning are very “hot” topics in all fields. Their contribution to parameter prediction is of particular value to the systems biology community as it offers significant assistance in model parametrization.

The framework seems to be well researched and structured and the presentation of the paper is very good. The results also look very promising for Km prediction and it would indeed be very interesting to see results for Kcat parameters in the future. Also the github page with the model information is very practical and very well detailed.

A few points should be addressed before acceptance.

Major issues

1) The test procedure seems a little basic as there is only one test dataset. A full kfold cross validation would be better.

2) It would also be nice to get some empirical investigation as to how robust the results are to different choices of train/validation/test data. It would be interesting to see how much the hyperparameters change, whether the errors are stable etc.

Both of these issues could be addressed fairly simply by re-running the analysis on different splits.

3) The choice about processing the input data (averaging log kms for enzyme/substrate combinations with the same functional domain) isn't well argued for and may have biased the results somewhat. A mentioned in lines 83-86: “Since we are only interested in one KM value for every substrate in combination with a group of enzymes with the same functional domain content (see below), we took the geometric mean if multiple values existed for the same combination of substrate and enzyme domain content.”

Not sure if this is a good idea - why throw away information like this? The mean is only a good summary statistic if the error distribution is approximately normal (which probably isn't the case here given that BRENDA data tends to be quirky) and even in this case the fact that certain data points represent more observations should be taken into account. Based on the arguments in the paper I think it would have been better to omit this step and allow multiple measurements of the same enzyme/substrate combination.

4) *** line 375: “We tested the null hypothesis that the median of the absolute errors on the test set...”

Why median absolute error not MSE?

Minor issues

It would be better to package the dataset that was used for the analysis in the paper instead of downloading it from BRENDA for reproducibility purposes. As BRENDA changes periodically, some of the information may be different and it might not be possible to obtain the same results.

Reviewer #5:

In this manuscript, Kroll et al. describe and train a model to predict Km using machine learning. The authors are correct in presenting and highlighting the problem around the availability and the challenges of calculating km comprehensively, consistency and at scale. I also agree that the correct prediction of Km at scale will enormously benefit other fields so the impact could be huge. I however have significant methodological questions and I am not fully convinced that PLoS Biology is the ideal home for this manuscript. I think the manuscript on its present form is very technical and I am not sure of interest to the broad readership of PLoS Biology. My advice would be to perform major changes to adapt it to a wider audience or to submit to a more technical journal such as PLoS Computational Biology or j chem inf mod. The article is well written and relatively easy to follow. Below I detail my comments and questions:

Regarding the training/test data, the authors use BRENDA which quite a standard database. I however don't see the data being available as supplementary. I understand that there is a script in github to do the preprocessing but since BRENDA can be updated I would make the data preprocessed available as a supplementary for better reproducibility and future comparisons. This would also be of help to the non-computational readers of PLoS Biology. I was a bit surprised to see that the authors don't use also SABIO-RK database, that seems quite a standard in the field. On a quick search they don't seem to include exactly the same information so I wonder why the authors did not use it for model building or testing (see comment below). Finally, I was a bit surprised why the authors don't doo 5- or 10-fold cross validation although I have more questions regarding model testing below.

I think it was a wise decision to use parameters that are broadly applicable to aim for a method that is applicable to any enzyme of any species and a strength as compared to other models that use information that is not broadly available. I liked the idea of using ECFPs and RDKit is indeed a very good standard in chemoinformatics. However, there are many different types of fingerprints so it seems a wasted opportunity to not look a little at other fingerprint approaches. I would use at least another fingerprint to compare if you see any differences. In my opinion 2D-pharmacophore fingerprints have shown quite robust results and I think it would be interesting to see if better results are achieved... But if the authors would prefer another fingerprint to compare I would have no objections, I just think it would be nice to see at least another fingerprint tested.

Regarding model building, the authors jump straight away to using FCNNs. I think it would be very interesting to see how NNs compare to simpler models. So I would start by implementing a simpler model before jumping to neural networks. Maybe linear modelling (e.g. elastic search) would be a good place to start to be able to assess the added value of using more complex models.

The description of the results on graph neural networks was a bit technical. I wonder if it would be too much for a wet-lab reader of PLoS Biology... I would consider simplifying it and/or moving part of it to the results and in the main text focus more on the outcomes of the model and its applicability.

I liked how information was subsequently added on molecular weight, partition coefficient and functional domain to assess the effects in the model.

The authors then describe how to predict km for substrates and enzymes not present in the training data. And I get slightly confused and this is highlights the importance of providing the data that was used for model building and testing as a supplementary (see above). The numbers discussed seem really low. Only 3 homologous enzymes with the same substrate were present in the training and test set? And only 9 where the enzyme and substrate are not in the training? then additional trainings were created but again 45 points seems low to me, given there are 5,000 data points. So I think this needs careful design from the outset and more clear explanation. I think the authors should do x-fold cross validation of all the models and these test/training sets should be designed to prevent the same enzymes/homologues and substrates to be present in both training and test sets. That would be the fair approach to testing it in my opinion.

I appreciate the authors discuss the MSE differences with previous approaches despite not being possible to do a comprehensive comparison. I wonder why they don't discuss R2 too and there are a couple of previous works that are also not included in this discussion. I think this part should be elaborated more.

In an ideal world, I would like to see a prospective experimental validation at the end of the paper. I think this is important, particularly if the aim is a general biology journal as PLoS Biology as opposed to a computational journal. I appreciated this may be difficult as measuring Km is not simple and the authors are fully computational. But I wonder if the authors could alleviate this somehow. Maybe using the SABIO-RK data that is not covered by BRENDA as an unseen dataset? Or maybe curation some recent Km literature values that did not make it into BRENDA yet? I think this would be important.

Finally, I think there are two aspects that would make this far more appealing for a broad biology journal. These would not be necessary for a computational journal but I think they are for PLoS Biology. The first one is data availability. I know the code is in github but experimental scientists that could benefit from your model can't access this easily. I think you should provide the Km predicted with your best model for all enzymes. Of course this not easy to do but maybe you could calculate them for all the enzymes in the most used models and release these data. Alternatively, my preferred option would be that you somehow enabled wet-lab scientist to be able to calculate the km themselves using your method with an easy online application, for example deploying a shiny app. The second aspect is that the paper is very dry on its current form. Could you maybe end with a simple example of use? Maybe an example where the Km is not known for am enzyme and this makes it very challenging to generate a simple model of cellular metabolism. But when you estimate the km using your best model, then you can easily construct a simple metabolism model for this enzyme that explain experimental observations. This is just a suggestion but ending the paper with an example of use would make if far less dry and more palatable for non-expert experimental scientists.

In summary, have enjoyed reading the manuscript and the article is clearly a step forward in Km prediction. I therefore strongly encourage the authors to perform the major revisions suggested or transfer the paper to a computational journal with fewer revisions needed.

Attachment

Submitted filename: Review Plos 2-2021.docx

Decision Letter 2

Roland G Roberts

9 Aug 2021

Dear Dr Lercher,

Thank you for submitting your revised Methods and Resources entitled "Prediction of Michaelis constants from structural features using deep learning" for publication in PLOS Biology. I've now obtained advice from three of the original reviewers and have discussed their comments with the Academic Editor. 

Based on the reviews, we will probably accept this manuscript for publication, provided you satisfactorily address the following data and other policy-related requests.

IMPORTANT:

a) Please make your title more appealing by including an active verb and alluding to the scale of prediction that your approach allows. We suggest something like "Deep learning allows genome-scale prediction of Michaelis constants from structural features"

b) Please address my Data Policy requests below; specifically, please supply numerical values underlying Figs 2AB, 3AB, 4ABC, 5AB, S1, and cite the location of the data clearly in each relevant Fig legend.

As you address these items, please take this last chance to review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the cover letter that accompanies your revised manuscript.

We expect to receive your revised manuscript within two weeks.

To submit your revision, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' to find your submission record. Your revised submission must include the following:

-  a cover letter that should detail your responses to any editorial requests, if applicable, and whether changes have been made to the reference list

-  a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable)

-  a track-changes file indicating any changes that you have made to the manuscript. 

NOTE: If Supporting Information files are included with your article, note that these are not copyedited and will be published as they are submitted. Please ensure that these files are legible and of high quality (at least 300 dpi) in an easily accessible file format. For this reason, please be aware that any references listed in an SI file will not be indexed. For more information, see our Supporting Information guidelines:

https://journals.plos.org/plosbiology/s/supporting-information  

*Published Peer Review History*

Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*Early Version*

Please note that an uncorrected proof of your manuscript will be published online ahead of the final version, unless you opted out when submitting your manuscript. If, for any reason, you do not want an earlier version of your manuscript published online, uncheck the box. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us as soon as possible if you or your institution is planning to press release the article.

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please do not hesitate to contact me should you have any questions.

Sincerely,

Roli Roberts

Roland G Roberts, PhD,

Senior Editor,

rroberts@plos.org,

PLOS Biology

------------------------------------------------------------------------

DATA POLICY:

You may be aware of the PLOS Data Policy, which requires that all data be made available without restriction: http://journals.plos.org/plosbiology/s/data-availability. For more information, please also see this editorial: http://dx.doi.org/10.1371/journal.pbio.1001797 

Note that we do not require all raw data. Rather, we ask that all individual quantitative observations that underlie the data summarized in the figures and results of your paper be made available in one of the following forms:

1) Supplementary files (e.g., excel). Please ensure that all data files are uploaded as 'Supporting Information' and are invariably referred to (in the manuscript, figure legends, and the Description field when uploading your files) using the following format verbatim: S1 Data, S2 Data, etc. Multiple panels of a single or even several figures can be included as multiple sheets in one excel file that is saved using exactly the following convention: S1_Data.xlsx (using an underscore).

2) Deposition in a publicly available repository. Please also provide the accession code or a reviewer link so that we may view your data before publication. 

Regardless of the method selected, please ensure that you provide the individual numerical values that underlie the summary data displayed in the following figure panels as they are essential for readers to assess your analysis and to reproduce it: Figs 2AB, 3AB, 4ABC, 5AB, S1. NOTE: the numerical data provided should include all replicates AND the way in which the plotted mean and errors were derived (it should not present only the mean/average values).

IMPORTANT: Please also ensure that figure legends in your manuscript include information on where the underlying data can be found, and ensure your supplemental data file/s has a legend.

Please ensure that your Data Statement in the submission system accurately describes where your data can be found.

------------------------------------------------------------------------

DATA NOT SHOWN?

- Please note that per journal policy, we do not allow the mention of "data not shown", "personal communication", "manuscript in preparation" or other references to data that is not publicly available or contained within this manuscript. Please either remove mention of these data or provide figures presenting the results and the data underlying the figure(s).

------------------------------------------------------------------------

REVIEWERS' COMMENTS:

Reviewer #3:

The authors have satisfactorily addressed all my concerns.

Reviewer #4:

All points were answered successfully and I have no further comments.

Reviewer #5:

I thank the authors for their hard work addressing nearly all the comments from the 5! reviewers. I think they have done a good job, the paper is now much stronger and I have enjoyed (and learned!) reading the revised version of the manuscript. I therefore recommend this paper for publication. I would finally encourage the authors to apply the model that they have developed, maybe by starting collaborations with metabolite flux scientists that are dry or wet lab. We need good models need to get out there and be used! Congratulations.

Decision Letter 3

Roland G Roberts

26 Aug 2021

Dear Dr Lercher,

On behalf of my colleagues and the Academic Editor, Jason Locasale, I'm pleased to say that we can in principle offer to publish your Methods and Resources paper "Deep learning allows genome-scale prediction of Michaelis constants from structural features" in PLOS Biology, provided you address any remaining formatting and reporting issues. These will be detailed in an email that will follow this letter and that you will usually receive within 2-3 business days, during which time no action is required from you. Please note that we will not be able to formally accept your manuscript and schedule it for publication until you have made the required changes.

Please take a minute to log into Editorial Manager at http://www.editorialmanager.com/pbiology/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production process.

PRESS: We frequently collaborate with press offices. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximise its impact. If the press office is planning to promote your findings, we would be grateful if they could coordinate with biologypress@plos.org. If you have not yet opted out of the early version process, we ask that you notify us immediately of any press plans so that we may do so on your behalf.

We also ask that you take this opportunity to read our Embargo Policy regarding the discussion, promotion and media coverage of work that is yet to be published by PLOS. As your manuscript is not yet published, it is bound by the conditions of our Embargo Policy. Please be aware that this policy is in place both to ensure that any press coverage of your article is fully substantiated and to provide a direct link between such coverage and the published work. For full details of our Embargo Policy, please visit http://www.plos.org/about/media-inquiries/embargo-policy/.

Thank you again for choosing PLOS Biology for publication and supporting Open Access publishing. We look forward to publishing your study. 

Sincerely, 

Roli Roberts

Roland G Roberts, PhD 

Senior Editor 

PLOS Biology

rroberts@plos.org

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Results of the hyperparameter optimizations of fully connected neural networks (FCNNs), which were trained to predict KM from substrate information only.

    The hypeparameter optimizations were performed for each of 4 different fingerprints of the substrates with a 5-fold cross-validation on the training set.

    (TIF)

    S2 Table. Results of the hyperparameter optimizations of gradient boosting models, which were trained to predict KM from substrate information only.

    The hypeparameter optimizations were performed for each of 4 different fingerprints of the substrates with a 5-fold cross-validation on the training set.

    (TIF)

    S3 Table. Results of the hyperparameter optimizations of elastic nets, which were trained to predict KM from substrate information only.

    The hyperparameter optimizations were performed for each of 4 different fingerprints of the substrates with a 5-fold cross-validation on the training set.

    (TIF)

    S4 Table. Result of the hyperparameter optimization of a fully connected neural networks (FCNN), which was trained to predict KM from substrate and enzyme information (GNN fingerprint and UniRep vector).

    The hypeparameter optimization was performed with a 5-fold cross-validation on the training set.

    (TIF)

    S5 Table. Result of the hyperparameter optimization of the gradient boosting model, which was trained to predict KM from substrate and enzyme information (GNN fingerprint and UniRep vector).

    The hypeparameter optimization was performed with a 5-fold cross-validation on the training set.

    (TIF)

    S6 Table. Result of the hyperparameter optimization of an elastic net, which was trained to predict KM from substrate and enzyme information (GNN fingerprint and UniRep vector).

    The hyperparameter optimization was performed with a 5-fold cross-validation on the training set.

    (TIF)

    S1 Fig. Scatter plot of log10-transformed KM values predicted with the gradient boosting model with substrate and enzyme information as inputs versus the experimental values downloaded from BRENDA and Sabio-RK.

    The scatter plot displays all data points of the Sabio-RK test set (orange) and all data points from the BRENDA test set with an EC number–substrate combination not present in the training set (violet). The data underlying the graphs shown in this figure can be found at https://github.com/AlexanderKroll/KM_prediction/tree/master/figures_data.

    (TIF)

    S2 Fig

    (a) Scatter plot of measured KM values and the absolute prediction errors of the BRENDA test data points for which neither the substrate nor the enzyme occurs in the training set. (b) Histogram with the distribution of the KM values in the training set. The data underlying the graphs shown in this figure can be found at https://github.com/AlexanderKroll/KM_prediction/tree/master/figures_data.

    (TIF)

    S1 Data. Dataset in xlsx format containing complete KM predictions for 47 genome-scale metabolic models, including those for Homo sapiens, Mus musculus, Saccharomyces cerevisiae, and Escherichia coli.

    (XLSX)

    Attachment

    Submitted filename: Review Plos 2-2021.docx

    Attachment

    Submitted filename: Kroll_Response2Reviewers.pdf

    Attachment

    Submitted filename: Kroll_Response2Rev4.pdf

    Data Availability Statement

    All datasets generated and the Python code used to produce the results (in Jupyter notebooks) are available from https://github.com/AlexanderKroll/KM_prediction.


    Articles from PLoS Biology are provided here courtesy of PLOS

    RESOURCES