Version Changes
Revised. Amendments from Version 1
We thank the reviewers for their comments and have amended the manuscript accordingly. This included suggestions for the optimal spectral resolution required for the presented algorithm to be able to accurately determine multiplet components and how this changes with increasing amounts of enhancement of multiplet splittings due to J-coupling. We also added an analysis of a fully sampled HSQC spectrum from the qtmge dataset to the extended data repository, tables S2 and S5. We clarified that there are three HSQC spectra overlaid in figure 1. We also commented on reduction of data acquisition times through use of splitting enhancement due to J-coupling. We added a mathematical expression to clarify our vector normalisation. We also added a comment on how to minimise pH variations between different samples during sample preparation. Finally, we corrected several typing errors.
Abstract
Background: Metabolism is essential for cell survival and proliferation. A deep understanding of the metabolic network and its regulatory processes is often vital to understand and overcome disease. Stable isotope tracing of metabolism using nuclear magnetic resonance (NMR) and mass spectrometry (MS) is a powerful tool to derive mechanistic information of metabolic network activity. However, to retrieve meaningful information, automated tools are urgently needed to analyse these complex spectra and eliminate the bias introduced by manual analysis. Here,
we present a data-driven algorithm to automatically annotate and analyse NMR signal multiplets in 2D- 1H, 13C-HSQC NMR spectra arising from 13C - 13C scalar couplings. The algorithm minimises the need for user input to guide the analysis of 2D- 1H, 13C-HSQC NMR spectra by performing automated peak picking and multiplet analysis. This enables non-NMR specialists to use this technology. The algorithm has been integrated into the existing MetaboLab software package.
Methods: To evaluate the algorithm performance two criteria are tested: is the peak correctly annotated and secondly how confident is the algorithm with its analysis. For the latter a coefficient of determination is introduced. Three datasets were used for testing. The first was to test reproducibility with three biological replicates, the second tested the robustness of the algorithm for different amounts of scaling of the apparent J-coupling constants and the third focused on different sampling amounts.
Results: The algorithm annotated overall >90% of NMR signals correctly with average coefficient of determination ρ of 94.06 ± 5.08%, 95.47 ± 7.20% and 80.47 ± 20.98% respectively.
Conclusions: Our results indicate that the proposed algorithm accurately identifies and analyses NMR signal multiplets in ultra-high resolution 2D- 1H, 13C-HSQC NMR spectra. It is robust to signal splitting enhancement and up to 25% of non-uniform sampling.
Keywords: Metabolism, metabolic tracing, NMR spectroscopy, independent component analysis, machine learning, automated
Introduction
An efficient metabolism is the basis for the survival of any living organism 1, 2 . Inside each cell, the network of metabolic reactions is organised into discrete but intrinsically linked metabolic pathways. Inside the cell, these pathways then function within different cellular compartments. Just as healthy metabolism requires the efficient functioning of the metabolic network, perturbations within one compartment can result in disease by spreading through and altering the entire metabolic network 3, 4 . This is a direct consequence of inherent redundancy of the metabolic network, which means that many metabolites can be anabolised or catabolised by several different metabolic pathways.
Metabolomics studies allow the measurement of the concentrations of metabolites present at a given time but provide no direct information as to the metabolic source of these metabolites. Feeding metabolic precursors enriched with low-abundance stable isotopes such as 13C, for example [1,2- 13C] glucose or [U- 13C] glutamine, enables the tracing of specific metabolic pathways usage. The contribution of the different metabolic pathways leading to the production of this metabolite can be untangled through the analysis of the 13C distribution within the metabolite 5– 7 .
While technologies such as gas chromatography–mass spectrometry (GC-MS) are highly sensitive but low-resolution, nuclear magnetic resonance (NMR) spectroscopy is a high-resolution technique, albeit with relatively low sensitivity. In our previous work, we developed and published a method for the combined analysis of NMR spectroscopic and MS data that harnesses the strengths of both technologies to produce highly-resolved metabolism information in the form of metabolite isotopomers 8 . The 2D- 1H, 13C-HSQC NMR spectrum contains information about the relative incorporation of 13C labelled nuclei into metabolites. This information is found in the multiplets of each resonance, with the pattern of the multiplet being due to the presence or absence of 13C nuclei in positions adjacent to the detected nuclei. The extent of peak splitting due to isotope incorporation (known as J-coupling) is dependent on the nuclei involved and the local chemical structure. Each multiplet is therefore virtually unique for a specific resonance of each metabolite. As many different combinations of labelling patterns may be present in each sample, the overall peaks pattern can be complex but can always be derived as a linear superposition of the individual multiplet components.
Signal annotation and multiplet analysis are the biggest challenges in the analysis of such 2D- 1H, 13C-HSQC NMR spectra, as precise resonance positions of metabolite resonances can change as a result of minor pH or sample composition changes. This can provide significant challenges in crowded areas of the NMR spectrum, especially for multiplet signals with similar, albeit distinct 13C/ 13C-Jconstants. In addition, the overall shape of a specific multiplet differs quite drastically depending on the choice of tracer (e.g. [U- 13C] vs [1,2- 13C] glucose), as well as the specific activity of metabolic pathways. An example of these challenges is depicted in Figure 1, where the signal for HC(2) of lactate is shown for three different samples of different tracers. In addition, the chemical shifts of the three lactate signals are different for the three samples because of differences in pH. It becomes quite clear why the analysis of such 2D- 1H, 13C-HSQC NMR spectra requires an in-depth knowledge and expertise in 2D- 1H, 13C-HSQC NMR spectroscopy of complex mixtures in order to perform an efficient and an unbiased analysis of the data. Automated data analysis of 2D- 1H, 13C-HSQC NMR tracing data provides an avenue free of operator bias and allows researchers with limited NMR and data analysis expertise to use this powerful analytical technique with ease.
Figure 1.
Overlay of the HC(2) resonance of lactate in 2D- 1H, 13C-HSQC NMR spectra from three different samples demonstrating the variation that can occur between samples even when observing the 13C multiplet of the same carbon nucleus (panel A). For the sample producing the red spectrum [1,2- 13C] glucose was used, for the green spectrum [U- 13C] glucose and for the sample of the blue spectrum, a mixture of [1,2- 13C] glucose and [U- 13C] glutamine were used as metabolic tracer. 1D 13C slices, extracted from the middle of each 2D spectrum are also shown to clearly demonstrate the differences in the appearance of the multiplets resulting from the different choice of tracers (panels B, C & D).
The necessity for automated algorithms has been recognised in other NMR spectroscopy areas such as automatic assignment and data analysis for metabolomics applications 9, 10 and machine learning applications for HSQC data analysis for drug discovery 11, 12 . However, no such approaches have been implemented so far for the automated analysis of 13C -multiplets present in 2D- 1H, 13C-HSQC NMR spectra used for metabolic tracing NMR experiments.
Here, we present a machine-learning based algorithm, which automatically annotates and analyses multiplets in 2D- 1H, 13C-HSQC NMR. After signal annotation and multiplet decomposition, the algorithm assesses quality of data analysis, which will inform the user about how much trust they can put into the result. The software developed here is fully integrated in the MetaboLab software package 13 and can be run as a fully automated procedure to be used by researchers with biological expertise without the necessity of expert NMR or data analysis knowledge.
Algorithm development
Outline
Mathematically, the identification of a metabolite signal presents itself as a signal unmixing problem. The aim is the decomposition of the NMR spectrum into statistically independent components by using a data-driven approach and without prior knowledge on how the multiplet components mix in the 2D spectrum. A 2D- 1H, 13C-HSQC NMR spectrum is a linear superposition of all resonances from each 13CH pair in each metabolite within the sample. The NMR multiplets of each metabolite are independent of the resonances of all other metabolites. Thus, it is theoretically possible to disentangle major spectrum components in a data-driven manner. The problem can be formulated as an inverse modelling problem, where multivariate observations (e.g. the 2D- 1H, 13C-HSQC NMR spectrum) relate to lower-dimensional vectors of statistically independent variables (e.g. the contribution of each metabolite to the 2D- 1H, 13C-HSQC NMR spectrum), using a linear model. Once a metabolite multiplet has been identified and localised within the spectrum, the contribution of each multiplet component to the entire signal must be quantified. The algorithm presented here uses linear least square regression (LS) with parameters constrained to non-negative values only 14 to estimate the relative contribution of each multiplet component to the experimentally found resonance.
As a result, in order to implement an algorithm capable of automated signal annotation and multiplet decomposition, the computational problem for each resonance of each metabolite was subdivided into several steps ( Figure 2):
Figure 2. Schematic representation of the main steps of the algorithm for automated 2D- 1H, 13C-HSQC NMR spectrum analysis.
The computational routine uses a list of metabolite names as input and outputs a metabolite composition as well as the software reliability for each resonance analysis. Optionally, the user can choose to produce a graphical report of the analysis in form of a pdf file.
-
1.
Restrict the searching area of the resonance of interest in the NMR spectrum based on a 2D- 1H, 13C-HSQC NMR spectral database built into the MetaboLab software
-
2.
Run the Independent Component Analysis (ICA) to obtain the latent factors (e.g. 13C spectra of any resonance present in the restricted area) and localise each latent component along the proton dimension
-
3.
Identify the independent component representative of the multiplet of interest and precisely determine the 1H and 13C chemical shifts
-
4.
Perform the multiplet analysis by constrained linear regression and compute the coefficient of determination ρ (%) to guide the user in the understanding of the software outputs.
In the following sections, the pipeline of the algorithm developed in this work is presented, each section explains a principal element of the algorithm ( Figure 2). To provide an overview, a pseudo-code representation of the implemented algorithm can be found in the extended data repository (githubRef, Table S1 and Algorithm 1).
Metabolite localisation in the HSQC NMR spectra
The assignment of NMR resonances to specific metabolite nuclei is based on a peak list obtained from a spectral library included in the MetaboLab software which was derived from publicly available databases 15– 18 .
For a given metabolite, the algorithm loops through each resonance and derives expected 1H and 13C chemical shifts from the built-in database. The next logical step would be to identify the precise location of the spectral resonance. However, because this algorithm works without any a priori knowledge about metabolic pathways or which metabolic tracer was used, we do not have any knowledge about neither the exact chemical shift values of the resonance nor the multiplet composition. In order to identify the correct multiplet in the 2D- 1H, 13C-HSQC NMR spectrum, we can use information about 13C/ 13C-Jcouplings, stored in the spectral database included with the software. While differences in pH and sample matrix composition may lead to a change in the precise resonance location in the 2D- 1H, 13C-HSQC NMR spectrum, the 1H and 13C chemical shift values will still be similar to the values given by the database. To improve the software’s computational efficiency, the search space (i.e. the portion of 2D- 1H, 13C-HSQC NMR spectrum to be examined) for each multiplet signal is restricted from the entire spectrum to an area of ±maxWidth1H ppm ( ±0.15 ppm) and ±maxWidth13C ppm ( ±3.0 ppm) around the 1H and 13C library shift of the metabolite respectively. The maxWidth parameters refer to user changeable parameters with the standard values indicated if the user does not provide these values. This range allows the algorithm to accommodate changes in chemical shift due to the aforementioned factors.
Multiplet identification: Independent Component Analysis (ICA)
As mentioned earlier, the computational problem to identify a particular metabolite multiplet in the restricted spectral area can be understood to be a signal unmixing problem. One particularly efficient way to solve this problem is Independent Component Analysis (ICA). Mathematically, we perceive the NMR spectrum columns (i.e. 13C 1D sub-spectra) of the chosen spectral area to be N observations ( y 1( ω), y 2( ω),.., y i ( ω),.., y N ( ω)), where each observation y i ( ω) ∈ ℝ T contains contributions of a smaller number of K NMR signals and noise. The K number of signals are also known as latent variables ( s 1( ω), s 2( ω),.., s i ( ω),.., s K ( ω)). While the noise of each spectral column is assumed to be independent, we assume each observation y i ( ω) to be a mixture of K latent variables weighted by unknown coefficients. We can then define a mathematical model as follows, using a vector-matrix notation and dropping the frequency dependence ω without loss of generality:
where Y ∈ ℝ N ×T is the observation matrix, W ∈ ℝ N ×K is the unknown linear mapping matrix from the latent space S ∈ ℝ K ×T to the observation space.
In the case of a 2D- 1H, 13C-HSQC NMR spectrum of a metabolite mixture, all the multiplets of the metabolites within the restricted spectrum area are latent variables ( s 1, ..., s T ) and the restricted area of the 2D- 1H, 13C-HSQC NMR spectrum is the observation. While a 1D-ICA generally requires multiple 1-dimensional observations, here one 2-dimensional observation is available (i.e. the search area of the 2D- 1H, 13C-HSQC NMR spectrum).
The shortcoming of having only a single 2D observation is addressed by utilising the proton dimension as the observation dimension. This means that all the 13C NMR spectra at different proton chemical shifts of the 2D-spectrum are considered as observations, therefore the proton dimension is the observation dimension, and the problem can be solved using 1D-ICA. The K latent factors are the underlying spectral components (e.g. the mutiplets of the metabolites). The K components are assumed to be statistically independent as required for the ICA algorithm 19 . This assumption is also physically/chemically plausible, as there is no interaction between metabolites at different 1H-shifts 20 . Moreover, the linearity of the model can be justified, as the metabolites mix in an additive way in the spectrum 20 . These assumptions justify the choice of using ICA among other unmixing algorithms such as non-negative matrix decomposition (NMF), since ICA finds a decomposition of the observed data to retrieve latent components which are as independent as possible. For the practical implementation of this algorithm, we chose to use an implementation of the FastICA 21 algorithm. Details on the working principle of 1D-ICA and its adaptation to solve the proposed problem are described in the Supporting Information (Section: Independent Component Analysis for mutiplets identification). Figure 3 provides an example of some of the ICA components (independent components are shown in green and experimental 13C NMR sub-spectra are plotted in blue) provided by the ICA algorithm during the analysis for HC(2) of Lactate. The 1H chemical shift range for the shown independent components was 3.95 to 4.2 ppm.
Figure 3. Independent components from the region around the expected shifts of HC(2) of lactate.
These independent components are then matched back to the experimental data with the highest correlation component being selected.
Localisation along the proton dimension by cross-correlation
While ICA is very successful in identifying statistically independent resonances in the 13C dimension, information about the 1H localisation is lost. In order to assign a specific 1H chemical shift to each latent component a cross-correlation between each component and all columns from the search area of the experimental NMR spectrum is required. Because each of the NMR columns is associated with a specific 1H ppm value, the column which receives the largest cross-correlation value determines the assigned 1H chemical shift of that component. To avoid selection of latent components representing spectral noise only, the algorithm uses a minimum threshold for the correlation value ( minCorr) below which the latent component will be discarded from further evaluation. The standard value of this minimum correlation is 0.8, which can be adjusted by the user if needed.
Latent component identification
The latent components localised within the spectrum might have captured multiplet patterns of metabolites within the searching area, which will contain the resonance of the metabolites of interest if this actually exists in the current spectrum. However, these components will also contain resonances belonging to CH pairs of other metabolites. To determine if the resonance of interest exists in one of the estimated latent components, the following steps are necessary:
-
1.
The multiplet of interest is simulated using chemical shift values and 13C/ 13C-J-coupling constants stored in MetaboLab’s database are used assuming equal intensities for each of the multiplet components as shown in Figure 4 panel A. The pygamma NMR simulation library (version 4.3.4) is used for multiplet simulations 22 .
-
2.
This multiplet is then aligned with each latent component through cross-correlation (of the simulated multiplet with the latent components) and the 13C chemical shift value of the simulated multiplet for the latent component with the highest cross-correlation value is chosen.
Figure 4. Multiplet analysis of Lactate C(2).
( A) The individual components that make up the multiplet of C(2) of lactate. ( B) Simulated individual components of C(2) of lactate with the percentage contribution of each component to the multiplet signal of 13C-labelled Lactate C(2). ( C) An overlay of the experimental data and the fitted simulated data based on the calculated percentage contributions of each labelled form of lactate. ( D) Region of the 2D- 1H, 13C-HSQC spectrum showing the multiplet peak for lactate C(2).
-
-
This alignment is not trivial since the magnitude of the multiplet components is unknown at this point of the pipeline and alignment methods based on cross-correlation are sensitive to difference in signals’ magnitude. Moreover, presence of "outlier" peaks (peaks belonging to another metabolite or noise) pose additional challenges.
-
-
To solve this problem, all possible combinations of simulated multiplets are considered. For example, if we consider a total of p = 4 multiplet components (i.e. we neglect long-range 13C/ 13C-J-couplings) and we choose a subset of q component at a time, this computation will result in combinations. Each of these combinations is aligned with the latent component and a regression analysis is performed to obtain the coefficient of determination ρ scores, which quantifies the goodness of the fitting. The combination of multiplet components which gives the highest ρ value is chosen and the ρ score saved. Theoretically, this should result in the highest ρ score for the correct multiplet simulation at the right 13C chemical shift.
-
-
However, the ρ value is sensitive to underlying noise and NMR resonances close to the resonance of interest. Therefore, in each latent components only NMR signals with maxima greater than 50% of the global maximum within the search area are retained and noise related issues can be avoided. In addition, the score ρ is weighted by the squared deviation ∆ ppm of the determined 13C chemical shift ( x H , x C ) from the library 13C chemical shift ( x Hlib , x Clib ) computed as Euclidean distance ( Equation 2 – Equation 4) 23 .
where γ H and γ C are the 1H and 13C gyromagnetic ratios.
Multiplet analysis: contribution of multiplet components to overall signal
As the last step, the relative intensity of each multiplet component is then determined by a constrained linear least square regression. Prior to this step we optimize the alignment along the proton dimension using the hill climbing heuristic optimization algorithm 24 to ensure that the 1H chemical shift obtained in the previous localisation step corresponds to the maximum of the experimental NMR resonance in the 1H dimension. The regression is defined as follows 5:
where F is the simulated multiplet with adjusted intensities of all multiplet components, X is a matrix containing the different multiplet sub-spectra as input and B ≥ 0 is a vector containing the relative contribution values, which is the vector of unknowns that needs to be determined. After the constrained linear regression, vector B is normalised, so that all relative contributions add up to be 1 ( B = B/∑ ib(i), where b(i) are the vector components).
Experimental methods
Sample preparation
For the test NMR samples, cells were cultured in the presence of isotopically labelled tracer, chosen dependent on the pathway of interest. For instance, cells cultured in the presence of [1,2- 13C 2]-D-glucose can provide information regarding flux through pyruvate dehydrogenase (PDH) and into the tricarboxylic acid (TCA) cycle via the isotomomer composition of C4 of glutamate. Metabolites were extracted using a biphasic system to remove proteins and lipophilic molecules whilst retaining the polar molecules 25 .
NMR methodology
2D- 1H, 13C-HSQC NMR spectra were collected using the standard procedures described previously 8, 26, 27 . Spectra were acquired on either a Bruker NEO 800-MHz or a Bruker Avance 600-MHz spectrometer. Spectrometers were equipped with 1.7mm z-PFG TCI CryoProbes. All 2D-HSQC NMR spectra were acquired using echo/anti-echo gradient selection with a presaturation pulse during the 1.5s interscan delay to suppress the water resonance. Typically, spectra were acquired using 2 scans, 512 complex data points and a spectral width of 15.6ppm in the 1H dimension. The 13C dimension was acquired using 25% of 8,192 complex data points and a spectral width of 189.8 ppm resulting in an experiment time of approximately 4 hours. To evaluate the efficacy of the algorithm with regards to the scaling of the splitting due to J-couplings experiments using scaling of 0, 2, 4 or 8 were collected. The non-uniformly collected spectra were reconstructed using the IRLS algorithm using 20 iterations with MDDNMR (version 2.5) 28, 29 and then processed using NMRPipe (Version 9.2) 30 . A polynomial baseline correction was applied following manual phase correction.
Implementation
While the biggest part of the algorithm is implemented in MATLAB (source code is available here: https://zenodo.org/record/7120367#.YzR6FC0w30o), the multiplet simulation is performed within python (version 3.9) using the pygamma NMR simulation library (version 4.3.4, https://pypi.org/project/pygamma/) 22 . In addition the fastICA package is needed, which is available at https://research.ics.aalto.fi/ica/fastica/. The MATLAB version used was 9.10.0.1739362 (R2021a). The MetaboLab version was 2022.0726.1733 (available at https://www.ludwiglab.org/software-development). To perform the automated HSQC multiplet analysis, the MetaboLab scripting interface is used. We provide examples of all the scripts used to analyse the three different data sets presented here as part of the extended data pdf document, available here: 10.5281/zenodo.7867854 31 . The datasets used as discussed in the Underlying data section 32, 33 .
Operation
The algorithm is tested on a laptop with an Intel i7-6700HQ 2.60GHz CPU and a GTX 960M GPU. The analysis of a single spectrum is performed in less than 5 minutes.
Results and discussion
Program output
While all resulting information is stored inside the MetaboLab data structure, the user has the option to create a clear text report in the form of a pdf file ( report: on). This report (see Supporting Information) contains sample information originally stored inside TopSpin’s title file, a list of the metabolites analysed and the user defined variables for the multiplet assignment and analysis ( maxWidth1H, maxWidth13C, minCorr, maxRange, nReps, R2, R2 (scaled), dataSets, experiments). The following information is output for each observable carbon of each selected metabolite: coefficient of determination ρ for each multiplet, the percentage contribution for each component of the multiplet, the distance of the assigned multiplet from the library shift, a zoomed region of the 2D- 1H, 13C-HSQC NMR spectrum showing the multiplet of interest and the 1H shift of the independent component, and an overlay of the experimental 1D- 13C column and simulated data. The coefficient of determination ρ and distances from the library chemical shifts are colour coded. The colours indicate whether the software considers the assignment and multiplet estimation to be trustworthy (coefficient of determination ≥ 80%), borderline (70% > coefficient of determination < 80%) or not trustworthy (coefficient of determination ≤ 70%).
Evaluation of algorithm performance
To evaluate the algorithm performance, the following metric is used. The coefficient of determination ρ quantifies the percentage of observed variance that can be explained by the linear regression model. Considering that the model structure is supported by the theoretical assumption that each observed multiplet is a linear combination of single multiplet components, ρ is indicative of the goodness of the fit. In this work, ρ is used to indicate the analysis reliability. If the experimental 1D spectrum is given as t( M × 1) and the estimated spectrum as ( M × 1), the coefficient of determination ρ can be calculated as follows:
where is the mean of t, is the total sum of squares, and is the unexplained sum of the squared distance between the observations and predictions. The coefficient of determination ρ indicates not only the quality of multiplet estimation but incorporate also information on the correctness of the multiplet assignment. For this reason, ρ evaluates the overall performance of the algorithm: it indicates if the multiplet has been correctly localised and consequently the linear regression method is able to closely fit the experimental data. If a signal has a poor signal-to-noise ratio or cannot be found at all, the corresponding ρ value will be set to 0. Sample reports are provided in the supporting information, including examples of metabolites with very low concentrations where the algorithm fails to assign and analyse the multiplet correctly.
Validation of the model and analysis of the results
Overall algorithm evaluation. The performance of the algorithm developed in this work is evaluated by reporting the results of multiple testing scenarios using different datasets. We manually inspected the datasets to evaluate the software output and in particular to confirm the following outcomes: a) the multiplet is not correctly assigned and the software warns the user about the unreliability of the output (i.e. low coefficient of determination, the distance from the library shifts exceeds the set threshold) b) the multiplet is not correctly identified and the software fails in flagging the issue to the user, giving as output a high coefficient of determination. The coefficient of determination ρ is generally affected by the noise and peaks surrounding the multiplet being analysed, as demonstrated in Figure 5 and Supporting Information Figures S1–S5. Therefore, ρ is thought to provide not only a measure of the goodness of the fit but also information about the signal-to-noise ratio (SNR). In heavily congested areas or for signals with poor SNR the computed value of ρ is likely to be low. In this case, the user is also warned and the software decision making may be checked manually if desired.
Figure 5. Region of the spectrum showing the multiplet peak for lactate C(2).
The grey area underlying the spectrum indicates the highest coefficient of determination computed for lactate C(2) as a function of the chemical shift offset.
Analysis was performed on four metabolites (alanine, aspartate, glutamate and lactate) on all 13 test 2D- 1H, 13C-HSQC NMR spectra. In each spectrum we expect 11 multiplets consisting of two for lactate, two for alanine, four for glutamate and three for aspartate. In total three different and previously published datasets were used. The three datasets contain a total of 13 2D- 1H, 13C-HSQC NMR spectra. All NMR data is available via public data repositories. The first dataset contains 2D- 1H, 13C-HSQC NMR spectra with [1,2- 13C] glucose as metabolic tracer (MTBLS241 8 ). This dataset contains spectra from three biological replicates. The second dataset contains 4 different 2D- 1H, 13C-HSQC NMR spectra from one sample, but with different amounts of apparent signal-splitting due to J-coupling (eqhn3 27 ). The third dataset contains 6 2D- 1H, 13C-HSQC NMR spectra from one sample using different amounts of sampling in a non-uniform sampling scheme (qtmge 26 ). The characteristics of each dataset are summarised in Supporting Information Table S2
The proton and carbon chemical shifts ρ values and percentage contribution of each multiplet component for these are given in Supporting Information Tables S3–S5. A quantitative analysis of the results is provided below for each dataset.
Dataset 1 - CANMS ( MTBLS241) Manual inspection of the automated analysis reveals that all the 33 metabolite multiplets (for lactate, alanine, glutamate and aspartate) are correctly identified and localised. The average coefficient of determination ρ is 94.06 ± 5.08%.
The automatic peak-picking is compared against the manual procedure of dataset MTBLS241 to assess the quality of the automated algorithm and compute the accuracy (see figure S6).
Dataset 2 - Scaling of apparent scalar couplings ( qtmge) Manual inspection of the automated analysis reveals that all the 44 metabolite multiplets are correctly identified and localised. The average coefficient of determination ρ is 95.47 ± 7.20%. Compared to the previous dataset, the slightly higher standard deviation is due to alanine (C2) ( ρ = 56.7%). The algorithm is compatible with the technique of enhancement of splitting due to J-coupling 27 meaning that the automated assignment can be used in conjunction with rapidly collected data on high field spectrometers. The degree of splitting enhancement is detected by the software and algorithm parameters automatically altered. The average coefficients of determination for the different values of enhancement of splitting due to J-coupling (1, 2, 4, 8) are ρ = 96.14 ± 3.96%, ρ = 97.92 ± 1.67%, ρ = 96.54 ± 3.65%, ρ = 91.30 ± 12.78% respectively. The results show that a splitting enhancement of 4 is dealt with effectively by the algorithm but an enhancement of 8 is not recommended as it can lead to poorer estimations on less intense multiplets. With an enhancement of splitting due to J-scaling of 1 the number of complex data points collected in the indirect dimension should be 8192. With increasing enhancement, the number of data points acquired can be reduced proportionately. The enhancement of splitting due to J-coupling can also be utilised to reduce the experiment time from 4 hours to 1 hour 26 .
Dataset 3 - NUS-sampling ( eqhn3) The algorithm gives good results with data collected with reduced sampling down to 25%. At sampling of below 25% the signal to noise is decreased such that only intense peaks can be analysed with confidence and the resulting coefficient of determination of weaker peaks is low. We therefore do not recommend collected data with sampling of less than 25%.
50 out of 66 metabolite multiplets have been picked correctly by the software. The average coefficient of determination (80.47 ± 20.98%) is significantly lower compared to the previous results. This is due to aspartate, whose co-efficient of determination is low in all the replicates due to poor signal-to-noise ratio because of very low aspartate concentrations. Considering the 16 multiplets that have not been correctly picked, 12 out of 16 have corresponding low coefficient (red flag). In this case the user is warned about the software not being able to provide a reliable analysis and we therefore consider this as a positive outcome for the software (true negative). Instead, 4 multiplets have wrongly been picked by the software and the corresponding coefficients of determination are > 85% (false positive). This last case arises for glutamate. The method described here localises a metabolite along the proton dimension by relying on the cross-correlation metric, as explained in the Algorithm development section. With well defined and resolved peaks the peak is easily identified by the algorithm as shown by the large area with a high coefficient of determination for instance with Lactate C(3) (Supporting Information Figure S2). However, if two or more metabolites have multiplets with almost identical patterns and carbon shift, as seen with C(2) (Supporting Information Figure S3) and C(3) (Supporting Information Figure S4) of glutamine and glutamate, the algorithm occasionally fails to correctly pick the metabolite as each metabolite assignment is carried out without knowledge of other metabolites’ assignments. This can almost always be identified by the larger than expected deviation of the chemical shifts from the library values. In one of our test cases the downfield 1H of glutamate C(3) is misassigned to glutamine C(3) while the upfield 1H is correctly assigned. The misassigned resonance can again be identified by the larger deviation from the library shift and thus analysis can be limited to the correctly assigned 1H resonance. The misassignment is due to the glutamine being unlabelled and thus the ease of the algorithm fitting a singlet in close proximity to the library shifts. Care must be taken in the analysis of spectra with significant signals from glutamate and glutamine. Where a carbon has two attached protons, as with aspartate C(3) and glutamate C(3), the algorithm may select the same multiplet independent of the actual 1H chemical shift. As the 13C/ 13C-J coupling constants and the 13C chemical shift and the multiplet composition are all identical for the two protons, this can be ignored. If the sample preparation results in the collected data differing significantly from the library shifts the algorithm can adapt in most cases except where the coupling constants of a nearby resonance match that of the actual metabolite. This can be minimised by correctly the pH of the polar phase (methanol:water) prior to drying the sample and resuspension in NMR buffer. This is shown in Supporting Information Figure S3 where if the sample conditions results in a 0.1 ppm 1H shift then a glutamate (C2) can be misassigned as creatine with a high coefficient of determination. However, the rest of the dataset would then show very poor algorithm results allowing the user to realise that there may be an issue with the sample. If proper sample preparation is performed, then such problems should not be an issue.
Conclusion
In this article, we have described a data-driven algorithm which functions within the existing MetaboLab software. The algorithm automatically assigns, annotates and analyses multiplets arising from 13C - 13C scalar couplings present in 2D- 1H, 13C-HSQC spectra collected to study metabolism using isotopic tracers. The method is robust and operates with a high degree of accuracy. We have tested the algorithm on three different datasets and show that the algorithm can analyse data collected with a range of tracers with different labelling, at different frequencies, with multiple biological repeats and with different sampling amounts. All spectra were successfully analysed and the algorithm is also able to correctly identify when enhancement of splitting due to J-coupling has been used and to automatically compensate for this effect prior to analysis. The coefficient of determination of the algorithm in the analysis is displayed conveniently in a colour coded form to aid the user in determining the reliability of the result. The algorithm functions successfully in all but the most challenging of situations allowing extensive analysis to be performed even by those with limited knowledge of the field.
Acknowledgements
The authors thank HWB-NMR staff at the University of Birmingham for providing open access to their Wellcome Trust-funded 800 MHz spectrometer. We also acknowledge support by the metabolic tracer analysis core (MTAC) at the Institute of Metabolism and Systems Research (IMSR) at the University of Birmingham.
Funding Statement
This work was supported by Wellcome [208400, <a href=https://doi.org/10.35802/208400>https://doi.org/10.35802/208400</a>]. We gratefully acknowledge financial support for LF from the Engineering and Physical Sciences Research Council (EPSRC) through a studentship from the Physical Sciences for Health Centre for Doctoral Training (EP/L016346/1).
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
[version 2; peer review: 2 approved]
Data availability
Underlying data
Underlying data can be accessed at the following data repositories: https://www.ebi.ac.uk/metabolights/MTBLS241/files (MTBLS241 dataset), https://doi.org/10.17605/OSF.IO/93BTZ (qtmge dataset) 32 and https://doi.org/10.17605/OSF.IO/BD54T (EQHN3 dataset) 33 .
Extended data
Extended data is available at: http://dx.doi.org/10.5281/zenodo.7867854.
Software availability
• Software available from: https://www.ludwiglab.org/software-development
• Source code for the presented algorithm is available from: https://github.com/ludwigc/Automated-HSQC-Multiplet-Analysis
• Archived source code at time of publication: https:doi.org/10.5281/zenodo.7120367 31
Data and software are available under the terms of the Creative Commons Zero "No rights reserved" data waiver (CC0 1.0 Public domain dedication).
References
- 1. Holmes E, Wilson ID, Nicholson JK: Metabolic phenotyping in health and disease. Cell. 2008;134(5):714–717. 10.1016/j.cell.2008.08.026 [DOI] [PubMed] [Google Scholar]
- 2. Metallo CM, Vander Heiden MG: Understanding metabolic regulation and its influence on cell physiology. Mol Cell. 2013;49(3):388–398. 10.1016/j.molcel.2013.01.018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Ward PS, Lu C, Cross JR, et al. : The potential for isocitrate dehydrogenase mutations to produce 2-hydroxyglutarate depends on allele specificity and subcellular compartmentalization. J Biol Chem. 2013;288(6):3804–3815. 10.1074/jbc.M112.435495 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Lewis CA, Parker SJ, Fiske BP, et al. : Tracing compartmentalized nadph metabolism in the cytosol and mitochondria of mammalian cells. Mol Cell. 2014;55(2):253–263. 10.1016/j.molcel.2014.05.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Metallo CM, Walther JL, Stephanopoulos G: Evaluation of 13c isotopic tracers for metabolic flux analysis in mammalian cells. J Biotechnol. 2009;144(3):167–174. 10.1016/j.jbiotec.2009.07.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Walther JL, Metallo CM, Zhang J, et al. : Optimization of 13c isotopic tracers for metabolic flux analysis in mammalian cells. Metab Eng. 2012;14(2):162–171. 10.1016/j.ymben.2011.12.004 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Wiechert W, Nöh K: Isotopically nonstationary metabolic flux analysis: complex yet highly informative. Curr Opin Biotechnol. 2013;24(6):979–986. 10.1016/j.copbio.2013.03.024 [DOI] [PubMed] [Google Scholar]
- 8. Chong M, Jayaraman A, Marin S, et al. : Combined analysis of nmr and ms spectra (canms). Angew Chem Int Ed Engl. 2017;56(15):4140–4144. 10.1002/anie.201611634 [DOI] [PubMed] [Google Scholar]
- 9. Xia J, Bjorndahl TC, Tang P, et al. : MetaboMiner--semi-automated identification of metabolites from 2D NMR spectra of complex biofluids. BMC Bioinformatics. 2008;9:507. 10.1186/1471-2105-9-507 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Wang C, Timári I, Zhang B, et al. : COLMAR Lipids Web Server and Ultrahigh-Resolution Methods for Two-Dimensional Nuclear Magnetic Resonance- and Mass Spectrometry-Based Lipidomics. J Proteome Res. 2020;19(4):1674–1683. 10.1021/acs.jproteome.9b00845 [DOI] [PubMed] [Google Scholar]
- 11. Fino R, Byrne R, Softley CA, et al. : Introducing the csp analyzer: A novel machine learning-based application for automated analysis of two-dimensional nmr spectra in nmr fragment-based screening. Comput Struct Biotechnol J. 2020;18:603–611. 10.1016/j.csbj.2020.02.015 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Kuhn S, Tumer E, Colreavy-Donnelly S, et al. : A pilot study for fragment identification using 2d nmr and deep learning. arXiv preprint arXiv:2103.12169,2021. 10.48550/arXiv.2103.12169 [DOI] [PubMed] [Google Scholar]
- 13. Ludwig C, Günther UL: MetaboLab--advanced NMR data processing and analysis for metabolomics. BMC Bioinformatics. 2011;12(1):366. 10.1186/1471-2105-12-366 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Alpaydin E: Introduction to machine learning.MIT press,2014. Reference Source [Google Scholar]
- 15. Wishart DS, Tzur D, Knox C, et al. : Hmdb: the human metabolome database. Nucleic Acids Res. 2007;35(Database issue):D521–D526. 10.1093/nar/gkl923 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Wishart DS, Knox C, Guo AC, et al. : Hmdb: a knowledgebase for the human metabolome. Nucleic Acids Res. 2009;37(Database issue):D603–D610. 10.1093/nar/gkn810 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Wishart DS, Jewison T, Guo AC, et al. : Hmdb 3.0—the human metabolome database in 2013. Nucleic Acids Research. 2013;41(Database issue):D801–D807. 10.1093/nar/gks1065 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Wishart DS, Feunang YD, Marcu A, et al. : Hmdb 4.0: the human metabolome database for 2018. Nucleic Acids Res. 2018;46(D1):D608–D617. 10.1093/nar/gkx1089 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Comon P: Independent component analysis, a new concept? Signal Process. 1994;36(3):287–314. 10.1016/0165-1684(94)90029-9 [DOI] [Google Scholar]
- 20. Levitt MH: Spin dynamics: basics of nuclear magnetic resonance.John Wiley & Sons,2001. Reference Source [Google Scholar]
- 21. Hyvärinen A: Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans Neural Netw. 1999;10(3):626–634. 10.1109/72.761722 [DOI] [PubMed] [Google Scholar]
- 22. Smith SA, Levante TO, Meier BH, et al. : Computer simulations in magnetic resonance. an object-oriented programming approach. J Magn Reson A. 1994;106(1):75–105. 10.1006/jmra.1994.1008 [DOI] [Google Scholar]
- 23. Williamson MP: Using chemical shift perturbation to characterise ligand binding. Prog Nucl Magn Reson Spectrosc. 2013;73:1–16. 10.1016/j.pnmrs.2013.02.001 [DOI] [PubMed] [Google Scholar]
- 24. Johnson AW, Jacobson SH: A class of convergent generalized hill climbing algorithms. Appl Math Comput. 2002;125(2–3):359–373. 10.1016/S0096-3003(00)00137-5 [DOI] [Google Scholar]
- 25. Sellick CA, Hansen R, Stephens GM, et al. : Metabolite extraction from suspension-cultured mammalian cells for global metabolite profiling. Nat Protoc. 2011;6(8):1241–1249. 10.1038/nprot.2011.366 [DOI] [PubMed] [Google Scholar]
- 26. Jeeves M, Roberts J, Ludwig C: Optimised collection of non-uniformly sampled 2d-hsqc nmr spectra for use in metabolic flux analysis. Magn Reson Chem. 2021;59(3):287–299. 10.1002/mrc.5089 [DOI] [PubMed] [Google Scholar]
- 27. Smith TB, Patel K, Munford H, et al. : High-Speed Tracer Analysis of Metabolism (HS-TrAM) [version 2; peer review: 4 approved]. Wellcome Open Res. 2018;3:5. 10.12688/wellcomeopenres.13387.2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Orekhov VY, Jaravine VA: Analysis of non-uniformly sampled spectra with multi-dimensional decomposition. Prog Nucl Magn Reson Spectrosc. 2011;59(3):271–292. 10.1016/j.pnmrs.2011.02.002 [DOI] [PubMed] [Google Scholar]
- 29. Kazimierczuk K, Orekhov VY: Accelerated nmr spectroscopy by using compressed sensing. Angew Chem Int Ed Engl. 2011;50(24):5556–5559. 10.1002/anie.201100370 [DOI] [PubMed] [Google Scholar]
- 30. Delaglio F, Grzesiek S, Vuister GW, et al. : Nmrpipe: a multidimensional spectral processing system based on unix pipes. J Biomol NMR. 1995;6(3):277–293. 10.1007/BF00197809 [DOI] [PubMed] [Google Scholar]
- 31. Ludwig C: ludwigc/AutomatedHSQC-Multiplet-Analysis: Extended Data and Source Code for Automated HSQC Multiplet Analysis (v1.0). Zenodo. 2022. 10.5281/zenodo.7120367 [DOI] [Google Scholar]
- 32. Ludwig C, Jeeves M, Roberts J: Rapid data collection of spectra for the use in tracer based metabolism studies.[Dataset].2020. 10.17605/OSF.IO/93BTZ [DOI]
- 33. Ludwig C: HS-TrAM.[Dataset].2018. 10.17605/OSF.IO/BD54T [DOI]





