Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Oct 1.
Published in final edited form as: NMR Biomed. 2025 Oct;38(10):e70131. doi: 10.1002/nbm.70131

Automatic Identification of Potential Cellular Metabolites for Untargeted NMR Metabolomics

Jiashang Chen 1,*, Angela Rao 1,*, Rajshree Ghosh Biswas 1,*, Ella J Zhang 1,*, Jonathan Xin Zhou 1, Evan Zhang 1, Zuzanna Kobus 1, Marta Kobus 1, Li Su 2, David C Christiani 2,3, David S Wishart 4,§, Leo L Cheng 1,2,5,§
PMCID: PMC12445015  NIHMSID: NIHMS2105265  PMID: 40887896

Abstract

An organism’s metabolic profile provides vital information pertaining to its physiology or pathology. To monitor these biochemical changes, Nuclear Magnetic Resonance (NMR) spectroscopy has found success in non-invasively observing metabolite changes within intact samples in an untargeted manner. However, biological samples are chemically complex, comprised of many different constituents (amino acids, carbohydrates, lipids) at varying concentrations depending on physiological and pathological conditions. Due to the narrow spectral window of proton NMR, compound resonance frequencies can often overlap, making the identification and monitoring of metabolites difficult and time-consuming, particularly when dealing with large numbers of samples. Here, we introduce a Python program (ROIAL-NMR) to systematically identify potential metabolites from defined proton NMR spectral regions-of-interest (ROIs), which are identified from complex biological samples (i.e., human serum, saliva, sweat, urine, CSF, and tissues) using the Human Metabolome Database (HMDB) as a reference platform. Briefly, for disease-versus-control studies, the program considers disease types and utilizes study-defined ROIs together with their differing intensity levels, according to sample types, in differentiating disease from control to propose potential metabolites represented by these ROIs in an output table. In this report, we illustrate the utility of the program with one of our recent studies, where we measured proton NMR spectra of serum samples taken from lung cancer (LC) patients, with and without Alzheimer’s disease and related dementia (ADRD). The program successfully identified 88 metabolites, with 66 differentiating LC from control patients, and 80 distinguishing LC patients with ADRD from those without ADRD to provide important information regarding pathophysiology in complex biological samples.

Keywords: Automated Metabolite Identification, Nuclear Magnetic Resonance Spectroscopy, Python Program

1. Introduction

As an individual’s physiology and pathology develop throughout one’s lifespan, numerous biological events occur across several omic realms: genomics, transcriptomics, proteomics, and metabolomics. While genomics, transcriptomics, and proteomics are useful for understanding the etiology of disease, they are less informative for assessing disease status or real-time biological activities; in contrast, metabolomics provides a more accurate reflection of both disease status and real-time bio-activities due to its close relationship to the observed phenotype.1 By providing a snapshot of all detectable small molecules in a biological system, metabolomics can assist a variety of clinical evaluations, from disease detection and diagnosis to patient prognostication and treatment assessment.2

There are two main approaches to metabolomics studies: targeted and untargeted. Targeted approaches are used to quantitatively study known metabolites or pathways that are associated with a disease or condition of interest and can be applied, for example, to study the pharmacokinetics of drug metabolism or therapeutic effects on a specific enzyme.3 Untargeted approaches focus on the unbiased identification and quantification of all measurable metabolites and/or metabolic pathways within a given biosample.1 As a global, hypothesis-generating method, untargeted metabolomics can be powerful in disease-related studies because it offers opportunities for novel biomarker and target discoveries. Nuclear magnetic resonance (NMR) spectroscopy has been considered as a robust technique for untargeted metabolomics studies due to its unbiased nature, excellent reproducibility, minimal sample preparation requirements and non-destructive quality.4

When NMR is applied to the analysis of biological samples, metabolite identifications are normally achieved through reference to biological NMR databases such as the Human Metabolome Database (HMDB),5 the Biological Magnetic Resonance Bank (BMRB),6and others. While these databases can report possible metabolites based on a given NMR chemical shift value, there is no mechanism built into these databases that can collectively identify potential metabolites that may exist in multiple spectral regions-of-interest (ROIs) simultaneously.

To address this challenge, we developed a Python program (ROIAL-NMR: Region Of Interest Assessment of Liquids by NMR) to systematically identify potential metabolites within biological samples. Since the current program is focused on the identification of potential metabolites within pre-determined spectral ROIs, details on NMR data acquisitions, spectral processing, and ROI determination will not be discussed in this report. To demonstrate the program’s efficacy, we applied ROIAL-NMR to evaluate the proton (1H) NMR spectra of a total of 36 serum samples collected from a healthy control group (n=18) that were matched with lung cancer (LC) patients, with and without Alzheimer’s disease and related dementia (ADRD). These patients were defined as LC-ADRD (n=9) and LC-non-ADRD (n=9). ROIAL-NMR successfully identified 88 potential metabolites, with 66 metabolites capable of differentiating between LC patients and healthy controls and 80 differentiating between LC patients, with and without ADRD. ROIAL-NMR is unique in that it enables the rapid (within 5 seconds) identification of a broad range of metabolites from ROIs by utilizing chemical shift data for 891 metabolites from the HMDB NMR metabolite database. The final output can reveal metabolomic differences between test groups through pre-identification and statistical analysis of ROIs from a set of spectra, a capability that most previous programs lack.

2. Methods

Program design

ROIAL-NMR is designed to work under the assumption that the following data sets are available:

  1. A table summarizing spectral ROIs for measured biological samples of either a single or multiple group(s) of subjects, such as disease and control groups with optional statistical parameters (i.e., the disease-associated ROI trends, mainly increasing or decreasing signal intensities, and their significance levels for comparisons among groups for each ROI or after corrections for Type I errors, such as False Discovery Rate corrections (FDR)).

  2. An NMR database containing cellular metabolites with their chemical shift assignments for different types of biological samples or biofluids (serum, plasma, urine, etc.). Here we used the data available in the HMDB, as a demonstration platform.

An overview of the ROIAL-NMR data collection and processing workflow is shown in Fig. 1.

Fig. 1. Overview of the ROIAL-NMR data collection and processing workflow.

Fig. 1.

Step 1: Obtaining ROIs from NMR data analysis of the chosen biosamples, collected from subject group(s). These ROIs may be capable of differentiating groups. Levels of statistical significance for each ROI are specified in a “Summarization table.” Step 2: With input of the “Summarization table”, ROIAL-NMR identifies and categorizes metabolites based on HMDB chemical shift data. Step 3: ROIAL-NMR outputs a final table of potential metabolites, organized according to disease group, significance categories, and trend.

Availability of ROIAL-NMR

The ROIAL-NMR program is available at the following online website: https://github.com/Leo-Cheng-Lab/ROIAL-NMR.git

Detailed instructions for using ROIAL-NMR are stated in the supporting document. A video demonstrating the use of ROIAL-NMR is also included in the supporting document (Video S1).

Of note, as a post-processing program for identifying potential metabolites from the user-defined ROIs and based on the user-selected metabolic database, such as the HMDB, the use of ROIAL-NMR assumes experimental condition agreements between user’s measurements (to define these ROIs) and those used to generate the database. This includes agreement in values such as pH, NMR acquisition parameters, magnetic field strength, etc. Obviously, for multi-group comparisons, ROIAL-NMR assumes that samples from all groups are prepared, measured, and processed under the same experimental conditions.

Data availability

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

Identification of metabolites

Details of the ROIAL-NMR program workflow is summarized in Fig 2, including inputs, outputs, and categorization procedures. In Fig. 2A, the first input, consisting of user-specified ROIs, is shown in pink. NMR measurable metabolites consisting of a list of metabolites specific to the sample type (serum, urine, etc.) obtained from HMDB are shown in light blue. Using this information, ROIAL-NMR identifies potential metabolites that appear in the user-specified ROIs. To evaluate metabolites related to a particular disease, a list of disease-specific metabolites available from the HMDB can be used as the second input, shown in red in Fig. 2A.

Fig. 2. The design of the ROIAL-NMR program.

Fig. 2.

A) An overview of the ROIAL-NMR program, from top to bottom: import ROI data and HMDB datasets for metabolite identification, calculate parameters for metabolite categorization, include optional filtering by disease-relevance, and output the final list. B) A decision-making tree illustrates the specific criteria evaluated during the categorization step in A).

Chemical shift data from the HMDB were organized according to metabolic spectral information and tissue types. Two program-specific metabolic databases were generated: one containing the chemical shift values for metabolites having NMR assignments available in the HMDB, and another containing spectral region resonance multiplets (SRRMs), i.e. multiplicities due to J-couplings, for each of these identified metabolites. In this study, as seen in the supporting document Figure S1, ROI refers to the chemical shift range determined by curve-fitting the overall NMR spectrum into multiple ‘resonance peaks’ and may contain a number of individual metabolites (in each ROI), which are then identified using a metabolite-specific SRRM (i.e., resonance frequencies, and their peak multiplicities).

Separate biosample-specific databases for each sample type available in the HMDB can also be created, including plasma, serum, cerebrospinal fluid (CSF), feces, saliva, sweat, and urine. Finally, taking the intersection of all inputs (ROI specifications, metabolic NMR information, and sample-specific information), all metabolites with chemical shift values lying within the determined ROIs can be identified by ROIAL-NMR.

Including both chemical shift values and metabolite-specific SRRMs for a given metabolite may seem redundant; however, we found that both are required for accurate metabolite identification. The SRRMs corresponding to a metabolite of interest are used by ROIAL-NMR to calculate the “match ratio,” which describes the degree to which a metabolite’s NMR spectral multiplicity can be represented in the determined ROIs. However, HMDB-defined SRRMs may partially or inconsistently overlap with experimentally determined ROIs, which renders some of the SRRMs for a metabolite of interest incompatible with identified ROIs. For example, lactate has two spectral regions or SRRMs (as defined by HMDB) at 1.30-1.33 and 4.08-4.13 ppm, and the ROIs experimentally identified via optimized curve-fitting are 1.31-1.34 and 4.10-4.12 ppm. As such, due to the slight variations between our ROIs and the HMDB SRRMs, using only metabolite spectral regions can make accurate comparisons difficult.

To overcome this difficulty, ROIAL-NMR also evaluates metabolite matches according to whether a metabolite’s peaks, listed as chemical shift values, fall within the determined ROIs. If at least one peak is included within the regions defined by a ROI, then the SRRM that encompasses the included peak is considered a “match” by the program. Referring to lactate as an example, the HMDB defines six peaks within its two spectral regions: peaks at 1.310 and 1.324 ppm correspond to spectral region or SRRM #1 [1.30-1.33], and peaks at 4.084, 4.098, 4.111, and 4.125 ppm correspond to SRRM #2 [4.08-4.13]. The peak at 4.111 ppm lies within the experimentally determined ROI [4.10-4.12], and the peak at 1.324 ppm lies within ROI [1.31-1.34] ppm. Therefore, ROIAL-NMR considers both of lactate’s SRRMs to be represented in the experimental data and reports a “match ratio” of 2/2, even though the HMDB-reported SRRM did not align precisely with either ROI. The more SRRMs from a given metabolite that are matched to ROIs, the higher the match ratio and the greater the likelihood that a particular metabolite is present in the tested samples.

However, when monitoring biological samples, usually only a small segment of the NMR spectra, such as the up-field regions from the water resonance, is analyzed (c.f. 0.5-4.5 ppm), due to the low signal to noise ratio in the down-field, or aromatic, region. As such, ROIAL-NMR excludes regions/peaks outside the segment of interest for each metabolite prior to calculating the match ratio. If more than 50% of a potential metabolite’s peaks lie beyond the analyzed region of interest, ROIAL-NMR excludes those peaks from the calculation of the match ratio.

For a single population study, ROIAL-NMR will complete its analysis and output the final list of potential metabolites in about 5 seconds. However, for a multi-group study, such as disease vs. healthy control investigations, further categorizations or filtering of the resulting list of potential metabolites will be performed by ROIAL-NMR. In particular, the metabolites will be further organized in terms of integrated peak areas and their peak intensities of each metabolite in each group, or disease-based variation (or a testing condition) as a trend, and the levels of significance of these trends (if this information is provided by the user). Therefore, a multi-group study will typically take about less than 10 seconds for ROIAL-NMR.

Categorization of metabolites

The list of possible metabolites that are identified in the evaluated spectral regions can be further categorized with user-generated information about each ROI, including intensity or concentration trends and significance levels. This can be done with or without corrections for multiple comparisons (e.g. false discovery rate, FDR, or Bonferroni correction for Type I errors). ROIAL-NMR uses the procedures shown in Fig. 2B (and described below) to define inclusion and significance thresholds for each metabolite.

First, ROIAL-NMR will evaluate the “match ratio” parameter. This parameter refers to the proportion of the number of SRRMs present in the input ROIs to the total number of SRRMs for each metabolite. Only metabolites with a match ratio ≥ 50% of their SRRMs within the analyzed ROIs are kept for further categorization. Furthermore, if half of the metabolites’ resonance regions, or SRRMs lie outside of the analyzed spectral region (i.e., outside of the 0.5-4.5 ppm used for biofluid analysis), this metabolite will be eliminated before calculating the match ratio to reduce overrepresentation of questionable metabolites.

Second, ROIAL-NMR will consider the “significance ratio”. This parameter is the number of statistically significant ROIs (p<0.05) present in the spectra among all SRRMs for each metabolite. If the significance ratio is > 50%, then the metabolite is considered significant. FDR (or Bonferroni correction) is also considered to determine the level of significance for each identified metabolite. Specifically, if at least one of the SRRMs of a given metabolite remains significant after FDR correction (p<0.05), then that metabolite is considered to be FDR-significant.

Third, to identify the relative intensity or concentration change of metabolites, ROIAL-NMR compares each metabolite’s SRRMs based on the integral intensities (trends) within ROIs between the tested (e.g., disease) group and the control (e.g., healthy) groups. These signal intensity trends may be classified into three categories: increasing, decreasing, or no trend. An increasing trend means that all SRRM peaks lying within ROIs always exhibit increased intensity in the tested (e.g. diseased) group compared to the control condition. Decreasing trends describe metabolites whose SRRM peaks lie in ROIs where the peak intensities are always decreasing, (again, compared to controls) and no trend refers to metabolites whose SRRMs both increase and decrease within different spectral ROIs. In the end, only metabolites whose regions all follow the same trend, that is increasing or decreasing, are reported as shown in Fig. 2B.

However, to avoid eliminating potentially significant metabolites, we allow signal intensity trends in the FDR-significant ROIs of a metabolite to override trends in non-significant ROIs. For instance, if a metabolite resides in two regions with one defined as “FDR-significant increase” (i.e., increased significantly after FDR correction), while the other appears as a “non-significant decrease”, then this metabolite is categorized as having an overall “FDR-significant increase”. On the other hand, if a metabolite has both an “FDR-significant increase” and a “significant decrease” set of ROIs, then it is categorized as having no trend and will not be included in the output table.

Once the match ratio, significance ratio, and signal intensity trends are calculated, the metabolites can finally be categorized. We define six different categories as follows: 1) increase, not significant; 2) increase, significant; 3) increase, FDR-significant; 4) decrease, not significant; 5) decrease; significant; and 6) decrease, FDR-significant. A summary of the categorization procedure is shown in Fig. 2B.

Due to the extensive collections of metabolites with NMR data in the HMDB, many putative metabolites are less likely to appear in the NMR spectra of certain biological samples at measurable concentrations. To minimize this complication, ROIAL-NMR also applies a concentration filter to exclude and annotate metabolites based on their reported concentrations in specific biofluids, as listed in the HMDB. Specifically, by referencing the normal concentration for each metabolite in each biosample type from adult subjects listed in the HMDB, we use the highest and lowest reported values as their potential concentration limits. The program then excludes metabolites with concentration limits below 5 μM (the lower limit of detection by most NMR instruments) and classifies the remaining metabolites into three groups: those between 5 μM and 500 μM, those greater than 500 μM, and those without quantified reports. Metabolites without available concentration data in the HMDB are marked as “not quantified” and are included in the results for their potential investigative value.

An additional, optional input for disease relevance may be applied at this stage. Disease-specific metabolite lists may be prepared by the user, using the HMDB or other reliable sources. If a disease-specific metabolite list is used, the final output will “tag” metabolites that have been previously identified and/or hypothesized to be relevant for a particular disease or condition. If omitted, the final output will include all potential metabolites identified from the given ROIs, without any indication of disease relevance.

The final output of ROIAL-NMR is represented by the integration of all datasets used by the program, as shown in Fig. 3. With this information, the user may organize the provided list of metabolites, along with their concentrations, significance, peak intensity trends, and disease-relevance indicators, in any manner they wish.

Fig. 3. Relationships among databases and results generated from ROIAL-NMR.

Fig. 3.

A). The Venn diagram illustrates the intersection of metabolites found in the NMR spectral data from a study (the only required user input for the program), metabolites associated with the investigated condition (disease), metabolites derived from the analyzed sample type, such as urine, serum, CSF, saliva, feces, or sweat. Disease-related metabolite lists may be found in HMDB, or databases established by the user, and metabolites in that sample type with concentrations ≥5 μM. The central area, labeled as “The final metabolites,” represents metabolites identified by the Program. B). A simplified flowchart of ROIAL-NMR.

NMR ROIs – Sample Set and Data Collection

Data used in this report were obtained from our recent study evaluating lung cancer (LC) patients, with and without Alzheimer’s disease and related dementia (ADRD), using proton (1H) NMR spectra collected on serum samples. Since the purpose of using these data is to demonstrate the utility of the program, we will not extensively focus on the NMR experimental details as that are presented in a separate report. Here, we will only describe parameters pertinent to the input for ROIAL-NMR program. Serum NMR was analyzed on a Bruker Avance III 600 MHz spectrometer, using a 4 mm high resolution magic angle spinning (HRMAS) probe. Chemical shift reference calibrations were achieved by using the up-field peak of the lactic acid doublet (1.32 ppm, relative to DSS). Spectral intensities based on curve-fitting were calculated from all deconvoluted spectral regions within 0.5–4.5 ppm, from which 18 ROIs were determined, and subjected to statistical analyses.

3. Results – An application of ROIAL-NMR

The utility and functionality of ROIAL-NMR is illustrated using 1H NMR data from the clinical study described above. In that study, we investigated the metabolomics of lung cancer (LC) patients with or without Alzheimer’s disease and related dementia (ADRD), and compared them with healthy controls, using serum HRMAS NMR spectroscopy.

Based on our spectral analyses, 18 spectral ROIs were identified and statistically analyzed to determine metabolite changes under diseased conditions relative to the healthy controls. Following statistical analysis on the 18 identified ROIs, levels of significance, with and without FDR corrections, were summarized for all 18 ROIs in a table (Fig. 4). Additionally, peak intensity trend information, and data on whether a particular spectral region was observed to increase or decrease in the tested condition relative to the control condition, were recorded for each ROI. Although trends are defined as relative to the control here, they may describe a comparison of any two groups. For our study, we made two comparisons: LC vs Healthy Control, and LC-ADRD vs. LC-non-ADRD.

Fig. 4. ROI identification in the LC-ADRD study.

Fig. 4.

Hierarchical clustering identified 18 ROIs that showed differences between patient groups. Wilcoxon/Kruskal-Wallis tests and false discovery rate (FDR) tests (with p-value ≤ 0.05 representing significance) were conducted to evaluate the statistical significance of each ROI.

For this study, metabolites previously associated with LC or ADRD were manually identified through a detailed literature search and cross-referenced with data from the HMDB. For other diseases, users can input their own disease-relevant metabolite lists into the program to automatically highlight them. Currently, ROIAL-NMR includes relevant metabolites for three conditions: lung cancer, Alzheimer’s Disease, and prostate cancer. Fig. 5 illustrates one possible method to present the final output of all potential LC- and ADRD-related metabolites along with their estimated significance levels. Metabolites identified and categorized according to their ROIs can also be illustrated in a different table format as shown in the supplementary material Table S1. Figure 5 exhibits all ROIAL-NMR assigned metabolites using data from the HMDB. Of note, there are a few metabolites (e.g., Canrenone (Can), Cytarabine (Ara-C), Perillyl alcohol (POH)) which are not endogenously produced, but potentially taken by the patients as a part of their treatment regimen.79 In this manner, ROIAL-NMR can identify and assign all metabolites in a complex sample without bias for further evaluation. However, users need to evaluate the final results based on their knowledge of physiology, pathology, pharmacology, diet, etc. of the groups they are measuring.

Fig. 5. ROIAL-NMR output.

Fig. 5.

A) Potential metabolites associated with LC and ADRD. LC vs. Healthy Control refers specifically to LC-non-ADRD vs. Ctrl-LC-non-ADRD. Metabolites in red indicate a possible concentration greater than 500 μM according to the HMDB, while those in black represent concentrations between 5 μM and 500 μM, and blue denotes metabolites that are not quantified. Metabolites with a double underline presents the same trends for both LC vs. Ctrl and LC-ADRD vs. LC-non-ADRD comparison groups, with at least one comparison demonstrating “FDR significance” (dark grey) or “significance without FDR” (grey) while single underlined metabolites present opposite trends for the two comparison groups. *: metabolites identified to be relevant to LC; #: metabolites relevant for ADRD. B) Legend for metabolite categorization. Three categories for metabolite grouping, based on levels of significance, are denoted. Significance thresholds are described in the text. Dark grey represents metabolites where half the SRRMs fall into significant ROIs where at least one SRRM remains significant after FDR correction, grey represents metabolites where half the SRRMs fall into the significant ROIs but no SRRM remain significant after FDR correction, and light grey represents metabolites where less than half of the SRRMs are in the significant ROI regions. A list of metabolite abbreviations are provided in Table S2 in the supporting documentation.

4. Discussion

Here, we present a newly developed Python program, ROIAL-NMR, that is capable of identifying and categorizing potential metabolites from user-defined NMR spectral ROIs. We recognize that using user-defined NMR spectral ROIs can lead to variable results obtained by different users. However, the main purpose of ROIAL-NMR is to assist with the identification and organization of metabolites within chemically complex samples assuming the user has correctly identified their ROIs. Automated identification of ROIs is beyond the scope of the current version of the program. Given this assumption, users should be consistent in the manner they create their ROIs. If this is done, ROIAL-NMR will be able to use the HMDB SRRM information to identify and organize the metabolites.

Fundamentally, ROIAL-NMR is designed to be a tool that can assist users in the compound identification process following NMR spectral data processing. Compounds are identified based on publicly available metabolite databases containing metabolite NMR assignments and can be further organized with various considerations of trends (increasing, decreasing), their significance levels, and disease relevance. While individual metabolites may be identified using specific chemical shift values from existing NMR metabolite databases, spectral ROIs, which are key to identifying metabolites in untargeted NMR-based metabolomics, are not currently available in these databases. Here, we used HMDB as a demonstrating platform; however, ROIAL-NMR can utilize any metabolite database containing NMR assignments. With access to extensive and free-to-use online resources, ROIAL-NMR can incorporate all metabolites from databases such as the HMDB, BMRB, and more.

For untargeted 1H NMR-based metabolomics, a common problem is the spectral overlap of various metabolite signals in a specific 1D spectral region. In a single ROI, ROIAL-NMR reports all possible metabolites within it, whether or not there is peak overlap. However, by applying a match ratio calculated across all ROIs, ROIAL-NMR effectively avoids over-identifying metabolites. The large annotation coverage of this method makes ROIAL-NMR more applicable for untargeted metabolomic studies in metabolically complex samples, such as cell growth media, cell lysates, and urine, compared to more traditional spectral peak fitting techniques where crowded spectra make analyses difficult to conduct and verify metabolite identities. Furthermore, numerous technical advancements made to address the concern of dominating signals, such as water and lipids, in complex biological samples can improve the accuracy of ROIAL-NMR results. For example, water is ubiquitous in most biological samples, where specialized water suppression techniques may be applied to enhance metabolite resonances, thus achieving high signal-to-noise ratios with varying performance across samples1012. Similarly, sugars and lipids may also be prevalent, where recent advancements have focused on suppressing their signal for more sensitive and selective metabolite measurements1315. These advanced suppression techniques not only improve NMR sensitivity and resolution but can also help ROIAL-NMR to identify low-concentrated metabolites that may otherwise be masked under dominating solvent and other metabolite signals.

The most important advantage of ROIAL-NMR is its ability to systematically and collectively consider multiple metabolites from multiple spectral ROIs, simultaneously. While manual or semi-automated metabolite identification based on available NMR databases is possible, this process can be tedious, labor-intensive, and prone to error.16, 17 This is especially true given the rapid advancement of NMR technologies, and the development of increasingly large and complex reference spectral databases. As such, the identification of metabolites in complex biological mixtures is a challenge and is only possible for the most abundant metabolites. 16 Even with peak-picking, binning algorithms,18 and other approaches to reduce overlap and enhance spectral resolution,19 deconvolution of NMR data remains a challenge in complex mixture analysis. 17

Although several automatic metabolite searching programs exist for NMR metabolomics, they still have some limitations.20, 21 For example, some of current machine learning approaches, are still “black boxes” with regards to their interpretability,22 and may not be entirely reliable for accurate and representative metabolite identification from experimental NMR spectra. Other types of metabolite identification programs, such as NMR-TS, focus on elucidating unknown metabolites from a homogenous single-compound sample.21 However, these conditions are unrealistic for untargeted NMR metabolomic studies, since biological samples contain a heterogeneous mix of well-known metabolites.

For metabolomic analyses in complex mixtures, current molecular identification programs such as AMIX (Bruker BioSpin)23 and the Chenomx NMR Suite24 focus on model-based metabolite fitting, with peak-picking software for spectral processing and manual metabolite identification through visual comparison or visual assessment of similarity scores. Manual operations, however, remain time-consuming (often more than 1 hour per sample) and are prone to errors. Several automated programs have been developed to deconvolve and quantify metabolites via 1D NMR spectra, such as MagMet,25 BATMAN, which uses a Bayesian model,26 and Bayesil, which uses a probabilistic graphical model27,28, and other sample specific metabolite identification programs.29 BATMAN includes a database of chemical shift and J-coupling data from the HMDB, which can also be amended by users. Similarly, Bayesil references the HMDB but contains several sub-libraries for different biofluids, namely serum (containing 47 metabolites), plasma (50 metabolites), and CSF (43 metabolites). Likewise, MagMet automatically performs spectral processing (Fourier transformation, phase and baseline correction, post-processing water suppression, chemical shift calibration) along with metabolite specific J-coupling information from a curated library of 85 serum/plasma metabolites for high-speed identification and quantification.

These programs provide accurate spectral profiling, but also have their limitations. Overlapping signals of chemically similar compounds or compounds with single resonances may confuse these programs,21 causing inaccurate identification and quantification. Additionally, because accurate peak fitting in these programs depends on the use of high-quality datasets, the number of detectable compounds may be limited. In other cases, the NMR reference datasets used in these programs are quite small (<100 compounds) and are often not publicly available or need to be custom-built. Rather than using small, inaccessible databases and complex scoring schemes, ROIAL-NMR uses large, openly accessible databases and uses a simpler scoring scheme that offers a clear match ratio of each database metabolite to the users’ data, thereby enabling further metabolite evaluations.

With chemical shift data for a total of 891 metabolites from the HMDB (637 serum, 598 urine, 238 CSF, 299 saliva, 502 feces, 99 sweat), ROIAL-NMR is able to identify and categorize a large number of potential metabolites which continue to grow with the growth of the HMDB. Provided that users-identify the ROIs for all spectra, there is no limit to how many spectra can be analyzed using ROIAL-NMR. This is advantageous when analyzing large numbers of samples, as is typically done for pre-clinical and clinical biomarker assessment. In our example, we identified 88 potential metabolites from 18 ROIs, with 66 metabolites identified as relevant to LC and 80 metabolites relevant to ADRD. The total time taken to analyze a set of identified ROIs (from 36 spectra) for 88 potential metabolites using ROIAL-NMR was 5 seconds. Manual or semi-automatic deconvolution and other basic operations for a 1D spectrum to produce ROIs typically take 7 to 8 minutes and could be completed more quickly by experienced operators. This is significantly faster compared to approximately one hour per spectrum it would have taken to manually analyze the same number of spectra by AMIX (Bruker BioSpin)17 or the Chenomx NMR Suite24.

ROIAL-NMR generates highly interpretable output that may be analyzed by users of any skill level. Python is a widely-used and easy-to-learn language that works on most operating systems, making the program easily accessible and amenable to adaptation. Additionally, ROIAL-NMR is compatible with analyzing multiple NMR spectra of both simple or chemically complex samples. Another key strength of this program is its high degree of flexibility, allowing users to customize and adjust filter parameters according to the specific needs of their studies. The methodology presented here is also generalizable to other spectral analysis modalities, such as Raman spectroscopy or mass spectrometry.

Since ROIAL-NMR was designed to streamline the manual metabolite-searching and identification process, it relies upon the user to provide accurate spectral ROIs, levels of significance, and other information. As a word of caution, ROIAL-NMR was not built to test the accuracy of the user-defined ROIs, but rather to identify potential metabolites within these user-defined ROIs. Moreover, shifts in metabolite chemical shifts due to sample matrix effects (i.e., pH, salinity) can make matching generated ROIs to HMDB spectral regions challenging. In some cases, these matrix effects can lead to chemical shifts of up to 1 ppm (esp. with imidazole containing compounds). As ROIAL-NMR utilizes chemical shift values from existing databases, the sample conditions must match that of the database. In the case of the HMDB, samples must be maintained at a pH of 7 (HMDB).30

Because ROIAL-NMR is designed to identify metabolites from complex samples and organize them according to their relative changes (compared to controls or other treatment parameters), absolute quantification of the compounds is not the primary aim of the program. However, information about their relative concentration trends can still be interpreted. Furthermore, as ROIAL-NMR relies on NMR databases for accurate metabolite identification, it is crucial to ensure that these databases are consistently updated. This necessitates regular checks and validations to confirm that ROIAL-NMR is accessing the most up-to-date spectral data from whichever NMR database is chosen by the user. Future developments with ROIAL-NMR are planned to incorporate periodic and automatic synchronization of NMR chemical shift databases to reduce manual input.

5. Conclusion

We have developed and tested a Python program, ROIAL-NMR, that systematically identifies potential metabolites measured from user-defined 1D 1H NMR spectral regions of interest using several modes of categorization. These include levels of significance, concentration trends, and disease relevance. We have shown that implementation of ROIAL-NMR can rapidly produce metabolite identifications for an untargeted metabolomics NMR study. By using an easily accessible and flexible program to automate the traditionally tedious and manual database-searching process, we have shown that it is possible to accelerate, simplify and enhance the metabolite identification process for NMR-based metabolomics research. The versatility of ROIAL-NMR should enable more facile exploration of disease-related metabolites and facilitate the broader implementation of untargeted NMR-based metabolomics.

Supplementary Material

GUI video: How to use ROIAL-NMR
Download video file (15.4MB, mp4)
Supplementary document

Acknowledgements:

We would like to thank the National Institute of Health (NIH), National Institute of Aging (NIA) and National Cancer Institute (NCI) for funding support: R01 AG070257 and CA273010, as well as the Massachusetts General Hospital Athinoula A. Martinos Center for Biomedical Imaging. We would also like to thank the Canadian Institute of Health Research (CIHR) for the following postdoctoral training fellowship: MFE-194064. Authors JC, AR, RGB and EJZ contributed equally.

Abbreviations:

ADRD

Alzheimer’s Disease and Related Dementia

BMRB

Biological Magnetic Resonance Bank

CSF

Cerebral Spinal Fluid

FDR

False Discovery Rate

HMDB

Human Metabolome Database

ROIAL-NMR

Regions of Interest Assessment of Liquids by NMR

ROI

Spectral Regions of Interest

SRRM

Spectral Region Resonance Multiples

Footnotes

Conflict of Interest:

The authors report no conflicts of interests.

References

  • 1.Schrimpe-Rutledge AC, Codreanu SG, Sherrod SD, McLean JA. Untargeted Metabolomics Strategies-Challenges and Emerging Directions. J Am Soc Mass Spectrom. 2016;27(12):1897–1905. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Bujak R, Struck-Lewicka W, Markuszewski MJ, Kaliszan R. Metabolomics for laboratory diagnostics. J Pharm Biomed Anal. 2015;113:108–120. [DOI] [PubMed] [Google Scholar]
  • 3.Cheung PK, Ma MH, Tse HF, et al. The applications of metabolomics in the molecular diagnostics of cancer. Expert Rev Mol Diagn. 2019;19(9):785–793. [DOI] [PubMed] [Google Scholar]
  • 4.Emwas AH. The strengths and weaknesses of NMR spectroscopy and mass spectrometry with particular focus on metabolomics research. Methods Mol Biol. 2015;1277:161–193. [DOI] [PubMed] [Google Scholar]
  • 5.Wishart DS, Guo A, Oler E, et al. HMDB 5.0: the Human Metabolome Database for 2022. Nucleic Acids Res. 2022;50(D1):D622–D631. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Ulrich EL, Akutsu H, Doreleijers JF, et al. BioMagResBank. Nucleic Acids Res. 2008;36(Database issue):D402–408. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.White DR, Powell BL, Craig JB, et al. A phase II trial of high-dose cytarabine and cisplatin in previously untreated non-small cell carcinoma of the lung. A Piedmont Oncology Association Study. Cancer. 1990;65(8):1700–1703. [DOI] [PubMed] [Google Scholar]
  • 8.Koltai T, Reshkin SJ, Harguindey S. Chapter 15 - Pharmacological interventions part III. In: Koltai T, Reshkin SJ, Harguindey S, eds. An Innovative Approach to Understanding and Treating Cancer: Targeting pH. Academic Press; 2020:335–359. [Google Scholar]
  • 9.Sanomachi T, Suzuki S, Togashi K, et al. Spironolactone, a Classic Potassium-Sparing Diuretic, Reduces Survivin Expression and Chemosensitizes Cancer Cells to Non-DNA-Damaging Anticancer Drugs. Cancers (Basel). 2019;11(10). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Singh U, Alsuhaymi S, Al-Nemi R, Emwas AH, Jaremko M. Compound-Specific 1D (1)H NMR Pulse Sequence Selection for Metabolomics Analyses. ACS Omega. 2023;8(26):23651–23663. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Mobarhan YL, Struppe J, Fortier-McGill B, Simpson AJ. Effective combined water and sideband suppression for low-speed tissue and in vivo MAS NMR. Anal Bioanal Chem. 2017;409(21):5043–5055. [DOI] [PubMed] [Google Scholar]
  • 12.Chen JH, Sambol EB, Kennealey PT, et al. Water suppression without signal loss in HR-MAS 1H NMR of cells and tissues. J Magn Reson. 2004;171(1):143–150. [DOI] [PubMed] [Google Scholar]
  • 13.Singh U, Al-Nemi R, Alahmari F, Emwas AH, Jaremko M. Improving quality of analysis by suppression of unwanted signals through band-selective excitation in NMR spectroscopy for metabolomics studies. Metabolomics. 2023;20(1):7. [DOI] [PubMed] [Google Scholar]
  • 14.Hassan Q, Dutta Majumdar R, Wu B, et al. Improvements in lipid suppression for (1) H NMR-based metabolomics: Applications to solution-state and HR-MAS NMR in natural and in vivo samples. Magn Reson Chem. 2018. [DOI] [PubMed] [Google Scholar]
  • 15.Singh U, Emwas AH, Jaremko M. Enhancement of weak signals by applying a suppression method to high-intense methyl and methylene signals of lipids in NMR spectroscopy. RSC Adv. 2024;14(37):26873–26883. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.van der Hooft JJJ, Rankin N. Metabolite Identification in Complex Mixtures Using Nuclear Magnetic Resonance Spectroscopy. In: Webb GA, ed. Modern Magnetic Resonance. Springer International Publishing: Chem; 2018:Springer International Publishing. [Google Scholar]
  • 17.Judge MT, Ebbels TMD. Problems, principles and progress in computational annotation of NMR metabolomics data. Metabolomics. 2022;18(12):102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Sousa SAA, Magalhães A, Ferreira MMC. Optimized bucketing for NMR spectra: Three case studies. Chemometrics and Intelligent Laboratory Systems. 2013;122 93–102. [Google Scholar]
  • 19.Zeng Q, Chen J, Lin Y, Chen Z. Boosting resolution in NMR spectroscopy by chemical shift upscaling. Anal Chim Acta. 2020;1110:109–114. [DOI] [PubMed] [Google Scholar]
  • 20.Migdadi L, Telfah A, Hergenroder R, Wohler C. Novelty detection for metabolic dynamics established on breast cancer tissue using 2D NMR TOCSY spectra. Comput Struct Biotechnol J 2022;20:2965–2977. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Zhang J, Terayama K, Sumita M, et al. NMR-TS: de novo molecule identification from NMR spectra. Sci Technol Adv Mater. 2020;21(1):552–561. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Zhang X, Xu J, Yang J, et al. Understanding the learning mechanism of convolutional neural networks in spectral analysis. Anal Chim Acta. 2020;1119:41–51. [DOI] [PubMed] [Google Scholar]
  • 23.Ellinger JJ, Chylla RA, Ulrich EL, Markley JL. Databases and Software for NMR-Based Metabolomics. Curr Metabolomics. 2013;1(1). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Mercier P, Lewis MJ, Chang D, Baker D, Wishart DS. Towards automatic metabolomic profiling of high-resolution one-dimensional proton NMR spectra. J Biomol NMR. 2011;49(3-4):307–323. [DOI] [PubMed] [Google Scholar]
  • 25.Rout M, Lipfert M, Lee BL, et al. MagMet: A fully automated web server for targeted nuclear magnetic resonance metabolomics of plasma and serum. Magn Reson Chem. 2023;61(12):681–704. [DOI] [PubMed] [Google Scholar]
  • 26.Hao J, Liebeke M, Astle W, De Iorio M, Bundy JG, Ebbels TM. Bayesian deconvolution and quantification of metabolites in complex 1D NMR spectra using BATMAN. Nat Protoc. 2014;9(6):1416–1427. [DOI] [PubMed] [Google Scholar]
  • 27.Ravanbakhsh S, Liu P, Bjorndahl TC, et al. Accurate, fully-automated NMR spectral profiling for metabolomics. PLoS One. 2015;10(5):e0124219. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Ravanbakhsh S, Liu P, Bjorndahl TC, et al. Correction: Accurate, Fully-Automated NMR Spectral Profiling for Metabolomics. PLoS One. 2015;10(7):e0132873. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Rohnisch HE, Eriksson J, Tran LV, Mullner E, Sandstrom C, Moazzami AA. Improved Automated Quantification Algorithm (AQuA) and Its Application to NMR-Based Metabolomics of EDTA-Containing Plasma. Anal Chem. 2021;93(25):8729–8738. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Wishart DS, Tzur D, Knox C, et al. HMDB: the human metabolome database. Nucleic acids research. 2007;35(suppl_1):D521–D526. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

GUI video: How to use ROIAL-NMR
Download video file (15.4MB, mp4)
Supplementary document

Data Availability Statement

Data sharing is not applicable to this article as no new data were created or analyzed in this study.

RESOURCES