Shifting-corrected regularized regression for 1H NMR metabolomics identification and quantification

Thao Vu; Yuhang Xu; Yumou Qiu; Robert Powers

doi:10.1093/biostatistics/kxac015

. 2022 May 10;24(1):140–160. doi: 10.1093/biostatistics/kxac015

Shifting-corrected regularized regression for ¹H NMR metabolomics identification and quantification

Thao Vu ¹, Yuhang Xu ^2,^✉, Yumou Qiu ³, Robert Powers ^4,^✉

PMCID: PMC9748598 PMID: 36514939

Summary

The process of identifying and quantifying metabolites in complex mixtures plays a critical role in metabolomics studies to obtain an informative interpretation of underlying biological processes. Manual approaches are time-consuming and heavily reliant on the knowledge and assessment of nuclear magnetic resonance (NMR) experts. We propose a shifting-corrected regularized regression method, which identifies and quantifies metabolites in a mixture automatically. A detailed algorithm is also proposed to implement the proposed method. Using a novel weight function, the proposed method is able to detect and correct peak shifting errors caused by fluctuations in experimental procedures. Simulation studies show that the proposed method performs better with regard to the identification and quantification of metabolites in a complex mixture. We also demonstrate real data applications of our method using experimental and biological NMR mixtures.

Keywords: Chemical shift, NMR metabolomics, Regularized regression, Spectral data

1. Introduction

Over the last several decades, the field of metabolomics has increasingly gained attention among postgenomics technologies (Dieterle and others, 2006) due to its ability to study the state of a biological system at the molecular level. In particular, metabolites are the direct outcomes of all genomic, transcriptomic, and proteomic responses to environmental stimuli, stress, or genetic mutations (Fiehn, 2002). Small changes in metabolite concentration levels might reveal crucial information that is closely related to disease status (Gowda and others, 2008), drug resistance (Thulin and others, 2017), and the biological activity of chemicals derived from diet and/or environment (Daviss, 2005). Therefore, metabolomics has become an increasingly attractive approach for researchers in many scientific areas such as toxicology (Ramirez and others, 2013), food science and nutrition (Wishart, 2008a), and medicine (Putri and others, 2013).

Nuclear magnetic resonance (NMR) spectroscopy is one of the premier analytical platforms to acquire data in metabolomics. It is renowned for the richness of information, rapid and straightforward measurements, high level of reproducibility, and minimal sample preparation (Wishart, 2008b). Each metabolite is uniquely characterized by its own resonance signature, namely Inline graphic NMR chemical shift fingerprint. Every spectral peak is generated by a distinct hydrogen nucleus resonating at a particular frequency, which is measured in parts per million (ppm) relative to a standard compound (Dona and others, 2016). For a particular metabolite, depending on its chemical structure, one or more peaks can show up at specific locations on the chemical shift axis. At the same time, the height of every spectral peak is directly proportional to the concentration of the corresponding metabolite in the mixture.

As an illustration, Figure 1 shows individual Inline graphic NMR spectra of three metabolites (Figure 1(a)–(c)) under an ideal experimental condition. In each panel, the x-axis denotes the chemical shift which is measured in ppm while the y-axis represents the relative peak intensity corresponding to each chemical shift. Additionally, whenever a peak is mentioned, it is referred to a small symmetrical segment of the spectrum; and the chemical shift corresponds to the center of the peak is known as a peak location. Ideally, given a mixture spectrum composed of several metabolites as shown in Figure 1(d), one could overlay the figure with each individual reference spectrum such as Figure 1(a)–(c)) to potentially identify each metabolite in the mixture if the signals match. The process of identifying individual metabolites in a complex mixture is called identification. Simultaneously, how much each metabolite contributes to the mixture is quantified by their corresponding peak intensities in the mixture spectrum. The process of estimating the concentration of each metabolite in the mixture is called quantification. Therefore, the NMR fingerprint and corresponding peak intensities are keys to any approaches to identify and quantify metabolites present in complex biological mixtures.

Fig. 1. — Three reference spectra of L-alanine (a), glycine (b), and 3-aminoisobutanoic acid (c) convolve a mixture spectrum (d) in an ideal condition. Another mixture spectrum (e) has the glycine peak shifted to the right from the referenced location (dashed line).

A conventional approach, which involves manual assignment protocols, has been previously reported (Dona and others, 2016). The manual approach relies on experienced spectroscopists to overlay the observed spectrum with reference spectra of pure compounds to decide which particular metabolites are present in the mixtures, so the whole process is time-consuming, labor-intensive, and prone to biases towards operator knowledge and expectations (Tulpan and others, 2011). Automating the process of metabolite identification and quantification is desired, but there exists two major obstacles. First, uncontrollable sample perturbations are inherent to every metabolomics study, which arise from a variety of sources such as variation in experimental factors (e.g., pH, temperature, and ionic strength), instrument instability, and inconsistency in sample handling and preparation. As a result, NMR signals of a metabolite may deviate from their referenced positions, which, in turn, makes it harder for any matching procedures. Figure 1(e) illustrates such shifting errors in signal positions, where the glycine peak is shifted to the right of its referenced location at 3.54 ppm (i.e., dashed line). Second, the number of candidate metabolites in the database always exceeds the number of actual sources of signals in the spectra, which raises a sparsity issue. For example, the number of metabolites detected from intact serum/plasma is in the range of 30 or less which is far fewer than the 4229 blood metabolites in the Human Serum Metabolome (Psychogios and others, 2011). The combination of the two factors makes the detection and interpretation of metabolite-specific signals challenging in practice.

Regularized regression approaches such as Lasso, elastic net, and adaptive Lasso seem to be intuitive choices to handle the sparsity problem because of their built-in regularization capability. However, they are not capable of addressing peak shifting errors. Recently, high-dimensional regression with measurement errors in covariates is an emerging statistical research area. For example, to deal with the measurement error problem, Sørensen and others (2015) and Datta and Zou (2017) proposed different modifications of Lasso; and Sørensen and others (2018) introduced methods based on the matrix uncertainty selector (Rosenbaum and Tsybakov, 2010). However, these approaches could not be applied to the problem of shifting errors due to two key differences. First, these works assume that the responses and covariates are correctly matched, but the covariates are subject to additive measurement errors. However, in the discussed problem, the observed spectral intensities of a mixture are assumed to be generated from mismatched covariates, i.e., the intensities of compounds with shifting errors. Second, replicates of covariate measurements or an external validation sample are traditionally required to calibrate the models to deal with measurement errors in covariates (Carroll and others, 2006). However, neither of them is available for the type of NMR data being considered. In a different manner, Bayesil (Ravanbakhsh and others, 2015), Chenomx (Chenomx, 2015), and ASICS (Tardivel and others, 2017; Lefort and others, 2019) develop their own methodology to deal with both problems. More precisely, Bayesil partitions the sample spectrum into disjoint regions before applying a probabilistic approach to assign a low probability to an undesirable match and vice versa. Additionally, an automated Profiler module of a popular proprietary software, namely Chenomx, utilizes a linear combination of Lorentzian peak shape models of reference metabolites to reconstruct the observed mixture spectrum (Weljie and others, 2006). Uniquely, ASICS learns warping functions to minimize the difference between the observed and reconstructed spectra before quantifying individual metabolite concentration. However, none of the methods has yet been demonstrated to be a gold standard in practice.

Herein, we introduce a new approach to automatically identify and quantify metabolites in complex biological mixtures. This parsimonious proposed method is shown to be efficient by simultaneously addressing both problems of shifting errors and the sparsity of some abundant metabolites present in mixtures. Specifically, the method first conducts the variable selection to identify correct metabolites in a mixture with nonzero coefficients. Second, the method performs a postselection coefficient estimation to quantify metabolite concentration after correcting for shifting errors using an embedded novel weight function. We demonstrate the effectiveness of the proposed model using simulated data, experimental NMR mixtures, and biological serum samples. Interesting findings are further emphasized when the method is compared with popular regularized regression models including Lasso (Tibshirani, 1996), elastic net (Zou and Hastie, 2005), and adaptive Lasso (Zou, 2006), and other existing fitting models including Bayesil, Chenomx, and ASICS.

2. Model and methodology

2.1. Backgrounds

Each NMR spectrum after being preprocessed by apodization, phasing, and baseline correction can be represented as a pair of equally spaced vector of chemical shifts typically ranging from Inline graphic to ppm and a same length vector of the corresponding relative intensity of the resonance. Depending on the resolution of the instrument or the spectrum, the total number of features in a spectrum is in the order of – (Astle and others, 2012). However, some NMR signals with low intensities might correspond with instrumental noise, which are not reliable for identifying metabolites in a complex mixture. Thus, we define each NMR spectrum of interest as a pair of ( Inline graphic ), where are the observed collection of signal intensities such that , and are the corresponding chemical shifts. Here, a positive constant serves as a threshold to remove low-intensity signals that are likely to be noise while reducing the number of features to be considered in our model. Details about the selection of Inline graphic are described in Section 4.

2.2. Spectrum model with shifting errors

A major underlying assumption in NMR-based quantitative metabolomics is that any given mixture spectrum is the accumulated sum of individual metabolite spectra (Wishart, 2008b). As illustrated in Figure 1, the peaks of the mixture in Figure 1(d) are composed of the three spectra in Figure 1(a)–(c). In this regard, the abundance of an individual metabolite is reflected by its relative peak heights. Consequently, a spectral representation of a mixture consisting of individual metabolites can be considered as a linear combination of spectral functions of each individual metabolite in the reference library. At a chemical shift Inline graphic , its corresponding intensity of a true mixture spectrum in an ideal experimental condition, denoted by can be modeled as follows:

(2.1)

where Inline graphic are all indices of chemical shifts of the peaks along the mixture spectrum; is the number of known compounds in the reference library; , is the intensity function of the th reference spectrum; represents random noise with mean zero and variance ; and non-negative represents the concentration of the Inline graphic th metabolite in the complex mixture. Accordingly, the th metabolite is considered to be present in the mixture if the coefficient is greater than 0. By mean-centering and such that and , we can remove the intercept term from (2.1) (Tibshirani, 1996).

Each reference spectrum is considered as a collection of peaks with different chemical shift locations and peak intensities. Since NMR peaks are sharp, it is common to represent each NMR peak as a Lorentzian curve (i.e., Cauchy distribution function) (Hollas, 2004). Depending on the molecular environment and the size of the molecule, the number of peaks in a Inline graphic NMR spectrum can range from 1 (e.g., methanol) to more than 47 (e.g., D-glucose). For the th metabolite with chemical shift positions in the reference library, its spectrum can be modeled as follows:

(2.2)

where Inline graphic is an input which can take any value along the chemical shift (ppm) axis; is a total number of peaks of the th metabolite; is a vector of all peak locations of the th metabolite; is a vector of shape parameters for each of the peaks, and these values are set at 0.002 to maintain the sharp shape of an NMR peak (Vu and others, 2019). For notation simplicity, we remove Inline graphic from for the rest of the paper. Finally, is the multiplier factor for each of the peaks such that the relative ratios between peak heights are maintained. For each metabolite, we obtain a list of peak locations and corresponding relative peak heights directly from the Human Metabolome Database (Wishart and others, 2018). Here, a vector of multiplier factor Inline graphic is calculated by solving linear equations of Cauchy densities evaluated at each peak location and corresponding peak heights; see Vu and others (2019) for details. Using reference spectra generated directly from (2.2) has the advantage over in-house spectra of pure chemical compounds in terms of minimizing some undesirable experimental perturbations. From (2.1), the peaks of the reference spectra with Inline graphic should also be peaks among of the target mixture. However, unavoidable fluctuations in sample pH, temperature, and instrument instability can cause peaks of the mixture to shift from their referenced locations. As a result, the observed spectrum intensities may incur location shifting errors. In this regard, the observed intensity Inline graphic at is subject to a location shift such that

(2.3)

where Inline graphic is a vector of shifting errors associated with referenced peak locations . In other words, when a particular peak is shifted, the neighboring signals are accordingly shifted by the same amount. Each follows a distribution with a bounded support on for a positive constant . The bounded support ensures the locality of shifting errors associated with signals in the mixture spectrum. For a given reference spectrum, we know the parameters Inline graphic and in (2.2). However, the shifting errors are not observable. Hence, a direct regression of on to estimate based on the model (2.3) is not practical.

Models (2.1) and (2.3) imply a mismatch between the observed and referenced intensities (i.e., Inline graphic and of the mixture. Such shifting deviations need to be corrected to ensure the consistent estimation of and accurate identification of the compounds present in the mixture. In Section 2.3, we propose a shifting-corrected regularized regression estimation procedure to correct for the positional errors in the spectral signals.

2.3. Methodology

The total number Inline graphic of metabolites in the reference library used for spectral fitting is typically in the order of – depending on the types of sample mixtures. The number of abundant metabolites actually present in the mixtures is a small subset of the reference library. Namely, most of the coefficients of the compounds not contributing to the mixture should ideally be zero in the regression model (2.1). Given the sparsity feature of the problem, we apply the Least Absolute Shrinkage and Selection Operator (Lasso) regularization to obtain a sparse estimate of the regression coefficients (Tibshirani, 1996). Recall that Inline graphic . If the spectra intensities without shifting errors can be observed, we can estimate by minimizing the following objective function

(2.4)

where Inline graphic is a penalty parameter. With proper selection of , Lasso is capable of obtaining sparse estimate that is consistent to (Bickel and others, 2009; Bühlmann and Van de Geer, 2011). When , i.e., no penalty is applied, Lasso-type estimates are simply ordinary least square estimates. As Inline graphic increases, more are shrunk to exactly zero (James and others, 2013). However, (2.4) cannot be implemented as the intensities without shifting errors are not observable.

Let Inline graphic be the squared distance between the observed signal intensity at and the reconstructed intensity from reference spectra at . If the observed peak at is in fact generated from , i.e., , there exists a shifting error in signal locations of . Then, the residual term should be small. Otherwise, the value of Inline graphic should be relatively large. For each , we calculate such pairwise residuals for the interval , where is a predefined, positive constant. In practice, may be empirically chosen and details about its selection are discussed in Section 4. The residuals can be used to construct weights for each feature pair Inline graphic . Let be the kernel of the normal density function with mean and variance . Define the weight function as follows:

(2.5)

where Inline graphic , ; serves as a tuning parameter which controls the distribution of weights in each search window . For simplicity, we will use and as defined above for the rest of the article. Notice that is a smooth and decreasing function of with for each . For each search window , the weight reaches its maximum if the observed signal at Inline graphic is shifted from its reference counterparts at and hence is the smallest. For the rest of the article, will be used in place of to simplify the notation.

Using the weights in (2.5), we propose a shifting-weighted regularized estimation approach that minimizes the following objective function:

(2.6)

Here, (2.6) is a constrained regularized optimization, where the non-negativity constraint on Inline graphic is due to the non-negativity of metabolite concentrations in our problem. For general problems without constraints, we may impose the penalty in (2.6). Note that this optimization problem is more complex than the classical Lasso optimization and may not be convex, since the weights also depend on the regression coefficients Inline graphic .

Compared to (2.4), the loss function Inline graphic takes into account any potential signal shifting for each location by including the pairwise distance corresponding to each element in the search window . These pairwise distances are weighted by such that the large is multiplied by a small value and vice versa, where the weight decays exponentially as Inline graphic increases.

Let Inline graphic be a minimizer of (2.6) where . Additionally, let be the active set of the metabolites which are identified as present in the target mixture. We define , and . At each peak location of the mixture, the value together with the reference peak locations around can be used to estimate and correct for the shifting errors.

The tuning parameter Inline graphic can be considered as a weight distributor for each search window corresponding to . Smaller yields a narrower weight distribution, which results in more weights close to 0. In this regard, an extremely small would assign the weight of 1 to the smallest while the remaining weights are essentially 0. On the other hand, a large Inline graphic would flatten out the weight distribution, which in turn loses the ability to detect the signal shifting. Given , takes a value between and . In general, we suggest to be between and , , to maintain the smoothness in weight distribution. More discussion about the sensitivity of on the performance of the proposed method is included in Section 4.

3. Implementation

In this section, we provide the computation algorithms to solve the shifting-corrected regularized estimation (2.6) proposed in Section 2.3.

3.1. Coordinate descent

As both Inline graphic and in the objective function in (2.6) depend on , it might not be a convex function of . However, for any fixed positive weights , is a weighted least squares loss of the augmented paired data ; hence, it is a convex function. Therefore, to minimize , we utilize the coordinate descent approach (Friedman and others, 2010). This optimization process minimizes the objective function with respect to each Inline graphic at a time while fixing the weights and the remaining coefficients . Specifically, the gradient of with respect to , given fixed weights , is , where , and . Details of the derivation are provided in Section S1 of the Supplementary material available at Biostatistics online. Since the penalty term Inline graphic in (2.6) is separable in , for each component , can be expressed as

where Inline graphic and are two functions independent of , and denotes the regression coefficients without the th component. Therefore, the objective function is a quadratic convex function of given all other coefficients. The coordinate descent algorithm essentially minimizes a quadratic convex function Inline graphic of with the constraint . Since , given and fixed, the minimum of over occurs at .

Specifically, at the current estimate Inline graphic , we obtain the weight functions as

(3.7)

where Inline graphic . Sequentially, we obtain

(3.8)

Note that at each Inline graphic th iteration the process is done for all ’s (). Then, we obtain the th update of by

(3.9)

It is worth noting that if there is no non-negativity constraint on Inline graphic for all , and the Lasso penalty is used in (2.6), the coordinate descent algorithm updates by the soft-thresholding operator as done by Friedman and others (2007).

At each iteration, the algorithm updates each regression coefficient Inline graphic separately, which requires computation steps. Meanwhile, there are weights to update for each updated . The total computational complexity is per iteration. Additionally, the process of looping through all regression coefficients is iterated until the convergence criterion is met or when the maximum number of iterations, which is set at 1000, is reached. Intuitively, as Inline graphic is a convex function given the weights , the algorithm would converge if the initial weights are close to the ones with the true . While is a nonconvex function of as changes with , and our proposed algorithm is not guaranteed to converge to a global optimum, we find that the results are not sensitive to the initial values in the simulation studies and the real data analysis. The theoretical convergence properties of the proposed method will be investigated in future work.

Let Inline graphic be the reference library data matrix, where the th column of consists of the spectrum of the th metabolite in (2.2). As before, is the -dimensional vector representing the spectrum of the target mixture. Given the penalty parameter chosen by the cross-validation (CV) criterion, the tuning parameter Inline graphic in the weight function (2.5), and the search window size , the main steps of the proposed optimization algorithm are outlined below.

Algorithm 1 — Coordinate descent algorithm to solve in (2.6)

3.2. Cross-validation

A decreasing sequence of Inline graphic values is used to calculate the corresponding prediction errors through CV. The optimal penalty parameter associated with the smallest error is chosen. Similar to Lasso’s pathwise coordinate descent (Friedman and others, 2010), the sequence of is generated such that its maximum Inline graphic is the minimum penalty value that all the estimated coefficients become 0. Specifically, is computed as where with . Then, we define for a small positive value of . The sequence of length is constructed by linearly decreasing from to on a log scale, where and are recommended to be 0.0001 and 50, respectively according to Friedman and others (2010).

The constructed sequence of penalty values Inline graphic is then used for 5-fold CV as outlined in Algorithm 2. For both the observed mixture spectrum (response) and the spectra in the reference library (covariates), all NMR signals after thresholding with are randomly partitioned into five sets. During the random partition, it is almost certain that some signals of each peak are used for training, which is sufficient to detect any peak shifting. Each of the five sets is used for validation while the other four sets are used for training. For a given value of Inline graphic in the sequence, the regression coefficients are estimated using the training set. The estimated loss in (2.6), which is obtained using the estimated coefficients, is evaluated on the validation set. The CV loss corresponding to each is the average loss over the five sets; the cross-validated penalty value is chosen as the minimizer of the CV loss. Similarly, we apply the standard 5-fold CV to choose the penalty values for Lasso, elastic net, and adaptive Lasso. The main difference is that the three methods utilize standard least-squares loss while the proposed method uses the weighted loss in (2.6) to account for the shifting errors.

3.3. Concentration estimation

Let Inline graphic be the solution of the coordinate descent procedure in Section 3.1, and be the corresponding estimated weights. To estimate the concentration of the present metabolites, we first need to correct for the shifting errors. At each signal location , let be the position corresponding to the smallest pairwise distance Inline graphic for . For each th metabolite with , we match with its referenced peak locations . The shifting error associated with the th peak of the th reference metabolite at can be estimated as

for Inline graphic , and all , where is the active set of the metabolites identified in the target mixture. Let . The final estimation of the metabolites concentration after the adjustment to shifting errors is denoted by , where if , and can be obtained by minimizing the following objective function directly

(3.10)

Here, the non-negativity constraint is again enforced such that Inline graphic for if to ease the interpretation of non-negative metabolite concentration. Additionally, as the concentration estimation in (3.10) is conditional on the selection results, i.e., the estimation of in (3.9) as well as the correction for shifting error, the inference for is more complicated than the usual post-selection Lasso estimators. In order to study the impact of the two steps on the least square estimator Inline graphic , one could consider the stability selection procedure, as in Meinshausen and Bühlmann (2010). More discussion about this is described in Section S1.3 of the Supplementary material available at Biostatistics online.

4. Simulation studies

The evaluation criteria used to compare the performance of different methods were accuracy, sensitivity, and specificity. Accuracy was calculated as a ratio of correctly labeled metabolites (true positives plus true negatives) to the total number of metabolites in reference library. Similarly, sensitivity was obtained as a fraction of the correctly identified metabolites (true positives) relative to the total number of true metabolites. Moreover, specificity was measured by dividing the number of unidentified metabolites by the number of metabolites not present in a mixture. The correct or incorrect metabolites in a mixture were determined based on their corresponding postselection error-corrected least squares estimates Inline graphic defined in (3.10). Additionally, Figure S1 of the Supplementary material available at Biostatistics online showed a decline in the loss function as the number of iterations increased. More precisely, the objective function was stabilized within the first 50 iterations across simulated and real mixtures, which verified the convergence of the proposed algorithm in practice.

Two simulation studies were conducted with a reference database of 200 compounds generated directly from (2.2) for chemical shifts ranging from 0.9 to 9.2 ppm with an equal space of 0.001 ppm, excluding the water suppression region between 4.6 and 4.8 ppm. The peak list for each compound Inline graphic was then randomly selected from the available chemical shifts, with ranging from 1 to 10. The corresponding peak heights were generated from the uniform distribution , which were then used to calculate the multiplier factor as described in Section 2.2. The shape parameter was fixed at 0.002 to maintain the sharp shape of an NMR peak. Each resulting spectrum was standardized such that its maximum peak intensity was set to 1. Our simulation studies only consisted of comparisons between the proposed methods and existing regularized regression models (i.e., Lasso, elastic net, and adaptive Lasso) because Bayesil, Chenomx, and ASICS only handled raw Inline graphic NMR data which were not obtainable through simulation. The performance comparison between the proposed method and existing software including Bayesil, Chenomx, and ASICS, was illustrated in Section 5. Due to limited space, we only reported in detail one of the simulation studies in the main text. The additional simulation study was discussed in Section S2 of the Supplementary material available at Biostatistics online.

A target mixture in the simulation as shown in Figure 2(a) was created by adding three individual spectra with random noise to resemble experimental variations. More specifically, true parameters were set up such that Inline graphic , and for . Furthermore, positional noise, i.e., peak shifting errors were explicitly examined by purposely shifting locations of chosen peaks. The peak in Figure 2(a) was shifted to the right from its referenced location while and stayed unchanged. The amount of chemical shift variation was applied in an increasing fashion, i.e., Inline graphic , such that ppm respectively to assess the performance of various methods. The whole process of adding random noise to a generated mixture spectrum and shifting peak locations was repeated 200 times. Section S3.2 of the Supplementary material available at Biostatistics online discussed in detail how the proposed method behaved as the variance of the added noise increased. Additionally, based on the sensitivity analysis results (Tables S3–S6 of the Supplementary material available at Biostatistics online), we set Inline graphic defined in (2.5) to be the closest integer capturing the maximum shifting variation. In other words, with equal space of 0.001 ppm between chemical shifts, was set to be 10, 20, 30, and 40 in correspondence with ppm, respectively.

Fig. 2. — Simulated mixture spectrum (three rightmost peaks) with added random noise (a) overlaid with the reference counterpart (leftmost peak) of in the simulation. Weight plot for the shifted peak (b) potentially relocates the shifted peak by detecting the noncentered maximum weight. As a result, shifted peak (c) is corrected to match its referenced peak (d).

Once a mixture was created, a threshold level Inline graphic defined in Section 2.1 was obtained such that was greater than of the area under the mixture spectrum curve (AUC), i.e., (Ahmed, 2005). An extended simulation study reported in Section S3.1 of the Supplementary material available at Biostatistics online assessed how changing would affect the performance of the proposed method. We evaluated different values of Inline graphic (i.e., , , , and ) in conjunction with different values (i.e., , , , , and ). Consistent results across values served as an assurance to continue both simulation studies and real data analysis using . Furthermore, a joint analysis for various values of both and defined in (2.5) was summarized in Section S3.1 of the Supplementary material available at Biostatistics online. The results confirmed the choice of Inline graphic for the analysis.

Table 1 recorded accuracy, sensitivity, and specificity for each method across four increasing levels of positional perturbations, averaged over 200 iterations. As shifting variations increased from Inline graphic to ppm, all accuracy, sensitivity, and specificity decreased across the four methods. However, the decreasing rates were slightly different across different metrics and methods. Specifically, sensitivity had the fastest dropping rate () compared to accuracy () and specificity () for the proposed method. Lasso particularly had the lowest sensitivity across all levels of shifting errors because of its tendency toward identifying a large number of compounds that were not truly contributing to the mixture.

Table 1.

Average accuracy, sensitivity, and specificity for 200 iterations in the simulation at an increasing shifting variation from Inline graphic ppm to ppm across the proposed method, Lasso, elastic net, and adaptive Lasso. Corresponding standard deviations are recorded in parentheses

	Metrics	0.01ppm	0.02ppm	0.03ppm	0.04ppm
	Accuracy	0.999 (0.002)	0.998 (0.003)	0.997 (0.005)	0.995 (0.006)
Proposed method	Sensitivity	0.980 (0.119)	0.990 (0.081)	0.978 (0.116)	0.950 (0.208)
	Specificity	0.999 (0.002)	0.998 (0.003)	0.997 (0.004)	0.996 (0.005)
	Accuracy	0.988 (0.029)	0.973 (0.040)	0.956 (0.062)	0.949 (0.065)
Lasso	Sensitivity	0.748 (0.328)	0.758 (0.263)	0.760 (0.209)	0.697 (0.270)
	Specificity	0.991 (0.030)	0.976 (0.041)	0.960 (0.063)	0.954 (0.067)
	Accuracy	0.941 (0.089)	0.936 (0.064)	0.923 (0.061)	0.898 (0.090)
Elastic net	Sensitivity	0.992 (0.052)	0.915 (0.146)	0.843 (0.167)	0.827 (0.167)
	Specificity	0.940 (0.091)	0.936 (0.063)	0.924 (0.062)	0.899 (0.092)
Adaptive Lasso	Accuracy	0.984 (0.043)	0.964 (0.056)	0.956 (0.059)	0.945 (0.069)
	Sensitivity	0.812 (0.242)	0.793 (0.210)	0.765 (0.188)	0.757 (0.188)
	Specificity	0.986 (0.044)	0.967 (0.057)	0.959 (0.061)	0.947 (0.070)

Open in a new tab

Table 2 reported estimated metabolite concentrations ( Inline graphic ) corresponding to each level of shift variation across the four methods. Note that the proposed method defined as the postselection error-corrected least squares estimates in (3.10). Additionally, for each of the three regularized regression models: Lasso, elastic net, and adaptive Lasso, was defined as the estimates minimizing the corresponding loss function in Tibshirani (1996), Zou and Hastie (2005), and Zou (2006), respectively, where, Inline graphic ; ; and . As expected, the estimated coefficient for the shifted metabolite , i.e., was further away from its true concentration of 1 as the shifting errors increased. Particularly, decreased from 0.970 to 0.939; and from 0.017 to 0.002 for the proposed method and Lasso, respectively. Even at the smallest amount of shifting ( Inline graphic ppm), Lasso quantified the abundance of metabolite with an estimate of 0.017. Compared to Lasso, elastic net and adaptive Lasso yielded slightly better estimates for of 0.111 and 0.06 respectively, yet still significantly underestimated the true parameter. On the other hand, the proposed method with an ability to detect and locate any potential peak shifting through the weight function in (2.5) provided a better estimate of 0.970.

Table 2.

Average estimated metabolite concentrations for 200 iterations in the simulation at an increasing shifting variations from Inline graphic ppm to ppm across proposed method, Lasso, elastic net, and adaptive Lasso. Corresponding standard deviations are recorded in parentheses

	Truth	0.01ppm	0.02ppm	0.03ppm	0.04ppm
	1	0.970 (0.106)	0.984 (0.067)	0.968 (0.152)	0.939 (0.228)
Proposed method	1	0.924 (0.218)	0.953 (0.167)	0.948 (0.184)	0.940 (0.213)
	1	0.903 (0.274)	0.946 (0.201)	0.946 (0.199)	0.926 (0.251)
	1	0.017 (0.055)	0.008 (0.034)	0.007 (0.040)	0.002 (0.006)
Lasso	1	0.725 (0.396)	0.850 (0.309)	0.925 (0.210)	0.859 (0.319)
	1	0.755 (0.389)	0.867 (0.298)	0.938 (0.194)	0.868 (0.314)
	1	0.111 (0.223)	0.064 (0.193)	0.060 (0.195)	0.032 (0.135)
Elastic net	1	0.970 (0.030)	0.975 (0.030)	0.980 (0.023)	0.982 (0.023)
	1	0.978 (0.020)	0.981 (0.024)	0.984 (0.020)	0.986 (0.019)
	1	0.060 (0.210)	0.041 (0.180)	0.044 (0.188)	0.019 (0.123)
Adaptive Lasso	1	0.770 (0.350)	0.873 (0.280)	0.916 (0.023)	0.911 (0.243)
	1	0.810 (0.320)	0.900 (0.240)	0.928 (0.217)	0.926 (0.217)

Open in a new tab

Figure 2(b) depicts the weight plot for the Inline graphic peak with the fixed search window of size , i.e., 0.01 ppm. In this illustration, the observed peak of at 1.933 ppm is shifted from its referenced location at (i.e., 1.924 ppm). In principle, if a given peak does not shift, its corresponding weight plot would reach a maximum at the center while the weights for all neighboring locations diminish quickly. In contrast, for any locations, which were not peaks (e.g., points along the baseline), the weights would equally be distributed across all the points within the search window. As a result, such weight plots helped identify and relocate any shifted peaks so that they match their counterparts in the reference database as closely as possible. In the weight plot of Figure 2(b), the observed peak of Inline graphic located at 1.933 ppm had the maximum weight () at 1.924 ppm. This suggests that the observed peak () might have been generated from the reference and deviated from its referenced position by 0.009 ppm. Therefore, the reference spectrum of needs to be repositioned accordingly (Figure 2(c) and (d)) to ensure precise estimation.

5. Real data analysis

5.1. Experimental mixtures

Three experimental mixtures of different compositions of 20 amino acids, as outlined in Table S11 of the Supplementary material available at Biostatistics online, were used for performance comparisons across Lasso, elastic net, adaptive Lasso, Bayesil, Chenomx, and ASICS. The performances were evaluated using an increasing size of the reference library (61, 101, and 200, respectively) based on accuracy, sensitivity, and specificity. Based on the sensitivity analyses in Section S4 of the Supplementary material available at Biostatistics online, we set Inline graphic (defined in (2.5)) and continued fixing (defined in Section 2.1) and (defined in (2.5)) for the real data analysis reported herein. Overall, the proposed method yielded the highest rate across the three metrics regardless of the library size.

As the number of candidate metabolites increased, it became easier to incorrectly claim the presence of metabolites. Table 3 showed a slight drop in specificity for both the proposed method (from 0.90 to 0.85) and Bayesil (from 0.66 to 0.13) in the mixture of all 20 amino acids. Interestingly, the automated profiler feature of Chenomx and ASICS often failed to capture some metabolites that were actually present in the mixture as shown by the number of false negatives (FN) in Table S12 of the Supplementary material available at Biostatistics online. As a result, both Chenomx and ASICS had the lowest sensitivity (0.80 and 0.45, respectively) compared to the remaining five methods ( Inline graphic ) as shown in Table 3 for the mixture of 20 compounds. Unfortunately, we were not able to evaluate the impact of increasing library size from 101 to 200 on Bayesil since the maximum number of available metabolites was 93. Additionally, without the flexibility to adjust the reference library accompanying ASICS, which was fixed at 190, the impact of increasing library size on ASICS was not assessed.

Table 3.

Comparison of proposed method with Lasso, elastic net, adaptive Lasso, using three experimental mixtures containing 6, 7, and 20 metabolites, respectively; and a library size of 61, 101, and 200 metabolites, respectively. Performance was evaluated based on average accuracy, sensitivity, and specificity

Met.	Metrics	Proposed method	Lasso	Elastic Net	Adaptive Lasso	Chenomx	Bayesil	ASICS
Library size 61
	Accuracy	1.00	0.72	0.62	0.75	0.80	0.64	0.80
6	Sensitivity	1.00	1.00	1.00	1.00	0.67	1.00	0.57
	Specificity	1.00	0.69	0.58	0.73	0.82	0.60	0.81
	Accuracy	1.00	0.77	0.79	0.64	0.89	0.64	0.83
7	Sensitivity	1.00	1.00	1.00	1.00	0.71	1.00	0.33
	Specificity	1.00	0.74	0.76	0.59	0.91	0.59	0.86
20	Accuracy	0.93	0.67	0.69	0.70	0.74	0.75	0.56
	Sensitivity	1.00	1.00	1.00	1.00	0.80	0.95	0.45
	Specificity	0.90	0.51	0.54	0.56	0.71	0.66	0.59
Library size 101
	Accuracy	0.94	0.79	0.88	0.72	0.87	0.30	0.80
6	Sensitivity	1.00	1.00	1.00	1.00	0.67	1.00	0.57
	Specificity	0.94	0.78	0.87	0.71	0.88	0.25	0.81
	Accuracy	0.97	0.90	0.84	0.87	0.93	0.42	0.83
7	Sensitivity	1.00	1.00	1.00	1.00	0.71	1.00	0.33
	Specificity	0.97	0.89	0.83	0.86	0.95	0.37	0.86
	Accuracy	0.88	0.72	0.73	0.72	0.69	0.33	0.56
20	Sensitivity	1.00	1.00	1.00	1.00	0.80	0.95	0.45
	Specificity	0.85	0.65	0.67	0.65	0.67	0.16	0.59
Library size 200
	Accuracy	0.94	0.86	0.86	0.83	0.92	0.30	0.80
6	Sensitivity	1.00	1.00	1.00	1.00	0.67	1.00	0.57
	Specificity	0.93	0.85	0.85	0.82	0.93	0.25	0.81
	Accuracy	0.98	0.86	0.86	0.92	0.96	0.42	0.83
7	Sensitivity	1.00	1.00	1.00	1.00	0.71	1.00	0.33
	Specificity	0.97	0.85	0.85	0.92	0.97	0.37	0.86
	Accuracy	0.86	0.74	0.75	0.74	0.76	0.33	0.56
20	Sensitivity	1.00	1.00	1.00	1.00	0.80	0.95	0.45
	Specificity	0.84	0.71	0.72	0.71	0.75	0.16	0.59

Open in a new tab

Lasso, elastic net, and adaptive Lasso performed quite similarly in terms of producing relatively more false positives than the proposed method, which, in turn, led to a lower specificity. For example, in the last mixture of 20 compounds, the proposed method had a specificity of 0.85 while the three regularized models yielded the average rate of 0.73. For instance, L-lysine which was part of mixture 3 (Table S11 of the Supplementary material available at Biostatistics online), had one of its peaks located at 1.925 ppm. Even a slight shift of the peak in the observed spectrum could lead to a confusion with the one peak from acetic acid located at 1.924 ppm. Without considering shifting errors, it was not surprising to see Lasso, elastic net, and adaptive Lasso inaccurately classified acetic acid as being present in the mixture. Figure 3 (left) mirrored the observed and fitted spectra corresponding to the proposed method, Lasso, elastic net, and adaptive Lasso, respectively. The zoomed-in version in Figure S3 of the Supplementary material available at Biostatistics online depicted the false positive effects caused by the artifacts around 7.5 ppm on Lasso, elastic net, and adaptive Lasso.

Fig. 3. — *Top panel:* Each fitted curve (left) generated from the proposed method, Lasso, elastic net, and adaptive Lasso is mirrored with the observed mixture spectrum (right). *Bottom panel:* Each fitted curve (left) generated from the proposed method, Bayesil, Chenomx, and ASICS is mirrored with the observed mixture spectrum (right).

The complexity level of a mixture spectrum also affected the accuracy and specificity. As the number of metabolites included in the mixture sample increased from 7 to 20 (Table 3), the accuracy decreased from 0.98 to 0.93 for the proposed method, from 0.93 to 0.81 for elastic net, and from 0.97 to 0.83 for Chenomx. Specificity encountered similar trends, with a drop from 0.97 to 0.85 for the proposed method, from 0.91 to 0.73 for the elastic net, and from 0.97 to 0.77 for Chenomx. Bayesil, Chenomx, and ASICS each used its own library of reference spectra which were collected at specific conditions (e.g., pH, temperature, etc.) to profile a given mixture spectrum. If the observed spectrum was collected under experimental conditions that were quite different from those in the reference libraries, it was possible that the induced shift variation was outside the range that these methods considered in their algorithms. This could consequentially cause failing to capture true metabolites (false negatives) and/or identifying wrong metabolites (false positives). Specifically, Table S13 of the Supplementary material available at Biostatistics online showed that Chenomx had a relatively lower number of false positives but it usually missed two to four metabolites across the three mixtures (e.g., L-alanine, L-cysteine, L-leucine, and L-glutamic acid).

On the other hand, Bayesil only failed to identify at most one metabolite (i.e., L-aspartic acid in the third mixture) while having a larger number of false positives (around 14-22). Finally, ASICS seemed to have both problems of missing true metabolites and identifying wrong metabolites. Figure 3 (right) mirrored the observed mixture spectrum with a fitted curve generated by the proposed method, Bayesil, Chenomx, and ASICS, respectively. More precisely, Figure S4 of the Supplementary material available at Biostatistics online zoomed in the spectral region from 1 to 4.5 ppm with the discrepancy between the observed and the profiled spectra to demonstrate the false positive and false negative effects on Bayesil, Chenomx, and ASICS as compared to the proposed approach.

5.2. Biological samples

Three serum samples of breast cancer patients from Hart and others (2017), which were publicly available on the MetaboLights database (Kale and others, 2016), were used to evaluate the practical application of the proposed method in comparison with the six other models. Evaluation criteria still included accuracy, sensitivity, and specificity, which were calculated using the metabolites manually identified by the authors of the study. According to Table 4, the proposed model yielded the best overall results with regard to identifying true metabolites in the mixtures while controlling the number of falsely identified metabolites. Similar to experimental results in Section 5.1, the six methods except for Chenomx identified more incorrect metabolites than the proposed approach. Specifically, the average sensitivity of our method was 0.70, while the remaining six approaches yielded an average sensitivity of less than 0.63. Across replications, Chenomx detected at most two metabolites out the total of 22 metabolites ( Inline graphic ) present in the mixture samples, which resulted in a low sensitivity of 0.05. Given the relatively low number of identified compounds, Chenomix, not surprisingly, produced the highest rate of specificity of 0.93.

Table 4.

Comparison of proposed method with Lasso, Elastic net, adaptive Lasso, Chenomx, Bayesil, and ASICS using three serum samples from breast cancer patients (Hart and others, 2017) and a library size of 104 metabolites. Performance was evaluated based on average accuracy, sensitivity, and specificity over the three empirical samples (N = 3). Corresponding standard deviations are recorded in parentheses

Metrics	Proposed method	Lasso	Elastic Net	Adaptive Lasso	Chenomx	Bayesil	ASICS
Accuracy	0.70 (0.01)	0.65 (0.03)	0.63 (0.04)	0.66 (0.03)	0.74 (0.02)	0.46 (0.03)	0.60 (0.01)
Sensitivity	0.67 (0.03)	0.67 (0.03)	0.68 (0.00)	0.67 (0.03)	0.05 (0.04)	0.55 (0.00)	0.54 (0.03)
Specificity	0.71 (0.01)	0.65 (0.04)	0.62 (0.05)	0.66 (0.05)	0.93 (0.04)	0.43 (0.04)	0.64 (0.02)

Open in a new tab

Interestingly, there was a consistency across all methods with regard to failing to detect some of the metabolites that were previously identified by the authors of the original study. A signal-to-noise ratio (SNR) was calculated using TopSpin 4.06 (Bruker, Germany) for the mixture spectra. Three missed metabolites that included citric acid, formic acid, and phenylalanine, had a SNR less than 2. This is below the generally accepted lower-limits for peak detection.

ASICS had some built-in steps to exclude any reference metabolites from the library if at least one of its corresponding peaks did not show up in the observed complex spectrum. However, doing so could lead to eliminate some potential metabolites as it was quite possible for a given metabolite to lose one or more peaks due to experimental fluctuations (Zangger, 2015). This could potentially contributing to a large number of false negatives as seen in Table S13 of the Supplementary material available at Biostatistics online. Bayesil, on the other hand, did not integrate any regularization to reflect the sparsity of abundant metabolites in a complex mixture. As a result, Bayesil tended to identify more metabolites which were not part of the mixtures.

6. Conclusion

Metabolite identification and quantification play an essential role in any NMR metabolomics studies and are necessary to describe the underlying biological processes being investigated. However, manual assignment approaches are time-consuming, labor-intensive, and reliant on the knowledge and assessment of NMR experts. Many curve-fitting models with reference library support have been introduced to automate the metabolite identification process, but none has been unanimously demonstrated to be a gold standard approach in practice. We proposed a new approach that focused on addressing two major challenges of metabolite identification from complex mixtures: undesirable perturbations in signal locations and sparsity metabolites relative to reference databases. The proposed method was assessed using simulated, experimental and biological NMR metabolomics data sets. The overall performance was based on three metric including accuracy, sensitivity, and specificity. In addition, a comparison was made between the newly introduced approach and Lasso, elastic net, adaptive Lasso, Chenomx, Bayesil and ASICS with an increasing size of a reference library.

As a hybrid approach, the proposed leveraged sparsity properties from regularized models (e.g., Lasso, elastic net, and adaptive Lasso) which allowed a selection of potential metabolites from a large reference library. With the promising performance of our proposed method demonstrated using library size up to 200 (as shown in Table 3), this could potentially be scaled up to a relatively larger library. Additionally, our method incorporated a search window at each observed signal to capture any peak shifting which was then corrected for before estimating the metabolite concentration. Lastly, our approach with the modified objective function to correct for peak shifting and regularization enforcement was easy to implement. Though we empirically pre-selected some hyperparameters such as Inline graphic , , and , and , we demonstrated in the sensitivity analyses (Tables S4–S7 of the Supplementary material available at Biostatistics online) that these values did not affect our method performance much.

The proposed method showed the best results in capturing metabolites that were truly present in the mixtures while keeping incorrect assignments at a relatively lower level regardless of increasing shifting variations as demonstrated in the simulation studies. In addition, as the complexity of a mixture spectrum increased and the number of candidate metabolites grew larger, all methods shared a common trend of tending to identify more incorrect metabolites. In other words, the combination of the two factors resulted in lower accuracy and specificity for all models. It was interesting to see that the automated Profiler feature of Chenomx only assigned a small number of metabolites to the mixtures. Such conservative assignments ensured fewer false identifications, which in turn led to a higher specificity overall. However, Chenomx often failed at capturing some true metabolites leading to a low sensitivity. On the contrary, Bayesil’s aggressive detection yielded a large number of false positives while identifying most of the true positives. Consequently, Bayesil had a good sensitivity yet its specificity suffered considerably. Uniquely, ASICS seemed to encounter both problems of missing true metabolites and identifying wrong metabolites, resulting low sensitivity and specificity. Nonetheless, the new method still managed to maintain the best results across all three criteria (accuracy, sensitivity, and specificity) as compared to the others.

Even though we did not theoretically quantify the inferences of the regression coefficients in this article, we think it is possible to obtain proper confidence intervals for the coefficients, with special attention paid to the characteristics of the NMR spectral data. Again, as the final concentration estimation is conditional on both stages of the variable selection and covariate shifting correction, we utilize the random subsampling approach to closely investigate the stability of each stage. Based on the detailed results in Section 1.3 of the Supplementary material available at Biostatistics online, we observe that the two steps are stable in terms of selecting the true metabolites with a selection probability of at least 0.9, and estimating the covariate shifting error reasonably well. Such a stability procedure could be utilized to quantify the variability and construct confidence intervals for the metabolite concentration. Additionally, we could also leverage the jackknife resampling method, where each NMR peak is removed at a time to construct the jackknife confidence intervals. With regard to the nature of our proposed algorithm, the weights and the regression coefficients Inline graphic are updated alternately at each coordinate descent step. More specifically, when the weights are updated using a new estimate of , the objective function may not decrease. As a result, the objective function does not always decrease monotonically after each step; and we have noticed this phenomenon in Figure S1 of the Supplementary material available at Biostatistics online. Although our proposed algorithm is also not guaranteed to converge, we find that the numerical convergence is achieved in our simulation studies and real data analyses. Additionally, we choose the initial values Inline graphic in our numerical analyses due to a practical reason that a majority of metabolites are not present in a mixture (i.e., ). Alternatively, we could also experiment with multiple initializations and select the ones which yield the smallest objective value. The theoretical convergence properties of the proposed method is important but is beyond the scope of this article and will be investigated in our future work.

Supplementary Material

kxac015_Supplementary_Data

Click here for additional data file.^{(1.2MB, pdf)}

Acknowledgments

The authors would like to thank the Associate Editor and two anonymous referees for their helpful and constructive comments, which led to a significant improvement of this article. We also thank Fatema Bhinderwala and Hannah Noel for generously providing us the experimental NMR mixture data sets.

Conflict of Interest: None declared.

Software

Corresponding code of the method in the form of GNU Octave is incorporated in a toolkit called MVAPACK (Worley and Powers, 2014), which is publicly available for academic users at http://bionmr.unl.edu/mvapack.php. The equivalent R code is available at https://github.com/thaovu1/SCRR.

Supplementary material

Supplementary material is available online at http://biostatistics.oxfordjournals.org.

Funding

The National Science Foundation (1660921); the Redox Biology Center (P30 GM103335, NIGMS), in part; and the Nebraska Center for Integrated Biomolecular Communication (P20 GM113126, NIGMS). The research was performed in facilities renovated with support from the National Institutes of Health (RR015468-01). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

References

Ahmed, O. A. (2005) New denoising scheme for magnetic resonance spectroscopy signals. IEEE Transactions on Medical Imaging, 24, 809–816. [DOI] [PubMed] [Google Scholar]
Astle, W., De Iorio, M., Richardson, S., Stephens, D. and Ebbels, T. (2012) A Bayesian model of NMR spectra for the deconvolution and quantification of metabolites in complex biological mixtures. Journal of the American Statistical Association, 107, 1259–1271. [Google Scholar]
Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009) Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37, 1705–1732. [Google Scholar]
Bühlmann, P. and Van de Geer, S. (2011) Statistics for High-dimensional Data: Methods, Theory and Applications. Berlin/Heidelberg, Germany: Springer Science & Business Media. [Google Scholar]
Carroll, R. J., Ruppert, D., Stefanski, L. A. and Crainiceanu, C. M. (2006) Measurement Error in Nonlinear Models: A Modern Perspective. Boca Raton, Florida, USA: Chapman and Hall/CRC. [Google Scholar]
Chenomx, N. (2015) Suite. Edmonton, AB, Canada: Chenomx Inc. [Google Scholar]
Datta, A. and Zou, H. (2017) Cocolasso for high-dimensional error-in-variables regression. The Annals of Statistics, 45, 2400–2426. [Google Scholar]
Daviss, B. (2005) Growing pains for metabolomics: the newest’omic science is producing results–and more data than researchers know what to do with. The Scientist, 19, 25–29. [Google Scholar]
Dieterle, F., Ross, A., Schlotterbeck, G. and Senn, H. (2006) Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. application in 1H NMR metabonomics. Analytical Chemistry, 78, 4281–4290. [DOI] [PubMed] [Google Scholar]
Dona, A. C., Kyriakides, M., Scott, F., Shephard, E. A., Varshavi, D., Veselkov, K. and Everett, J. R. (2016) A guide to the identification of metabolites in NMR-based metabonomics/metabolomics experiments. Computational and Structural Biotechnology Journal, 14, 135–153. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fiehn, O. (2002) Metabolomics—the link between genotypes and phenotypes. Plant Mol Biol 48, 155–171. [PubMed] [Google Scholar]
Friedman, J., Hastie, T., Höfling, H. and Tibshirani, R. (2007) Pathwise coordinate optimization. The Annals of Applied Statistics, 1, 302–332. [Google Scholar]
Friedman, J., Hastie, T. and Tibshirani, R. (2010) Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1. [PMC free article] [PubMed] [Google Scholar]
Gowda, G. N., Zhang, S., Gu, H., Asiago, V., Shanaiah, N. and Raftery, D. (2008) Metabolomics-based methods for early disease diagnostics. Expert Review of Molecular Diagnostics, 8, 617–633. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hart, C. D., Vignoli, A., Tenori, L., Uy, G. L., Van To, T., Adebamowo, C., Hossain, S. M., Biganzoli, L., Risi, E., Love, R. R.. and others (2017) Serum metabolomic profiles identify ER-positive early breast cancer patients at increased risk of disease recurrence in a multicenter population. Clinical Cancer Research, 23, 1422–1431. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hollas, J. M. (2004) Modern Spectroscopy, 35–36. Hoboken, New Jersey, USA: Wiley. [Google Scholar]
James, G., Witten, D., Hastie, T. and Tibshirani, R. (2013) An Introduction to Statistical Learning, vol. 112. Berlin/Heidelberg, Germany: Springer. [Google Scholar]
Kale, N. S., Haug, K., Conesa, P., Jayseelan, K., Moreno, P., Rocca-Serra, P., Nainala, V. C., Spicer, R. A., Williams, M., Li, X.. and others (2016) MetaboLights: an open-access database repository for metabolomics data. Current Protocols in Bioinformatics, 53, 14–13. [DOI] [PubMed] [Google Scholar]
Lefort, G., Liaubet, L., Canlet, C., Tardivel, P., Père, M.-C., Quesnel, H., Paris, A., Iannuccelli, N., Vialaneix, N. and Servien, R. (2019) ASICS: an R package for a whole analysis workflow of 1D 1H NMR spectra. Bioinformatics, 35, 4356–4363. [DOI] [PubMed] [Google Scholar]
Meinshausen, N. and Bühlmann, P. (2010) Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72, 417–473. [Google Scholar]
Psychogios, N., Hau, D.D., Peng, J., Guo, A.C., Mandal, R., Bouatra, S., Sinelnikov, I., Krishnamurthy, R., Eisner, R., Gautam, B. and Young, N. (2011) The human serum metabolome. PLoS One, 6, e16957. [DOI] [PMC free article] [PubMed] [Google Scholar]
Putri, S. P., Nakayama, Y., Matsuda, F., Uchikata, T., Kobayashi, S., Matsubara, A. and Fukusaki, E. (2013) Current metabolomics: practical applications. Journal of Bioscience and Bioengineering, 115, 579–589. [DOI] [PubMed] [Google Scholar]
Ramirez, T., Daneshian, M., Kamp, H., Bois, F. Y., Clench, M. R., Coen, M., Donley, B., Fischer, S. M., Ekman, D. R., Fabian, E.. and others (2013) Metabolomics in toxicology and preclinical research. Altex, 30, 209. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ravanbakhsh, S., Liu, P., Bjordahl, T. C., Mandal, R., Grant, J. R., Wilson, M., Eisner, R., Sinelnikov, I., Hu, X., Luchinat, C. and others (2015) Accurate, fully-automated NMR spectral profiling for metabolomics. PLoS One, 10, e0124219. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rosenbaum, M. and Tsybakov, A. B. (2010) Sparse recovery under matrix uncertainty. The Annals of Statistics, 38, 2620–2651. [Google Scholar]
Sørensen, Ø., Frigessi, A. and Thoresen, M. (2015) Measurement error in LASSO: impact and likelihood bias correction. Statistica Sinica, 25, 809–829. [Google Scholar]
Sørensen, Ø., Hellton, K. H., Frigessi, A. and Thoresen, M. (2018) Covariate selection in high-dimensional generalized linear models with measurement error. Journal of Computational and Graphical Statistics, 27, 739–749. [Google Scholar]
Tardivel, P. J., Canlet, C., Lefort, G., Tremblay-Franco, M., Debrauwer, L., Concordet, D. and Servien, R. (2017) ASICS: an automatic method for identification and quantification of metabolites in complex 1D 1 H NMR spectra. Metabolomics, 13, 1–9.27980501 [Google Scholar]
Thulin, E., Thulin, M. and Andersson, D. I. (2017) Reversion of high-level mecillinam resistance to susceptibility in Escherichia coli during growth in urine. EBioMedicine, 23, 111–118. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tibshirani, R. (1996) Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58, 267–288. [Google Scholar]
Tulpan, D., Léger, S., Belliveau, L., Culf, A. and Čuperlović-Culf, M. (2011) MetaboHunter: an automatic approach for identification of metabolites from 1H-NMR spectra of complex mixtures. BMC Bioinformatics, 12, 1–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vu, T., Siemek, P., Bhinderwala, F., Xu, Y. and Powers, R. (2019) Evaluation of multivariate classification models for analyzing NMR metabolomics data. Journal of Proteome Research, 18, 3282–3294. [DOI] [PMC free article] [PubMed] [Google Scholar]
Weljie, A. M., Newton, J., Mercier, P., Carlson, E. and Slupsky, C. M. (2006) Targeted profiling: quantitative analysis of 1H NMR metabolomics data. Analytical Chemistry, 78, 4430–4442. [DOI] [PubMed] [Google Scholar]
Wishart, D. S. (2008a) Metabolomics: applications to food science and nutrition research. Trends in Food Science & Technology, 19, 482–493. [Google Scholar]
Wishart, D. S. (2008b) Quantitative metabolomics using NMR. TrAC Trends in Analytical Chemistry, 27, 228–237. [Google Scholar]
Wishart, D. S., Feunang, Y. D., Marcu, A., Guo, A. C., Liang, K., Vázquez-Fresno, R., Sajed, T., Johnson, D., Li, C., Karu, N. and others. (2018) HMDB 4.0: the human metabolome database for 2018. Nucleic Acids Research, 46, D608–D617. [DOI] [PMC free article] [PubMed] [Google Scholar]
Worley, B. and Powers, R. (2014) MVAPACK: a complete data handling package for NMR metabolomics. ACS Chemical Biology, 9, 1138–1144. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zangger, K. (2015) Pure shift NMR. Progress in Nuclear Magnetic Resonance Spectroscopy, 86, 1–20. [DOI] [PubMed] [Google Scholar]
Zou, H. (2006) The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429. [Google Scholar]
Zou, H. and Hastie, T. (2005) Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 301–320. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

kxac015_Supplementary_Data

Click here for additional data file.^{(1.2MB, pdf)}

[B1] Ahmed, O. A. (2005) New denoising scheme for magnetic resonance spectroscopy signals. IEEE Transactions on Medical Imaging, 24, 809–816. [DOI] [PubMed] [Google Scholar]

[B2] Astle, W., De Iorio, M., Richardson, S., Stephens, D. and Ebbels, T. (2012) A Bayesian model of NMR spectra for the deconvolution and quantification of metabolites in complex biological mixtures. Journal of the American Statistical Association, 107, 1259–1271. [Google Scholar]

[B3] Bickel, P. J., Ritov, Y. and Tsybakov, A. B. (2009) Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37, 1705–1732. [Google Scholar]

[B4] Bühlmann, P. and Van de Geer, S. (2011) Statistics for High-dimensional Data: Methods, Theory and Applications. Berlin/Heidelberg, Germany: Springer Science & Business Media. [Google Scholar]

[B5] Carroll, R. J., Ruppert, D., Stefanski, L. A. and Crainiceanu, C. M. (2006) Measurement Error in Nonlinear Models: A Modern Perspective. Boca Raton, Florida, USA: Chapman and Hall/CRC. [Google Scholar]

[B6] Chenomx, N. (2015) Suite. Edmonton, AB, Canada: Chenomx Inc. [Google Scholar]

[B7] Datta, A. and Zou, H. (2017) Cocolasso for high-dimensional error-in-variables regression. The Annals of Statistics, 45, 2400–2426. [Google Scholar]

[B8] Daviss, B. (2005) Growing pains for metabolomics: the newest’omic science is producing results–and more data than researchers know what to do with. The Scientist, 19, 25–29. [Google Scholar]

[B9] Dieterle, F., Ross, A., Schlotterbeck, G. and Senn, H. (2006) Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. application in 1H NMR metabonomics. Analytical Chemistry, 78, 4281–4290. [DOI] [PubMed] [Google Scholar]

[B10] Dona, A. C., Kyriakides, M., Scott, F., Shephard, E. A., Varshavi, D., Veselkov, K. and Everett, J. R. (2016) A guide to the identification of metabolites in NMR-based metabonomics/metabolomics experiments. Computational and Structural Biotechnology Journal, 14, 135–153. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] Fiehn, O. (2002) Metabolomics—the link between genotypes and phenotypes. Plant Mol Biol 48, 155–171. [PubMed] [Google Scholar]

[B12] Friedman, J., Hastie, T., Höfling, H. and Tibshirani, R. (2007) Pathwise coordinate optimization. The Annals of Applied Statistics, 1, 302–332. [Google Scholar]

[B13] Friedman, J., Hastie, T. and Tibshirani, R. (2010) Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1. [PMC free article] [PubMed] [Google Scholar]

[B14] Gowda, G. N., Zhang, S., Gu, H., Asiago, V., Shanaiah, N. and Raftery, D. (2008) Metabolomics-based methods for early disease diagnostics. Expert Review of Molecular Diagnostics, 8, 617–633. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] Hart, C. D., Vignoli, A., Tenori, L., Uy, G. L., Van To, T., Adebamowo, C., Hossain, S. M., Biganzoli, L., Risi, E., Love, R. R.. and others (2017) Serum metabolomic profiles identify ER-positive early breast cancer patients at increased risk of disease recurrence in a multicenter population. Clinical Cancer Research, 23, 1422–1431. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] Hollas, J. M. (2004) Modern Spectroscopy, 35–36. Hoboken, New Jersey, USA: Wiley. [Google Scholar]

[B17] James, G., Witten, D., Hastie, T. and Tibshirani, R. (2013) An Introduction to Statistical Learning, vol. 112. Berlin/Heidelberg, Germany: Springer. [Google Scholar]

[B18] Kale, N. S., Haug, K., Conesa, P., Jayseelan, K., Moreno, P., Rocca-Serra, P., Nainala, V. C., Spicer, R. A., Williams, M., Li, X.. and others (2016) MetaboLights: an open-access database repository for metabolomics data. Current Protocols in Bioinformatics, 53, 14–13. [DOI] [PubMed] [Google Scholar]

[B19] Lefort, G., Liaubet, L., Canlet, C., Tardivel, P., Père, M.-C., Quesnel, H., Paris, A., Iannuccelli, N., Vialaneix, N. and Servien, R. (2019) ASICS: an R package for a whole analysis workflow of 1D 1H NMR spectra. Bioinformatics, 35, 4356–4363. [DOI] [PubMed] [Google Scholar]

[B20] Meinshausen, N. and Bühlmann, P. (2010) Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72, 417–473. [Google Scholar]

[B21] Psychogios, N., Hau, D.D., Peng, J., Guo, A.C., Mandal, R., Bouatra, S., Sinelnikov, I., Krishnamurthy, R., Eisner, R., Gautam, B. and Young, N. (2011) The human serum metabolome. PLoS One, 6, e16957. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] Putri, S. P., Nakayama, Y., Matsuda, F., Uchikata, T., Kobayashi, S., Matsubara, A. and Fukusaki, E. (2013) Current metabolomics: practical applications. Journal of Bioscience and Bioengineering, 115, 579–589. [DOI] [PubMed] [Google Scholar]

[B23] Ramirez, T., Daneshian, M., Kamp, H., Bois, F. Y., Clench, M. R., Coen, M., Donley, B., Fischer, S. M., Ekman, D. R., Fabian, E.. and others (2013) Metabolomics in toxicology and preclinical research. Altex, 30, 209. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] Ravanbakhsh, S., Liu, P., Bjordahl, T. C., Mandal, R., Grant, J. R., Wilson, M., Eisner, R., Sinelnikov, I., Hu, X., Luchinat, C. and others (2015) Accurate, fully-automated NMR spectral profiling for metabolomics. PLoS One, 10, e0124219. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] Rosenbaum, M. and Tsybakov, A. B. (2010) Sparse recovery under matrix uncertainty. The Annals of Statistics, 38, 2620–2651. [Google Scholar]

[B26] Sørensen, Ø., Frigessi, A. and Thoresen, M. (2015) Measurement error in LASSO: impact and likelihood bias correction. Statistica Sinica, 25, 809–829. [Google Scholar]

[B27] Sørensen, Ø., Hellton, K. H., Frigessi, A. and Thoresen, M. (2018) Covariate selection in high-dimensional generalized linear models with measurement error. Journal of Computational and Graphical Statistics, 27, 739–749. [Google Scholar]

[B28] Tardivel, P. J., Canlet, C., Lefort, G., Tremblay-Franco, M., Debrauwer, L., Concordet, D. and Servien, R. (2017) ASICS: an automatic method for identification and quantification of metabolites in complex 1D 1 H NMR spectra. Metabolomics, 13, 1–9.27980501 [Google Scholar]

[B29] Thulin, E., Thulin, M. and Andersson, D. I. (2017) Reversion of high-level mecillinam resistance to susceptibility in Escherichia coli during growth in urine. EBioMedicine, 23, 111–118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B30] Tibshirani, R. (1996) Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58, 267–288. [Google Scholar]

[B31] Tulpan, D., Léger, S., Belliveau, L., Culf, A. and Čuperlović-Culf, M. (2011) MetaboHunter: an automatic approach for identification of metabolites from 1H-NMR spectra of complex mixtures. BMC Bioinformatics, 12, 1–22. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B32] Vu, T., Siemek, P., Bhinderwala, F., Xu, Y. and Powers, R. (2019) Evaluation of multivariate classification models for analyzing NMR metabolomics data. Journal of Proteome Research, 18, 3282–3294. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B33] Weljie, A. M., Newton, J., Mercier, P., Carlson, E. and Slupsky, C. M. (2006) Targeted profiling: quantitative analysis of 1H NMR metabolomics data. Analytical Chemistry, 78, 4430–4442. [DOI] [PubMed] [Google Scholar]

[B34] Wishart, D. S. (2008a) Metabolomics: applications to food science and nutrition research. Trends in Food Science & Technology, 19, 482–493. [Google Scholar]

[B35] Wishart, D. S. (2008b) Quantitative metabolomics using NMR. TrAC Trends in Analytical Chemistry, 27, 228–237. [Google Scholar]

[B36] Wishart, D. S., Feunang, Y. D., Marcu, A., Guo, A. C., Liang, K., Vázquez-Fresno, R., Sajed, T., Johnson, D., Li, C., Karu, N. and others. (2018) HMDB 4.0: the human metabolome database for 2018. Nucleic Acids Research, 46, D608–D617. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B37] Worley, B. and Powers, R. (2014) MVAPACK: a complete data handling package for NMR metabolomics. ACS Chemical Biology, 9, 1138–1144. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B38] Zangger, K. (2015) Pure shift NMR. Progress in Nuclear Magnetic Resonance Spectroscopy, 86, 1–20. [DOI] [PubMed] [Google Scholar]

[B39] Zou, H. (2006) The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429. [Google Scholar]

[B40] Zou, H. and Hastie, T. (2005) Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 301–320. [Google Scholar]

PERMALINK

Shifting-corrected regularized regression for 1H NMR metabolomics identification and quantification

Thao Vu

Yuhang Xu

Yumou Qiu

Robert Powers

Summary

1. Introduction

Fig. 1.

2. Model and methodology

2.1. Backgrounds

2.2. Spectrum model with shifting errors

2.3. Methodology

3. Implementation

3.1. Coordinate descent

Algorithm 1.

3.2. Cross-validation

Algorithm 2.

3.3. Concentration estimation

4. Simulation studies

Fig. 2.

Table 1.

Table 2.

5. Real data analysis

5.1. Experimental mixtures

Table 3.

Fig. 3.

5.2. Biological samples

Table 4.

6. Conclusion

Supplementary Material

Acknowledgments

Software

Supplementary material

Funding

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Shifting-corrected regularized regression for ¹H NMR metabolomics identification and quantification