Coherent Point Drift Peak Alignment Algorithms Using Distance and Similarity Measures for Two-Dimensional Gas Chromatography Mass Spectrometry Data

Zeyu Li; Seongho Kim; Sikai Zhong; Zichun Zhong; Ikuko Kato; Xiang Zhang

doi:10.1002/cem.3236

. Author manuscript; available in PMC: 2021 Jan 26.

Published in final edited form as: J Chemom. 2020 Mar 28;34(8):e3236. doi: 10.1002/cem.3236

Coherent Point Drift Peak Alignment Algorithms Using Distance and Similarity Measures for Two-Dimensional Gas Chromatography Mass Spectrometry Data

Zeyu Li ¹, Seongho Kim ^2,^3,^*, Sikai Zhong ¹, Zichun Zhong ¹, Ikuko Kato ^3,⁴, Xiang Zhang ⁵

PMCID: PMC7837599 NIHMSID: NIHMS1579081 PMID: 33505107

Abstract

The peak alignment is a vital preprocessing step before downstream analysis, such as biomarker discovery and pathway analysis, for two-dimensional gas chromatography mass spectrometry (2DGCMS)-based metabolomics data. Due to uncontrollable experimental conditions, e.g., the differences in temperature or pressure, matrix effects on samples, and stationary phase degradation, a shift of retention times among samples inevitably occurs during 2DGCMS experiments, making it difficult to align peaks. Various peak alignment algorithms have been developed to correct retention time shifts for homogeneous, heterogeneous or both type of mass spectrometry data. However, almost all existing algorithms have been focused on a local alignment and are suffering from low accuracy especially when aligning dense biological data with many peaks. We have developed four global peak alignment (GPA) algorithms using coherent point drift (CPD) point matching algorithms: retention time-based CPD-GPA (RT), prior CPD-GPA (P), mixture CPD-GPA (M), and prior mixture CPD-GPA (P+M). The method RT performs the peak alignment based only on the retention time distance, while the methods P, M, and P+M carry out the peak alignment using both the retention time distance and mass spectral similarity. The method P incorporates the mass spectral similarity through prior information and the methods M and P+M use the mixture distance measure. Four developed algorithms are applied to homogeneous and heterogeneous spiked-in data as well as two real biological data and compared with three existing algorithms, mSPA, SWPA, and BiPACE-2D. The results show that our CPD-GPA algorithms perform better than all existing algorithms in terms of F1 score.

Keywords: GC×GC-MS, MS similarity, Peak alignment, Point matching algorithm

1. Introduction

In a two-dimensional gas chromatography mass spectrometry (2DGCMS)-based metabolomics, it is crucial to analyze multiple samples for investigating biological and technical variation among samples and to increase a statistical power in the course of data analysis, such as biomarker discovery and pathway analysis. However, some uncontrollable experimental conditions including the differences in temperature or pressure, matrix effects on samples, and stationary phase degradation, lead to a shift of retention times which always occurs in the gas chromatogram (GC) columns between samples, hampering further downstream statistical analysis to identify and quantify metabolites. To resolve this unavoidable issue, it is necessary for investigators to perform the peak alignment in 2DGCMS-based metabolomics. The peak alignment is an essential preprocessing step to correct the retention time shifts so that the peaks generated from the same metabolite in different samples will be recognized. In the 2DGCMS system, two approaches have been developed for peak alignment: retention time-based alignment (RTA) and retention time and MS/MS similarity-based alignment (RTMSA).

Six RTA methods have been introduced using either one-dimensional or two-dimensional retention times: the rank annihilation method,¹ a correlation-optimized shifting method,² a piecewise retention time alignment,³ a two-dimensional (2D) correlation optimized warping,⁴ a robust and local alignment algorithm,⁵ and low-degree polynomial transformations.⁶ However, RTA is prone to generate higher false positive rate, in particular, as data become more complicated, such as biological data, because there are many metabolites having similar retention times in a dense retention area. Aligning metabolite peaks solely based on the one- or two-dimensional retention times may introduce a high rate of false-positive alignment because some metabolites with similar chemical functional groups have similar retention times in GC dimensions. For this reason, various RTMSA methods, MSort,⁷ DISCO,⁸ mSPA,⁹ SWPA¹⁰, and GUINEU¹¹ were developed. MSort and mSPA are only able to align homogeneous data, while other RTMSA approaches can be applied to both homogeneous and heterogeneous data. The homogeneous data imply that all samples were analyzed under the identical experiment conditions while the heterogeneous data refer that experiment data were acquired under different experiment conditions. It is noted that, in this study, the two terms, homogeneous and heterogeneous, are used not to qualify the nature of their compound mixtures but to distinguish the actual gradient or thermal conditions used during 2DGCMS experiments. All RTMSA methods align using both the 2D retention times and MS/MS similarity to reduce false positive rates. Recently, an RTMSA method was introduced, which is BiPACE-2D.¹² BiPACE-2D is an extension of BiPACE¹³ to the 2DMS data. It uses a mixture of retention time distance and similarity measure in a log-linear sum along with Gaussian RT penalty terms, not in a weighted linear sum which are used for mSPA-PAM. It generally performs similar to or better than existing RTMSA methods in case of homogeneous data, but its performance becomes worse than other RTMSA methods when it was applied to heterogeneous data. Nevertheless, all these existing alignment methods are based on a local alignment, resulting in that a peak is likely to be not correctly aligned in a dense chromatographic region where many peaks are present in a small region. False alignment will lead to subsequent errors in the downstream statistical analysis.

To overcome the aforementioned issues, Deng et al. (2016)¹⁴ employed point-matching algorithms (PMAs). PMA is often used in the domains of computer vision and medical imaging. It first extracts feature points in the image and then searches globally the matching points in the consecutive images by adopting the projection of rigid and non-rigid transformation. There are several versions of PMA including the iterated closest-point algorithm (ICP),¹⁵ robust point matching (RPM),¹⁶ the thin-plate spline-RPM (TPS-RPM),¹⁷ and coherent point drift (CPD).¹⁸ The CPD algorithm outperforms others and Deng et al. (2016)¹⁴ used it to introduce a global peak alignment algorithm (PMA-PA) for both homogeneous and heterogeneous data, which demonstrates that its performance was comparable or better than existing RTMSA methods. However, PMA-PA uses retention times only and therefore, we here developed novel CPD-based RTMSA methods using both retention time distance and MS/MS similarity measure. The developed algorithms are implemented in the software package ‘coAlign’, which is written in MATLAB and available at http://mrr.wayne.edu.

2. Methods

2.1. Coherent point drift point matching algorithm (CPD-PMA)

The point matching algorithm (PMA) is to find an optimal spatial mapping or transformation that maps one point set to the other, which is often used in the areas of computer vision and medical imaging.^17,18 In general, PMA is implemented based on rigid or nonrigid transformation. A rigid transformation is to preserve any point-wise distances such as translation, rotation, and reflection, while a non-rigid transformation allows the change of the distance between a pair of points. Various algorithms were introduced for rigid and non-rigid PMA. One of the most widely used PMA, especially for rigid transformation, is the iterative closest point (ICP) algorithm because of its computational simplicity and efficiency.^4,15 However, ICP requires an adequate set of initial positions, which is not valid for nonrigid transformation. Furthermore, it guarantees to converge only up to a local minimum and its performance is very sensitive to outliers. To resolve these limitations of ICP, several probabilistic algorithms were developed, such as robust point matching (RPM)¹⁶ and thin plate spline (TPS)-RPM.¹⁷ These methods align two point sets under a probability density estimation using Gaussian mixture model (GMM) centroids. Later, the coherent point drift (CPD) algorithm was proposed for both rigid and nonrigid transformations.¹⁸ The CPD algorithm is a probabilistic algorithm based on a maximum likelihood estimation of GMM and forces GMM centroids to move coherently as a group in order to preserve the global topological structure of the point sets. Comparison analysis demonstrated that CPD outperforms other existing methods. CPD is briefly described as follows.

A key procedure of peak alignment is to find a transformation function f(T, θ) that aligns the target point set T = {t₁, … , t_N} to the reference point set R = {r₁, … , r_M}, where f(T, θ|η) is a transformation function applied to T with a set of the transformation parameters θ and tuning parameters η. In CPD, the tuning parameters are η = ∅ if f(·) is a rigid transformation function and η = {β, λ} if it is a nonrigid transformation function, where β represents the width of smoothing Gaussian filter, i.e., the less β, the less oscillations (high frequency waves), resulting in the transformation function smoother, and λ tunes the weight of the penalty term, i.e. as λ decreases, the likelihood function becomes dominated, while as λ increases, the objective function becomes smoother. The alignment of two point sets is then considered as a probability density estimation problem. GMMs are applied to both point sets, and then the GMM centroids of two sets are fitted by maximizing the likelihood. The following GMM probability density function (pdf) is used in CPD:

p (r ∣ ω, θ, η) = (1 - ω) \sum_{n = 1}^{N} p (t_{n}) p (r ∣ t_{n}, σ^{2}, θ, η) + ω \frac{1}{M}, r \in R .

(1)

In Equation (1), M and N are number of points in the point sets R and T. To account for the possibility of outliers and missing data, a uniform distribution $\frac{1}{M}$ is added and ω is the weight between 0 and 1. This weight ω should be tuned. p(t_n) is the prior probability, which is fixed as $\frac{1}{N}$ in CPD. p(r|t_n, σ², θ,η) is the posterior probability and expressed as

p (r ∣ t_{n}, σ^{2}, θ, η) = \frac{1}{{(2 π σ^{2})}^{1 ∕ 2}} \exp (- \frac{{∣ r - f (t_{n}, θ ∣ η) ∣}^{2}}{2 σ^{2}}) .

(2)

To find the best fit between two sets R and T, the negative log likelihood function

E (θ, σ^{2} ∣ ω, η) = - \sum_{m = 1}^{M} \log (p (r_{m} ∣ ω, θ, η)) = - \sum_{m = 1}^{M} \log ((1 - ω) \sum_{n = 1}^{N} p (t_{n}) p (r ∣ t_{n}, σ^{2}, θ, η) + ω \frac{1}{M})

(3)

is minimized under the independent and identically distributed assumption. The CPD uses EM algorithm¹⁹ to estimate θ and σ². In the E-step, the posterior probabilities of GMM are updated, given ( ${\hat{σ}}^{2}$ , $\hat{θ}$ , $\hat{ω}$ , $\hat{η}$ ), as

p (r_{m} ∣ t_{n}) = \exp (\frac{- {∣ r_{m} - f (t_{n}, \hat{θ} ∣ \hat{η}) ∣}^{2}}{2 σ^{2}}) \times {(\sum_{i = 1}^{N} \exp (\frac{- {∣ r_{m} - f (t_{i}, \hat{θ} ∣ \hat{η}) ∣}^{2}}{2 σ^{2}}) + {(2 π {\hat{σ}}^{2})}^{\frac{D}{2}} (\frac{\hat{ω}}{1 - \hat{ω}}) \frac{N}{M})}^{- 1},

(4)

where D is the dimension of the point sets. The parameters θ and σ² are estimated in the M-step. For more details about CPD method, we refer readers to the works of Myronenko and Song (2010).

2.2. Coherent point drift global peak alignment algorithms (CPD-GPA)

The peak alignment (PA) is a process of finding a transformation that aligns two sets of peaks. In fact, both PA and PMA aim to align two sets of peaks or points with an optimal mapping and transformation to achieve the best alignment, in which PA can be viewed as a special case of PMA. In this study, four novel PA algorithms have been developed using CPD because CPD has an ability to preserve the global topological structure so that it can align peaks globally unlike other PA algorithms that focus on a local alignment as well as because CPD performs the best among other existing point matching algorithms. The developed global peak alignment (GPA) algorithms are described as follows.

Let R = {r₁, … , r_M) be a peak list of the reference mass spectrometry (MS) data and T = {t₁, … , t_N} the peak list of the target MS data, where r_i and t_j (i = 1, … , M; j = 1, … , N) are composed of its first and second dimension retention times (r_i1, r_i2) and (t_j1, t_j2) , respectively. In case that MS information is available for each peak (e.g., GC×GC-MS), the vectors of m/z value and its intensity will be introduced, in addition to the retention times, as ( $x_{r}^{i}$ , $y_{r}^{i}$ ) and ( $x_{t}^{j}$ , $y_{t}^{j}$ ), where $x_{r}^{i}$ and $x_{t}^{j}$ represent the vectors of m/z value and $y_{r}^{i}$ and $y_{t}^{j}$ the vectors of intensity for the i-th reference and the j-th target peaks, respectively. Note that the distance and the similarity always correspond to the retention times and the mass spectra information, respectively.

Retention time-based CPD-GPA

The first algorithm is similar to the algorithm developed by Deng et al. (2016) in the sense that both algorithms use two-dimensional retention times only with a direct application to the CPD method without any modification. However, Deng et al. (2016) used the CPD with fast implementation that used low-rank matrix approximation to a linear system of equation, while our algorithm used the original CPD algorithm without any approximation. In fact, we observed that the use of CPD without approximation performs better than that with approximation in terms of F1 score (the maximum F1 score: 0.9618 vs. 0.9382). This algorithm only uses two-dimensional retention time as input, so we call it ‘RT’.

Prior MS similarity and retention time-based CPD-GPA

MS/MS similarity measures are introduced into CPD in this method. In fact, incorporating MS/MS similarity measures into CPD was not straightforward because CPD uses a distance-based measure and the MS/MS similarity measure is reciprocal to the retention time distance. 1Thus, we introduced the MS/MS similarity as prior information instead of $\frac{1}{N}$ into Equation (1). Namely, the GMM pdf is modified as

p (r ∣ ω, θ, η) = (1 - ω) \sum_{n = 1}^{N} S (t_{n}, r) \cdot p (r ∣ t_{n}, σ^{2}, θ, η) + ω \frac{1}{M}, r \in R,

(5)

where S(t_n,r) is the MS/MS similarity measure between two peaks, t_n ∈ T and r ∈ R. We further found that the updated posterior probabilities of GMM in Equation (4) become, given ( ${\hat{σ}}^{2}$ , $\hat{θ}$ , $\hat{ω}$ , $\hat{η}$ ),

p (r_{m} ∣ t_{n}) = \exp (\frac{- {∣ r_{m} - f (t_{n}, \hat{θ} ∣ \hat{η}) ∣}^{2}}{2 σ^{2}}) \cdot S (t_{n}, r_{m}) \times {(\sum_{i = 1}^{N} \exp (\frac{- {∣ r_{m} - f (t_{i}, \hat{θ} ∣ \hat{η}) ∣}^{2}}{2 σ^{2}}) \cdot S (t_{i}, r_{m}) + {(2 π {\hat{σ}}^{2})}^{\frac{D}{2}} (\frac{\hat{ω}}{1 - \hat{ω}}) \frac{1}{M})}^{- 1} .

In particular, D is equal to two in this study because each retention time consists of a two-dimensional coordinate. For the MS/MS similarity measure, we used the cosine similarity (also known as dot product) that is defined as

S (t_{n}, r_{m}) = \frac{y_{t}^{n} \circ x_{r}^{m}}{∣ y_{t}^{n} ∣ ∣ x_{r}^{m} ∣},

(7)

where $y_{t}^{n}$ and $x_{r}^{m}$ are the intensities of the n-th target and the m-th reference peaks, respectively, $y_{t}^{n} \circ x_{r}^{m} = \sum_{k = 1}^{K} y_{tk}^{n} \cdot x_{rk}^{m}$ and $∣ x_{r}^{m} ∣ = {(\sum_{k = 1}^{K} {(x_{rk}^{m})}^{2})}^{\frac{1}{2}}$ . It is noteworthy that the cosine similarity is always greater than or equal to zero because each intensity is nonnegative, i.e., $y_{t}^{n} \geq 0$ and $x_{t}^{m} \geq 0$ We call this CPD-GPA ‘P’.

Mixture distance-based CPD-GPA

In this method, we developed a mixture distance measure using MS/MS similarity and retention time distance. In order to make both domains of the MS/MS similarity measure and the retention time distance consistent to each other, we replaced the MS/MS similarity with 1 − S(t_n, r_m) ranging from one to zero, constructing the mixture distance as

d (t_{n}, r_{m} ∣ δ, γ) = (1 - δ) \cdot {∣ r_{m} - t_{n} ∣}^{2} + δ \cdot γ \cdot (1 - S (t_{n}, r_{m})),

(8)

where δ is the mixture weight and 0 ≤ δ ≤ 1; γ is a scale factor for the angular distance. In Equations (2) and (4), we replaced the Euclidean distance with this mixture distance. As a result, the corresponding GMM pdf and the updated posterior probability function are as follows:

p (r ∣ t_{n}, σ^{2}, θ, η, δ, γ) = \frac{1}{{(2 π σ^{2})}^{1 ∕ 2}} \exp (- \frac{d (f (t_{n}, θ ∣ η), r ∣ δ, γ)}{2 σ^{2}});

(9)

p (r_{m} ∣ t_{n}) = \exp (\frac{- d (f (t_{n}, \hat{θ} ∣ \hat{η}), r_{m} ∣ \hat{δ}, \hat{γ})}{2 σ^{2}}) \times {(\sum_{i = 1}^{N} \exp (\frac{- d (f (t_{i}, \hat{θ} ∣ \hat{η}), r_{m} ∣ \hat{δ}, \hat{γ})}{2 σ^{2}}) + {(2 π {\hat{σ}}^{2})}^{\frac{D}{2}} (\frac{\hat{ω}}{1 - \hat{ω}}) \frac{N}{M})}^{- 1} .

(10)

Since the third algorithm introduced a mixture distance between peaks, we abbreviated the name of the algorithm as ‘M’ in this paper.

Prior similarity and mixture distance-based CPD-GPA

Lastly, we combined the two algorithms ‘P’ and ‘M’ together, by incorporating the MS/MS similarity as the prior information and the mixture distance into Equations (1), (2), and (4). In particular, the corresponding updated posterior probabilities of GMM become, given ( ${\hat{σ}}^{2}$ , $\hat{θ}$ , $\hat{ω}$ , $\hat{η}$ , $\hat{δ}$ , $\hat{γ}$ ),

p (r_{m} ∣ t_{n}) = \exp (\frac{- d (f (t_{n}, \hat{θ} ∣ \hat{η}), r_{m} ∣ \hat{δ}, \hat{γ})}{2 σ^{2}}) \cdot S (t_{n}, r_{m}) \times {(\sum_{i = 1}^{N} \exp (\frac{- d (f (t_{i}, \hat{θ} ∣ \hat{η}), r_{m} ∣ \hat{δ}, \hat{γ})}{2 σ^{2}}) \cdot S (t_{i}, r_{m}) + {(2 π {\hat{σ}}^{2})}^{\frac{D}{2}} (\frac{\hat{ω}}{1 - \hat{ω}}) \frac{1}{M})}^{- 1} .

(11)

We call this algorithm ‘P+M’.

2.3. GC×GC-MS datasets

Four GC×GC-MS datasets were utilized to evaluate the performance of the developed algorithms, which are called ‘homogeneous’, ‘heterogeneous’, ‘mice’ and ‘chlamy’ datasets. The first three datasets were used in the previously developed algorithms, such as DISCO,⁸ mSPA,⁹ SWPA¹⁰ and BiPACE-2D¹² algorithms, and the ‘chlamy’ dataset was used only for the BiPACE-2D algorithm. The use of those datasets enabled us to directly compare the performances of our newly developed approaches with those existing algorithms. The ‘homogeneous’ dataset was from 10 repeated experiments of a mixture of 76 compound standards at 5°C/min temperature gradient, which corresponds to S1-S10 in Table 1. The ‘heterogeneous’ dataset includes all the 10 measurements of the ‘homogenous’ dataset, 2 repeated experiments at 7 °C/min and 4 repeated experiments at 10 °C/min of the same compound mixture. So the ‘heterogeneous’ dataset corresponds to data S1-S16 used in Table 1. The ‘mice’ dataset was from 5 repeated experiments of a mice metabolite extract with spiked-in compounds, which corresponds to data M1-M5 in Table 1. The ‘chlamy’ dataset was from 12 experiments of metabolites from Chlamydomonas reinhardtii with 3 replicates for each of 4 factor combinations of bacteria strain (wild-type/high H₂-producing strain) and timing (before or during Tk production phase).^12,20 The details about the sample acquisition, experimental procedures and data processing can be found in the previous works.^10,12 The original datasets from the GC×GC-MS measurements contain multiple peaks that are identified as the same compound. Before conducting the performance evaluation, we removed the duplicate peaks by only keeping the one with the largest peak area, which followed the same routine of mSPA⁹ and SWPA¹⁰ methods. To be consistent across the four datasets, we also removed the duplicate peaks of the ‘chlamy’ dataset using the same approach.

Table 1.

The summary of GC×GC-MS data

	(a) Compound standard
	5 °C/min										7 °C/min
Run ID	S1	S2	S3	S4	S5	S6	S7	S8	S9	S10	S11	S12
Total number of peaks	183	188	163	152	154	147	175	164	171	175	134	171
Total number of unique peaks	78	76	76	75	74	73	74	76	77	75	75	73

								(b) Mice
	10 °C/min
Run ID	S13	S14	S15	S16				M1	M2	M3	M4	M5
Total number of peaks	150	139	114	119				759	733	695	727	661
Total number of unique peaks	76	73	76	75				466	456	437	452	418

	(c) Chlamy

Run ID	C1	C2	C3	C4	C5	C6	C7	C8	C9	C10	C11	C12
Total number of peaks	375	294	459	479	425	470	280	445	409	442	409	373
Total number of unique peaks	242	203	264	255	232	252	189	246	246	240	228	206

Open in a new tab

2.4. Tuning parameters

We used both the rigid and non-rigid transformations in all four developed algorithms. The rigid transformation requires one major tuning parameter, which is ω to control outliers (0 ≤ ω < 1). The non-rigid transformation has two more tuning parameters, β ∈ [1,5] and λ ∈ [1,5], in addition to ω ∈ [0,1). We tuned ω for both rigid and non-rigid transformations, β and λ for the non-rigid transformation. The mixture weight, δ, for the mixture distance was also tuned for the developed algorithms ‘M’ and ‘P+M”. In particular, the scale tuning parameter, γ, for the mixture distance was fixed as 100 based on our preliminary analyses. For all the parameter tunings, we divided their ranges into ten equal bins and selected ten values at the corresponding cut points. Likewise to Deng et al. (2016)¹⁴, we further investigated how the z-score standardization before peak alignment influences the performances of each developed algorithm.

2.5. Pairwise peak alignment implementation

In order to implement the CPD point matching algorithm between two point sets, one requires choosing ‘target’ and ‘reference’ datasets so that the matching results are dependent on this allocation. To avoid this role dependence, we carried out two implementations by switching the role of ‘target’ and ‘reference’ datasets for each pairwise peak alignment. Then only the matched peaks appeared in both matching lists were kept as positive matching, otherwise the matching was considered as being not matched and was discarded.

For the ‘homogeneous’ dataset, 45 pairs were generated from 10 datasets measured at the same temperature gradient 5°C/min. As for the ‘heterogeneous’ dataset, the two sets in each pair were from each different temperature gradient, resulted in 20 pairs between measurements at 5°C/min and 7°C/min, 40 pairs between measurements at 5°C/min and 10°C/min, and 8 pairs between measurements at 7°C/min and 10°C/min. For the ‘mice’ datasets, 10 pairs were generated from 5 datasets. As for the ‘chlamy’ dataset, 66 pairs were analyzed from 12 datasets.

2.6. Performance evaluation

The performance evaluation was performed using the same measures used in mSPA⁹ and SWPA¹⁰ and there can be found the more detailed explanation. The compound identification was carried out using the LECO ChromaTOF software. In particular, ‘homogeneous’, ‘heterogeneous’ and ‘mice’ datasets were processed by the LECO ChromaTOF software version 3.4, while the ‘chlamy’ dataset was handled by LECO ChromaTOF software version 4.22. For each peak alignment algorithm, we used these identified compound names for the performance evaluation. That is, if the two matched peaks have the same name, they are considered as a ‘true’ positive match. The true positive rate (TPR) and predictive positive value (PPV) are defined as

TPR = \frac{TP}{u}; PPV = \frac{TP}{v},

(12)

where u is the total number of ‘true’ peak matches, v is the total number of peak matches found by a peak alignment algorithm, and TP is the number of ‘true’ positive match found by a peak alignment algorithm. The F1 score is defined as the harmonic average of TPR and PPV,

F 1 score = \frac{2 \cdot TPR \cdot PPV}{TPR + PPV} .

(13)

3. Results

3.1. Homogeneous dataset

The developed four CPD-GPA algorithms were applied to the homogeneous datasets S1 to S10 that generate 45 pairs. TPR-PPV plots in both Fig. 1 and Supplementary Information Fig. S1 depict the performance results on the pairwise peak alignment. In the TPR-PPV plots, the case with the best performance should be located at the top-right corner.

Figure 1. — The effects of outlier and mixture weight parameters, ω and δ, on the developed peak alignment methods for homogeneous data.

In case of the tuning parameter ω for outliers, when the data were not standardized with z-score, the methods RT and M are more sensitive to the tuning parameter ω regardless of the type of transformation (the left panel of Fig. 1). Interestingly, it appears that in case of rigid transformation, the method ‘RT’ achieves the best performance when ω is close to zero, while others reach the best performance when ω is near one. However, when the nonrigid transformation was applied, all methods achieve the best performance close to ω = 1. When the z-score was used, the methods ‘M’ and ‘P+M’ are the most sensitive to the tuning parameter ω, while the methods ‘RT’ and P’ have little influence on the change of ω.

The methods M and P+M have one more tuning parameter, δ, for the mixture distance described in Equation (8). In particular, as δ goes to zero (one), the retention time distance (MS/MS similarity) plays a more critical role in the peak alignment. The choice of the weight parameter has a big influence on the performance of the peak alignment (the right panel of Fig. 1). Regardless of transformation, it appears that either the retention time distance or the MS/MS similarity is more important than the other measure to achieve a better performance according to the use of z-score.

The effects of nonrigid tuning parameters, β and λ, on the peak alignment are depicted in Supplementary Information Fig.S1. When no z-score standardization is used, there is no effect of β and λ on the methods, but their influence on the method with z-score is not ignorable. In fact, z-score standardization makes the performance more sensitive to the choices of both parameters across all methods. Nevertheless, it generally appears that β and λ play less important roles in achieving a better performance than those of outlier and mixture weight parameters.

The overall performance in homogeneous cases is displayed in Table 2 and Supplementary Information Table S1 in terms of the maximum F1 score. All methods achieve the best performance with the nonrigid transformation, while the z-score standardization contributes to the highest performance only for the method RT. As observed in Supplementary Information Fig. S1, β and λ have no significant influence on the performance of a peak alignment except for the method RT. Interestingly, the mixture weight δ is close to one for the method Μ (δ = 0.8888 in Table 2), but it is close to zero for the method P+M (δ = 0.1111 in Table 2). This means that the retention time distance contributes the most to the method P+M, while the MS/MS similarity plays the most critical role in achieving the best performance for the method M. Among the four methods, the method P+M performs the best in terms of PPV (Mean+SEM: 0.9896±0.0015) and F1 score (0.9780±0.0021), while the method M achieves the highest TPR (0.9678±0.0026). Note SEM stands for the standard error of mean.

Table 2.

The maximum F1 scores for each of pairwise peak alignment results. The numbers in parentheses are standard error of mean (SEM). The cases with the best performance are highlighted with bold and italic fonts.

Method	Rigid/Nomigid	Z-score	Mixture Weight (δ)	Outlier (ω)	β	λ	TPR	PPV	F1 score
Homogeneous

RT	Nonrigid	Yes	-	0.8888	2.2	4.2	0.9498 (0.0039)	0.9744 (0.0028)	0.9618 (0.0030)
P	Nonrigid	No	-	0.9999	1	1	0.9663 (0.0033)	0.9886 (0.0016)	0.9773 (0.0023)^*
M	Nonrigid	No	0.8888	0.1111	1	1	*0.9678 (0.0026)*	0.9881 (0.0018)	0.9778 (0.0020)^*
P+M	Nonrigid	No	0.1111	0.9999	1	1	0.9669 (0.0030)	*0.9896 (0.0015)*	*0.9780 (0.0021)^

Heterogeneous

RT	Nonrigid	Yes	-	0.8888	4.2	2.6	0.8747 (0.0058)	0.9429 (0.004)	0.9073 (0.0048)
P	Nonrigid	Yes	-	0.8888	2.6	1.4	*0.9377 (0.0038)*	0.9733 (0.0024)	*0.9550 (0.0029)^
M	Rigid	Yes	0.1111	0.9999	-	-	0.9346 (0.0059)	0.9768 (0.0021)	0.9548 (0.0039)^*
P+M	Nonrigid	Yes	0.1111	0.9999	1.4	1.4	0.9026 (0.0056)	*0.9832 (0.0023)*	0.9408 (0.0039)

Mice

RT	Nonrigid	Yes	-	0.9999	2.2	5	0.5353 (0.0087)	0.4878 (0.0093)	0.5096 (0.0059)
P	Nonrigid	No	-	0.9999	1	1	*0.6778 (0.0128)*	0.5986 (0.0105)	0.6357(0.0113)^*
M	Rigid	No	0.9999	0.3333	-	-	0.6655 (0.0118)	*0.6261 (0.0096)*	0.6450 (0.0100)^*
P+M	Rigid	No	0.7777	0.5555	-	-	0.6755 (0.0130)	0.6253 (0.0112)	*0.6493 (0.0116)^

Open in a new tab

The asterisk indicates a method or a set of methods that has a significantly higher F1 score than other methods for each data set at a two-sided 5% level.

3.2. Heterogeneous dataset

All 16 compound standard datasets were used for the heterogeneous case. In particular, the heterogeneous pairwise peak alignments were carried out using 68 pairs of data generated between S1-S10 and S11-S12, between S1-S10 and S13-S16, and between S11-S12 and S13-S16.

Without the z-score standardization, it shows that the role of the outlier parameter, ω, depends on the method used (the left panel of Fig. 2). In fact, the method P+M achieves the best performance regardless of transformation. Furthermore, the method P is the most sensitive and the method RT is the least sensitive to the choice of ω. This might be because the retention time distance has little effect on the peak alignment and, on the other hand, the MS/MS similarity plays a crucial role due to the nature of heterogeneous data. After the use of z-score, the heterogeneous data were standardized across all data so that the role of the retention time distance is enhanced. Thus, the methods M and P+M perform similar to each other and the performance of the method RT becomes comparable to other methods. Besides, the method P becomes less sensitive to the parameter ω, but the methods M and P+M become more sensitive to the choice of ω compared to the case without z-score. Interestingly, the method RT is less sensitive to the choice of ω regardless of both z-score and transformation.

Figure 2. — The effects of outlier and mixture weight parameters, ω and δ, on the developed peak alignment methods for heterogeneous data.

Regardless of transformation and z-score, the mixture distance weight δ plays a prominent role in peak alignment for both methods M and P+M (the right panel of Fig. 2). It is interesting that the best performances for both methods occur when δ is close to one, meaning that the MS/MS similarity is more important for peak alignment, without regard to transformation, when the data are not standardized by z-score. However, the behavior is suddenly changed when the z-score standardization is applied. Namely, the peak alignment for both methods performs better when δ is close to zero, suggesting that the retention time distance plays a more critical role in peak alignment. This might be because of the nature of the heterogeneous data. Another possible explanation might be related to a property of CPD-GPA that aligns peaks while preserving the global topological structure of the peaks. Thus, without z-score standardization, the peaks’ topological structure was distorted so that the MS/MS similarity becomes relatively important for peak alignment, but the z-score standardization might partially recover the topological structure so that the retention time distance becomes an important role.

Similar to the homogeneous data, all methods except for the method RT are not influenced by the nonrigid tuning parameters, β and λ, when it comes to peak alignment (Supplementary Information Fig. S2). In case of the method RT, the performance of peak alignment is not affected by the choice of β and λ like other methods, when the z-score standardization is not used. However, with the use of z-score standardization, the effect of either β or λ is pronounced for the method RT, which is similar to the homogeneous case. This might be because the MS/MS similarity reduces the influence of β and λ that are mainly related to the retention time distance.

Table 2 shows that the z-score standardization and controlling outlier (ω = 0.8888 or 0.9999) are important to achieve a better performance in peak alignment for all methods. As expected, the mixture distance weight δ is around one (δ = 0.9999 or 1) to achieve the maximum F1 score for the methods M and P+M without z-score, due to the nature of heterogeneous data. However, the methods M and P+M achieve the best performance when the mixture distance weight δ is close to zero (δ = 0.1111 or 0.2222), which assigns a larger weight on the retention time distance, after the use of z-score (Supplementary Information Table S1). Overall, the method P achieves the highest TPR (0.9377±0.0038) and maximum F1 score (0.9550±0.0029), while the method P+M performs the highest PPV (0.9832±0.0023).

3.3. Mice dataset

The nature of mice data (M1-M5) is generally similar to the homogeneous data (S1-S10) because all experiments were carried out under the same configuration (Table 1). However, the number of detected unique peaks for the mice data (445.8±8.37) is almost six times larger than that for the homogeneous data (75.4±0.48), suggesting that the mice data have a higher peak density so that it is expected that the retention time distance alone might not be enough to align peaks. There were a total of 10 pairwise peak alignments performed by each of four methods.

The left panel of Fig. 3 shows the performances of peak alignment of each of four methods according to the choice of the outlier parameter ω. Unlike the homogeneous data, the method RT performs worse than other methods regardless of transformation and z-score. This might be because, as expected before, many peaks are located in a small area and so the retention time distance loses its ability to distinguish between the true and the false positive matchings between two peaks.

Figure 3. — The effects of outlier and mixture weight parameters, ω and δ, on the developed peak alignment methods for mice data.

The mixture distance weight δ influences the peak alignment similar to the other cases (the right panel of Fig. 3). Namely, without z-score standardization, the better performance occurs when δ is close to one (i.e., a larger weight on the MS/MS similarity), while, after z-score was applied, the weight factor is shifted towards to zero (i.e., a larger weight on the retention time distance). This phenomenon can be observed in Supplementary Information Table S1. In fact, without z-score, the weight factor with the highest F1 score ranges between 0.7777 and 1 and it ranges from 0.1111 to 0.4444 with z-score.

When the data were not standardized with z-score, the behaviors of the nonrigid tuning parameter β are similar to other two cases (the left panel of Supplementary Information Fig. S3). In case that the z-score was used, the method RT performs worst among four methods different from other two cases, implying that the retention time distance alone faces a significant challenge in peak alignment.

As can be seen in Table 2 and Supplementary Table S1, the method RT has the highest F1 score with the z-score standardization, while the methods P, M, and P+M reach the highest F1 score without z-score. Overall, the methods P, M, and P+M achieve the highest TPR (0.6778±0.0128), the highest PPV (0.6261±0.0096), and the highest F1 score (0.6493±0.0116), respectively.

3.4. Comparisons with mSPA, SWPA, and BiPACE-2D algorithms

The developed four algorithms were compared to the three existing peak alignment algorithms that are mSPA⁹, SWPA¹⁰, and BiPACE-2D¹². It is noteworthy that GUINEU¹¹ was not selected based on the comparison results in BiPACE-2D¹² where its performance was in general worse than other algorithms. The methods with the highest F1 score are displayed in Table 3 for each algorithm. Because the data used here are exactly the same as those used for mSPA and SWPA, the comparisons are carried out directly based on the results available in the literature^9,10 and those are added into Table 3. However, although BiPACE-2D reported the performance of pairwise peak alignments, the method to deal with multiple peaks is different from those used in mSPA, SWPA, and this current study. Therefore, to make the comparison fair, the peak alignments using BiPACE-2D were newly carried out using the same data used in this study as well as the tuning parameters that were achieved the best performances in the original BiPACE-2D study. In particular, mSPA can be applied only to the homogeneous case so that it has the performance results only for the homogeneous and mice data, and the chlamy dataset was applied only to our developed four methods and BiPACE-2D.

Table 3.

Comparisons among our best results, mSPA, SWPA, and BiPACE-2D results on the same datasets in terms of the maximum F1 scores for pairwise peak alignments. The numbers in parentheses are standard error of mean (SEM) for TPR and PPV and 95% confidence intervals for F1 score. The cases with the best performance are highlighted with bold and italic fonts.

CPD-GPA

	Method	Rigid/Nomigid	z-score	Mixture Weight (δ)	Outlier (ω)	β	λ	TPR	PPV	F1 score
Homogeneous	M	Nonrigid	No	0.8888	0.1111	1	1	0.9669 (0.0030)	*0.9896 (0.0015)*	0.9780^* (0.9739-0.9821)
Heterogeneous	P	Nonrigid	Yes	-	0.8888	2.6	1.4	*0.9377 (0 .0038)*	*0.9733 (0 .0024)*	*0 .9550^ *(0 .9493-0.9607)*
Mice	P+M	Rigid	No	0.7777	0.5555	-	-	0.6755 (0.0130)	0.6253 (0.0112)	0.6493^* (0.6266-0.6720)
Chlamy	P+M	Rigid	No	0.7777	0.5555	-	-	0.7671 (0.0048)	0.8474 (0.0037)	*0.8049^ *(0.7973-0.8125)*

mSPA

	Method	w	Distance	MS/MS similarity dot product				TPR	PPV	F1 score
Homogeneous	PAM	0.5	Canberra	dot product				*0.9751 (0.0024)*	0.9870 (0.0021)	*0.9810^ *(0.9771-0.9849)*
Mice	PAM	0.05	Manhattan	dot product				*0.7012 (0.0110)*	0.5475 (0.0101)	0.6148 (0.5944-0.6352)
Chlamy	PAM	0.05	Manhattan	dot product				*0.8013 (0.0046)*	0.7823 (0.0047)	0.7914^* (0.7830-0.7998)

SWPA

	Method	p						TPR	PPV	F1 score
Homogeneous	SWRME	0.8						0.9214 (0.0059)	0.9725 (0.0037)	0.9461 (0.9369-0.9553)
Heterogeneous	SWRE	0.9						0.8247 (0.0050)	0.9781 (0.0025)	0.8945 (0.8876-0.9014)
Mice	SWRE	0.9						0.5023 (0.0255)	0.6207 (0.0139)	0.5526 (0.5457-0.5595)
Chlamy	SWRE	0.9						0.6297 (0.0062)	0.7754 (0.0049)	0.6945 (0.6837-0.7053)

BiPACE-2D

	D1	D2	T1	T2	MCS	MS/MS similarity		TPR	PPV	F1 score
Homogeneous	10	0.25	0	0.25	2	dot product		0.8603 (0.0073)	0.9395 (0.0059)	0.8979 (0.8854-0.9104)
Heterogeneous	10	0.25	0	0.25	2	dot product		0.8226 (0.0067)	0.9207 (0.0041)	0.8681 (0.8585-0.8777)
Mice	25	0.5	0.75	0	2	weighted cosine		0.6984 (0.0101)	*0.6623 (0.0069)*	*0.6798^ *(0.6641-0.6955)*
Chlamy	100	0.5	0.99	0.99	2	cosine		0.7618 (0.0049)	*0.8521 (0.0039)*	0.8041^* (0.7963-0.8119)

Open in a new tab

The asterisk indicates a method or a set of methods that has a significantly higher F1 score than other methods for each data set at a two-sided 5% level.

For the homogeneous data (Table 3), the highest TPR, PPV, F1 score occur when mSPA-PAM, CPD-GPA (the method M), and mSPA-PAM are used, respectively (0.9751±0.0024, 0.9896±0.0015, and 0.9810±0.0020). Interestingly, BiPACE-2D significantly underperforms compared to other three algorithms in terms of F1 score. The 95% confidence interval (CI) of BiPACE-2D is not overlapped with those of any algorithm. Although mSPA-PAM achieves the highest F1 score, its 95% CI is overlapped with that of CPD-GPA, implying that both algorithms are comparable.

The developed CPD-GPA (in particular, the method P) outperforms against other three algorithms based on all TPR (0.9377±0.0038), PPV (0.9733±0.0024), and F1 score (0.9550±0.0020) in case of heterogeneous data. Moreover, its F1 score is significantly higher than those of other algorithms. Namely, the lower bound of its 95% CI is not overlapped with 95% CIs of any algorithm (Table 3). Similar to the homogeneous data, BiPACE-2D significantly performs worst among all four algorithms in the sense that the upper bound of its 95% CI is not overlapped with the lower bounds of 95% CIs of any algorithm.

In case of the mice data, mSPA-PAM, BiPACE-2D, and BiPACE-2D achieve the highest TPR (0.7012±0.0110), PPV (0.6613±0.0069), and F1 score (0.6798±0.0080), respectively. However, the F1 score of BiPACE-2D is not significantly different from that of CPD-GPA (in particular, the method P+M). In other words, the 95% CI of BiPACE-2D is overlapped with that of CPD-GPA (Table 3).

The CPD-GPA algorithm was applied to the homogeneous chlamy data using the same method and tuning parameters as those obtained from the mice data. In particular, the chlamy data have 12 experiment data sets so that there were 66 pairs. CPD-GPA (in particular, the method P+M) achieves the highest TPR (0.7671±0.0048) and F1 score (0.8049±0.0039), while BiPACE-2D has the highest PPV (0.8521±0.0039). However, the 95% CI of CPD-GPA is overlapped with that of BiPACE-2D.

In summary, it appears that mSPA-PAM has an advantage over other methods in terms of the highest TPR although it cannot be applied to the heterogeneous data, while CPD-GPA and BiPACE-2D have relatively higher PPV among four algorithms. Overall, CPD-GPA achieves either the highest or the comparable F1 scores across all cases.

4. Discussion

The four CPD-GPA algorithms are newly developed using retention time distance and MS/MS similarity. Although these algorithms are implemented for two-dimensional GC-MS data, the developed algorithms can be employed for other types of MS data, such as LC-MS. For example, the method RT can be applied to the MS experimental data that have retention times only, such as LC-MS or LC×LC-MS. Other three CPD-GPA algorithms, P, M, and P+M, can be applied to the data that have both retention time and MS/MS, such as GC-MS, GC×GC-MS, LC-MS/MS, or LC×LC-MS/MS.

To incorporate the MS/MS similarity into the CPD algorithm, two approaches are developed. The first approach was to consider the MS/MS similarity as prior information and added into Equation (1) using the cosine correlation instead of the constant 1/N. Based on the previous works related to MS/MS similarity measures,^21–24 the performance of peak alignment is influenced by the choice of the similarity measure, and the other similarity measure, such as partial correlation,²⁴ might perform better than the cosine correlation. However, in terms of computational expense, the cosine correlation is one of the least expensive similarity measures and thus we used the cosine correlation measure in this study. The second approach was to use the mixture distance measure similar to the earlier work done by Kim et al.⁹, which used the mixture similarity measure. Although the MS/MS similarity was converted into the distance domain by one minus cosine similarity (Equation (8)), the distribution of the distance measure dominates that of the mixture distance measure due to the narrow domain of the converted cosine similarity. To resolve this, the scale factor γ was introduced to expand the domain of the converted cosine similarity. In this study, we empirically examined the effect of the scale factor and fixed it as 100. Nevertheless, we expect that the peak alignment performance can be improved by tuning the scale factor γ.

Another potential means to improve the performance of the developed CPD-GPA algorithms is to examine different distance measures, such as Maximum, Manhattan and Canberra distances, as investigated in mSPA.⁹ In this study, the Euclidean distance and the mixture distance are used. As demonstrated in mSPA,⁹ Canberrra distance might be more appropriate to deal with two-dimensional MS data because of the different units between the first dimension and the second dimension retention times.

It appears that the mixture distance weight (δ) and the outlier parameter (ω) are negatively associated to each other in order to achieve a better performance in peak alignment in Table 2 and Supplementary Information Table S1. Namely, in all cases for the methods M and P+M, when δ is close to zero (one), ω is close to one (zero). For example, for the heterogeneous case, the method M has δ = 0.1111 and ω = 0.9999 with the F1 score of 0.9548, while it has δ = 1 and ω = 0.1111 with the F1 score of 0.8740 as shown in Supplementary Information Table S1. This implies that the effect of outliers is diminished when the MS/MS similarity plays a critical role in peak alignment (e.g., δ = 1 and ω = 0.1111), but the effect is enlarged when the distance measure has an important role (δ = 0.1111 and ω = 0.9999).

The consensus analysis among four CPD-GPA algorithms was carried out using the mice data based on Venn diagrams and the corresponding results are shown in Fig. 4 and Supplementary Information Fig. S4. Supplementary Information Fig. S4 shows the Venn diagrams among four CPD-GPA algorithms according to the number of aligned peak pairs for each of 10 pairwise peak alignments. Fig. 4 is the summary of these individual Venn diagrams by summing up all 10 Venn diagrams and then calculating the associated percentages. Note that we considered all aligned peak pairs whether a matching is true positive or not. All four algorithms commonly aligned more than 65% of the peak pairs (i.e., 65.6%), suggesting that about 35% of the peak pairs depends on the choice of the method to be aligned. Seemingly, the method RT has the most peak pairs that were not aligned by other three methods (4.45%), but it has the least peak pairs that were aligned by itself (72.64% [RT] vs. 91.99% [P], 90.30% [M], and 91.65% [P+M]). Among four algorithms, the method P has the most peak pairs, which is consistent with the fact that this has the highest TPR in Table 2. The consensus analysis is further restricted to the three methods P, M, and P+M that use both the retention time distance measure and the MS/MS similarity different from the method RT. In this case, the percentages of the unique peak pairs aligned only by each of P, M, and P+M are 2.87%, 0.85%, and 0.62%, respectively. This might be because the method P is very different from the methods M and P+M in terms of the way to incorporate the MS/MS similarity into CPD.

Figure 4. — Venn diagram for consensus analysis among four developed peak alignment methods for mice data. The percentiles represent the number of consensus true positive peak pairs found over the total number of true positive peak pairs by all the four methods.

BiPACE-2D underperformed especially for the homogeneous and overperformed for the heterogeneous data compared to its original work¹², although its overall performance is not better than our developed CPD-GPA algorithms. This is because the method used here to correct the multiple peaks is different from that used in the original work. Their approach is called ‘modified grouping by maximum area’ (MGMA), which corrects the multiple peaks numerically based on the elliptical function centered at (0,0).

The peak alignment performance of each method was investigated mainly based on the pairwise peak alignment. Thus, we further considered the performance of the full peak alignment between our developed methods and BiPACE-2D. We used the tuning parameters for each method used in Table 3. The full peak alignment was performed using the approach introduced in mSPA⁹, which is a sequential full alignment. Namely, for the homogeneous data, we aligned 10 samples with 9 pairs of the pairwise alignment, (S1,S2), (S2,S3),…, (S9,S10), and, for the heterogeneous data, we aligned 16 samples with additional 5 pairs of the pairwise alignment, (S10,S11), …, (S14,S15). Similarly, for mice and chlamy data, we aligned 5 and 12 samples with 4 and 11 pairs, (M1,M2), (M2,M3), …, (M4,M5) and (C1,C2), (C2,C3), …, (C11,C12), respectively. The numbers of true peaks matched (u) for each of four data set were calculated based solely on the metabolite names that were present in all samples, resulting in 66, 63, 146, and 71 for homogeneous, heterogeneous, mice and chlamy data, respectively. Once the full alignment peak table was constructed for each method, the peaks that had at least one missing were removed and then the number of matched peaks (v) was calculated. After that, the number of true positives (TP) was computed where TP was defined as the peaks that were aligned with the same name. Then, using Equations (12) and (13), we calculated TPR, PPV and F1 score as shown in Table 4. As expected, the performance of the full peak alignment is consistent with that of the pairwise peak alignment in Table 3 in the sense that CPD-GPA performs better in all data sets except for the mice data set.

Table 4.

Comparisons between our best results and BiPACE-2D results on the same datasets in terms of the maximum F1 scores for full peak alignments. The cases with the best performance are highlighted with bold and italic fonts.

CPD-GPA

	Method	Rigid/Nonrigid	z-score	Mixture Weight (δ)	Outlier (ω)	β	λ	TPR	PPV	F1 score
Homogeneous	M	Nonrigid	No	0.8888	0.1111	1	1	*0.9545*	*0.9844*	*0.9692*
Heterogeneous	P	Nonrigid	Yes	-	0.8888	2.6	1.4	*0.7937*	0.9091	*0.8475*
Mice	P+M	Rigid	No	0.7777	0.5555	-	-	0.5205	0.5468	0.5333
Chlamy	P+M	Rigid	No	0.7777	0.5555	-	-	*0.5634*	*0.8889*	*0.6897*
BiPACE-2D

	D1	D2	T1	T2	MCS	MS/MS similarity		TPR	PPV	F1 score

Homogeneous	10	0.25	0	0.25	2	dot product		0.7576	0.9259	0.9362
Heterogeneous	10	0.25	0	0.25	2	dot product		0.7143	*0.9574*	0.8182
Mice	25	0.5	0.75	0	2	weighted cosine		*0.6037*	*0.5813*	*0.6078*
Chlamy	100	0.5	0.99	0.99	2	cosine		0.5493	0.8864	0.6783

Open in a new tab

Our general recommendations are as follows. For experimental data that do not have the MS/MS similarity information, such as LC-MS or LC×LC-MS, the method RT will be the choice for the peak alignment. It should be noted that, comparing with mSPA-PAD, which uses only retention times, the method RT performs better than mSPA-PAD for the mice data (0.5096 vs. 0.4729) and similar to mSPA-PAD for the homogeneous data (0.9618 vs. 0.9652) in terms of F1 score, as shown in Supplementary Information Table S1 and in Table 2 of Kim et al. (2011).⁹ In case that both retention times and MS/MS similarity are available, such as GC-MS, LC-MS/MS, GC×GC-MS, and LC×LC-MS/MS, the method P+M will achieve a better performance. However, if experimental data are generated under different conditions (i.e., heterogeneous) or one is more concerned with TPR, the method P will be selected for peak alignment.

Supplementary Material

Suppl. Info

NIHMS1579081-supplement-Suppl__Info.pdf^{(961.3KB, pdf)}

Acknowledgements

This work was partially supported by the grants DMS-1312603 and ACI-1657364 from the National Science Foundation (NSF) and Grants Boost and Startup Grant from Wayne State University. The Biostatistics Core is supported, in part, by NIH Center grant P30 CA022453 to the Karmanos Cancer Institute at Wayne State University.

Literature Cited

1.Fraga CG, Prazen BJ & Synovec RE Objective data alignment and chemometric analysis of comprehensive two-dimensional separations with run-to-run peak shifting on both dimensions. Anal Chem 73, 5833–5840 (2001). [DOI] [PubMed] [Google Scholar]
2.van Mispelaar VG, Tas AC, Smilde AK, Schoenmakers PJ & van Asten AC Quantitative analysis of target components by comprehensive two-dimensional gas chromatography. J Chromatogr A 1019, 15–29 (2003). [DOI] [PubMed] [Google Scholar]
3.Pierce KM, Wood LF, Wright BW & Synovec RE A comprehensive two-dimensional retention time alignment algorithm to enhance chemometric analysis of comprehensive two-dimensional separation data. Analytical Chemistry 77(2005). [DOI] [PubMed] [Google Scholar]
4.Zhang D, Huang X, Regnier FE & Zhang M Two-dimensional correlation optimized warping algorithm for aligning GC x GC-MS data. Anal Chem 80, 2664–2671 (2008). [DOI] [PubMed] [Google Scholar]
5.Gros J, Nabi D, Dimitriou-Christidis P, Rutler R & Arey JS Robust Algorithm for Aligning Two-Dimensional Chromatograms. Analytical Chemistry 84, 9033–9040 (2012). [DOI] [PubMed] [Google Scholar]
6.Rempe DW, et al. Effectiveness of Global, Low-Degree Polynomial Transformations for GCxGC Data Alignment. Analytical Chemistry 88, 10028–10035 (2016). [DOI] [PubMed] [Google Scholar]
7.Oh C, Huang X, Regnier FE, Buck C & Zhang X Comprehensive two-dimensional gas chromatography/time-of-flight mass spectrometry peak sorting algorithm. Journal of Chromatography 1179(2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Wang B, et al. DISCO: distance and spectrum correlation optimization alignment for two-dimensional gas chromatography time-of-flight mass spectrometry-based metabolomics. Anal Chem 82, 5069–5081 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Kim S, Fang A, Wang B, Jeong J & Zhang X An optimal peak alignment for comprehensive two-dimensional gas chromatography mass spectrometry using mixture similarity measure. Blolnformatlcs 27, 1660–1666 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Kim S, Koo I, Fang A & Zhang X Smith-Waterman peak alignment for comprehensive two-dimensional gas chromatography-mass spectrometry. BMC Blolnformatlcs 12, 235 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Castillo S, Mattila I, Miettinen J, Oresic M & Hyotylainen T Data Analysis Tool for Comprehensive Two-Dimensional Gas Chromatography/Time-of-Flight Mass Spectrometry. Analytical Chemistry 83, 3058–3067 (2011). [DOI] [PubMed] [Google Scholar]
12.Hoffmann N, Wilhelm M, Doebbe A, Niehaus K & Stoye J BiPACE 2D-graph-based multiple alignment for comprehensive 2D gas chromatography-mass spectrometry. Bioinformatics 30, 988–995 (2014). [DOI] [PubMed] [Google Scholar]
13.Hoffmann N, et al. Combining peak- and chromatogram-based retention time alignment algorithms for multiple chromatography-mass spectrometry datasets. Bmc Blolnformatlcs 13(2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Deng B, Kim S, Li H, Heath E & Zhang X Global peak alignment for comprehensive two-dimensional gas chromatography mass spectrometry using point matching algorithms. J Bioinform Comput Biol, 1650032 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Besl PJ & Mckay ND A Method for Registration of 3-D Shapes, leee T Pattern Anal 14, 239–256 (1992). [Google Scholar]
16.Rangarajan A, Chui H & Bookstein FL The softassign Procrustes matching algorithm. Lect Notes Comput Sc 1230, 29–42 (1997). [Google Scholar]
17.Chui HL & Rangarajan A A new point matching algorithm for non-rigid registration. Comput Vis Image Und 89, 114–141 (2003). [Google Scholar]
18.Myronenko A & Song XB Point Set Registration: Coherent Point Drift, leee TPattern Anal 32, 2262–2275 (2010). [DOI] [PubMed] [Google Scholar]
19.Dempster AP, Laird NM & Rubin DB Maximum Likelihood from Incomplete Data Via Em Algorithm. J Roy Stat Soc B Met 39, 1–38 (1977). [Google Scholar]
20.Doebbe A, et al. The interplay of proton, electron, and metabolite supply for photosynthetic H2 production in Chlamydomonas reinhardtii. J Biol Chem 285, 30247–30260 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Kim S, et al. Compound Identification Using Partial and Semipartial Correlations for Gas Chromatography-Mass Spectrometry Data. Analytical Chemistry 84, 6477–6487 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Kim S, Koo I, Wei X & Zhang X A method of finding optimal weight factors for compound identification in gas chromatography-mass spectrometry. Bioinformatics 28, 1158–1163 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Koo I, Kim S & Zhang X Comparative analysis of mass spectral matching-based compound identification in gas chromatography-mass spectrometry. J Chromatogr A 1298, 132–138 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Kim S & Zhang X Comparative analysis of mass spectral similarity measures on peak alignment for comprehensive two-dimensional gas chromatography mass spectrometry. Comput Math Methods Med 2013, 509761 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Suppl. Info

NIHMS1579081-supplement-Suppl__Info.pdf^{(961.3KB, pdf)}

[R1] 1.Fraga CG, Prazen BJ & Synovec RE Objective data alignment and chemometric analysis of comprehensive two-dimensional separations with run-to-run peak shifting on both dimensions. Anal Chem 73, 5833–5840 (2001). [DOI] [PubMed] [Google Scholar]

[R2] 2.van Mispelaar VG, Tas AC, Smilde AK, Schoenmakers PJ & van Asten AC Quantitative analysis of target components by comprehensive two-dimensional gas chromatography. J Chromatogr A 1019, 15–29 (2003). [DOI] [PubMed] [Google Scholar]

[R3] 3.Pierce KM, Wood LF, Wright BW & Synovec RE A comprehensive two-dimensional retention time alignment algorithm to enhance chemometric analysis of comprehensive two-dimensional separation data. Analytical Chemistry 77(2005). [DOI] [PubMed] [Google Scholar]

[R4] 4.Zhang D, Huang X, Regnier FE & Zhang M Two-dimensional correlation optimized warping algorithm for aligning GC x GC-MS data. Anal Chem 80, 2664–2671 (2008). [DOI] [PubMed] [Google Scholar]

[R5] 5.Gros J, Nabi D, Dimitriou-Christidis P, Rutler R & Arey JS Robust Algorithm for Aligning Two-Dimensional Chromatograms. Analytical Chemistry 84, 9033–9040 (2012). [DOI] [PubMed] [Google Scholar]

[R6] 6.Rempe DW, et al. Effectiveness of Global, Low-Degree Polynomial Transformations for GCxGC Data Alignment. Analytical Chemistry 88, 10028–10035 (2016). [DOI] [PubMed] [Google Scholar]

[R7] 7.Oh C, Huang X, Regnier FE, Buck C & Zhang X Comprehensive two-dimensional gas chromatography/time-of-flight mass spectrometry peak sorting algorithm. Journal of Chromatography 1179(2008). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Wang B, et al. DISCO: distance and spectrum correlation optimization alignment for two-dimensional gas chromatography time-of-flight mass spectrometry-based metabolomics. Anal Chem 82, 5069–5081 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Kim S, Fang A, Wang B, Jeong J & Zhang X An optimal peak alignment for comprehensive two-dimensional gas chromatography mass spectrometry using mixture similarity measure. Blolnformatlcs 27, 1660–1666 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Kim S, Koo I, Fang A & Zhang X Smith-Waterman peak alignment for comprehensive two-dimensional gas chromatography-mass spectrometry. BMC Blolnformatlcs 12, 235 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Castillo S, Mattila I, Miettinen J, Oresic M & Hyotylainen T Data Analysis Tool for Comprehensive Two-Dimensional Gas Chromatography/Time-of-Flight Mass Spectrometry. Analytical Chemistry 83, 3058–3067 (2011). [DOI] [PubMed] [Google Scholar]

[R12] 12.Hoffmann N, Wilhelm M, Doebbe A, Niehaus K & Stoye J BiPACE 2D-graph-based multiple alignment for comprehensive 2D gas chromatography-mass spectrometry. Bioinformatics 30, 988–995 (2014). [DOI] [PubMed] [Google Scholar]

[R13] 13.Hoffmann N, et al. Combining peak- and chromatogram-based retention time alignment algorithms for multiple chromatography-mass spectrometry datasets. Bmc Blolnformatlcs 13(2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Deng B, Kim S, Li H, Heath E & Zhang X Global peak alignment for comprehensive two-dimensional gas chromatography mass spectrometry using point matching algorithms. J Bioinform Comput Biol, 1650032 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Besl PJ & Mckay ND A Method for Registration of 3-D Shapes, leee T Pattern Anal 14, 239–256 (1992). [Google Scholar]

[R16] 16.Rangarajan A, Chui H & Bookstein FL The softassign Procrustes matching algorithm. Lect Notes Comput Sc 1230, 29–42 (1997). [Google Scholar]

[R17] 17.Chui HL & Rangarajan A A new point matching algorithm for non-rigid registration. Comput Vis Image Und 89, 114–141 (2003). [Google Scholar]

[R18] 18.Myronenko A & Song XB Point Set Registration: Coherent Point Drift, leee TPattern Anal 32, 2262–2275 (2010). [DOI] [PubMed] [Google Scholar]

[R19] 19.Dempster AP, Laird NM & Rubin DB Maximum Likelihood from Incomplete Data Via Em Algorithm. J Roy Stat Soc B Met 39, 1–38 (1977). [Google Scholar]

[R20] 20.Doebbe A, et al. The interplay of proton, electron, and metabolite supply for photosynthetic H2 production in Chlamydomonas reinhardtii. J Biol Chem 285, 30247–30260 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Kim S, et al. Compound Identification Using Partial and Semipartial Correlations for Gas Chromatography-Mass Spectrometry Data. Analytical Chemistry 84, 6477–6487 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Kim S, Koo I, Wei X & Zhang X A method of finding optimal weight factors for compound identification in gas chromatography-mass spectrometry. Bioinformatics 28, 1158–1163 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Koo I, Kim S & Zhang X Comparative analysis of mass spectral matching-based compound identification in gas chromatography-mass spectrometry. J Chromatogr A 1298, 132–138 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Kim S & Zhang X Comparative analysis of mass spectral similarity measures on peak alignment for comprehensive two-dimensional gas chromatography mass spectrometry. Comput Math Methods Med 2013, 509761 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Coherent Point Drift Peak Alignment Algorithms Using Distance and Similarity Measures for Two-Dimensional Gas Chromatography Mass Spectrometry Data

Zeyu Li

Seongho Kim

Sikai Zhong

Zichun Zhong

Ikuko Kato

Xiang Zhang

Abstract

1. Introduction

2. Methods

2.1. Coherent point drift point matching algorithm (CPD-PMA)

2.2. Coherent point drift global peak alignment algorithms (CPD-GPA)

Retention time-based CPD-GPA

Prior MS similarity and retention time-based CPD-GPA

Mixture distance-based CPD-GPA

Prior similarity and mixture distance-based CPD-GPA

2.3. GC×GC-MS datasets

Table 1.

2.4. Tuning parameters

2.5. Pairwise peak alignment implementation

2.6. Performance evaluation

3. Results

3.1. Homogeneous dataset

Figure 1.

Table 2.

3.2. Heterogeneous dataset

Figure 2.

3.3. Mice dataset

Figure 3.

3.4. Comparisons with mSPA, SWPA, and BiPACE-2D algorithms

Table 3.

4. Discussion

Figure 4.

Table 4.

Supplementary Material

Acknowledgements

Literature Cited

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases