Skip to main content
ACS AuthorChoice logoLink to ACS AuthorChoice
. 2024 Feb 9;96(7):2849–2856. doi: 10.1021/acs.analchem.3c03686

Systematic Evaluation of Chromatographic Peak Quality for Targeted Mass Spectrometry via Variational Autoencoder

Chi Yang †,*, Yung-Chin Hsiao †,‡,§, Chi-Ching Lee ∥,⊥,#, Jau-Song Yu †,‡,§,¶,*
PMCID: PMC10882576  PMID: 38336364

Abstract

graphic file with name ac3c03686_0007.jpg

Targeted mass spectrometry is a powerful technique for quantifying specific proteins or metabolites in complex biological samples. Accurate peak picking is a critical step as it determines the absolute abundance of each analyte by integrating the area under the picked peaks. Although automated software exists for handling such complex tasks, manual intervention is often required to rectify potential errors like misclassification or mis-picking events, which can significantly affect quantification accuracy. Therefore, it is necessary to develop objective scoring functions to evaluate peak-picking results and to identify problematic cases for further inspection. In this study, we present targeted mass spectrometry quality encoder (TMSQE), a data-driven scoring function that summarizes peak quality in three types: transition level, peak group level, and consistency level across samples. Through unsupervised learning from large data sets containing 1,703,827 peak groups, TMSQE establishes a reliable standard for systematic and objective evaluations of chromatographic peak quality in targeted mass spectrometry. TMSQE shows a high degree of consistency with expert experiences and can efficiently capture problematic cases after the automated software. Furthermore, we demonstrate the generalizability of TMSQE by successfully applying it to various data sets, including both peptide and metabolite data sets. Our proposed scoring approach provides a reliable solution for consistent and accurate peak quality evaluation, facilitating peak quality control for targeted mass spectrometry.


Targeted mass spectrometry (MS) techniques enable precise and accurate quantification of multiple targets in a single run that have revolutionized biomarker studies.1 Two commonly employed approaches in this field are multiple reaction monitoring (MRM) and parallel reaction monitoring (PRM). Both MRM and PRM offer the ability to selectively analyze and quantify molecules of interest in complex biological samples.2 MRM assays rely on prior knowledge of the target molecules and their corresponding fragment ions to preselect surrogate transitions (precursor and daughter ion pairs). In contrast, PRM allows for the simultaneous detection of all of the fragment ions to quantify target analytes. By using stable isotope-labeled (heavy) molecules as internal standards with known quantities, targeted MS techniques can accurately measure the quantity of endogenous (light) molecules in samples based on the peak area ratios of light-to-heavy ions. The high sensitivity, reproducibility, and the multiplexing capability of the targeted MS techniques make them suitable for clinical applications in biomarker discovery and validation.36

The processing of targeted MS data involves the identification and selection of peak regions corresponding to analyte signals from chromatograms followed by the integration of peak areas for both light and heavy signals to achieve accurate quantification. The selection step requires peak picking to exclude interference signals, address retention time shifts, and eliminate uncertain signals that may resemble background noises. Once high-quality peak regions are determined, subsequent integration of peak areas relying on these regions is performed for accurate quantification of the analytes present in samples. Therefore, the quantification accuracy is heavily dependent on the robustness and reliability of the peak-picking outcomes.

Several artificial intelligence approaches have been developed to automate the peak-picking process. PB-Net7 developed a deep learning model, consisting of two long short-term memory layers8 and a self-attention layer,9 to predict the retention time points of peak boundaries. The automRm10 is an automatic workflow designed for analyzing metabolite MRM data sets and uses two models that were trained with the random forest algorithm to fully automate the peak-picking and quality evaluation processes. The first peak-picking model determines the most likely peak candidates, while the latter peak-reporting model evaluates the quality of those qualified for reporting. In addition, automRm has shown higher accuracy than other tools, such as MRMProbs,11 MRMkit,12 or Skyline.13

Despite the automatic software developed, picking high-quality chromatographic peaks automatically for targeted MS data sets remains challenging. While these tools automate and accelerate the peak-picking process, manual re-evaluation through visual checks is still necessary to ensure the overall quality of the entire data set. This is due to the potential misclassification or mis-picking events of AI models, and such failure cases are usually hidden in the output results. Identification of problematic peaks would be a time- and labor-consumption process, especially when dealing with data sets that contain hundreds of samples. Using additional classifiers to identify these failures may lead to further misjudgments. Furthermore, since these mentioned machine learning or deep learning models were trained using expert labeling in a supervised manner, the manual decisions of good or poor quality may vary from inter- and intra-analysts. Varied standards will inevitably be established during the generation of these labels for model training. To address the issues of identifying problematic cases and establishing inconsistent standards, an objective and systematic quality measure that minimizes manual intervention is needed to ensure the robust and optimal peak quality of chromatographic peaks in targeted MS data analysis.

Our strategy involved the use of comprehensive quantitative metrics and unsupervised learning techniques to achieve an objective evaluation of peak quality, while minimizing manual intervention. To accomplish this, we leveraged the well-established quality metrics proposed in TargetedMSQC,14 which encompassed various aspects such as jaggedness, symmetry, similarity, and many others. These metrics were adapted and restructured to fit our objective and systematic purposes. Subsequently, we employed an unsupervised deep learning method capable of learning peak quality from larger and more versatile data sets without manual labels of good or poor quality. As a result, the neural networks can provide generalized quality descriptions, enabling the development of reliable and objective scoring functions for assessing peak quality and facilitating the identification and quantification of target analytes in complex biological samples.

In this study, we established data-driven scoring functions for objective and systematic evaluations of the quality of picked peaks in targeted MS chromatograms using the β-total correlation variational autoencoder (β-TCVAE15). With these scoring functions named targeted mass spectrometry quality encoder (TMSQE), we can objectively and quantitatively describe the quality of the picked peaks and subsequently classify them into categories of good, acceptable, and poor quality. Our classifications are based on score distributions from a large and versatile data cohort, which enable us to mitigate potential misjudgments. We propose the TMSQE scoring functions to offer a reliable standard for efficient identifications of problematic cases with quality explanations after automatic peak-picking tools. To further facilitate in-depth diagnostics of chromatographic peak quality, we have also developed a Python package, tmasque, that utilizes our proposed TMSQE scoring functions to generate human readable outputs for automating the quality control of targeted MS data analysis.

Materials and Methods

Data Collection from Public Domain and In-House Experiment

A total of 1018 skyline files were collected from the 75 targeted proteomic studies (Table S1), which assayed human samples with stable isotope-labeled internal standards. These data sets are publicly available on Panorama Public.16 We then used Skyline software13 to manually export the chromatogram and peak boundary CSV files from these skyline files. These targeted MS data sets consisted of 1,703,827 peak groups, which encompassed a total of 6,186,462 transition signals derived from 4045 unique peptides. Additionally, these data sets were generated from 22 distinct mass spectrometer models manufactured by four different brands: AB Sciex, Agilent Technologies, Thermo Fisher Scientific, and Waters Corporation. This comprehensive compilation provides versatile data sets that align with our unsupervised learning purposes.

To compare the peak quality scores before and after manual evaluations, we used an unpublished data set generated from MRM assays performed on the AB Sciex QTRAP 5500 instrument. The data set encompassed a comprehensive analysis of 50 target peptides across 451 saliva samples from three distinct clinical groups: healthy donors (n = 150), samples with oral potentially malignant disorders (OPMD, n = 147), and samples with oral squamous cell carcinoma (OSCC, n = 154). A total of 67,650 peak groups were obtained by conducting triplicate experiments quantifying the 50 targets across the 451 saliva samples. We employed Skyline software to obtain the chromatograms and their corresponding peak boundaries both before and after manual evaluations.

Generation of Response Curves by the LC-MRM Assay

Reverse response curves were generated for the 105 target peptides through quintuplicate experiments, following the procedure described in the previous study.17 Serially diluted heavy peptides were spiked into trypsin-digested samples of pooled saliva, which already contained the corresponding light peptides. The LC-MRM assay was then used to measure the concentration-dependent responses of the spiked-in heavy peptides by measuring the peak area ratio of heavy-to-light signals.

For the sample preparation, the pooled saliva sample was adjusted with buffer of 25 mM ammonium bicarbonate, denatured with 10% sodium deoxy-cholate (DOC), reduced with 50 mM Tris (2-carboxyethyl) phosphine (TCEP) at 60 °C for 30 min, alkylated with 100 mM iodoacetamide at 37 °C for 30 min, acidified with 10% formic acid (FA) and 10% trifluoroacetic acid (TFA) to precipitate DOC, and subsequently subjected to trypsin-mediated digestion. The aliquots (1 μg) of the digested saliva were separately mixed with the serially diluted heavy peptides (0.0078, 0.0156, 0.0312, 0.0625, 0.125, 0.25, 0.5, 2, 8, 32, 128, and 512 fmol), and the resulting samples were lyophilized and stored at −20 °C until further analysis by LC-MRM.

For LC-MRM analysis, the lyophilized samples were rehydrated with 4 μL of 0.1% FA and injected onto a nanoACQUITY UPLC C18 column (100 μm × 100 mm, 1.7 μm particle size; Waters) with the liquid-chromatogram steps including an 82.5 min linear gradient from 3 to 25% buffer B, a 7.5 min linear gradient from 25 to 35% buffer B, and a final 1 min linear gradient from 30 to 95% buffer B, followed by a postgradient equilibration with 3% buffer B for 3 min. The resolved fractions were applied to a QTRAP 5500 system (AB Sciex, Redwood, CA, US) with the scheduled MRM method. The preselected transition (precursor/fragment pair) list of the target peptides (Table S2) was used for data acquisition, and the data were processed using Skyline software.

Quality Feature Embedding

Our first step of developing the quality scoring functions and guidelines for targeted MS data is to quantitatively describe the quality of chromatographic peaks with multiple features (Figure 1). We characterized peak quality using the 47 quality features derived from the 32 quality metrics proposed in the TargetedMSQC study14 (Table S3). These metrics provide a comprehensive description of the quality of chromatographic peaks in targeted MS data, incorporating nine aspects and four levels. The nine aspects include jaggedness, symmetry, similarity, modality, shift, full width at half-maximum, area ratio, intensity, and retention time while the four levels encompass the transition (quality of individual signal for each transition ion), the transition pair (quality summarized from light and heavy pairs of a transition ion), the isotope (quality summarized from light and heavy signals of all transition ions of a peak group), and the peak-group level (quality summarized from all transition ions of a peak group).

Figure 1.

Figure 1

Flowchart of the development of TMSQE scoring and score guideline.

We further categorized these 47 quality features into three quality types: type I quality, containing 17 features that describe the quality of individual transition signals; type II quality, consisting of 22 features that evaluate the overall quality of a peak group; and type III quality, which generally measures the consistency of all samples of an analyte using 8 features. For detailed descriptions of the 47 quality features used in this study, please refer to Table S3. In summary, for each of the 6,186,462 transition signals, we presented their peak quality with a 17-feature vector of type I quality and an 8-feature vector of type III quality. Type II quality was used to describe the overall quality of an entire peak group, resulting in a 22-feature vector for each of the 1,703,827 peak groups.

Dimension Reduction with β-TCVAE

After the quality feature embedding, we applied β-TCVAE algorithm15 to nonlinearly reduce the dimensions of the quality features, resulting in a two-dimensional latent space for each of the three quality types (Figure 1). Reducing to two dimensions was motivated by its capability to effectively capture peak quality as illustrated in Figure S1. The details of β-TCVAE can be found in Text S1.

To conduct β-TCVAE, we have implemented neural network architectures as presented in Table S4. By applying the β-TCVA approach, these neural networks were trained with our collected data sets, consisting of the 1,703,827 vectors of the type II features and the 6,186,462 feature vectors of types I and III. The training was performed with a minibatch size of 10,000, and the initial learning rate was set at 1 × 10–3. The learning rate was then adaptively decreased until it reached 1 × 10–6. After 40 epochs of training, the learning process reached convergence as indicated by the stabilization of the loss values. The resulting encoder networks function as our quality encoders, capable of encoding the quality features into a two-dimensional latent space for each of the three quality types.

Latent Space Annotation

As depicted in our method development flowchart in Figure 1, we annotated the encoded latent space after the dimension reduction of the quality features. We established reference points within our encoded latent spaces to identify regions corresponding to high or low peak quality. These reference points serve as the indicators of peak quality with the expectation that feature vectors corresponding to high-quality chromatographic peaks will be represented as latent points located near the reference point of optimal quality. As shown in Table S3, these quality features have a characteristic of being one- or two-side bounded with specific directions of quality. To establish bounded ranges for one-sided bounded features, we assigned a maximum value of five. Consequently, the theoretically best and worst feature values derived from the computation of quality metrics can function as absolute benchmarks, facilitating an unambiguous and objective assessment of peak quality. According to the calculations of the quality metrics, the best values for the jaggedness, symmetry, similarity, modality, and shift were 0, 1, 1, 0, and 0, respectively, while the worst values were 1, −1, −1, 1, and 1 in the opposite direction. In the case of type III quality features, which assess the consistency or the coefficient to variations, a value of zero was deemed as optimal while the worst values were capped at five. Additionally, we established three additional reference points to capture the distributions of the encoded points in the two-dimensional latent space. These references include the 70% best case, the light-best-heavy-worst case, and the heavy-best-light-worst case. With these additional reference points, we would have better resolution understanding of the latent spaces.

Scoring Function Development

After annotating the latent spaces, we established a scoring function for each of the three types (Figure 1). The function relies on the distances from the encoded point of each chromatographic peak to each of the five reference points. The scoring function is shown as

graphic file with name ac3c03686_m001.jpg

where j = {1, 2,..., 5} represents the indexes to the five reference points (the best, 70% best, light-worst-heavy-best, light-best-heavy-best, and worst points). dj represents the Euclidean distance from the encoded point to each of the reference points, and dmax is the distance spanning from the best to the worst. Additionally, we assigned the weights wj to these reference points, specifically, 2, 1, 1, −1, and −2 in a sequential order. For the elucidation of this weight set, please refer to Figure S2 for detailed information. To normalize the raw scores, we employed the min-max technique, transforming them to a range of −10 to 10 using formula: −10 + 20 * [(SrawSmin)/(SmaxSmin)], where Smax and Smin represent the raw scores for the best and the worst reference points, respectively. The intention behind this scoring approach is to assign higher quality scores to encoded points closer to the best reference point and vice versa. As a result, our scoring functions provide summarized quality scores that represent the quality of individual transition (type I quality), the overall quality of each peak group (type II quality), and the consistency across samples (type III quality). Lastly, by analyzing the score distributions of the chromatographic peaks that we have collected, the quality guidelines can be established to categorize the peaks into good, acceptable, and poor quality.

Automatic Peak Picking with the automRm Workflow

To compare the effectiveness of our scoring guidelines with the results of automRm, we applied the automRm workflow on the same four data sets employed in the automRm study. The four data sets contain one reversed-phase (RP) and three hydrophilic interaction liquid chromatograph (HILIC) data sets. We obtained both the trained peak-picking and reporting models, as well as the four data sets, from the public automRm repository.18

To evaluate the quality of automRm’s peak-picking results using TMSQE, we performed several postprocessing steps to ensure compatibility. First, we filtered out molecules that did not have corresponding isotope-labeled 13C standards as our proposed approach primarily focuses on paired light and heavy signals of fragmented ions. Subsequently, for each remaining molecule, we assigned the signals from the 13C-labeled molecule as heavy ions, while all corresponding 12C-unlabeled signals were considered as light ions based on the provided metabolite data sets. We then artificially replicated the heavy signal to create multiple pairings with unlabeled light signals for each molecule. It is important to note that this virtual pairing resulted in no correlations between the area ratios for light and heavy signals, which consequently affected the feature PeakGroupRatioCorr in type II. Despite this limitation, the TMSQE was still considered applicable to evaluate the peak quality as the quality scores were derived from combinations of multiple features and were expected to be robust against the influence of individual features.

Results and Discussion

Peak Quality Evaluations with TMSQE

In this study, we introduce TMSQE, a novel quality encoder specifically designed for the objective assessment of chromatographic peak quality in the targeted MS data. Our approach aims to learn peak quality directly from the data, eliminating the need for manual labeling of peaks as good or poor quality in the training data. To accomplish this, we trained the TMSQE quality encoders using β-TCVAE, developed scoring functions, and established the scoring guidelines as shown in Figure 1.

With TMSQE, the quality of each chromatographic peak of targeted MS can be evaluated within three steps (Figure 2): (1) feature embedding to represent the peak quality with multiple features, (2) quality encoding with TMSQE to encode the features into latent spaces and calculate the quality scores, and (3) quality summarization to apply our data-driven scoring guideline to classify peaks as good, acceptable, or poor quality, assisting in a summary of the quality efficiently with a single standard. We have developed a Python package named tmasque to streamline the quality evaluation processes outlined in Figure 2. The related source codes are also publicly available at GitHub (https://github.com/chiyang/tmasque). Please refer to the Supporting Information on the Python package implementation (Text S2).

Figure 2.

Figure 2

Workflow of peak quality evaluation using TMSQE. This figure illustrates the process of peak quality evaluation using our proposed TMSQE. Initially, quality features are calculated for each targeted MS chromatographic peak. These features have been restructured into three quality types to describe peak quality at different levels (Table S3). Following the feature embedding for each type, TMSQE encodes these feature vectors into two-dimensional latent spaces. Subsequently, quality scores can be obtained through our developed scoring functions, which are graphically presented in the contour plots. Finally, the scores can be summarized into poor, acceptable, and good quality. This data-driven summarization guideline further simplifies the quality decisions for each chromatographic peak. Additionally, the embedded quality feature values can be used to interpret the reasons behind the quality decisions and facilitate in-depth quality diagnostics.

In the following sections, we provide detailed results during our method development, present the validation results of our proposed TMSQE scoring, and discuss the potential applications.

Scoring Functions Derived from the Latent Spaces of β-TCVAE

In Figure 3A–C, we present the graphical representations of the latent spaces generated by the trained quality encoders for the three quality types. For types I and III, individual points in the latent space were encoded from the quality features associated with each transition ion. For type II, points were derived from each peak group. Our quality features have bounded values and directions, enabling us to establish reference points to annotate these latent spaces. These reference points serve as theoretical markers indicating the best and worst quality positions in the latent spaces. Based on the proximity to these reference points, we established scoring functions to summarize the peak quality. The resulting contour plots, depicting the scoring functions, are presented in Figure 3D–F.

Figure 3.

Figure 3

Scoring contour plots illustrating our quality scoring functions on the latent spaces. Panels A–C sequentially display the latent spaces corresponding to the three quality types. Points in the three latent spaces were encoded from the quality features of our collected chromatographic peaks. To comprehend the quality directions, we established five reference points, denoted as i–v. Point (i) represents the highest quality while point (ii) signifies the top 70% quality. Point (iii) denotes the worst quality for light ions and the best quality for heavy ions, while Point (iv) signifies the opposite: the best for light ions and the worst for heavy ions. Finally, point (v) represents the lowest overall quality. After applying the scoring functions, the latent spaces representing the quality of types I, II, and III were annotated with their respective scoring contours as depicted in panels D–F, respectively.

In Figure 3, a clear trend can be observed along a major axis, indicating the favorable direction from the theoretical worst point to the best. Also, encoded points near the best point exhibit a cone-shaped distribution with the vertex being the best quality point. This cone-shaped pattern suggests the existence of a single optimal pattern for peak quality, toward which peaks of varying qualities converge. Exploiting this clear trend, we were able to develop the scoring functions that can give a summarized score for each of the three quality types. As a result, for any chromatographic peak of targeted MS, we can compute the corresponding quality features, encode them into the latent space, and use our scoring function to obtain a TMSQE score for each of the three quality types.

Quality Guideline Derived from TMSQE Score Distribution

To provide guidelines to summarize peak quality from TMSQE scores, we analyzed the score distributions within our collected cohort of data cohort. As shown in Figure 4A, the type I scores exhibited a multimodal distribution with three modes, and we set the two empirical score thresholds to classify peaks into three categories of good, acceptable, and poor quality. With this concept, the two score thresholds were set to differentiate clusters of type II scores, except that the category of acceptable quality contained two modes (Figure 4B). As for type III scores, we used the 40th and 60th percentiles as the thresholds for the unimodal distribution (Figure 4C). This classification guideline facilitates the efficient identification of problematic peaks that need manual reevaluation.

Figure 4.

Figure 4

Distribution of quality scores for the three types in our collected cohort. Panels A–C represent the score distributions of types I, II, and III, respectively, with scores below zero excluded. Score distribution of types I and II (panels A and B) exhibits multimodal patterns, suggesting the presence of distinct clusters of varying quality levels. Based on the two thresholds indicated as the vertical straight lines on each panel, we classified the quality scores into three categories of poor, acceptable, and good quality, and these categories were depicted as red, yellow, and green colors, respectively.

To apply the scoring guideline to summarize peak quality, we propose using combinations of the three quality types. To give an overall quality measure for a peak group, we propose using the type II score supplemented with the median of the type I scores. This is because the type II score describes the overall quality of a peak group while the type I score evaluates the quality of individual transition ions in the peak group. Additionally, we propose to consider the type I score in conjunction with the type III score to assist in determining suitable quantifiers that require both high peak quality and highly consistent signals among transition ions as the type III scores cover cross-sample consistency and peak-area consistency for each transition ion.

During the development of the TMSQE scores and the corresponding score thresholds, we minimized subjective human judgments in determining peak quality to establish an objective and data-driven quality standard. In this way, we expect that the standardized procedure outlined in our study will provide an objective summary and evaluation of peak quality based on the TMSQE scoring functions.

Consistency between TMSQE Scores and Manual Experience in Quality Evaluation

To investigate whether the TMSQE scores are consistent with expert experience in judging peak quality, we examined the distributions of TMSQE scores using our in-house data set containing the 50 target peptides assayed in 451 saliva samples (Table S5). The chromatographic peaks were initially selected by Skyline software and then curated manually. Among the total 67,650 peak groups, the peak boundaries of the 36,836 peak groups were retained or adjusted while the 30,814 peaks were removed manually.

Chromatographic peaks with a wide range of TMSQE scores were initially identified using Skyline (Figure 5). After manual curation, peak groups that were removed exhibited poor-quality scores, as indicated by the third quarter, falling below the acceptable score threshold for each of the three quality types. Regarding chromatograms with manually adjusted peak boundaries, we observed a clear improvement of the scores for all three quality types. Additionally, the overall score distributions of all three types also demonstrated improvement, with the modes shifting from the acceptable zone to the good-quality zone after manual adjustment. Finally, a subset of retained chromatographic peaks displayed good-quality score distributions, with medians surpassing those of the adjusted peaks. These results suggest that peaks with higher scores were deemed satisfactory and retained without adjustment as their quality was already deemed sufficient. Overall, the TMSQE scoring fits manual experience in quality assessment.

Figure 5.

Figure 5

Distributions of TMSQE scores before and after manual curations of peak picking. Panels A–C display violin plots illustrating the distributions of TMSQE scores for types I, II, and III, respectively. Each plot consists of five lanes, from left to right, representing the score distributions of the initial peaks picked by Skyline, the peaks before and after manual adjustment, retained peaks, and manually removed peaks. Each lane embeds a box plot denoting the interquartile range and the score median.

Validation of TMSQE Scoring Guidelines by Comparing with automRm

In addition to the validation of TMSQE scores with manual experiences, we investigated the consistency of quality assessment between TMSQE scores and automRm.10 The automRm workflow comprises two models: a peak-picking model and a peak-reporting model. The peak-picking model identifies chromatographic peaks while the peak-reporting model assigns the quality QS scores to determine the suitability of the identified peaks for reporting.

To conduct our validation, we used the same set of the four data sets employed in the original automRm study as our benchmark data. After applying automRm, we computed the TMSQE scores of the chromatographic peaks picked by the automRm peak-picking model. To compare the quality decisions between automRm and TMSQE, peaks were considered qualified for automRm if their QS scores were greater than or equal to 50. In contrast, for the TMSQE scores of both types I and II, the minimum requirement was to meet the acceptable quality thresholds of the scoring guideline.

Table 1 presents the comparison results of TMSQE scores and the automRm QS for the four metabolite data sets. The results clearly demonstrate a remarkable level of decision consistency in assessing the chromatographic peak quality. Approximately 79.53% of the peak quality decisions were in agreement, and a Spearman correlation of around 0.81 was observed between the TMSQE scores and automRm QS. This high degree of consistency suggests that despite considering different quality features, TMSQE and automRm exhibit similar preferences in evaluating peak quality. TMSQE summarizes peak quality based on the data-driven distributions from a large data cohort, whereas the two automRm models were trained using expert-labeled data from the combined four data sets. Despite these differences, the significant consistency observed between TMSQE and automRm highlights the generalizability of TMSQE scoring across metabolite MRM data, even though TMSQE was originally trained on peptide data sets.

Table 1. Peak Quality Comparison between TMSQE Scores and the automRm Peak-Picking Model.

data set total peak group consistent decision (qualified + unqualified) automRm qualified only TMSQE qualified only Spearman correlation
HILIC1a 1152 974 (84.45%) (441 + 533) 139 (12.07%) 39 (3.39%) 0.814
HILIC1b 960 767 (79.90%) (321 + 446) 165 (17.19%) 28 (2.92%) 0.782
HILIC2 3024 2398 (79.30%) (1246 + 1152) 581 (19.21%) 45 (1.49%) 0.824
RP 1536 1167 (75.98%) (436 + 731) 308 (20.05%) 61 (3.97%) 0.812

Regarding the inconsistent results obtained by the two approaches, a small fraction (less than 4%) of the entire peak groups were classified as acceptable or good quality using TMSQE, while they were considered poor quality with automRm QS scores below 50. In contrast, a larger proportion (approximately 12 to 20%) of the total peak groups were approved by the automRm pick-reporting model, but TMSQE classified them as poor quality. The percentage difference implies that the standards used to assess peak quality in TMSQE are more rigorous. This constitutes the primary rationale behind our proposal to use TMSQE scoring as a quality checkpoint, following the use of automated software such as automRm or Skyline. Consequently, targeted MS data analysts would have additional opportunities to reassess peak groups flagged as poor quality by TMSQE scoring, uncovering peaks that might have been inaccurately picked by the automatic software (Figure S3).

Validation of TMSQE Scores with Response Curve Experiment

Chromatographic peaks of low-abundance molecules often coincide with increased levels of interference or background noise in the peak regions, leading to lower quality at lower concentrations. To investigate whether our proposed TMSQE scoring can reflect this phenomenon, we conducted a response curve experiment using MRM assays and examined the correlations between TMSQE scores and molecule concentrations.

As our TMSQE scoring evaluates the peak quality independently of the analyte concentration, we propose using it to facilitate determination of the limit of detection (LOD). As shown in Figure 6, the number of good-quality peaks increased with the concentrations and reached plateaus at higher concentrations. For the detailed trends of the 105 target peptides, please refer to Figure S4. This pattern signifies that each molecule can be reliably detected with high-quality peaks above a certain concentration threshold as the LOD, and the TMSQE scores showed significant differences between the concentrations below and above the LOD (Figure S5). Such a quality turning point can be selected as the LOD as depicted in Figure 6D.

Figure 6.

Figure 6

Correlations between TMSQE scoring and analyte concentrations in the response curve experiments. In panels A to C, the stacked bar charts illustrate the distribution of peaks categorized as poor, acceptable, and good quality across the 12 concentration points. TMSQE scores from the five replicates were summarized by using medians for each concentration. Panel D displays the quality scores of the three selected peptides across the 12 concentrations, highlighting the potential of TMSQE scoring in determining the LOD.

Additionally, as mentioned earlier, the TMSQE scores in the response curve experiment can facilitate quantifier selection (Figure S5). The scores of types I and III correspond to the assessment of peak shape quality and consistency, respectively, for each individual transition ion. Although further research is needed to develop appropriate procedures for LOD determination and quantifier selection, automating these two tasks becomes feasible by leveraging the TMSQE scoring.

Applications of the TMSQE Scoring toward Fully Automated Analysis Workflow

Fine-grained peak picking and objective quality scoring are both required for targeted MS data automation that can be universally applied to a consistent quality standard. This is particularly critical when dealing with low abundance molecules as the evaluation of peak quality may become difficult and ambiguous to exclude interferences or noises that can adversely impact the specificity of the quantitation results.

To address the consistent quality evaluation for automation, we introduce the TMSQE scoring method aimed at objectively assessing peak quality. To validate this approach, we initially assessed the trained β-TCVAE models by visually examining their latent spaces (Figure 3). Subsequently, we utilized three independent data collections, as detailed in the preceding sections, to validate the derived scoring functions: (1) the in-house MRM data set with peaks manually evaluated using Skyline; (2) the four metabolite MRM data sets with peaks picked by automRm; and (3) the MRM experiment for generating response curves. In addition, we demonstrated the reproducibility of our proposed approach by training three additional models (Table S6 and Figure S6). Collectively, these outcomes demonstrate the reliability of the TMSQE scoring and support its application for automation. Our Python package, encompassing the presented models for the three quality types, facilitates direct application without the need for model retraining. Employing identical models ensures consistent scores, enabling a standardized assessment of peak quality based on the same criteria.

As the ambiguity in peak quality assessment can be reduced by using TMSQE scoring, we propose to apply this scoring as the objective quality labels to train future AI models. This will eliminate the need for manual labeling, enabling supervised learning to learn from larger data sets and enhance the generalizability of peak-picking models. We anticipate the future integration of TMSQE into automated workflows to achieve accurate quantification across a broad range of targeted MS data.

Conclusion

Objective and systematic evaluation of chromatographic peaks in targeted MS data sets can be achieved with our proposed TMSQE scoring. This is the first attempt to establish data-driven scoring functions from a large collection of chromatograms. We validated the TMSQE scoring and demonstrated a high level of consistency with manual experiences as well as the quality decisions made by the automatic software. The validations confirm the reliability and generalizability of TMSQE for peak quality control.

TMSQE scoring can be applied in the following scenarios: (1) identification of problematic peaks after running the automatic software, (2) LOD determination in response curve experiments, (3) quantifier selection, and (4) objective quality labeling for future AI training.

Additionally, we have implemented the tmasque Python package to streamline the TMSQE scoring procedure. The scoring guideline is also included to provide a reproducible standard for efficient and robust quality control. This will be advantageous for practitioners, especially those with limited experience, in targeted MS data analysis.

Acknowledgments

This study was supported by the grants of the National Science and Technology Council, Taiwan (111-2320-B-182-038 and 112-2320-B-182-032) and by Chang Gung Memorial Hospital, Taiwan, grant number CLRPD1J0015. This research was also supported by the “Molecular Medicine Research Center, Chang Gung University” from The Featured Areas Research Center Program within the framework of the Higher Education Sprout Project by the Ministry of Education in Taiwan.

Supporting Information Available

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.analchem.3c03686.

  • Details of the β-TCVAE algorithm; Python implementation of TMSQE scoring procedure; latent space visualization and corresponding score distributions when reducing quality features to three dimensions; weight settings in scoring functions; selected examples for the inconsistency between automRm and TMSQE; detailed correlation trends between TMSQE scores and analyte concentrations; discrimination of the peak quality below and above the LOD using TMSQE scoring; and reproducibility of the TMSQE scoring (PDF)

  • Targeted proteomics study list; 47 quality features; deep learning network architectures; peptide target lists of the response curve experiment and the OSCC data set; and variability in quality decisions among the repeated models (XLSX)

Author Contributions

The manuscript was written through contributions of all authors. C.Y.: data collection and curation, algorithm implementation, model training, and writing. Y.C.H.: MRM assay, data analysis, and writing. C.C.L.: writing review. J.S.Y.: writing review and supervision. All authors have given approval to the final version of the manuscript.

The authors declare no competing financial interest.

Supplementary Material

ac3c03686_si_001.pdf (2.2MB, pdf)
ac3c03686_si_002.xlsx (77.3KB, xlsx)

References

  1. Schiess R.; Wollscheid B.; Aebersold R. Targeted proteomic strategy for clinical biomarker discovery. Mol. Oncol. 2009, 3 (1), 33–44. 10.1016/j.molonc.2008.12.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Arora A.; Somasundaram K. Targeted Proteomics Comes to the Benchside and the Bedside: Is it Ready for Us?. Bioessays 2019, 41 (2), e1800042 10.1002/bies.201800042. [DOI] [PubMed] [Google Scholar]
  3. Beck K.; Camp N.; Bereman M.; Bollinger J.; Egertson J.; MacCoss M.; Wolf-Yadlin A. Development of Selected Reaction Monitoring Methods to Systematically Quantify Kinase Abundance and Phosphorylation Stoichiometry in Human Samples. Methods Mol. Biol. 2017, 1636, 353–369. 10.1007/978-1-4939-7154-1_23. [DOI] [PubMed] [Google Scholar]
  4. Bereman M. S.; MacLean B.; Tomazela D. M.; Liebler D. C.; MacCoss M. J. The development of selected reaction monitoring methods for targeted proteomics via empirical refinement. Proteomics 2012, 12 (8), 1134–1141. 10.1002/pmic.201200042. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Cohen Freue G. V.; Borchers C. H. Multiple reaction monitoring (MRM): principles and application to coronary artery disease. Circ.: Cardiovasc. Genet. 2012, 5 (3), 378. 10.1161/CIRCGENETICS.111.959528. [DOI] [PubMed] [Google Scholar]
  6. Method of the Year 2012. Nat. Methods 2013, 10 ( (1), ), 1. 10.1038/nmeth.2329. [DOI] [PubMed] [Google Scholar]
  7. Wu Z.; Serie D.; Xu G.; Zou J. PB-Net: Automatic peak integration by sequential deep learning for multiple reaction monitoring. J. Proteonomics 2020, 223, 103820. 10.1016/j.jprot.2020.103820. [DOI] [PubMed] [Google Scholar]
  8. Gers F. A.; Eck D.; Schmidhuber J.. Applying LSTM to Time Series Predictable Through Time-Window Approaches. Springer London: London, 2002; pp 193–200. [Google Scholar]
  9. Vaswani A.; Shazeer N.; Parmar N.; Uszkoreit J.; Jones L.; Gomez A. N.; Kaiser L.; Polosukhin I. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. 10.48550/arXiv.1706.03762. [DOI] [Google Scholar]
  10. Eilertz D.; Mitterer M.; Buescher J. M. automRm: An R Package for Fully Automatic LC-QQQ-MS Data Preprocessing Powered by Machine Learning. Anal. Chem. 2022, 94 (16), 6163–6171. 10.1021/acs.analchem.1c05224. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Tsugawa H.; Kanazawa M.; Ogiwara A.; Arita M. MRMPROBS suite for metabolomics using large-scale MRM assays. Bioinformatics 2014, 30 (16), 2379–2380. 10.1093/bioinformatics/btu203. [DOI] [PubMed] [Google Scholar]
  12. Teo G.; Chew W. S.; Burla B. J.; Herr D.; Tai E. S.; Wenk M. R.; Torta F.; Choi H. MRMkit: Automated Data Processing for Large-Scale Targeted Metabolomics Analysis. Anal. Chem. 2020, 92 (20), 13677–13682. 10.1021/acs.analchem.0c03060. [DOI] [PubMed] [Google Scholar]
  13. MacLean B.; Tomazela D. M.; Shulman N.; Chambers M.; Finney G. L.; Frewen B.; Kern R.; Tabb D. L.; Liebler D. C.; MacCoss M. J. Skyline: an open source document editor for creating and analyzing targeted proteomics experiments. Bioinformatics 2010, 26 (7), 966–968. 10.1093/bioinformatics/btq054. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Toghi Eshghi S.; Auger P.; Mathews W. R. Quality assessment and interference detection in targeted mass spectrometry data using machine learning. Clin. Proteomics 2018, 15, 33. 10.1186/s12014-018-9209-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Chen R. T. Q.; Li X.; Grosse R.; Duvenaud D. Isolating Sources of Disentanglement in Variational Autoencoders. arXiv 2018, arXiv:1802.04942. 10.48550/arXiv.1802.04942. [DOI] [Google Scholar]
  16. Sharma V.; Eckels J.; Schilling B.; Ludwig C.; Jaffe J. D.; MacCoss M. J.; MacLean B. Panorama Public: A Public Repository for Quantitative Data Sets Processed in Skyline. Mol. Cell. Proteomics 2018, 17 (6), 1239–1244. 10.1074/mcp.RA117.000543. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Hsiao Y. C.; Chu L. J.; Chen Y. T.; Chi L. M.; Chien K. Y.; Chiang W. F.; Chang Y. T.; Chen S. F.; Wang W. S.; Chuang Y. N.; et al. Variability Assessment of 90 Salivary Proteins in Intraday and Interday Samples from Healthy Donors by Multiple Reaction Monitoring-Mass Spectrometry. Proteomics: Clin. Appl. 2018, 12( (2), ) 1700039. 10.1002/prca.201700039. [DOI] [PubMed] [Google Scholar]
  18. Buescher J. M.Example data for testing and training automRm. automRm GitLab Repository. https://gitlab.gwdg.de/joerg.buescher/demodata (accessed 2023-03-06).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ac3c03686_si_001.pdf (2.2MB, pdf)
ac3c03686_si_002.xlsx (77.3KB, xlsx)

Articles from Analytical Chemistry are provided here courtesy of American Chemical Society

RESOURCES