Abstract
Biomarker profiling using mass spectrometry plays an essential role in biological studies and is highly dependent on the data analysis for sample classification. In this study, we introduced power nomination of the mass spectra as a method for systematically altering the weights of peaks at different intensity levels. In combination with the use of support vector machine method (SVM), the impact on the sample classification has been characterized using data in four studies previously reported, including the distinctions of anomeric configurations of sugars, types of bacteria, stages of melanoma, and the types of breast cancer. Comprehensive analysis of the data with normalization at different power normalization index (PNI) was developed and analysis tools, including error-PNI plots, reference profiles, and error source profiles, were used to assess the potential of the analytical methods as well as to find the proper approaches to classify the samples.
Graphical Abstract

Sample classification based on mass spectrometry (MS) analysis has been widely practiced in biological studies, including proteomics,1–5 disease diagonosis,6–10 bacteria identifications,11,12 structure analysis of carbohydrates,13,14 and etc. The general approach is to extract the characteristic features from the mass spectra to distinguish different types of the samples. This could be done simply by observing a single peak corresponding to a unique compound, but in most of the cases a set of multiple peaks needs to be used in the data analysis. Those peaks represent the existence of a set of chemical or biological compounds of different concentrations. The classification based on the peak profiles can be done using peak correlation or matching,3,4,15 which counts the identical peaks between the sample spectra and library, or statistical methods, such as standard deviation,6 t test,8 and ANOVA,16 which are mostly used in proteomics and metabolomics2,17,18 to determine the identity of a sample.19,20 The distinction between the samples often could not be achieved simply based on the existence of the peaks in the spectra but also their absolute or relative concentrations; therefore, the unique profiles of a set of compounds are used for the sample classification and biomarker identification. Pattern recognition methods or machine learning methods have been used for this purpose, such as the dot product,21 principal component analysis (PCA),12 supporting vector machine (SVM),22,23 decision tree,24 and neural networks.25 Analysis of peak profiles extracts the information from the mass spectra, in a way very different from the peak-by-peak analysis approaches,19,20 and can also comprehensively reveal different aspects of the samples.
Profiling-based sample classification and biomarker identification are dependent on the significance or weights set for the peaks included in the profile. Typically, higher weights are given to peaks of higher-intensities, which thereby makes them of higher contribution to the final decision19,26,27 (Supporting Information, section S1). In analysis of complex samples, the major peaks of the potential biomarkers, however, can be of low intensity levels due to the suppression by the chemical noises from other compounds in the samples. Its negative impact on data analysis is typically minimized by preselecting the mass range for the peaks of interest, based on the previous knowledge. This, however, can introduce errors in the data analysis due to the bias on the selection of the mass ranges.
The nature of the samples and the quality of the data can also vary significantly at different stages of a study. For example, at the early stage of a study, the major aim typically is to develop and optimize the analytical method for obtaining high-quality spectra. At this stage, the number of biological samples as well as the knowledge about the samples can be limited. It is also difficult to make an arbitrary selection of peaks or mass ranges for data analysis without a significant bias. Data analysis providing a comprehensive evaluation of the experimental method is particularly important for the rapid method development and an effective optimization, so it can become ready for analysis of biological samples of a large quantity. Machine learning methods, such as the support vector machine, have advantages for the early stage method development due to their excellent applicability with raw data and minimal requirement for pre-existing knowledge. A potential risk in use of machine learning methods is overfitting, which can be prevented by reserving testing data sets from training for an objective evaluation of the model.
Intensity scaling or intensity transformation has also been implemented for data analysis to overcome the suppression problem mentioned above, such as notating the peaks as 0 or 1 with a set threshold,28,29 or intensity rescaling based on ranking4 and ranking orders.30,31 In some studies, logarithm32 or square root normalizations33 of peak intensities have also been performed to emphasize the low-intensity peaks. Appropriate intensity transformation can improve the accuracy in sample classification; however, it should be carefully selected based on the nature of the samples.
In this study, we explored a method involving the peak intensity transformation for a systematic evaluation of the mass spectrometry data for sample classification and biomarker identification. The outcome can also be used to assess and optimize the analytical approach at an early stage of a study. A distinct feature of this method was the implementation of the power normalization of the peak intensities prior to the sample classification, which was done using SVM in this study. Power normalization index (PNI, to be further described) was varied to systematically rescale the intensities of all the peaks and the subsequent impacts on the sample classification were analyzed. We also introduced the error-PNI count plot, which revealed the relationship between the power normalization and the errors in sample classification and, more importantly, served as a high level summary of the possibility in distinguishing the samples analyzed using a particular analytical procedure. Suggestion can also be made for further improvement of the analytical method. Data from four experimental studies6,8,12,34 previously reported were used for the development and validation of this method.
METHOD FOR DATA PROCESSING AND ANALYSIS
In order to process the mass spectra efficiently, each mass spectrum was converted into a vector with multiple dimensions, with each dimension corresponding to a particular mass-to-charge ratio (m/z) of a magnitude assigned as the peak intensity at the m/z value. The power normalization was applied for each spectrum first, at a PNI value, and the mass spectra for each sample category were then divided into the training and testing groups. The classification was done using a multiclass SVM method. The training group was used to generate the model with classification boundaries, while the testing group was used to evaluate the classification accuracy based on the model. The SVM method has been shown to be powerful in classifications with lower number of samples,35 which is particularly suitable for early stage studies. The errors for the testing results can then be used to construct the error-PNI count maps.
Multiclass SVM and Decision Scores.
Multiclass SVM analysis of data was performed after the power normalization of the mass spectra and the impact by the selection of PNI for the sample classification (see the Supporting Information for the multiclass SVM method) was evaluated. The decision scores, which are typically used to determine the identity of the samples,35 were calculated for all possible types. The rank of the scores were then used to determine the similarity between the testing sample and each possible sample type.
The decision score was calculated using the sum of distances (D) between the testing sample and all the classification boundaries for different sample types. This works particularly well for evaluating the initial analysis in an early stage study, where the number of possible sample types is typically larger than the replicates of each sample type. For example, in the study for the bacteria analysis,12 there were 14 types of bacteria, but only five samples for each type were analyzed. The distance dij of tth tested sample between the ith and jth group types was calculated as
| (1) |
where ω is the normal vector of the hyperplane (see the Supporting Information for details). The larger is the distance, the further is the data point away from the classification boundary, which means a higher possibility for a correct assignment of the sample to the corresponding group. The numerator in eq 1 is called the decision value. Sometimes, ranking might be achieved by simply using ωijϕ(xt).36 However, it can only be done in this way when the number of sample types is limited and properties of training set are similar enough, so b and |ω| can be ignored.
The sum of the distance Dtm of the tth tested sample for mth sample type was calculated as
| (2) |
where n is the total number of the sample types in the classification. The calculated decision values D can be used to support a variety of data analysis, such as the similarity ranking of all the possible sample types for the testing sample, as well as the ranking of the characteristic peaks that contribute the most to the classification (see the Supporting Information).
Power Normalization.
The power normalization was used to adjust the weights of the peaks at different intensity levels for the classification. It was performed by scaling the intensity of all the peaks in the mass spectra with a power normalization index (PNI) (eq 3),
| (3) |
where peaki is the original peak intensity of the ith peak in the spectra and Peaki is the scaled intensity after the power normalization. The denominator, the square root of the sum of squares of all the peak intensities, was used to achieve an energy balance in every spectrum. As shown in Figure 1, the normalization at a PNI changed the relative difference in weight between peaks of high and low intensities, which affected their contributions to the classification. Normalization with a PNI lower than 1 reduced the differences in intensity, while granting higher weights to the peaks of lower intensities. For example, the fragment patterns in the MS/MS spectra (Figure 1c,d) recorded for two synthesized monosaccharides,13,37 ido-α-GA and glc-β-GA, were very simliar (Figure 1c,d). After rescaling the spectra with a power normalization at PNI of 0.3, the peaks previously hidden became more prominent (Figure 1a,b). A further analysis using SVM identified that some of the peaks originally of low intensities made critical contribution in distinguishing ido-α-GA and glc-β-GA (to be further discussed).
Figure 1.

MS/MS spectra normalized at PNI 0.3 for (a) ido-α-GA and (b) glc-β-GA and the original spectra (c and d), respectively. (e) The weighing factor of different intensities as a function of PNI.
The data analysis was performed using programs written in Matlab (version R2012b, MathWorks, Natick, MA). The modules for training and testing using SVM were downloaded online.38 The functions for picking the characteristic peaks, ranking the similarity, and plotting the figures were programmed based on the intermediate variable in the SVM model.
RESULTS AND DISCUSSION
Data sets from four studies previously reported were used in this study to test and validate the method described above, which included the spectra recorded for synthesized carbohadrates of 16 stereo configurations,13 lipid profiles of 18 types of bacteria,12 and mass spectra of peptides in human blood samples from patients with melanoma6 and breast cancer.39 These data sets were all collected in an early stage of these studies, where new analytical methods were being developed, the numbers of samples for each type were relatively low, and the conditions for experimental control might vary significantly.
Error-PNI Count Plot and Biomarker Identification.
When the spectra of a sample in a testing group were power-normalized and classified using SVM, the sample was assigned to a category based on the ranking of the decision scores. The number of wrong assignments could then be counted for each PNI value and used to plot an error-PNI curve as part of the error-PNI count map. This method was first applied for classifying 16 D-aldohexose-glycolaldehydes (GA), synthesized in 8 sugar types with two anomeric configurations for each type, viz., α-D-all, β-D-all, α-D-alt, β-D-alt, α-D-gal, β-D-gal, α-D-glc, β-D-glc, α-D-gul, β-D-gul, α-D-ido, β-D-ido, α-D-man, β-D-man, α-D-tal, β-D-tal. They were synthesized as the standards for producing glycosyl-GA anions at m/z 221, which can be used as the diagnostic ions for probing the anomeric configurations of the oligosaccharides.
In the previous study,13 the glycosyl-GA anions m/z 221 was produced through CID from disaccharide ions and then was further fragmented under controlled CID conditions (see the Supporting Information).13,37 The MS3 spectra were used for classification by spectral similarity score method.17 In that study, as well as in previous ones using the diagnostic ions for anomeric configuration identification,13,37 the MS/MS data set were acquired with CID conditions controlled by keeping the intensity of the surviving precursor ions after CID as 18 ± 5% of the base product ion (100%). It is believed that this was important for achieving reproducible sample classification. In this study, 136 MS/MS spectra (Supporting Information, Table S1) collected under this condition during the previous study were first used for testing the data analysis method with power normalization. For each test, one spectrum of a particular sample type was randomly picked as the testing sample and the rest 135 spectra were used for training of the SVM model. Both the testing and training spectra were power-normalized at the same PNI value and the errors in assignments were counted and used for plotting the error-PNI curves, as shown in Figure 2.
Figure 2.

Error-PNI plot of 16 types of synthesized monosaccharides-GA. Sugar all-α is not plotted because no classification error was found with the PNI range used.
It is obvious that the selection of the PNI value has a significant impact on the accuracy in classification. For 14 out of 16 sugars (except for tal-β and ido-α), the assignment is improved as the PNI increases and can be achieved with 100% accuracy when PNI is larger than 0.5. However, for tal-β and ido-α a 100% accuracy could only be obtained when the PNI is lower than 0.5 and 0.34, respectively. This indicates that the dominant fragment peaks of the diagnostic ions from these two sugars are not related to the structural differences while the peaks of minor intensities are and therefore can be used as biomarkers for distinguishing them from other stereoisomers.
The error-PNI plot as shown in Figure 2 represents a summary of a systematic evaluation of the effectiveness of the experimental approach applied for classifying a chemical or biological system. The valley in each curve indicates the best normalization point for classifying each individual component in the chemical system; the overlap of the valleys, if existing, indicates the best normalization point for the global classification of the chemical system. For instance, using the method involving the CID of the diagnostic ions for classifying the sugars mentioned above, a normalization at PNI between 0.3 and 0.34 (① in Figure 2) should be selected to distinguish glc-β and tal-β, but 0.43 to 0.58 should be selected for gul-β and ido-α (② in Figure 2). On the basis of the error-PNI plot for the chemical system with the 16 sugars, it can be predicted that a complete classification cannot be done with a single step using the current analytical method, since there is not an overlap of all the valleys of the error-PNI curves. The best overall result for a single step classification would be obtained with a PNI between 0.46 and 0.56, where 15 of 16 isomers can be classified correctly. However, on the basis of the error-PNI plot, a multistep classification can be suggested to further improve the classification, which will be discussed later.
In the previous studies,13,37 a number of statistical methods were tested and the similarity score method21,40 was shown to work much better for classifying the sugars based on the MS/MS spectra of the diagnostic ion m/z 221. The reason can now be explained with the error-PNI plot obtained in this study. Derived from dot product calculation, the similarity score method calculates the ratio between the geometric mean and the arithmetic mean of corresponding intensities of two spectra (eq 4),21,40
| (4) |
where Is,i is the peak intensity of the ith m/z value in the sample spectra, Ir,i is the peak intensity in the reference spectra (library or training), and k is the normalization term that is related to the sum of all the peaks intensities in the spectra. Note that there is a scale factor of 0.5 in the numerator, which is equivalent to a power normalization with a PNI at 0.5. This is in agreement with the analysis using the error-PNI plot in this study.
The effect of using an appropriate PNI to classify the data can also be illustrated with the impact on the PCA results. Using classification of all-α, alt-β, gal-α and man-β as an example, Figure 3a and b show the PCA results based on the original data and data normalized at PNI of 0.5, respectively. Without the normalization, all-α, alt-β, gal-α, and man-β could not be clearly distinguished (Figure 3a); however, after the normalization at PNI = 0.5, the data points for each sample can be much better grouped (Figure 3b). This is because an emphasis on the spectral peaks of lower intensities increased the difference between the spectra from different types of samples. This also indicates that the peaks of the highest intensities in the mass spectra for these samples might not be used as the signature peaks for distinguishing these samples.
Figure 3.

PCA of four types of commonly misclassified sugars with (a) original data and (b) data normalized at PNI 0.5. Similarity ranking of testing sample ido-α based on (c) original data and (d) data normalized at PNI 0.5, insets showing the boundary figures of three top-ranked types. Loading plots based on SVM for classifying the two highly similar sample groups, ido-α and glc-β, (e) with the original data and (f) data normalized at PNI of 0.5.
The similarity ranking method was used for assigning the sample group based on the data analysis. The impact on the similarity ranking and the improvement of the classification by the normalization could be illustrated using the distinction between the ido-α and glc-β samples as an example. After applying SVM classification with the original spectra and the normalized spectra at PNI 0.5, the similarity rankings for ido-α were plotted as shown in Figure 3c,d, respectively, for comparison. The ranking was based on the sum of distance Dtm (eq 2) between the testing samples and the classification boundaries (Figure 3d, inset). Without the power normalization, the ido-α sample was mis-assigned as glc-β, as shown in Figure 3c; however, it was corrected after the power normalization was applied with a proper PNI value (Figure 3d).
The normal vector ω (in eq 1) of the corresponding PNI can be used to select the characteristic peaks for each sample type. The peaks with the highest ω value contribute the most in terms of distinguishing the sample from others. Using the distinction between the ido-α and glc-β samples as an example, the top 10 candidate peaks ranked for the decision-making are m/z 87, 141, 161, 117, 159, 163, 113, 71, 86, 85, and 185 (Figure 3f). Comparing the loading plots with and without power normalization (Figure 3e,f), the data analysis with power normalization identified some peaks, such as m/z 71 and 141, which were not previously selected13 as the signature peaks but actually can contribute to the distinction between the ido-α and glc-β samples.
Another capability enabled by the analysis with the error-PNI plots is the evaluation of the critical experimental conditions. In previous studies,13,37 it has been claimed that the CID conditions, viz., the precursor ion intensity kept at 18 ± 5% relative to the base product ion after CID,37 was critical for using the diagnostic ions m/z 221 to identify the correct anomeric configurations. To test the need of retaining this condition for the analysis when the data analysis with spectral power nomination, we collected additional 132 MS/MS spectra (Supporting Information, Table S1) without carefully tuning the CID conditions as described above. The variation of the intensity ratio between the precursor ion and base product ion was in a range of 24–100%. The SVM model was still trained using the 136 spectra collected under the controlled CID conditions. The classification of the 132 samples yielded a 100% accuracy.
On the basis of the study with the classification of the sugar samples, it is obvious that the power normalization can have a significant impact on the data analysis using the mass spectra. The error-PNI plot can serve as a unique but effective tool for evaluating the data as well as to assist the development of the experimental methods. A multistep classification procedure can also be designed based on the information provided by the error-PNI plot. For instance, in order to achieve a complete classification, at the first step a PNI = 0.5 (① in Figure 2) can be selected to classify all 15 GAs except for tal-β; at the second step, a PNI = 3.2 can be selected to classify tal-β and glc-β (② in Figure 2).
After the validation with the classification of the sugars, the developed data analysis method was then applied for classifications using data from three other studies. In a previous study, low temperature plasma was used to perform a direct analysis of bacteria,12 including Bacillus subtilis, Staphylococcus aureus ATCC 25923, E. coli K12, and 13 Salmonella enterica bacteria (see Table S2), all in Luria–Bertani (LB) agar. As for a typical early stage study, the study included limited sample quantity, viz., five spectra for each of the 16 bacteria types, and the spectra were subjected to high matrix effects. For the testing of the classification, one sample of each bacteria was randomly selected as the testing sample and the rest were used for training. As shown in Figure 4a,b, while peaks in the m/z range above 200 are attributed to fatty acid ethyl esters from the bacteria membrane, there are abundant peaks in the lower m/z range due to other chemicals in the sample matrixes. Applying the classification method with power normalization, the effect of preselecting a mass range can be systematically analyzed and a clear strategy for classification can be derived.
Figure 4.

Mass spectra of (a) SARA50 and (b) SARA51. The blue square is the location of mass range selection in the previous study. Error-PNI plots (c) without and (d) with mass range selection.
The error-PNI plot for classification without preselecting the m/z range is shown in Figure 4c, with the error-PNI curves highlighted for four types of Salmonella enteriaca bacteria, Paratyphi B DMS106/76 (SARA50), Typhimurium LT2 (SARA2), Paratyphi B DMS3205/83 (SARA47), and Paratyphi B DMS53/76 (SARA51). These bacteria are highly similar in terms of the lipid profiles in the mass spectra12 and could not be well distinguished (Figure 4a,b and Figure S4a). Obviously there is a difficulty in classifying these samples, since there is not an overlap of the valleys in the error-PNI curves for these bacteria (Figure 4c). With the m/z range 250–300 preselected, the possibility for correct classifications is significantly improved (Figure 4d). At PNI at 0.7 (① in Figure 4d), 15 of 16 bacteria can be correctly classified, except for SARA47, which can be correctly classified without power normalization (PNI = 1, at ② in Figure 4c). On the basis of these results, a two-step procedure shall be used for the classification of all the bacteria using SVM, with the first step based on the original data followed by a second step applying power normalization at PNI of 0.7.
Similarly error analysis was performed for classification of serum samples for different stages of melanoma (Supporting Information Section S6). In the previous study,6 serum samples from the B16 mouse model were used to detect the pulmonary metastatic melanoma. Those samples were collected at four different time points, before the injection of the melanoma cells (day 0), on days 7, 14, and 21 after the injection. On-chip fraction was performed for the peptides collected from serum, followed by a positive mode MS analysis of the peptides using a matrix-assisted laser desorption ionization-time-of-flight (MALDI-TOF). A total of 30 samples were collected for each stage. For the classification using the power normalization in this study, one sample was randomly chosen as the testing point and the rest of the 29 samples were used for training in SVM. The PNI was selected in a range of 0.01 to 2, at a step size of 0.05.
The error-PNI plot for the data analysis is shown in Figure 5a. The samples collected on day 7, day 14, and day 21 can be distinguished from each other at a relatively high confidence, when using SVM classifications with normalization at PNI between 0.46 and 0.51; however, the error for classifying the samples of cancer stage at day 0 can be as high as nearly 30%. The assignments of the day 0 samples are summarized in Figure 5b, with about 15% misclassified as “day 7” and 15% as “day 14”. The “day 0” and “day 21” samples, however, can be distinguished from each other very well at PNI 0.5. The misclassification of “day 0” samples can be systematically analyzed over the entire PNI range and the result is shown in Figure 5c. At a PNI of 0.1, a very few of “0 day” samples are misclassified as “day 7” or “day 21” but not as “day 14”. At a PNI larger than 0.3, “day 0” samples can be completely distinguished from “day 21” samples but not from “day 7” or “day 14”. The plot in Figure 5c is termed as the “relevance profile” for the “day 0” samples, which reveals the relationship between one particular sample type, the “day 0” sample in this case, and any of the other sample types. This analysis is extremely useful for understanding the failure in the sample classification and for making improvement by selecting the proper spectral normalizations. It is noteworthy that the relevance profile of a sample type is highly characteristic and can potentially be used for identifying the samples of this type.
Figure 5.

(a) Error-PNI plot of classification of melanoma samples. (b) Classification result of the “0 day” samples. (c) Relevance profile for the “0 day” samples, showing the misclassification of “0” day sample to other sample types, as a function of PNI. (d) Breast cancer stage study: error source profile for Control sample type, showing the misclassification of samples of other types into the “Control” type, as a function of PNI.
In another study, peptides circulating in blood, which were cleaved by carboxypeptidase N in the tumor microenvironment, were collected and analyzed in order to identify the developmental stages of breast cancer.8 Circulating peptides in 58 human plasma samples have been profiled using MALDI-TOF MS, including 10 samples of healthy controls (Control), 11 samples with stage I (BC-I), 12 samples with stage II (BC-II), 15 samples with stage III (BC-III), and 10 samples with stage IV (BC-IV) breast cancer. Each sample had two replicates, which is not sufficient for a sophisticated method development but is also typical for an early stage study. For testing the data analysis using SVM with power normalization, one sample of each type was randomly chosen for testing and the rest were used for training in SVM. The PNI was selected from 0.01 to 2 with a step size of 0.05. The samples were extracted from serum and high matrix effects were observed (Figure S7). The error-PNI plot (Figure S8) indicates a high possibility of error if the original data (PNI = 1) were used directly for classification.
Following the strategies discussed above, a thorough analysis at a system level can be done to understand the situation, which was quite complicated in this case. In addition to the relevance profile used above, an error source profile can also be derived to summarize the misclassification of other sample types INTO one particular type, as shown in Figure 5d for the Control sample type. According to the error source profile, no BC-I is misclassified as Control at all PNI values. At PNI around 0.4, all the samples classified as Control are true control samples but at PNI 0.8 nearly 10% of the classified Control are actually Stage II and 2% are Stage IV. As the PNI changes, the classification results change accordingly.
Probability estimation can be provided by combining classification results at different PNIs of an unknown sample, based on the relevance and error source profiles. Using a simple case for an example, if one sample is classified as BC-IV at PNI 0.4 but classified as Control at PNI 0.8, its true identity can be estimated with a probability. The number of samples of each type mis-assigned to BC-IV at PNI 0.4 and to Control at PNI 0.8 can be extracted from the error source profiles of BC-IV and Control types (Figure S9), as listed in Table 1. The possibility of the said sample to actually be a Control type sample but being mis-assigned as the BC-IV can be calculated as pctrl = (32/52)(16/52), where 52 is the total number of the sample. The possibility for being a BC-II samples is pBC-II = (2/52)(25/52). Most likely, this sample would not be BC-I, BC-III, or BC-IV based on the information listed in the table. When a sample is determined to be a Control type based on the classification using the data analysis reported here, the probabilities of its being a true Control type are calculated as Pctrl = pctrl/(pctrl + pBC-II) = 92%. There is an 8% possibility for its being a BC-II type sample. SVM with power normalization at multiple PNI values enables a comprehensive analysis of the data that can assist the process of finding the ultimate solution in the classification. The information on the mis-assignments can be used for the sample classification as well.
Table 1.
Numbers of Samples of Each Type Assigned as BC-IV at PNI 0.4 and as Control at PNI 0.8 (Total 52 Samples)
| classified as | ||
|---|---|---|
| PNI | PNI 0.4 | PNI 0.8 |
| sample type | BC-IV | Ctrl |
| Ctrl | 32 | 16 |
| BC-I | 25 | 0 |
| BC-II | 24 | 2 |
| BC-III | 29 | 0 |
| BC-IV | 46 | 0 |
CONCLUSION
In the study, the power normalization of the mass spectra was introduced for data analysis and SVM was used to perform the normalized data for sample classification. A set of visualization methods, including the error-PNI plot, relevance profile, and error source profile, were introduced to facilitate the analysis of the data, which ultimately are a reflection of the analytical approach adopted for the biological study. The selection of the proper normalization factor has been proven to be critical for the sample classification. Its applications in data analysis for the studies involving spectra for sugar standards, bacteria, melanoma, and breast cancer samples have been demonstrated and multidimension data analysis enabled by the power normalization at various PNI values could be used to improve the sample classification significantly. Using the data analysis for melanoma and breast cancer studies, we also demonstrated that the “errors” in the classification can actually be used for improving the sample identifications, with the multidimension classification enabled by the power normalization at different PNIs.
Supplementary Material
ACKNOWLEDGMENTS
The authors thank Dr. Chiharu Konda and Prof. R. Graham Cooks for providing the mass spectrometry data for analysis of sugars and bacteria, respectively. This work was supported by the National Institute of General Medical Sciences (Grant 1R01GM106016) from the National Institutes of Health.
Footnotes
ASSOCIATED CONTENT
Supporting Information
The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.anal-chem.5b04418.
Details in original experimental data and methods and results of data analysis (PDF)
The authors declare no competing financial interest.
NOTE ADDED AFTER ASAP PUBLICATION
This paper originally published ASAP on February 24, 2016. Due to a production error, equation 3 was incorrect. A square root was added to the denominator in the equation, and the paper was reposted on March 3, 2016.
REFERENCES
- (1).Pham TV; Piersma SR; Oudgenoeg G; Jimenez CR Expert Rev. Mol. Diagn 2012, 12, 343–359. [DOI] [PubMed] [Google Scholar]
- (2).Aebersold R; Mann M Nature 2003, 422, 198–207. [DOI] [PubMed] [Google Scholar]
- (3).Yang B; Wu Y-J; Zhu M; Fan S-B; Lin J; Zhang K; Li S; Chi H; Li Y-X; Chen H-F; Luo S-K; Ding Y-H; Wang L-H; Hao Z; Xiu L-Y; Chen S; Ye K; He S-M; Dong M-Q Nat. Methods 2012, 9, 904–906. [DOI] [PubMed] [Google Scholar]
- (4).Eng JK; McCormack AL; Yates JR J. Am. Soc. Mass Spectrom 1994, 5, 976–989. [DOI] [PubMed] [Google Scholar]
- (5).Jia C; Yu Q; Wang J; Li L Proteomics 2014, 14, 1185–1194. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (6).Fan J; Huang Y; Finoulst I; Wu H.-j.; Deng Z; Xu R; Xia X; Ferrari M; Shen H; Hu Y Cancer Lett 2013, 334, 202–210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (7).Gholami B; Norton I; Eberlin LS; Agar NYR IEEE J. Biomed. Health Inform 2013, 17, 734–744. [DOI] [PubMed] [Google Scholar]
- (8).Li Y; Li Y; Chen T; Kuklina AS; Bernard P; Esteva FJ; Shen H; Ferrari M; Hu Y Clin. Chem 2014, 60, 233–242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (9).Liao H; Wu J; Kuhn E; Chin W; Chang B; Jones MD; O’Neil S; Clauser KR; Karl J; Hasler F; Roubenoff R; Zolg W; Guild BC Arthritis Rheum 2004, 50, 3792–3803. [DOI] [PubMed] [Google Scholar]
- (10).Zou W; She J; Tolstikov VV Metabolites 2013, 3, 787–819. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (11).Sauer S; Kliem M Nat. Rev. Microbiol 2010, 8, 74–82. [DOI] [PubMed] [Google Scholar]
- (12).Zhang JI; Costa AB; Tao WA; Cooks RG Analyst 2011, 136, 3091–3097. [DOI] [PubMed] [Google Scholar]
- (13).Konda C; Bendiak B; Xia YJ Am. Soc. Mass Spectrom 2012, 23, 347–358. [DOI] [PubMed] [Google Scholar]
- (14).Both P; Green AP; Gray CJ; Sardzik R; Voglmeir J; Fontana C; Austeri M; Rejzek M; Richardson D; Field RA; Widmalm G; Flitsch SL; Eyers CE Nat. Chem 2013, 6, 65–74. [DOI] [PubMed] [Google Scholar]
- (15).McDonnell LA; Heeren RM A. Mass Spectrom. Rev 2007, 26, 606–643. [DOI] [PubMed] [Google Scholar]
- (16).Pereira J; Porto-Figueira P; Cavaco C; Taunk K; Rapole S; Dhakne R; Nagarajaram H; Camara JS Metabolites 2015, 5, 3–55. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (17).Katajamaa M; Oresic MJ Chromatogr. A 2007, 1158, 318–328. [DOI] [PubMed] [Google Scholar]
- (18).Gay S; Binz PA; Hochstrasser DF; Appel RD Proteomics 2002, 2, 1374–1391. [DOI] [PubMed] [Google Scholar]
- (19).Elias JE; Gibbons FD; King OD; Roth FP; Gygi SP Nat. Biotechnol 2004, 22, 214–219. [DOI] [PubMed] [Google Scholar]
- (20).Yang D; Ramkissoon K; Hamlett E; Giddings MC J. Proteome Res 2008, 7, 62–69. [DOI] [PubMed] [Google Scholar]
- (21).Wan KX; Vidavsky I; Gross ML J. Am. Soc. Mass Spectrom 2002, 13, 85–88. [DOI] [PubMed] [Google Scholar]
- (22).Wu BL; Abbott T; Fishman D; McMurray W; Mor G; Stone K; Ward D; Williams K; Zhao HY Bioinformatics 2003, 19, 1636–1643. [DOI] [PubMed] [Google Scholar]
- (23).Kall L; Canterbury JD; Weston J; Noble WS; MacCoss MJ Nat. Methods 2007, 4, 923–925. [DOI] [PubMed] [Google Scholar]
- (24).Geurts P; Fillet M; de Seny D; Meuwis MA; Malaise M; Merville MP; Wehenkel L Bioinformatics 2005, 21, 3138–3145. [DOI] [PubMed] [Google Scholar]
- (25).Ball G; Mian S; Holding F; Allibone RO; Lowe J; Ali S; Li G; McCardle S; Ellis IO; Creaser C; Rees RC Bioinformatics 2002, 18, 395–404. [DOI] [PubMed] [Google Scholar]
- (26).Perkins DN; Pappin DJC; Creasy DM; Cottrell JS Electrophoresis 1999, 20, 3551–3567. [DOI] [PubMed] [Google Scholar]
- (27).Zhan X; Patterson AD; Ghosh D BMC Bioinf 2015, 10.1186/s12859-015-0506-3. [DOI] [PMC free article] [PubMed]
- (28).Koenig T; Menze BH; Kirchner M; Monigatti F; Parker KC; Patterson T; Steen JJ; Hamprecht FA; Steen HJ Proteome Res 2008, 7, 3708–3717. [DOI] [PubMed] [Google Scholar]
- (29).Fenyo D; Beavis RC Anal. Chem 2003, 75, 768–774. [DOI] [PubMed] [Google Scholar]
- (30).Bern M; Kil YJ; Becker C Curr. Protoc Bioinformatics 2012, 40 (13:20), 13.20.1–13.20.14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (31).Lam H; Deutsch EW; Eddes JS; Eng JK; King N; Stein SE; Aebersold R Proteomics 2007, 7, 655–667. [DOI] [PubMed] [Google Scholar]
- (32).Coombes KR; Tsavachidis S; Morris JS; Baggerly KA; Hung MC; Kuerer HM Proteomics 2005, 5, 4107–4117. [DOI] [PubMed] [Google Scholar]
- (33).Tabb DL; MacCoss MJ; Wu CC; Anderson SD; Yates JR Anal. Chem 2003, 75, 2470–2477. [DOI] [PubMed] [Google Scholar]
- (34).Konda C; Londry FA; Bendiak B; Xia YJ Am. Soc. Mass Spectrom 2014, 25, 1441–1450. [DOI] [PubMed] [Google Scholar]
- (35).Chang C-C; Lin C-J ACM Trans. Intell. Syst. Technol 2011, 2, 1–27. [Google Scholar]
- (36).Geppert H; Horváth T; Gärtner T; Wrobel S; Bajorath J J. Chem. Inf. Model 2008, 48, 742–746. [DOI] [PubMed] [Google Scholar]
- (37).Fang TT; Bendiak BJ Am. Chem. Soc 2007, 129, 9721–9736. [DOI] [PubMed] [Google Scholar]
- (38).Chang C-C; Lin C-J ACM Transactions on Intelligent Systems and Technology 2011, 2, 27. [Google Scholar]
- (39).Li YJ; Li YG; Chen T; Kuklina AS; Bernard P; Esteva FJ; Shen HF; Ferrari M; Hu Y Clin. Chem 2014, 60, 233–242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (40).Zhang ZQ Anal. Chem 2004, 76, 3908–3922. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
