Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Feb 19.
Published in final edited form as: J Proteome Res. 2020 Mar 27;19(5):1965–1974. doi: 10.1021/acs.jproteome.9b00793

Identifying Significant Metabolic Pathways Using Multi-Block Partial Least-Squares Analysis

Lingli Deng 1, Fanjing Guo 2, Kian-Kai Cheng 3, Jiangjiang Zhu 4, Haiwei Gu 4, Daniel Raftery 4, Jiyang Dong 5
PMCID: PMC7895463  NIHMSID: NIHMS1669008  PMID: 32174118

Abstract

In metabolomics, identification of metabolic pathways altered by disease, genetics, or environmental perturbations is crucial to uncover the underlying biological mechanisms. A number of pathway analysis methods are currently available, which are generally based on equal-probability, topological-centrality, or model-separability methods. In brief, prior identification of significant metabolites is needed for the first two types of methods, while each pathway is modeled separately in the model-separability-based methods. In these methods, interactions between metabolic pathways are not taken into consideration. The current study aims to develop a novel metabolic pathway identification method based on multi-block partial least squares (MB-PLS) analysis by including all pathways into a global model to facilitate biological interpretation. The detected metabolites are first assigned to pathway blocks based on their roles in metabolism as defined by the KEGG pathway database. The metabolite intensity or concentration data matrix is then reconstructed as data blocks according to the metabolite subsets. Then, a MB-PLS model is built on these data blocks. A new metric, named the pathway importance in projection (PIP), is proposed for evaluation of the significance of each metabolic pathway for group separation. A simulated dataset was generated by imposing artificial perturbation on four pre-defined pathways of the healthy control group of a colorectal cancer study. Performance of the proposed method was evaluated and compared with seven other commonly used methods using both an actual metabolomics dataset and the simulated dataset. For the real metabolomics dataset, most of the significant pathways identified by the proposed method were found to be consistent with the published literature. For the simulated dataset, the significant pathways identified by the proposed method are highly consistent with the pre-defined pathways. The experimental results demonstrate that the proposed method is effective for identification of significant metabolic pathways, which may facilitate biological interpretation of metabolomics data.

Keywords: significant metabolic pathway, pathway importance in projection (PIP), multi-block partial least-squares analysis (MB-PLS), simulated pathway analysis

Graphical Abstract:

graphic file with name nihms-1669008-f0001.jpg

INTRODUCTION

Over the past decade, metabolic pathway analysis has gained increased research interest for biological interpretation of metabolomics data.13 A number of methods and software4 have been proposed, which can be generally grouped into three categories including equal-probability-based methods,57 topological-centrality-based methods,8 or model-separability-based methods.811

For both equal-probability- and topological-centrality-based methods, identification of metabolites with significant changes across groups is required prior to analysis and is provided as input. For example, the fold enrichment method, the most commonly used method among the equal-probability-based methods, evaluates the fold change of each pathway based on the ratio of the actual number of significant metabolites to the expected number of significant metabolites in the pathway. On the other hand, topological-centrality-based methods consider both the significant changes of metabolic levels and their topological relationships to better evaluate their impact on the pathways of interest. Equal-probability- and topological-centrality-based methods are members of overrepresentation analysis (ORA).

Unlike equal-probability- and topological-centrality-based methods, prior identification of significant metabolites is not required for a model-separability-based method. The global test method9 is a representative model-separability-based method. In the global test method, pathways are treated independently, and a single model is built for each pathway. All detected metabolites involved in a pathway are used to statistically analyze whether the pathway is significantly perturbed through Geoman’s global test.12 In this manner, the global test method can make full use of the data.

From a systems biology point of view, many metabolites involved in more than one pathway, and pathways work collectively, which leads to a perturbation in the overall metabolome.13 However, interactions between metabolites and pathways are not taken into consideration in the currently available pathway analysis methods.10

In the present study, we propose a new method for identification of significant pathways in metabolomics. The developed method is based on multi-block partial least-squares (MB-PLS) analysis,14 and it combines all pathway data blocks into a global model. In addition, a global metric, named pathway importance in projection (PIP), is proposed to evaluate the importance of a given pathway to discriminate the two studied groups in the global model. The method was applied to identify the altered pathways in a colorectal cancer (CRC) dataset15,16 and a simulated data set, and the results are compared with those obtained from other pathway identification methods.

MATERIALS AND METHODS

Datasets

Real Metabolomics Dataset.

In this work, a metabolomics dataset focused on colorectal cancer (CRC) was used to evaluate the proposed method. In brief, 234 blood samples were collected from 66 colorectal cancer patients, 76 colonic polyps patients, and 92 healthy volunteers. The data were acquired using a targeted LC–MS/MS approach, and a total of 113 metabolites were detected, as previously described.15,16 This work was conducted in accordance with the protocols approved by the Indiana University School of Medicine and Purdue University Institutional Review Boards. All subjects in the study provided informed consent according to the institutional guidelines. All the CRC patients in this study were newly diagnosed, and the blood samples were drawn before any surgery, chemotherapy, or radiation treatment. Each blood sample was allowed to clot for 45 min and then centrifuged at 2000 rpm for 10 min. All samples were stored in −80 °C freezer until further analysis.

The targeted LC–MS/MS method was developed as a standard operating procedure (SOP) in the Northwest Metabolomics Research Center (NW-MRC). Targeted data acquisition was performed in multiple-reaction-monitoring (MRM) mode. The extracted MRM peaks were integrated using MultiQuant 2.1 software (AB Sciex, Toronto, ON, Canada). To correct for a small amount of instrument drift, the spectral data were normalized across the sample batches for each metabolite using the average values from the data of quality control injections (at least five for each batch). A total of 158 MRM transitions were targeted (99 in negative ion mode and 59 in positive ion mode), and 113 metabolites could be measured in the serum samples with sufficient signal-to-noise and very few or no missing data.

Metabolic Pathways in KEGG Database.

The Kyoto Encyclopedia of Genes and Genomes (KEGG; http://www.genome.jp/kegg/)17 is a publicly accessible database, which contains a collection of manually drawn metabolic pathway maps, genetic information, environmental information such as signal transduction, various other cellular processes, and human diseases. In the present study, a total of 81 human (Homo sapiens) metabolic pathways (covering 1498 metabolites) were downloaded from the KEGG database on March 16, 2019.

For the detected 113 metabolites in the CRC study, only 89 metabolites can be found in the 81 Homo sapiens pathways. Furthermore, pathways with a low number of detected metabolites may not have sufficiently biological or statistical information for further analysis, so the pathways containing fewer than three detected metabolites were excluded in this study to make the results more robust and interpretable. As a result of the exclusion criteria, only 30 metabolic pathways and 81 metabolites were included for further analysis (Tables S1 and S2).

Simulated Dataset.

In general, the ground truth of the altered pathways is unknown, which makes it difficult to evaluate the performance of significant pathway identification methods. It is necessary to simulate datasets with predefined altered pathways for methods evaluation. In this paper, the simulated datasets were generated by imposing artificial information on some pre-defined pathways of the healthy control group of the CRC study. The simulation algorithm can be briefly described as follows:

  • Step 1. The 92 health control samples from the CRC study were randomly divided into two groups (G1 and G2), with 46 samples for each group. Let G1 be the control group and G2 be the experimental group. There are 30 pathways {P(1), P(2), …, P(30)} involved in the data. Without loss of generality, let the first four pathways {P(1), P(2), P(3), P(4)} be the predefined altered pathways. In addition, let {P(1), P(2)} be up-regulated and {P(3), P(4)} be down-regulated in the experimental group G2.

  • Step 2. Let C = (cj)1 × 81 be the concentration vector of a sample. We decompose the concentration vector as follows:
    cj=i=130hij,j=1,2,,81 (1)
    where hij is the impact factor of pathway P(i) on metabolite j, which can be written by
    hij={cj|Ni|jNi0otherwise (2)
    where Ni Indicates the metabolites set involved in the pathway i, and |. | represents the number of terms in the specified set. H = (hij)30 × 81 is called the impact matrix.
  • Step 3. With the impact matrix H, we can impose the alteration information of the pre-defined pathways, e.g., P(1), P(2), P(3), P(4), on the samples in G2 as follows:
    hij={hij×(1+fi)ifi=1,2hij×(1fi)ifi=3,4,j=1,2,,81hijotherwise (3)
    where fi is a regulating coefficient of pathway i, which can be used to simulate the significance of the pre-defined altered pathway P(i). For the sake of simplicity, the regulating coefficients of the four pre-defined altered pathways are set to be a same value in this paper. We can get the new concentration vector of a sample in G2 by summing up its impact matrix as follows:
    cj=i=130hij,j=1,2,,81 (4)
  • Step 4. Select, randomly, four pathways to be the pre-defined altered pathways; then, a new experimental group G2 will be obtained by updating the samples in the original G2 using Step 3. Combining with the samples of G1, we can get a new simulated dataset with known altered pathways.

Current Methods for Identification of Significant Pathways

Equal-Probability-Based Methods.

Fold enrichment6,7 and the hypergeometric test5,6 are two commonly used methods based on equal probability assumption, i.e., all metabolites, especially significant metabolites, are detected with equal probability. Suppose we have K metabolites in a pathway of interest and a total of N metabolites in the metabolic network. Assume that there are k significant metabolites in the pathway of interest and n significant metabolites in the metabolic network. Based on the ratio of significant metabolites in the metabolic network, the expected value of k would be k˜=KN×n. If k exceeds the above expected value, the pathway will be said to be enriched, with a ratio of fold enrichment given by kk˜. In order to get an accurate result, the values of N, K, n, and k should be large. Nevertheless, in metabolomics, the number of identified metabolites is usually moderate ranging up to several hundred in most studies. Therefore, fold enrichment may not be accurate for this situation. However, from the probability theory, the hypergeometric test can be used to calculate the probability for a similar case, that of drawing n balls from a container with N total balls including K black balls and NK white balls, with exactly k black balls being sampled in a draw. Thus, the hypergeometric test can be used as an improved method of fold enrichment. One of the issues for equal-probability-based methods is that metabolites are treated independently in the analysis, despite the fact that metabolites interact and may be involved in multiple metabolic pathways.

Topological-Centrality-Based Methods.

The metabolic network can also be modeled by an undirected graph, in which one metabolite is represented by a node and one reaction by an edge between nodes. Node centrality, which can be described with mathematically defined terms such as degree, closeness, and betweenness, can be used to evaluate the topological characters of nodes in a network.18 For topological-centrality-based methods, the significance of an interesting pathway is in proportion to the ratio of the sum of the centrality of significant metabolites in the pathway to the sum of the centrality of total metabolites in the pathway. In contrast to the equal-probability-based methods, which assume all metabolites have an equal weight in the metabolic network, the topological centrality methods weigh metabolites by their topological-centrality. Both equal-probability- and topological-centrality-based methods require prior identification of significant metabolites. The results may be sensitive to the identification of significant metabolites by univariate or multivariate statistics.

Model-Separability-Based Methods.

Pathway discrimination score is an example of a model-separability-based method. In this method, metabolites are first assigned based on their involvement in pathways, and the data are subjected to analysis by a multivariate regression method, such as PLS-DA.19 The discrimination score of the model, e.g., AUROC (area under a receiver operating characteristic curve),20 can be used to identify the significant pathways, i.e., AUROC, of the PLS-DA model. Another popular method is to use Goeman’s global test12 to determine if the behavior of the group of metabolites belonging to the pathway is significantly related to a particular outcome of interest.9 The global test is implemented by estimating the probability of the hypothesis H0:β1 = β2 = Δ = βm = 0 is true, where βi(i = 1,2, …, m) is the set of model regression coefficients. The model-separability-based methods analyze each pathway separately. However, multiple pathways are usually affected to a greater or lesser extent during the development of a disease, and a single pathway feature may not be sufficient to distinguish two different conditions. Therefore, establishing a comprehensive model that includes all pathway data into a global model may provide additional useful insight into the biological interpretation of the results.

Proposed Method Based on the MB-PLS Model

Pathway-Level Dataset.

Assume that there are M metabolites detected in a metabolomics study, and the M metabolites are involved in B metabolic pathways. The first step of our work is to construct a pathway-level dataset. Assume that Lb(0<Lb ≤ M) of the M metabolites belong to the bth(b = 1,2, …, B) pathway. The state of the bth pathway is characterized by data block P(b), which contains concentration levels of the Lb metabolites belonging to the bth pathway. Then, the concatenated matrix P = (P(1), P(2), …, P(B)) is called the pathway-level matrix.

In preparation for analysis, the pathway-level matrix is preprocessed. First, each column of the matrix block is mean-centered and scaled to unit variance, and each matrix block of the pathway-level matrix is then scaled to the same total variance. For the sake of simplicity and without loss of generality, the scaled pathway-level matrix is still denoted as P = (P(1), P(2), …, P(B)).

MB-PLS Analysis.

Here, MB-PLS analysis is used to investigate the pathway-level matrix P = (P(1), P(2), …, P(B)). There are essentially two ways to obtain the MB-PLS model:14,21 the first is to use the block-scores to deflate the block matrix P(b)(b = 1,2, …, B).22 This ensures that the block-scores are orthogonal; however, this approach has been proven to lead to inferior predictive results.23 The second approach is to use super-scores to deflate the block matrix P(b)(b = 1,2, …, B),14 and it is the strategy used in the current work. The super-scores vector si is the sum of weighted block-scores vectors ti (b).

si=b=1Bwibti(b),fori=1,2,,I (5)

where wi = (wi1, wi2, …, wiB) is the normalized super-weight for the ith component of the model, and the element wib reflects the weight of the data block P(b) for the ith component. The detailed algorithm for deflating the block matrix and obtaining super-scores has been previously described.14,21,23 The appropriate number of model components I can be obtained with Monte Carlo cross-validation (MCCV),24 and the model prediction performance is estimated using the AUROC.

Pathway Importance in Projection (PIP).

For conventional PLS-DA, variable importance in projection (VIP)25 is usually used to evaluate the contribution of variables to the model. Inspired by the idea of VIP, a new metric named PIP is introduced in this paper to estimate the contribution and importance of a pathway in the MB-PLS model.

For a MB-PLS model with I components, the importance of the bth pathway is defined as

PIPb=Bi=1Ir2(y,si)wib2i=1Ir2(y,si),forb=1,2,,B (6)

where y is the response variable (i.e., class labels vector of the samples), and r2(y, si) is the squared correlation between y and the ith super-score si. r2(y, si) is used to measure the explained variance of the response variable y for si. The larger the PIPb value, the more important the pathway P(b). PIP is a normalized metric, which satisfies with

b=1BPIPb2=1 (7)

PIP for Identification of Significant Pathway.

The efficiency of the PIP metric for significant pathway identification is evaluated by a sample‑based permutation test procedure that preserves the complex correlation structure of the data. Specifically, we permuted the class labels (response variable y) and calculated the PIP values for all pathways in the MB‑PLS model of the permuted data. In such a way, we obtained a null distribution of PIP values for each pathway. The empirical, nominal p‑value of the observed PIP value of a pathway was then calculated relative to this null distribution. Because of the fact that the average of squared PIP values is equal to 1, any pathway having a PIP > 1 is considered to highly contribute to model separability. In this study, the pathways satisfying PIP > 1 and nominal p‑value < 0.1 simultaneously are considered as significant pathways.

RESULTS AND DISCUSSION

Results for CRC Metabolomics Dataset

Overview of the Dataset by MB-PLS.

MB-PLS models were built on the pairwise group comparison between the CRC, polyps, and healthy control groups. The score plots are shown in Figure 1ac. The separability of each MB-PLS model was estimated by AUROC using a MCCV procedure with 100 repeats, in which the samples were randomly divided into two sets: 70% as the training set and 30% as the test set. The average and standard deviation of AUROC values were calculated and are shown in Figure 1d. The results indicate that MB-PLS model is capable of separating the CRC group from both healthy controls and polyp patient groups, with an AUROC value of 0.85 between CRC and controls and 0.88 between CRC and polyp patients. A MB-PLS model for comparison of controls and polyp patients was also established. It is of low separability with an AUROC value of 0.52.

Figure 1.

Figure 1.

Score plots of MB-PLS models of (a) controls vs CRC, (b) polyps vs CRC, and (c) polyps vs controls. (d) AUROC plot of the three MB-PLS models obtained from a 100 repeat MCCV procedure.

Subsequently, PIPs were calculated to evaluate the pathways that contributed most to discriminate CRC patients from the other two groups in this study. A permutation test was used to evaluate the significance level of the PIP. When the PIP score threshold was set to 1 and the nominal p-value of the PIP threshold was set to 0.1, five pathways (arginine and proline metabolism; arginine biosynthesis; histidine metabolism; glutamine and glutamate metabolism; alanine, aspartate and glutamate metabolism) were selected for the group separation between CRC patients and healthy controls, and three pathways (arginine biosynthesis; histidine metabolism; glutamine and glutamate metabolism) were selected as significant pathways for interpretation of the difference between CRC patients and polyps patients, as shown in Figure 2 and Table S3.

Figure 2.

Figure 2.

PIP for all pairwise comparisons among CRC, polyps, and healthy control groups. *, nominal p-value of PIP < 0.1; **, nominal p-value of PIP < 0.05.

Three pathways are significant for both models of “CRC versus controls” and “CRC versus polyps”. Metabolites of these pathways are shown in Figure 3. Notably, most of these metabolites are amino acids. The altered concentrations of amino acids suggested perturbed cancer cell activities, e.g., synthesis of proteins or catabolism to provide energy and/or other metabolite substrates. The detected pathways were also further supported by a number of recent studies. For example, cancer cells use glutamine as an important energy source.26 Glutamine and glutamate metabolism plays key roles in tumor growth and invasion, and this was also reported in numerous CRC studies.3,2730 In addition, two recent papers31,32 show a positive association between consumption of arginine and improvement of CRC. It had been suggested that arginine and proline metabolism may serve as an indicator of immune system impairment and abnormal apoptosis in CRC patients.27 Furthermore, bioinformatics analysis showed the greatest effect of alanine, aspartate, and glutamate metabolism pathway on CRC,33 and changes in alanine, aspartate and glutamate metabolism was also detected in other published CRC studies.3,34 In addition, the metabolic disorder of histidine metabolism was also reported in CRC30 and other type of cancers.35,36 The current work focused on method development, such that a detailed biological analysis of the data is beyond the scope of this paper.

Figure 3.

Figure 3.

Sub-network of the five identified pathways and their detected metabolites. The pathways and metabolites are coded based on the KEGG database (Note: hsa00220: Arginine biosynthesis; hsa00250: Alanine, aspartate and glutamate metabolism; hsa00330: Arginine and proline metabolism; hsa00340: Histidine metabolism; hsa00471: Glutamine and glutamate metabolism. C00022: Pyruvate; C00025: Glutamic acid; C00026: Alpha-ketoglutaric acid; C00036: Oxaloacetate; C00041: Alanine; C00042: Succinate; C00049: Aspartic acid; C00062: Arginine; C00064: Glutamine; C00122: Fumaric; C00135: Histidine; C00148: Proline; C00152: Asparagine; C00300: Creatine; C00327: Citrulline; C00334: Gama-aminobutyrate; C00581: Guanidinoacetate; C01157: Hydroxyproline; C03794: Adenylosuccinic acid; C05127: 1-Methylhistamine).

Comparison with Other Model-Separability-Based Methods.

Three different methods including the proposed method (the PIP of MBPLS model), p-value of the global test, and AUROC of pathway discrimination score were compared. A PLS-DA model was built for each pathway block, and then the prediction accuracy of the class labels of the samples was evaluated using a MCCV (repeated 100 times) procedure, in which the sample set was randomly divided into two subsets: 70% as the training set and 30% as the test set. The average AUROC of test set was used to evaluate the significance of pathways.

PIP of MB-PLS model, p-value of global test, and AUROC of PLS-DA model for the 30 pathways were calculated comparing CRC patients with healthy controls and polyp patients, as shown in Figure 4. These results show a positive correlation between PIP, AUROC, and −lg(p-value of global test). Most pathway blocks with a large PIP also have a large AUROC and −lg(p-value of global test) and vice versa. Still, there are some notable differences evident between the results of PIP and the two other methods. The methods of global test and the AUROC of PLS-DA treat each pathway block as a discrete entity and built a model separately for each pathway. In contrast, MB-PLS builds a single model for all the pathway blocks. Therefore, the PIP score represents a global statistical value that considers both the influence of single pathway blocks and the association between pathway blocks. Thus, due to the interaction with other pathway blocks, some pathways that were found to be not very important by the local analysis methods (such as global test and AUROC of PLS-DA model) may show a large PIP score in the MB-PLS model, such as glutamine and glutamate metabolism in the current study (shown as node 1 in Figure 4).

Figure 4.

Figure 4.

Analysis results of three model-separability based methods. (a) Controls vs CRC. (b) Polyps vs CRC.

Comparison with Other Pathway Identification Methods.

The proposed PIP method was compared with the other seven commonly used methods. These commonly used methods include two equal-probability-based methods (i.e., hypergeometric test and fold enrichment), three topological-centrality-based methods (i.e., closeness, degree, and betweenness), and two model-separability-based methods (i.e., AUROC of PLS-DA model and global test). For the equal-probability-based methods and topological-centrality-based methods, PLS-DA was applied to identify the significant metabolites prior to significant pathway identification using the criterion VIP > 1. Note that different methods have different metrics for the significance of pathways, and those metrics are not comparable. So, we ranked the pathways according to their significances in each method, and then compared the eight methods using the pathways rankings, as shown in Figure 5.

Figure 5.

Figure 5.

Ranking of 30 significant pathways following analysis with eight different methods. (a) CRC vs controls; (b) CRC vs polyps. Warmer colors indicate higher significance of the pathway.

To compare the results from all eight methods, hierarchical clustering analysis was performed on pathway ranks of the eight methods (Figure 5). Figure 5a shows that methods corresponding to the same category are clustered tightly in the hierarchical clustering. It follows that methods in the same category show very similar analysis results, and the results between methods of different categories are less similar. For example, pyrimidine metabolism gets better rankings in three model-separability-based methods but lower rankings in the equal-probability and topological-centrality methods. By comparing the results from all eight methods, we found that equal-probability and topological-centrality methods produced more consistent results with one another, while results from model-separability-based methods were different from the other two types of methods.

Pathway Analysis of a Breast Cancer Metabolomics Dataset.

A dataset with a larger set of measured compounds may help to show the full potential of the developed method. We downloaded a metabolomic dataset of breast cancer from the public repository of MetabolomicsWorkBench (http://metabolomicsworkbench.org/; Project ID: PR000226), which has a larger set of measured compounds (228 metabolites) than the CRC dataset. A number of metabolic pathways were identified by the developed method to be significantly altered in breast cancer patients. These significant pathways have previously been reported to be closely related to breast cancer. More detail results are presented in the Supporting Information.

Although there are 228 detected metabolites in the breast cancer dataset, only 33 pathway candidates and 88 detected metabolites were included for the further analysis following exclusion of pathways with less than three metabolites. The finding indicated that metabolites mapping (or annotation) and metabolites coverage are two challenges for pathway analysis in metabolomics.

Results for the Simulated Dataset

The eight pathway identification methods were also applied to the simulated datasets, which are generated as described in the Materials and Methods section under “Simulated Dataset”. Briefly, there are two groups of samples in the simulated datasets, i.e., the control group G1 and the experimental group G2 with four pre-defined altered pathways, 46 samples for each group. The four pre-defined pathways were randomly selected from the 30 involved pathways, and then alteration information was imposed on the four pathways with a given regulating coefficient f. Pathway analysis results of each method were rank-transformed. The rank sum of the pre-defined pathways is adopted in this analysis to estimate the performance of these methods. A small rank sum means a good performance for pathway identification.

To get a comprehensive assessment of the eight methods, a series of simulated datasets with different regulating coefficients f and different pre-defined altered pathways were generated and subjected to these identification methods. In each simulated dataset, four pathways were randomly selected from the 30 pathways to be the pre-defined altered pathways. The average rank sum rank for 100 repetitions with different regulating coefficients f are plot in Figure 6.

Figure 6.

Figure 6.

Sum of ranks of the four pre-defined altered pathways with respect to different regulating coefficients. Note that each point in the plot is an average over 100 simulated datasets with different pre-defined pathways.

Figure 6 indicates that it is easier to identify altered pathways with larger perturbation (i.e., larger f) for the eight identification methods, as we can see that the rank sum of the four pre-defined pathways decreases with increasing regulating coefficient f. For a given regulating coefficient f, model-separability-based methods (PIP, global test, and AUROC of PLS-DA) perform best followed by equal-probability methods (fold enrichment and hypergeometric test) and then topological-centrality methods (degree, closeness, and betweenness). As to the three model-separability-based methods, PIP and the global test produced better results than the AUROC of PLS-DA. The global test worked almost as well as PIP for the severely altered datasets (i.e., large regulating coefficient f), while PIP performed better than the global test and all other methods for the datasets with small regulating coefficients, which implies that PIP has better sensitivity for significant pathways identification than the other seven methods.

CONCLUSIONS

Currently, LC–MS-based untargeted metabolomics methods typically result in several thousands of peaks. However, peak annotation remains a challenge for researchers, such that the number of annotated peaks is typically in the hundreds. Notably, however, unconfirmed peak annotation may complicate results analysis and may even lead to misleading biological interpretations. In the current study, the developed MB-PLS method was applied on the CRC metabolomics dataset with a total of 113 detected known metabolites (obtained from a targeted metabolomics experiment that targeted 158 MRM transitions). The number of metabolites was further reduced to 81 due to exclusion criteria used in the current study. Similar to all other pathway analysis methods, a dataset with a larger set of measured compounds will provide further validation of the developed method.

Here, we have developed a MB-PLS-based method with a proposed PIP metric for identification of significant pathways in metabolomics. The new approach groups detected metabolites into pathway blocks based on their involvement in metabolism and builds a single global model that includes all pathways in a single analysis. The developed method provides a global view of altered metabolism showing the influence of each pathway and interactions between pathways. The results from experimental metabolomics and synthetic datasets showed that the MB-PLS-based method with PIP parameter provides good performance for significant pathway identification. In addition, the proposed MB-PLS analysis was found to facilitate biological interpretation of the metabolomics data. Similar to other currently available pathway analysis methods, application of the MB-PLS-based method on a metabolomics dataset with a large set of measured compounds may help to show the full potential of the developed method. Taken together, the MB-PLS-based method developed in the current study may serve as a potential alternative method for metabolic pathway analysis.

Supplementary Material

SI Pathway Mutliblock Analysis

ACKNOWLEDGMENTS

This work was financially supported by the National Natural Science Foundation of China (nos. 81801788 and 81871445), the Educational Commission of Jiangxi Province (no. GJJ160591), and the Cancer Consortium Support Grant (5 P30 CA015704) in the University of Washington. K.K.C. is supported by the Research University Grant from Universiti Teknologi Malaysia (14H62 and 18H73).

Footnotes

Supporting Information

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jproteome.9b00793.

(Table S1) Information of metabolic pathways involved in the dataset, (Table S2) compounds involved in the dataset of colorectal cancer project, (Table S3) PIP of the MB-PLS model for the pathways in colon dataset, (Figure S1) algorithm diagram of the MB-PLS model and PIP calculation, and pathway analysis on a metabolomics dataset of breast cancer (PDF)

The colorectal cancer dataset and Matlab code of MBPLS model and PIP identifier (ZIP)

Complete contact information is available at: https://pubs.acs.org/10.1021/acs.jproteome.9b00793

The authors declare no competing financial interest.

The colorectal cancer dataset has also been deposited into the MetabolomicsWorkBench repository (Project ID: PR000226) with the following link: https://www.metabolomicsworkbench.org/data/DRCCMetadata.php?Mode=Project&ProjectID=PR000226. The Matlab code of MB-PLS model and PIP identifier is also available at Github: https://github.com/jydong2018/metabolomics/blob/master/MBPLS_PIP.m.

Contributor Information

Lingli Deng, Department of Information Engineering, East China University of Technology, Nanchang 330013, China.

Fanjing Guo, Department of Electronic Science, Xiamen University, Xiamen 361005, China.

Kian-Kai Cheng, Innovation Centre in Agritechnology, Universiti Teknologi Malaysia, 84600 Muar, Johor, Malaysia.

Jiyang Dong, Department of Electronic Science, Xiamen University, Xiamen 361005, China;.

REFERENCES

  • (1).Lin C; Chen Z; Zhang L; Wei Z; Cheng K-K; Liu Y; Shen G; Fan H; Dong J Deciphering the metabolic perturbation in hepatic alveolar echinococcosis: a 1H NMR-based metabolomics study. Parasites Vectors 2019, 12, 300. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (2).Liang Q; Liu H; Li X; Hairong P; Sun P; Yang Y; Du C High-throughput metabolic profiling, combined with chemometrics and bioinformatic analysis reveals functional alterations in myocardial dysfunction. RSC Adv. 2019, 9, 3351–3358. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (3).Gu J; Xiao Y; Shu D; Liang X; Hu X; Xie Y; Lin D; Li H Metabolomics Analysis in Serum from Patients with Colorectal Polyp and Colorectal Cancer by 1H-NMR Spectrometry. Dis. Markers 2019, 2019, 3491852. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (4).Chong J; Soufan O; Li C; Caraus I; Li S; Bourque G; Wishart DS; Xia J MetaboAnalyst 4.0: towards more transparent and integrative metabolomics analysis. Nucleic Acids Res. 2018, 46, W486–W494. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (5).Rivals I; Personnaz L; Taing L; Potier MC Enrichment or depletion of a GO category within a class of genes: which test? Bioinformatics 2007, 23, 401–407. [DOI] [PubMed] [Google Scholar]
  • (6).Zhang B; Kirov S; Snoddy J WebGestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic Acids Res. 2005, 33, W741–W748. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (7).Zeeberg BR; Feng W; Wang G; Wang MD; Fojo AT; Sunshine M; Narasimhan S; Kane DW; Reinhold WC; Lababidi S; Bussey KJ; Riss J; Barrett JC; Weinstein JN GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol. 2003, 4, R28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (8).Xia J; Sinelnikov IV; Han B; Wishart DS MetaboAnalyst 3.0-making metabolomics more meaningful. Nucleic Acids Res. 2015, 43, W251–W257. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (9).Hendrickx DM; Hoefsloot HCJ; Hendriks MMWB; Canelas AB; Smilde AK Global test for metabolic pathway differences between conditions. Anal. Chim. Acta 2012, 719, 8–15. [DOI] [PubMed] [Google Scholar]
  • (10).AlAkwaa FM; Yunits B; Huang S; Alhajaji H; Garmire LX Lilikoi: an R package for personalized pathway-based classification modeling using metabolomics data. GigaScience 2018, 7, giy136. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (11).Huang S; Chong N; Lewis NE; Jia W; Xie G; Garmire LX Novel personalized pathway-based metabolomics models reveal key metabolic pathways for breast cancer diagnosis. Genome Med. 2016, 8, 34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (12).Goeman JJ; van de Geer SA; de Kort F; van Houwelingen HC A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 2004, 20, 93–99. [DOI] [PubMed] [Google Scholar]
  • (13).Rosato A; Tenori L; Cascante M; de Atauri Carulla PR; dos Santos VAPM; Saccenti E From correlation to causation: analysis of metabolomics data using systems biology approaches. Metabolomics 2018, 14, 37. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (14).Westerhuis JA; Smilde AK Deflation in multiblock PLS. J. Chemom 2001, 15, 485–493. [Google Scholar]
  • (15).Zhu J; Djukovic D; Deng L; Gu H; Himmati F; Chiorean EG; Raftery D Colorectal Cancer Detection Using Targeted Serum Metabolic Profiling. J. Proteome Res 2014, 13, 4120–4130. [DOI] [PubMed] [Google Scholar]
  • (16).Deng L; Gu H; Zhu J; Gowda GAN; Djukovic D; Chiorean EG; Raftery D Combining NMR and LC/MS Using Backward Variable Elimination: Metabolomics Analysis of Colorectal Cancer, Polyps, and Healthy Controls. Anal. Chem 2016, 88, 7975–7983. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (17).Kanehisa M; Goto S KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000, 28, 27–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (18).Freeman LC Centrality in social networks conceptual clarification. Soc. Networks 1978, 1, 215–239. [Google Scholar]
  • (19).Ballabio D; Consonni V Classification tools in chemistry. Part 1: linear models. PLS-DA. Anal. Methods 2013, 5, 3790–3798. [Google Scholar]
  • (20).Greiner M; Pfeiffer D; Smith RD Principles and practical application of the receiver-operating characteristic analysis for diagnostic tests. Prev. Vet. Med 2000, 45, 23–41. [DOI] [PubMed] [Google Scholar]
  • (21).Lopes JA; Menezes JC; Westerhuis JA; Smilde AK Multiblock PLS analysis of an industrial pharmaceutical process. Biotechnol. Bioeng 2002, 80, 419–427. [DOI] [PubMed] [Google Scholar]
  • (22).Wangen LE; Kowalski BR A multiblock partial least squares algorithm for investigating complex chemical systems. J. Chemom 1989, 3, 3–20. [Google Scholar]
  • (23).Westerhuis JA; Coenegracht PMJ Multivariate modelling of the pharmaceutical two-step process of wet granulation and tableting with multiblock partial least squares. J. Chemom 1997, 11, 379–392. [Google Scholar]
  • (24).Xu QS; Liang YZ Monte Carlo cross validation. Chemom. Intell. Lab. Syst 2001, 56, 1–11. [Google Scholar]
  • (25).Pérez-Enciso M; Tenenhaus M Prediction of clinical outcome with microarray data: a partial least squares discriminant analysis (PLS-DA) approach. Hum. Genet 2003, 112, 581–592. [DOI] [PubMed] [Google Scholar]
  • (26).Vander Heiden MG; Cantley LC; Thompson CB Understanding the Warburg effect: the metabolic requirements of cell proliferation. Science 2009, 324, 1029–1033. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (27).Qiu Y; Cai G; Su M; Chen T; Zheng X; Xu Y; Ni Y; Zhao A; Xu LX; Cai S; Jia W Serum Metabolite Profiling of Human Colorectal Cancer Using GC-TOFMS and UPLC-QTOFMS. J. Proteome Res 2009, 8, 4844–4850. [DOI] [PubMed] [Google Scholar]
  • (28).Liu G; Zhu J; Yu M; Cai C; Zhou Y; Yu M; Fu Z; Gong Y; Yang B; Li Y; Zhou Q; Lin Q; Ye H; Ye L; Zhao X; Li Z; Chen R; Han F; Tang C; Zeng B Glutamate dehydrogenase is a novel prognostic marker and predicts metastases in colorectal cancer patients. J. Transl. Med 2015, 144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (29).Luo Y; Yoneda J; Ohmori H; Sasaki T; Shimbo K; Eto S; Kato Y; Miyano H; Kobayashi T; Sasahira T; Chihara Y; Kuniyasu H Cancer Usurps Skeletal Muscle as an Energy Repository. Cancer Res. 2014, 74, 330–340. [DOI] [PubMed] [Google Scholar]
  • (30).Qiu Y; Cai G; Su M; Chen T; Liu Y; Xu Y; Ni Y; Zhao A; Cai S; Xu LX; Jia W Urinary metabonomic study on colorectal cancer. J. Proteome Res 2010, 9, 1627–1634. [DOI] [PubMed] [Google Scholar]
  • (31).Karimian J; Hadi A; Salehi-sahlabadi A; Kafeshani M The Effect of Arginine Intake on Colorectal Cancer: a Systematic Review of Literatures. Clin. nutr. res 2019, 8, 209–218. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (32).Jahani M; Noroznezhad F; Mansouri K Arginine: Challenges and opportunities of this two-faced molecule in cancer therapy. Biomed. Pharmacother 2018, 102, 594–601. [DOI] [PubMed] [Google Scholar]
  • (33).Li X; Chung ACK; Li S; Wu L; Xu J; Yu J; Wong C; Cai Z LC-MS-based metabolomics revealed SLC25A22 as an essential regulator of aspartate-derived amino acids and polyamines in KRAS-mutant colorectal cancer. Oncotarget 2017, 8, 101333. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (34).Nijhuis A; Thompson H; Adam J; Parker A; Gammon L; Lewis A; Bundy JG; Soga T; Jalaly A; Propper D; Jeffery R; Suraweera N; McDonald S; Thaha MA; Feakins R; Lowe R; Bishop CL; Silver A Remodelling of microRNAs in colorectal cancer by hypoxia alters metabolism profiles and 5-fluorouracil resistance. Hum. Mol. Genet 2017, 26, 1552–1564. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (35).Schramm G; Surmann EM; Wiesberg S; Oswald M; Reinelt G; Eils R; König R Analyzing the regulation of metabolic pathways in human breast cancer. BMC Med. Genomics 2010, 3, 39. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (36).Li C-Y; Liang G-Y; Yao W-Z; Sui J; Shen X; Zhang Y-Q; Peng H; Hong W-W; Ye Y-C; Zhang ZY; Zhang W-H; Yin L-H; Pu Y-P Integrated analysis of long non-coding RNA competing interactions reveals the potential role in progression of human gastric cancer. Int. J. Oncol 2016, 48, 1965–1976. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

SI Pathway Mutliblock Analysis

RESOURCES