Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2017 Jan 30;33(10):1545–1553. doi: 10.1093/bioinformatics/btx012

Sparse network modeling and metscape-based visualization methods for the analysis of large-scale metabolomics data

Sumanta Basu 1,2,#, William Duren 3,#, Charles R Evans 4, Charles F Burant 4, George Michailidis 5,, Alla Karnovsky 3,
Editor: Cenk Sahinalp
PMCID: PMC5860222  PMID: 28137712

Abstract

Motivation

Recent technological advances in mass spectrometry, development of richer mass spectral libraries and data processing tools have enabled large scale metabolic profiling. Biological interpretation of metabolomics studies heavily relies on knowledge-based tools that contain information about metabolic pathways. Incomplete coverage of different areas of metabolism and lack of information about non-canonical connections between metabolites limits the scope of applications of such tools. Furthermore, the presence of a large number of unknown features, which cannot be readily identified, but nonetheless can represent bona fide compounds, also considerably complicates biological interpretation of the data.

Results

Leveraging recent developments in the statistical analysis of high-dimensional data, we developed a new Debiased Sparse Partial Correlation algorithm (DSPC) for estimating partial correlation networks and implemented it as a Java-based CorrelationCalculator program. We also introduce a new version of our previously developed tool Metscape that enables building and visualization of correlation networks. We demonstrate the utility of these tools by constructing biologically relevant networks and in aiding identification of unknown compounds.

Availability and Implementation

http://metscape.med.umich.edu

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Development and application of high-resolution analytical methods for metabolomics allows generation of increasingly large and complex data sets. This has created a need for data analysis and visualization tools to assist with interpretation of experimentally observed changes and put them in relevant biological or disease context. The most commonly used approach to achieving this task relies on mapping metabolites onto metabolic pathways. A number of carefully curated databases contain information about metabolic pathways in different organisms (Caspi et al., 2012; Duarte et al., 2007; Hao et al., 2010; Kanehisa 2006; Ma et al., 2007; Romero et al., 2005; Sigurdsson et al., 2010; Thiele et al., 2013). Virtually all of them contain information generated via genome-based metabolic reconstructions combined with extensive literature searches and expert curation. In addition to detailed information about metabolites, metabolic reactions, enzymes, genes and metabolic pathway topology, some tools include information about sub-cellular compartments where the metabolic reactions occur (Hao et al., 2010) and describe metabolic enzyme complexes and transporters (Duarte et al., 2007).

While the pathway databases provide carefully curated, high quality data that cover the majority of primary metabolites, the coverage of lipids, secondary, and volatile metabolites is significantly lower, resulting in relatively low overall coverage of experimentally identified metabolites (Barupal et al., 2012). Additionally, the metabolites from different organisms (e.g., bacterial metabolites from the microbiome), drug metabolites, and compounds of environmental origin are not included in the majority of curated pathways. Several approaches have utilized additional information to overcome this problem and expand the range of metabolites included in secondary analysis. For example, MetaMapp and MetaMappR combine the biochemical reactions from KEGG with Tanimoto chemical and National Institute of Standards and Technology (NIST) mass spectral similarity scores to build extended metabolite networks (Barupal et al., 2012; Grapov et al., 2015). Li et al. (2013) recently developed Mummichog, a tool that combines network analysis with metabolite identity prediction. Efforts have also been made to extend metabolite annotation coverage beyond pathways using Medical Subject Headings (MeSH) to link them to publications (Duren et al., 2014; Sartor et al., 2012).

Untargeted metabolomics studies can simultaneously measure thousands of features, many of which cannot be readily identified, but nonetheless could be strongly associated with the disease or specific biological condition under study. Identification of unknown features can be labor intensive and usually requires additional experiments [reviewed in (Neumann and Bocker, 2010)]. Data driven approaches that allow the inclusion of unknown features in metabolic pathway analysis have the potential to facilitate their tentative identification and prioritization for experimental follow up.

Correlation networks have been widely used in the analysis of functional genomics (Kramer et al., 2009), proteomics (Petsalaki et al., 2015), and combined genomics and metabolomics data (Camacho et al., 2005; Steuer et al., 2003). Early work in this area used Pearson’s correlation coefficients to establish linear associations between metabolites (Camacho et al., 2005; Steuer et al., 2003) and infer the structure of underlying metabolic networks. While these are easy to compute, they capture both direct and indirect associations between metabolites. In contrast, partial correlation networks can differentiate between direct and indirect associations and provide insights into the dependence structure between metabolites. Several groups have used Gaussian graphical modeling (GGM) to reconstruct partial correlation networks among sets of genes or metabolites in order to overcome this limitation (de la Fuente et al., 2004; Krumsiek et al., 2012; Krumsiek et al., 2011; Zuo et al., 2014). However, computing a full partial correlation network requires the sample size to be at least as large as the number of features being analyzed. This condition is rarely met in practice, especially in untargeted LC-MS studies that detect thousands of metabolic features, which considerably limits their practical applications. One way to get around this problem is to compute limited order partial correlations. This has been done to build gene co-expression/correlation networks (de la Fuente et al., 2004; Wille et al., 2004). An alternative approach developed in machine learning literature is based on regularized estimation of the partial correlation matrices (Bühlmann and Van De Geer, 2011). This approach has been applied to genomics data (Schafer and Strimmer, 2005; Schafer and Strimmer, 2005). Graphical lasso (Glasso) (Friedman et al., 2008) and nodewise regression (Meinshausen and Buhlmann, 2006) are two of the earliest and most popular methods for doing regularized estimation.

We present here a novel Debiased Sparse Partial Correlation algorithm (DSPC) that is based on the recently proposed de-sparsified graphical lasso modeling procedure (Jankova, 2015). A key assumption underlying our modeling strategy is that the number of true connections among the metabolites is much smaller than the available sample size, i.e. the true network of partial correlations among the metabolites is sparse. This assumption is strongly supported both by empirical evidence and theoretical calculations (Gardner et al., 2003; Jeong et al., 2001; Leclerc, 2008). Under this assumption, DSPC reconstructs a graphical model and provides partial correlation coefficients and P-values for every pair of metabolic features in the dataset. Thus, DSPC allows discovering connectivity among large numbers of metabolites using fewer samples. The results can be visualized as weighted networks where nodes represent metabolites and edges represent partial correlation coefficients or the associated P-values.

In this study, we demonstrate the power of DSPC by comparing it to previous approaches using a published data set from the KORA population study containing 1020 samples and 151 metabolites (http://www.helmholtz-muenchen.de/cmb/ggm) (Krumsiek et al., 2011). We also introduce the new version of our previously developed tool Metscape (Karnovsky et al., 2012) that enables building and visualization of correlation networks. We used our tools to analyze the data from a multiplatform metabolomics study for the compounds linked to type 1 diabetes in non-obese diabetic (NOD) mice model (Fahrmann et al., 2015). To further validate DSPC and to demonstrate the utility of the new Metscape module we used targeted (amino acids and acylcarnitines) and untargeted LC-MS metabolic profiles from the blood samples of 120 mid-life women participating in the longitudinal, multi-ethnic Study of Women's Health Across the Nation (SWAN) (Sowers et al., 2000). The integrative analysis of targeted and untargeted datasets provides an internal control for corroborating the findings of the methodology. We also illustrate the application of our tools for building biologically relevant networks and guiding the identification of unknown features.

2 Materials and methods

2.1 The DSPC algorithm

Suppose the relative concentration levels of p metabolites on n samples, after log transformation and appropriate normalization, are stored in an n x p matrix X, where Xij denotes the relative concentration level of the jth metabolite in the ith sample. The columns of X are centered and scaled to have zero mean and unit standard deviation.

Assuming the relative concentration levels of the p metabolites in the population come from a multivariate normal distribution N(0, Σ), metabolites i and j have non-zero partial correlation if and only if Θij ≠ 0, where Θ = Σ−1 is the inverse covariance matrix. Based on a sample of size n, the graphical lasso (glasso) estimator of Θ is defined as

Θ^=argminΘ0 tr(SΘ)logdetΘ+λ|Θ|1,off

where S = X’X/n is the sample covariance matrix, |Θ|1,off=ijp|Θij| is the sparsity inducing 1-penalty and λ is a positive tuning parameter, controlling the amount of regularization(Jankova, 2015). In the simulation and real data analyses reported in this paper, we worked with scaled data and set λ=log p/n, as proposed by the statistical theory presented in (Jankova, 2015).

For a p x p matrix A = ((aij))1 ≤ i,j ≤ p, we use vec(A) to denote the p2 x 1 vector [a11, …, ap1, a12, …, ap2, …, a1p, …, app]T obtained by stacking the columns of A. For any two matrices A and B, we use AB to denote the Kronecker product of A and B, defined as (AB) = ((aij B))1 ≤ i, j ≤ p. The debiased graphical lasso (DSPC) estimator is obtained by applying the following bias correction procedure to the graphical lasso estimate:

T^=vec(Θ^)Θ^ Θ^ vec (SΘ^1)

Under appropriate regularity conditions and a suitable sparsity assumption on Θ, Jankova and van de Geer (Jankova, 2015) establish that the entries of the DSPC estimate T^ have the following asymptotic distribution

n(T^ijΘij)/σ^n N(0,1), σ^n2:=Θ^ij2+Θ^iiΘ^jj

Based on the above observation, we conduct hypothesis tests on the presence or absence of the individual edges in the partial correlation network

H0:Θij=0 vs. H1:Θij0

and calculate p-values using the formula 1-2Φn|T^ij|/σ^n, where

Φ(t) is the cumulative distribution function (cdf) of a standard Normal distribution. The partial correlation network can be constructed using only the edges for which the P-value, after correcting for multiple testing, is below a pre-specified threshold. Depending on the desired level of control on false positives, different multiple testing criteria can be used, e.g. Bonferroni correction for controlling family-wise error rate (FWER) or Benjamini-Hochberg correction for controlling false discovery rate (FDR) (Benjamini and Hochberg, 1995).

2.2 Implementation and availability of the DSPC algorithm

DSPC is freely available as part of a standalone Java CorrelationCalculator program (http://metscape.med.umich.edu/calculator.html). CorrelationCalculator features and the workflow are described in the Supplementary Methods section.

2.3 Metscape correlation analysis module

The most prominent feature of the recently released version of our previously published Cytoscape app Metscape (Karnovsky et al., 2012) is the ability to visualize correlation networks where nodes represent compounds and edges depict correlations between them. See Supplementary Methods section for details.

2.4 Metabolomics analysis of SWAN samples

Plasma samples of 120 mid-life women participating in the SWAN study were used to generate metabolomics data that was subsequently analyzed by DSPC. SWAN is a longitudinal, multiethnic study of women as they age through midlife (Sowers et al., 2000). In 1996, 3302 women from different ethnic backgrounds (Caucasian, African American, Chinese, Japanese and Hispanic) were enrolled at seven clinical sites. Eligible women were age 42–52 years. Blood samples were collected at baseline and at each follow-up visit and stored in the SWAN Repository. Samples from visits 5 and 12 (7 years apart) were used in metabolomics assays. The study sample included women who did not have diabetes or metabolic syndrome at visit 5 who subsequently developed diabetes by visit 12 and randomly selected controls without diabetes, which were matched for age and ethnicity. Eligible women and their controls were then randomly sampled in batches of 15 for these analyses. See Supplementary Methods for the detailed procedures used for the targeted (amino acids and acylcarnitines) and the untargeted metabolomics assays.

2.5 Data processing and normalization

The two targeted (amino acids, acylcarnitines) and untargeted (positive and negative ion mode RPLC-MS data) SWAN metabolomics data sets were normalized separately using the following procedure. Metabolites from targeted assays with more than 30% missing values across all samples were excluded from further analysis (this filter had already been applied to untargeted metabolomics data as described above). The untargeted data were normalized to adjust for batch effects using the median metabolite peak area of each batch. The resulting values were log2 transformed and the missing values were imputed as the median values on a per feature basis.

Normalized values from the amino acid and acylcarnitine assays and the untargeted positive and negative mode data were then merged into a single data set. Only the samples common to all four data sets were retained, resulting in 234 samples. To demonstrate one of the applications of our method we created a subset of data as follows. Metabolites unambiguously identified from the targeted assays were used as seeds, providing a starting point for identification of unknown compounds. Unidentified features from the untargeted analysis that had an absolute Pearson correlation of 0.7 or higher with at least one of the metabolite seed compounds were included.

3 Results

In this section we briefly describe the analysis workflow, followed by testing and evaluation of the DSPC algorithm. We then show that the CorrelationCalculator and Metscape were useful for building and visualizing biologically relevant networks and identification of unknown compounds, thereby contributing to the interpretation of metabolomics data.

3.1 Correlation analysis workflow

Calculating a partial correlation network among hundreds or thousands of features is a computationally demanding task. In addition, with the small to moderate sample size available in most studies, the process of recovering the network structure and estimating all the edges suffers from low statistical power and low efficiency. In the case of a Gaussian graphical model, the correspondence between the block diagonal structures of the population covariance and inverse covariance matrix ensures that the features in different connected components of the partial correlation network are marginally uncorrelated. This provides a computationally tractable and efficient option—first constructing a marginal correlation network by applying a threshold to the sample correlation matrix, and then estimating a partial correlation network separately for each of its connected components (Mazumder and Hastie, 2012). Given the modular nature of many metabolic networks, this strategy can help reduce the complexity of the problem, thus enabling the discovery of finer aspects of the underlying network topology. With this in mind, we designed a practical workflow for the CorrelationCalculator program that is illustrated in Figure 1.

Fig. 1.

Fig. 1

Correlation analysis workflow. Metabolites with experimental measurements can be uploaded into CorrelationCalculator. The program can perform basic normalization and Pearson’s correlation analysis. A subset of data selected by setting a Pearson’s correlation coefficient threshold or the entire data set can be passed to DSPC. The results can be downloaded in tab-delimited format and sent directly to Metscape

We compared the features of CorrelationCalculator and Metscape with four other programs for the analysis metabolomics data that offer the option of doing correlation analysis: MetaMappR, Metabolomics Workbench, MetabNet and Metaboanalyst (Grapov et al., 2015; Sud et al., 2016; Uppal et al., 2015; Xia and Wishart, 2011). While some of these tools support Pearson or Spearman correlation-based analysis, none of them provide the ability to build and visualize partial correlation networks (Table 1).

Table 1.

Comparison of the CorrelationCalculator and Metscape with selected tools for the analysis of metabolomics data.

Feature CorrelationCalculator/ Metscape MetabNet MetaMappR Metaboanalyst Metabolomics workbench
Can calculate Pearson or Spearman correlations? Yes Yes Yes Yes Yes
Can calculate partial correlations (PC)? Yes No No No No
Supports PC calculations where n<p Yes No No No No
Allow visualizing correlation data in heatmaps? Yes No No Yes No
Allow visualizing correlation data in networks? Yes Yes (via Cytoscape) Yes (via Cytoscape) No No
Provides access to pathway information? Yes No Yes Yes Yes
Uses structural information to build the networks? No No Yes No No

3.2 DSPC testing and validation

To provide statistical validation of the DSPC algorithm we used a previously published KORA dataset that contains 151 metabolites measured in 1020 blood serum samples (http://www.helmholtz-muenchen.de/cmb/ggm). Data were standardized to have zero mean and variance of 1. First, we reconstructed the network of 151 metabolites based on all 1020 samples using GGM-based partial correlation method (PCOR) (Krumsiek et al., 2011), followed by Bonferroni multiple comparison adjustment. Since the number of samples far exceeded the number of metabolites, PCOR was able to accurately estimate the partial correlation structures in the data. This prompted us to use this partial correlation network as a benchmark or “true network” to help investigate the ability of PCOR and DSPC to extract these structures from randomly selected subsets of data that approximate the sizes of real-life datasets. Figure 2 and Supplementary Table 1 show the sizes of the PCOR and DSPC networks, the proportion of all edges and the top 10 percent edges recovered by each algorithm. In each setting, we report the median recovery over 200 random draws of subsamples. As the sample size decreases, the PCOR performance deteriorates. For n < 151, the PCOR method is not applicable, while DSPC still recovers many of the significant edges in the network.

Fig. 2.

Fig. 2

Evaluation of PCOR and DSPC networks. (A) The relative sizes (number of nodes) of the PCOR and DSPC networks of decreasing sample size are shown. (B) Proportion of all edges recovered by PCOR and DSPC. (C) Proportion of the 10% most significant edges recovered by PCOR and DSPC

Figure 3 shows the networks generated by PCOR and DSPC using 1020 and 200 samples. It demonstrates that DSPC performs equally well compared with PCOR when the number of samples is greater than the number of metabolites, but importantly, also allows analyzing data sets with fewer samples.

Fig. 3.

Fig. 3

Validation of the DSPC algorithm. (A) Network built using basic partial correlation algorithm with 1020 samples; (B) Network built using DSPC with 1020 samples; (C) Network built using basic partial correlation algorithm with 200 samples; (D) Network built using DSPC with 200 samples. Metabolites are colored according to classes. Both methods perform well when the number of samples is large. DSPC can recover more significant edges when the number of samples is reduced

Notably, both methods were able to group together the metabolites that belong to the same class, which previously have been shown to co-vary in other biological contexts. When we compared the ability of DSPC and Pearson’s correlation networks to identify within- and between- class relationships, we found that on average DSPC was able to estimate more edges within the same class than using Pearson’s correlations (see Supplementary Table 2 and Supplementary Figure 3 for details).

3.3 Using CorrelationCalculator and metscape to build global metabolic networks

To demonstrate the power of our methodology, we analyzed recently published metabolic profiles of non-obese diabetic (NOD) mice (Fahrmann et al., 2015). This multi-platform metabolomics study identified a number of metabolites that were significantly altered in animals with type 1 diabetes (T1D). The data were downloaded from the Metabolomics Workbench repository (Sud et al., 2016). Age and sex adjusted data from 71 samples were log2 transformed and autoscaled and a subset of 192 significantly altered metabolites was used to generate a partial correlation network (Fig. 4). Notably, our method was able to cluster several major classes of chemical compounds including several classes of lipids (e.g. triglycerides, sphingomyelins) and primary metabolites that share multiple metabolic pathway connections (e.g. nucleotides and sugars). Although it is challenging to undertake a direct comparison between partial correlation and biochemical networks, a number of similarities can be seen between the network shown in Figure 4 and the biochemical network presented in Fahrmann et al., that was built using MetaMappR program (Fahrmann et al., 2015), e.g. grouping most of the oxylipins into a single cluster. Similarities such as these provide evidence of the biochemical relevance of results derived from DSPC, but it is also worth noting that each method provides its own unique view of the data. The unbiased nature of DSPC may reveal unexpected or non-canonical connections between different species or classes of metabolites, providing insight that would be missed by a biochemical network alone.

Fig. 4.

Fig. 4

Partial correlation network of T1D differentiating metabolites. Node size indicates the direction of the change. Bold black border indicates significant metabolites. Colored edges had P-value< 0.003, and FDR adjusted P-value< 0.5. Dotted lines represent edges with P-values < 0.2. Red and blue edges show positive and negative correlations

Due to the ability of DSPC to handle larger numbers of metabolites with fewer samples, we were able to include more compounds in our analysis and thus provide broader context for data interpretation. For instance, Fahrmann et al. (Fahrmann et al., 2015) reported reduced levels of methionine and arachidonic acid in the plasma of the T1D mice. We found that these two compounds had strong partial correlation in our network (Padj =  0.0043). A previous study has shown a relationship between dietary methionine levels and arachidonic acid (Sugiyama et al., 1997) and proposed that reduced methionine results in reduction of methylation of phosphatidylethanolamine via reduced levels of s-adenosylmethionine which is derived from methionine. Indeed PE:PC ratios are increased in liver of diabetic mice and this is related to a decrease in conversion of linoleic acid to arachidonic acid (Imaizumi et al., 1989; Venkatraman et al., 1991). The reduction in multiple PC species (Fig. 4) is consistent with this mechanism. It would be of interest to assess the effect of methionine supplementation on the deregulation of lipid species and the development of T1D in NOD mice.

3.4 Identifying matching features from targeted and untargeted LC-MS data

To further test the performance of our method and demonstrate the utility of CorrelationCalculator and Metscape, we analyzed a set of 140 metabolites that contained known compounds from the targeted and untargeted SWAN data sets (described in the Methods) and the unknown features that correlated with at least one known compound with an absolute value of Pearson’s correlation coefficient of 0.7 or greater. The dataset was uploaded into CorrelationCalculator, analyzed using the DSPC algorithm and the results were loaded into Metscape for network building and analysis. We chose the adjusted p-values as the basis for network edges with a significance threshold of <0.05. To further illustrate the advantages of DSPC, we compared the resulting network to the Pearson’s correlation network generated from the same data (Supplementary Figs 4 and 5). The results were similar to those reported previously (Krumsiek et al., 2011), i.e. the high density of the Pearson’s correlation network made it difficult to delineate the relationships between groups of metabolites, while the DSPC network clearly grouped different classes of related metabolites.

Figure 5 shows the sub-networks for the amino acids measured in the targeted assay and their counterparts detected in the positive and negative modes of the untargeted experiment. As expected, most matching compounds from targeted and untargeted assays are connected by direct edges. Further examination of highly correlated network clusters containing identified metabolites and unknown features such as proline and valine sub-networks allowed rapid identification of the adducts and in-source fragments of these compounds (see Supplementary Fig. 6 and Supplementary data for details). While many software packages used to process raw, untargeted LC-MS or GC-MS data (Alonso et al., 2011; Brown et al., 2011; Kuhl et al., 2012), including the Agilent MPP software used here, attempt to group features representing the same compound, a substantial number of redundant features (mostly due to chemical noise and contaminants) often remain undetected (Zamboni et al., 2015), requiring additional manual filtering. Building the DSPC networks and visualizing them in Metscape facilitated the identification of many of these features.

Fig. 5.

Fig. 5

DSPC amino acid network. The network was constructed using the targeted and untargeted data. Nodes representing compounds measured in targeted amino acid assay are shown in pink. Nodes that represent compounds measured using the untargeted RP platform have blue and red borders for those detected in negative and positive modes respectively. Known compounds are shown as hexagons, whereas diamond- shaped nodes represent the unknown features, most of which were identified as adducts and in-source fragments of the highly correlated known compounds as shown in Supplementary Figure 6 for valine and proline subnetworks

3.5 Assessing the biological relevance of the DSPC correlation networks

Further examination of the SWAN DSPC network described above showed that most of the compounds found in close proximity shared metabolic pathway connections, as well as chemical class similarity. For example, the algorithm was able to group the most similar amino acids (e.g. the aromatic amino acids, the branched chain amino acids, lysine and arginine etc.) without any prior information about their properties.

Next, we asked whether the strongest edges in the DSPC network are supported by biochemical evidence. To answer this question we selected a subset of known compounds from the set described in previous section as follows. Only known compounds that correlated with at least one other known compound with Pearson’s correlation coefficient > 0.7 were included in the DSPC analysis. Top scoring (adjusted p-values <0.05) were evaluated manually. There were 51 edges that met this criterion (Supplementary Table 3). Consistent with previous observations (Krumsiek et al., 2012; Krumsiek et al., 2011; Steuer et al., 2003) our manual examination of these edges showed that they could be linked to biochemical evidence. Some well understood examples include aromatic amino acids tryptophan and phenylalanine and branched chain amino acids valine, leucine and isoleucine. Both aromatic amino acids can be metabolized by the same enzymes, including deamidation by L-amino-acid oxidase (1.4.3.2) and decarboxylation by aromatic-L-amino-acid decarboxylase (4.1.1.28). Likewise, degradation of all branched chain amino acids is regulated by the same enzyme, branched chain keto acid dehydrogenase (BCKDHA). Another important class of compounds in this data set is acylcarnitines, which are metabolites involved in the transfer of short, medium and long-chain acyl groups across the mitochondrial membrane for oxidation or export. Common enzymes that are involved in regulating some of these functions include Carnitine O-acetyltransferase (2.3.1.7) and Carnitine O-palmitoyltransferase (2.3.1.21).

3.6 Identification of unknown compounds

Another potential application of our tools is in aiding identification of unknown features in the data. We illustrate this with the SWAN LC-MS data set described above. Figures 6a and 6d show two correlation networks containing both known metabolites and unknown features. The unknown features in the networks were searched by their m/z values against the Human Metabolome Database (http://www.hmdb.ca) and Metlin (https://metlin.scripps.edu) with a mass tolerance of 30 ppm, allowing for the most common adducts in positive ion mode (M + H and M + Na). This resulted in multiple possible matches for many of the unknown compounds (see Supplementary Table 4), making assignment of putative metabolite identities difficult. However, when the list of hits was compared against identified metabolites in the network, more informed assignments of putative metabolite identities could be made. Specifically, in Figure 6a, the database hit “pyroglutamic acid” was the most similar match to the identified compound in the network, N-acetyl-DL-glutamic acid. In Figure 6d, even though no identified seed compound was present in one of two subnetworks, several carnitine-related metabolites appeared in the database hit lists of the unknowns in both subnetworks. This, along with the presence of the seed compound acetyl-L-carnitine in the adjacent subnetwork, increased confidence in assigning several acylcarntines as putative metabolite identities.

Fig. 6.

Fig. 6

Identification of unknown compounds. (A, D) Correlation networks containing both known metabolites and unknown features. (B, E) Overlaid extracted ion chromatograms showing identical retention times and peak shapes of the features in the plasma samples and in the spiked samples. (C, F) The mass spectra of the spiked and un-spiked samples closely matched the predicted isotope distribution computed using the molecular formula of the assigned metabolites

Although the correlation network aided in assignment of putative metabolite identities, verification of these identifications still requires analysis of authentic standard compounds. To test the validity of our assignments, we purchased commercial standards for two putatively identified metabolites, neither of which were yet included in our in-house metabolite library. LC-TOF-MS analysis of pooled plasma samples was performed with and without spiked addition of these authentic standard compounds. Overlaid extracted ion chromatograms revealed that the spiked samples showed identical retention time and peak shape to the features in the plasma samples alone (Figs 6b and 6e). Additionally, the mass spectra of both the spiked and un-spiked samples closely matched the predicted isotope distribution computed using the molecular formula of the assigned metabolites (Figs 6c and 6f), further confirming their correct identification

4 Discussion

We presented a novel approach for reconstructing correlation-based networks from large-scale metabolomics data that relies on a sparse graphical modeling strategy.

There has been a significant amount of work on regularized estimation of sparse regression and partial correlation networks in the statistical and machine learning literature (Bühlmann and Van De Geer, 2011; Schafer and Strimmer, 2005; Schafer and Strimmer, 2005). However, the key theoretical results were based on providing high probability bounds for the prediction and estimation errors of the parameters of interest. Our approach leverages recent work (Jankova, 2015) that allows us to provide p-values for the edge parameters of the network. Coupled with multiple comparisons correction, this provides more refined probabilistic guarantees about the quality of the network reconstruction.

When we compared the performance of our method to the previously published PCOR method (Krumsiek et al., 2011), we found that when the number of samples exceeded the number of features being analyzed, DSPC and PCOR performed equally well. As the sample size decreased, the DSPC algorithm consistently was able to recover a larger number of strong edges. Since in the majority of untargeted LC-MS based metabolomics studies the number of features will exceed the number of samples, our approach enables building biologically relevant networks allowing researchers to clean the data, identify the unknown features and gain insight into underlying molecular mechanisms and biological phenomena. The number of significant edges that can be recovered by DSPC is still affected by the sample size, as can be seen in T1D data example, which included 192 metabolites measured in 71 samples.

Our methodology heavily relies on the assumption of normality of the data. This assumption is largely satisfied in many data sets, after an appropriate normalization and transformation has been applied. However, if the assumption is strongly violated, it may be appropriate to pursue an alternative strategy such as the one discussed in (Liu et al., 2009).

To make our method broadly available and to ensure its practical application we created a user-friendly CorrelationCalculator program that allows users to compute Pearson’s correlations, use the values to set thresholds and compute partial correlations on the selected subset of data. The results can be visualized in the new Metscape correlation module that supports interactive analysis of correlation networks. Users can take advantage of the broad range of Metscape features that have been described previously (Karnovsky et al., 2012), e.g. mapping known compounds to canonical metabolic pathways, getting access to additional information about molecular weights, retention times, chemical class membership etc., that provide multiple ways to interrogate the data.

One of the well-known challenges associated with the analysis of untargeted metabolomics data, especially those generated by LC-MS, is feature identification. The most obvious challenge is the inherent complexity of biological samples, most of which contain hundreds to thousands of detectable metabolites, which may vary widely in composition and abundance in different sample types. Additionally, due to the presence of multiple isotopes, adducts and in-source fragments one metabolite is often represented by multiple features (Zamboni et al., 2015). We demonstrate that CorrelationCalculator in combination with Metscape can aid in the identification of redundant features as well as unknown compounds. While correlation analyses have been used previously to aid grouping similar features, DSPC provides an advantage of identifying the strongest direct associations between features ignoring additional indirect edges. Visualizing partial correlation networks in Metscape allows performing data cleanup and filtering through interactive network exploration either as part of biological data interpretation, or as a separate preprocessing workflow.

Consistent with previous observations (Camacho et al., 2005; Kotze et al., 2013; Krumsiek et al., 2011; Steuer et al., 2003) we found that the majority of high scoring edges in partial correlation networks could be explained by pathway information and underlying biochemical knowledge. We demonstrate that our tools can be used to rapidly identify and cluster functionally related metabolites complimenting existing biochemical knowledge. CorrelationCalculator in combination with Metscape can also be used to discover the connectivity between metabolites where pathway information is not readily available (e.g. for lipidomics data). Our method and tools can also be used to aid in identification of the unknown features that are significantly associated with known metabolites.

The high level of concordance we observed between data-driven and knowledge-based biochemical networks is encouraging, as it provides the basis for identifying new connections between metabolites that may represent yet undiscovered metabolic regulatory interactions. Discovery of such novel interactions may lead to a more comprehensive picture of cellular metabolism.

Funding

This work was supported by NIH grants DK089503 (CFB, AK, GM, WD), DK097153 (CFB, AK), DK092558(CRE), R01-GM114029-01A1(GM),1R21-GM101719-01A1(GM), R03CA211817(AK) and NSF DMS-154527 (GM), DMS-1228164(GM). The Study of Women's Health Across the Nation (SWAN) has grant support from the National Institutes of Health (NIH), DHHS, through the National Institute on Aging (NIA), the National Institute of Nursing Research (NINR) and the NIH Office of Research on Women’s Health (ORWH) (Grants U01NR004061; U01AG012505, U01AG012535, U01AG012531, U01AG012539, U01AG012546, U01AG012553, U01AG012554, U01AG012495). The SWAN Repository has grant support from (U01AG017719). The content of this article is solely the responsibility of the authors and does not necessarily represent the official views of the NIA, NINR, ORWH, NSF or the NIH.

Conflict of Interest: none declared.

Supplementary Material

Supplementary Data

References

  1. Alonso A. et al. (2011) AStream: an R package for annotating LC/MS metabolomic data. Bioinformatics, 27, 1339–1340. [DOI] [PubMed] [Google Scholar]
  2. Barupal D.K. et al. (2012) MetaMapp: mapping and visualizing metabolomic data by integrating information from biochemical pathways and chemical and mass spectral similarity. BMC Bioinformatics, 13, 99. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Benjamini Y., Hochberg Y. (1995) Controlling the False Discovery Rate - a Practical and Powerful Approach to Multiple Testing. J. Roy. Stat. Soc. B Met., 57, 289–300. [Google Scholar]
  4. Brown M. et al. (2011) Automated workflows for accurate mass-based putative metabolite identification in LC/MS-derived metabolomic datasets. Bioinformatics, 27, 1108–1112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bühlmann P., Van De Geer S.. Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Science & Business Media, 2011. [Google Scholar]
  6. Camacho D. et al. (2005) The origin of correlations in metabolomics data. Metabolomics, 1, 53–63. [Google Scholar]
  7. Caspi R. et al. (2012) The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Res., 40(Database issue), D742–D753. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. de la Fuente A. et al. (2004) Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics, 20, 3565–3574. [DOI] [PubMed] [Google Scholar]
  9. Duarte N.C. et al. (2007) Global reconstruction of the human metabolic network based on genomic and bibliomic data. Proc. Natl. Acad. Sci. USA, 104, 1777–1782. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Duren W. et al. (2014) MetDisease-connecting metabolites to diseases via literature. Bioinformatics, 30, 2239–2241. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Fahrmann J. et al. (2015) Systemic alterations in the metabolome of diabetic NOD mice delineate increased oxidative stress accompanied by reduced inflammation and hypertriglyceremia. Am. J. Physiol. Endocrinol. Metab., 308, E978–E989. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Gardner T.S. et al. (2003) Inferring genetic networks and identifying compound mode of action via expression profiling. Science (New York, N.Y.), 301, 102–105. [DOI] [PubMed] [Google Scholar]
  13. Grapov D. et al. (2015) MetaMapR: pathway independent metabolomic network analysis incorporating unknowns. Bioinformatics, 31, 2757–2760. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Hao T. et al. (2010) Compartmentalization of the Edinburgh Human Metabolic Network. BMC Bioinformatics, 11, 393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Imaizumi K. et al. (1989) Effect of phosphatidylethanolamine and its constituent base on the metabolism of linoleic acid in rat liver. Biochimica Et Biophysica Acta, 1005, 253–259. [DOI] [PubMed] [Google Scholar]
  16. Jankova JvdG., Sara, (2015) Confidence intervals for high-dimensional inverse covariance estimation. Electron. J. Stat., 9, 1205–1229. [Google Scholar]
  17. Jeong H. et al. (2001) Lethality and centrality in protein networks. Nature, 411, 41–42. [DOI] [PubMed] [Google Scholar]
  18. Kanehisa M. et al. (2006) From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res., 34(database issu), D354–D357. :35 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Karnovsky A. et al. (2012) Metscape 2 bioinformatics tool for the analysis and visualization of metabolomics and gene expression data. Bioinformatics, 28, 373–380. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Kotze H.L. et al. (2013) A novel untargeted metabolomics correlation-based network analysis incorporating human metabolic reconstructions. BMC Syst. Biol., 7, 107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Kramer N. et al. (2009) Regularized estimation of large-scale gene association networks using graphical Gaussian models. BMC Bioinformatics, 10, 384.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Krumsiek J. et al. (2012) Mining the unknown: a systems approach to metabolite identification combining genetic and metabolic information. PLoS Genet., 8, e1003005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Krumsiek J. et al. (2011) Gaussian graphical modeling reconstructs pathway reactions from high-throughput metabolomics data. BMC Syst. Biol., 5, 21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Kuhl C. et al. (2012) CAMERA: an integrated strategy for compound spectra extraction and annotation of liquid chromatography/mass spectrometry data sets. Anal. Chem., 84, 283–289. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Leclerc R.D. (2008) Survival of the sparsest: robust gene networks are parsimonious. Mol. Syst. Biol., 4, 213.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Li S. et al. (2013) Predicting network activity from high throughput metabolomics. PLoS Comput. Biol., 9, e1003123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Liu H. et al. (2009) The nonparanormal: semiparametric estimation of high dimensional undirected graphs. J. Mach. Learn. Res., 10, 2295–2328. [Google Scholar]
  28. Ma H. et al. (2007) The Edinburgh human metabolic network reconstruction and its functional analysis. Mol. Syst. Biol., 3, 135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Mazumder R., Hastie T. (2012) Exact covariance thresholding into connected components for large-scale graphical Lasso. J. Mach. Learn. Res., 13, 781–794. [PMC free article] [PubMed] [Google Scholar]
  30. Meinshausen N., Buhlmann P. (2006) High-dimensional graphs and variable selection with the Lasso. 1436–1462.
  31. Neumann S., Bocker S. (2010) Computational mass spectrometry for metabolomics: identification of metabolites and small molecules. Anal. Bioana.l Chem., 398, 2779–2788. [DOI] [PubMed] [Google Scholar]
  32. Petsalaki E. et al. (2015) SELPHI: correlation-based identification of kinase-associated networks from global phospho-proteomics data sets. Nucleic Acids Res. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Romero P. et al. (2005) Computational prediction of human metabolic pathways from the complete human genome. Genome Biol., 6, R2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Sartor M.A. et al. (2012) Metab2MeSH: annotating compounds with medical subject headings. Bioinformatics, 28, 1408–1410. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Schafer J., Strimmer K. (2005) An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics, 21, 754–764. [DOI] [PubMed] [Google Scholar]
  36. Schafer J., Strimmer K. (2005) A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat. Appl. Genet. Mol. Biol., 4, Article32.. [DOI] [PubMed] [Google Scholar]
  37. Sigurdsson M.I. et al. (2010) A detailed genome-wide reconstruction of mouse metabolism based on human Recon 1. BMC Syst. Biol., 4, 140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Sowers M.F.R. et al. (2000) SWAN: a multicenter, multiethnic, community-based cohort study of women and the menopausal transition. Women's Faculty Committee Publications and Presentations,. [Google Scholar]
  39. Steuer R. et al. (2003) Observing and interpreting correlations in metabolomic networks. Bioinformatics, 19, 1019–1026. [DOI] [PubMed] [Google Scholar]
  40. Sud M. et al. (2016) Metabolomics Workbench: An international repository for metabolomics data and metadata, metabolite standards, protocols, tutorials and training, and analysis tools. Nucleic Acids Res., 44, D463–D470. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Sugiyama K. et al. (1997) Methionine content of dietary proteins affects the molecular species composition of plasma phosphatidylcholine in rats fed a cholesterol-free diet. J. Nutr., 127, 600–607. [DOI] [PubMed] [Google Scholar]
  42. Thiele I. et al. (2013) A community-driven global reconstruction of human metabolism. Nat. Biotechnol., 31, 419–425. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Uppal K. et al. (2015) MetabNet: an R package for metabolic association analysis of high-resolution metabolomics data. Front. Bioengin. Biotechnol., 3, 87. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Venkatraman J.T. et al. (1991) Effect of dietary fat on diabetes-induced changes in liver microsomal fatty acid composition and glucose-6-phosphatase activity in rats. Lipids, 26, 441–444. [DOI] [PubMed] [Google Scholar]
  45. Wille A. et al. (2004) Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana. Genome Biol., 5, R92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Xia J., Wishart D.S. (2011) Metabolomic data processing, analysis, and interpretation using MetaboAnalyst. Curr. Protoc. Bioinformatics, Chapter 14:Unit 14 10. [DOI] [PubMed] [Google Scholar]
  47. Zamboni N. et al. (2015) Defining the metabolome: size, flux, and regulation. Mol. Cell, 58, 699–706. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Zuo Y. et al. (2014) Biological network inference using low order partial correlation. Methods, 69, 266–273. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES