Skip to main content
PLOS Computational Biology logoLink to PLOS Computational Biology
. 2021 Oct 22;17(10):e1008986. doi: 10.1371/journal.pcbi.1008986

PaIRKAT: A pathway integrated regression-based kernel association test with applications to metabolomics and COPD phenotypes

Charlie M Carpenter 1,*, Weiming Zhang 2,, Lucas Gillenwater 3, Cameron Severn 1, Tusharkanti Ghosh 1, Russell Bowler 4, Katerina Kechris 1, Debashis Ghosh 1
Editor: Carl Herrmann5
PMCID: PMC8565741  PMID: 34679079

Abstract

High-throughput data such as metabolomics, genomics, transcriptomics, and proteomics have become familiar data types within the “-omics” family. For this work, we focus on subsets that interact with one another and represent these “pathways” as graphs. Observed pathways often have disjoint components, i.e., nodes or sets of nodes (metabolites, etc.) not connected to any other within the pathway, which notably lessens testing power. In this paper we propose the Pathway Integrated Regression-based Kernel Association Test (PaIRKAT), a new kernel machine regression method for incorporating known pathway information into the semi-parametric kernel regression framework. This work extends previous kernel machine approaches. This paper also contributes an application of a graph kernel regularization method for overcoming disconnected pathways. By incorporating a regularized or “smoothed” graph into a score test, PaIRKAT can provide more powerful tests for associations between biological pathways and phenotypes of interest and will be helpful in identifying novel pathways for targeted clinical research. We evaluate this method through several simulation studies and an application to real metabolomics data from the COPDGene study. Our simulation studies illustrate the robustness of this method to incorrect and incomplete pathway knowledge, and the real data analysis shows meaningful improvements of testing power in pathways. PaIRKAT was developed for application to metabolomic pathway data, but the techniques are easily generalizable to other data sources with a graph-like structure.

Author summary

PaIRKAT is a tool for improving testing power on high dimensional data by including graph topography in the kernel machine regression setting. Studies on high-dimensional data can struggle to include the complex relationships between variables. The semi-parametric kernel machine regression model is a powerful tool for capturing these types of relationships. They provide a framework for testing for relationships between outcomes of interest and high dimensional data such as metabolomic, genomic, or proteomic pathways. Our paper proposes a kernel machine method for including known biological connections between high dimensional variables by representing them as edges of ‘graphs’ or ‘networks.’ It is common for nodes (e.g., metabolites) to be disconnected from all others within the graph, which leads to meaningful decreases in testing power when graph information is ignored. We include a graph regularization or ‘smoothing’ approach for managing this issue. We demonstrate the benefits of this approach through simulation studies and an application to the metabolomic data from the COPDGene study.


This is a PLOS Computational Biology Methods paper.

Introduction

Metabolomics is the study of the metabolite composition of a cell, tissue, or biological fluid. Leading metabolomic experimental techniques such as liquid or gas chromatography coupled with mass spectrometry (LC-MS or GC-MS) and nuclear magnetic resonance (NMR) spectroscopy can capture the abundance of all metabolites within a cell (the metabolome). These technologies provide high-throughput data similar to other familiar -omics datatypes such as genomics, transcriptomics, and proteomics. An important advantage of metabolomics over other -omics data is its proximity to biological phenotypes[1]. While genomic or proteomic data are vital pieces for understanding the progression from DNA to phenotype, the metabolites are the end products of the enzymatic reactions of a cell[2]. The metabolome is comprised of exogenous (environmentally derived) and endogenous (genetically regulated) metabolites which can be used as biomarkers for the current phenotypic state of a cell or organism.

Like other -omics data, careful considerations of the metabolome’s unique characteristics are required to fully leverage it for biological insights. Specifically, metabolites are known to be related directly and indirectly by enzymatic reactions within a metabolomic pathway. Clustering methods have been developed to incorporate this connectivity into the primary analysis to avoid this two-step approach. These include Bayesian methods for metabolite clustering based on peak detection[3,4] and ad hoc methods based on singleton metabolite presence[5]. For this work, we choose to group subsets of metabolites that interact with one another and represent these pathways as graphs or networks. Throughout this paper we will use the term graph and network interchangeably. Open source databases with metabolomic pathway documentation such as the Kyoto Encyclopedia of Genes and Genomes (KEGG), the Human Metabolome Database (HMDB), Reactome, OmniPath, and WikiPathways are growing resources[610], and the pathways within these databases are easily translated to graphs to be used in downstream analyses.

The semiparametric kernel machine regression method[11,12] has gained popularity in many areas of biomedical research such as genomics, microbiome analysis, and neuroimaging[1315]. One reason for its popularity is that it provides a computationally scalable method of classification and regression through the introduction of a kernel function. Another is that it provides a setting for formal statistical estimation and testing procedures for high-dimensional data sources, often using a score statistic. Formal statistical tests are useful for metabolomic research, as a goal is often identifying specific metabolites and pathways for further inquiry. At a high level, kernel machines test for relationships between an outcome and a set of predictors by testing if variation between the two correspond with one another.

A hurdle more unique to metabolomics is the high levels of sparsity in individual metabolites and pathway connectivity. While metabolomic databases (e.g., KEGG, HMDB) are growing, none are considered complete. Data generating techniques like LC-MS and GC-MS are also imperfect technologies that may miss metabolite abundances that are too low[16]. Thus, pathway representations of metabolomic data are often sparse and disconnected, i.e., nodes or sets of nodes are not connected to any other within the pathway.

Disjoint nodes are of concern for graph-structured data. Techniques that force graphs to be fully connected by making small, uniform changes to the structure have been suggested for handling this issue[17,18]. However, it is understood that these alterations impose new challenges by changing the subspaces spanned by the graph. Works by Schaid [19] as well as Freytag et al. [20] developed a network-based kernel where similarity is defined directly from the network structure. These methods and others like it are tailored to genome-wide association studies and not applicable to other omics data. Freytag also imposes “as much noise as necessary” within the network to ensure positive semidefinite matrices which is something we aim to avoid. In fact, our proposal dampens out noisy features of the graph. The PIMKL method works with pathways within the metabolome by combining them through a weighted summed kernel[21]. These weights provide insight into the importance of each sub-pathway, but this does not surmount to the level of evidence gathered from a direct comparison between specific pathways and phenotype.

In this paper we propose the Pathway Integrated Regression-based Kernel Association Test (PaIRKAT), a new kernel machine regression method for incorporating known pathway information into the semi-parametric kernel regression framework. In addition, PaIRKAT contributes an application of a graph kernel regularization method for overcoming sparse connectivity and disjoint pathways. To our knowledge, this is the first method to incorporate graph regularization into a kernel regression test. PaIRKAT allows for tests of association with phenotypes and the specific pathways while integrating pathway structure, and, instead of adding small amounts of noise, this approach dampens noisy components of a pathway while preserving biologically relevant signals. This leads to improved testing power and better overall biomarker detection. We evaluate these methods through several simulation studies and an application to real metabolomics data from the COPDGene[22] study.

Results

Method overview

Here we provide the main steps of PaIRKAT and provide an overview of the ideas behind them. The method is described in full in Methods and Models. The primary goal of PaIRKAT is to include the topographical information of graph structured data into the kernel machine regression model. We use the semiparametric kernel machine model[11,12,23] to test for relationships between the phenotype of interest, Y, and a high dimensional set, Z, while controlling for important covariates, X, in the model g(Y) = +h(Z)+ϵ. In this model h(·) is a positive semidefinite kernel function that transforms Z to an appropriate feature space.

Omics data (metabolomics, genomics, etc.) can often be represented as a graph with edges representing biological interactions between the nodes (metabolites, etc.). Freytag et al. and Schaid both define a kernel directly from the graph structure where higher proximity within the pathway gives a higher similarity score [19,20]. This has been coined a ‘guilt by association’ approach [24] and has been proven effective empirically. These methods use a map from SNPs to genes to formulate similarity matrices, making them unapplicable to other types of studies. PaIRKAT also uses the ‘guilt by association’ paradigm but relies on a graph’s regularized normalized Laplacian as the measure of proximity within the pathway. Then any appropriate kernel can be applied for testing making it more generally applicable than other similar approaches.

We explored the utility of incorporating the Laplacian directly into the kernel machine but found it to be ineffective using simulation studies. Instead, we transform L˜ using methods designed to dampen noisy aspects of a graph while preserving its biologically relevant features[25,26]. The PaIRKAT method is to include this regularized normalized Laplacian, L˜R, in the model through the kernel function as g(Y)=Xβ+h(ZL˜R)+ϵ. Tests for relationships between Y and h(ZL˜R) are performed using an adjusted score statistic[23] and Davies’ method for estimating distributions of linear combinations of χ2 variables[27].

Simulation results

A complete description of our simulation study can be found in Methods and Models, but we give a brief synopsis of the simulation scheme. We first randomly generated a graph. Second, we randomly generated features, Z, from multivariate normal distribution with a covariance structure derived from the graph. Lastly, we randomly generated a normally distributed outcome, Y, with a mean based on a linear relationship between the columns of Z. We performed tests ignoring graph topography, including graph topography in the kernel function via the normalized Laplacian (L˜), and our proposed method PaIRKAT of including graph topography in the kernel function via the regularized Laplacian (L˜R). Our simulations aimed to assess how sensitive our method is to incomplete and/or incorrect graph information. We also compare the power of our method to two simple competing approaches: an F-test on all principal components (PCs) of Z [28] and the minimum Simes’ adjusted p-value[29] from univariate tests on Z (Univariate Simes).

Type I error rates for PaIRKAT are summarized in Tables 1, 2, 3 and 4. The type I error rates for tests using a graph’s normalized Laplacian, L˜ (see Methods and Models section for definition), are summarized in S1, S2, S3, and S4 Tables. The type I error rate of ≈0.05 is maintained throughout all simulation scenarios.

Table 1. Type 1 error rates using all pathway information, i.e., no nodes or edges were dropped for these simulations.

Perfect” indicates calculating L˜R from the graph used to generate the data. “Mismatch” indicates the percentage of direct edges that were incorrect. Error rates were calculated from score tests on 1000 simulated data sets. All simulations used graphs with 15, 30, or 45 nodes. “Complete Mismatch” indicates 100% mismatch.

Pathway size
15 30 45
Perfect 0.0482 0.0529 0.0568
10% Mismatch 0.0498 0.0494 0.0474
40% Mismatch 0.0487 0.0525 0.0464
70% Mismatch 0.0502 0.0512 0.0511
Complete Mismatch 0.0487 0.0511 0.0494
No Pathway 0.0580 0.0540 0.0530
Principal Component 0.0484 0.0543 0.0558
Univariate Simes 0.0490 0.0513 0.0507

Table 2. Type 1 error rates using pathways with 5% missing edges.

Error rates were calculated from score tests on 1000 simulated data sets using graphs with 15, 30, or 45 nodes. The graph used to simulate Z and Y was of medium edge density, while the graph used to test was of low density. The low-density graphs are drawn from the Barabasi-Albert model with edge density 0.13, 0.07, and 0.04 for graphs with 15, 30, and 45 nodes, respectively. Medium edge density graphs are created by giving any 2 nodes without a direct edge between them a 5% chance of becoming directly connected. This creates graphs with an average edge density of 0.18, 0.12, and 0.09 for graphs with 15, 30, and 45 nodes, respectively. “Perfect” indicates calculating L˜R from the graph without changing remaining edges. “Mismatch” indicates the percentage of remaining direct edges that were incorrect. “Complete Mismatch” indicates 100% mismatch.

Pathway size
15 30 45
Perfect Network 0.0497 0.0491 0.0463
10% Mismatch 0.0478 0.0479 0.0485
40% Mismatch 0.0465 0.0510 0.0536
70% Mismatch 0.0518 0.0523 0.0486
Complete Mismatch 0.0491 0.0539 0.0463
No Network 0.0480 0.0440 0.0390
Principal Component 0.0507 0.0489 0.0494
Univariate Simes 0.0515 0.0494 0.0477

Table 3. Type 1 error rates using pathways with 15% missing edges.

Error rates were calculated from score tests on 1000 simulated data sets using graphs with 15, 30, or 45 nodes. The graph used to simulate Z and Y was of high edge density, while the graph used to test was of low density. The low-density graphs are drawn from the Barabasi-Albert model with edge density 0.13, 0.07, and 0.04 for graphs with 15, 30, and 45 nodes, respectively. High edge density graphs are created by giving any 2 nodes without a direct edge between them a 15% chance of becoming directly connected. This creates graphs with an average edge density of 0.26, 0.21, and 0.19 for graphs with 15, 30, and 45 nodes, respectively. “Perfect” indicates calculating L˜R from the graph without changing remaining edges. “Mismatch” indicates the percentage of remaining direct edges that were incorrect. “Complete Mismatch” indicates 100% mismatch.

Pathway size
15 30 45
Perfect Network 0.0508 0.0538 0.0456
10% Mismatch 0.0541 0.0519 0.0521
40% Mismatch 0.0495 0.0486 0.0478
70% Mismatch 0.0514 0.0506 0.0524
Complete Mismatch 0.0504 0.0523 0.0490
No Network 0.0430 0.0530 0.0510
Principal Component 0.0525 0.0509 0.0481
Univariate Simes 0.0499 0.0491 0.0459

Table 4. Type 1 error rates using pathways with dropped nodes.

Error rates were calculated from score tests on 1000 simulated data sets using graphs 15, 30, or 45 nodes initially. The graph used to simulate Z and Y contained all nodes. Nodes with degree below the 25th percentile within a graph had a 25% chance of being dropped before testing. “Perfect” indicates calculating L˜R from the graph without changing edges between remaining nodes. “Mismatch” indicates the percentage of direct edges between remaining nodes that were incorrect. “Complete Mismatch” indicates 100% mismatch.

Pathway size
15 30 45
Perfect Network 0.0480 0.0513 0.0494
10% Mismatch 0.0499 0.0489 0.0476
40% Mismatch 0.0492 0.0495 0.0501
70% Mismatch 0.0522 0.0511 0.0500
Complete Mismatch 0.0481 0.0488 0.0483
No Network 0.0420 0.0490 0.0530
Principal Component 0.0505 0.0476 0.0501
Univariate Simes 0.0481 0.0502 0.0502

The power curves for all pathway structures and competing methods while simulating complete knowledge, missing edges, and missing nodes are displayed in Fig 1. Having a perfect pathway structure provides the most power. Relationships between an outcome and pathway are easier to detect in larger pathways. The more incorrect direct edges in the pathway, the lower the overall power. The univariate Simes was improved by including L˜R. Using the PCs of Z and ZL˜R gave the exact same power, which is expected from a basis transformation, and performed similarly to a completely incorrect edge structure. Clearly, any correct information from the graph improved power overall. We also see that increasing the overall signal to noise ratio improves power for all pathway structures (Fig 2). PaIRKAT (L˜R) achieves approximately 80% power at a signal to noise ratio around 0.32, whereas ignoring network information requires a signal to noise ratio over twice that, about 0.70 and only including the Laplacian never achieves 80% power (Fig 2). The univariate Simes’ test performed as well as PaIRKAT with perfect pathway knowledge. This is unsurprising since all zi are related to the outcome in our simulations.

Fig 1. Power curves from the four pathway knowledge and 6 pathway structure simulation scenarios.

Fig 1

Power curves were all calculated from score tests on 1000 simulated data sets using graphs with 15, 30, or 45 nodes. Power curves assuming complete pathway knowledge with no dropped edges or nodes are displayed in a). For (b) and (c), the graph used to simulate Z and Y was of medium or high density, respectively, while the graph used to test was of low density. Medium and high edge density graphs used for data generation had ~5% and ~15% more edges, respectively, than the low-density graph used for testing. The power curve generated assuming missing nodes (d) used all graph nodes to generate Z and Y. Then nodes (and corresponding columns of Z) with degree below the 25th percentile within a graph had a 25% chance of being dropped before testing.

Fig 2. Signal to Noise Ratio. Power curves from increasing the signal to noise ratio while assuming complete pathway knowledge.

Fig 2

The signal to noise ratio was calculated as the as the ratio between the overall variance in Y, Var(β0+j=1pβjZij), and the overall residual variance, Var[Yi(β0+j=1pβjZij)]. Each power calculation comes from score tests on 1000 simulated data sets using graphs with 30 nodes.

COPDGene analysis results

A complete description of these analyses can be found in Methods and Models, but here we give a brief description of the outcome variables we analyzed. We create models for two phenotypes from the COPDGene study[22]: (1) percent emphysema and (2) the ratio of post-bronchodilator forced expiratory volume at one second divided by forced vital capacity (FEV1/FVC). To normalize FEV1/FVC, we use the following log ratio transformation, log(FEV1/FVC1FEV1/FVC). This is referred to as the “log FEV1/FVC ratio” for simplicity. We test for associations between 28 pathways and each outcome under the same three conditions in the simulation study: ignoring graph topography, including graph topography via the normalized Laplacian (L˜), and our proposed method PaIRKAT of including graph topography via the regularized Laplacian (L˜R).

Including the metabolites’ regularized graphs had large impacts on the associations between the log FEV1/FVC ratio and several subsets of metabolites. For the 28 pathways tested, power was improved for 17 pathways when using PaIRKAT vs. using L˜ or ignoring pathway information. Of note, the strength of the associations between the log FEV1/FVC ratio and the ABC transporters, the arginine and proline metabolism, the cysteine and methionine metabolism, the pyrimidine metabolism, the glycine, serine, and threonine metabolism, and the neuroactive ligand-receptor interaction metabolite subsets increased dramatically. The average p-value was also lower for 12 pathways with using L˜ vs. ignoring pathway information. S1 Fig displays the p-values from the kernel regression tests for associations between the log FEV1/FVC and the 28 pathways of interest for each subsample size.

Including the metabolites’ regularized graphs also had impacts on the associations between percent emphysema and several subsets of metabolites. For the 28 subsets of metabolites tested, power was improved for 17 pathways when using PaIRKAT vs. including L˜ or ignoring pathway information. Of note, the strength of the associations between percent emphysema and the ABC transporters, the β-alanine metabolism, the neuroactive ligand-receptor interaction, the glycine, serine and threonine metabolism, and the histidine metabolism metabolite subsets increased dramatically when using PaIRKAT vs. ignoring pathway information. The average p-value was also lower for the same 5 pathways with using L˜ vs. ignoring pathway information. However, there was still not a significant result from any method for 4 of these pathways, and PaIRKAT provided similar power for the fifth. S2 Fig displays the p-values from the kernel regression tests for associations between percent emphysema and the 28 pathways of interest for each subsample size.

Fig 3 displays results from 3 pathways selected to illustrate PaIRKAT’s impact on power for fully connected (left column), partially disconnected (middle column), and sparse (right column) graphs. For the steroid hormone biosynthesis pathway, an almost completely sparse pathway, we see virtually no differences between PaIRKAT and ignoring pathway connectivity. We also see relatively small differences between all three methods for the fully connected aminoacyl-tRNA biosynthesis pathway. The major impacts from PaIRKAT come when there are a few nodes or node subsets disjoint from the rest of the graph, as we see in the cysteine and methionine metabolism.

Fig 3. Selected results from COPDGene subset analysis.

Fig 3

Average p-values from kernel regressing tests that do not include pathway information (No Laplacian, red circles), include pathway information through a normalized Laplacian (L˜, green triangles), and include pathway information through a regularized normalized Laplacian (L˜R=(I+τL˜)1, blue squares) are displayed. P-values were averaged over 100 random subsets of size 100, 200, 300, 400, and 500 from the COPDGene dataset. τ was set to 1 for all tests that used L˜R. The 3 pathways selected illustrate the expected results in fully connected (left), partially disconnected (middle), and sparse (right) graphs.

Discussion

We have developed PaIRKAT, a method for incorporating pathway information under a kernel regression framework. Other methods to incorporate pathway connectivity via graph operations have been developed[20,21,26,3032]. PaIRKAT enables the researcher to test on specified pathways instead of aggregating all pathways through a weighted kernel as in[21,30]. It can also handle disjointed pathways without adding in artificial noise to the network as in[17,18,20]. This allows the investigator to compile information from multiple sources, e.g., KEGG and HMDB. The regression framework also expands upon a method developed for classification[26]. It should be noted that the kernel framework is testing a global null, i.e., if any node covaries with the outcome the null hypothesis is rejected. See Goeman and Buhlmann[33] for a full discussion on whether or not this approach is appropriate for pathway based hypotheses.

Pathway misspecifications from incomplete data collection or imperfect canonical pathways within databases are common hurdles in -omics studies. We explored the sensitivity of the method by simulating data assuming incorrect pathway structures and incomplete pathway knowledge. These studies show that our method is highly robust to pathway misspecifications. In smaller pathways, we see that the partially mismatched structure with ~10% of direct edges being incorrect does as well as the perfect network structure. This is likely due to the very small change from the perfect structure in these cases, as a graph with only 15 nodes could easily be unchanged with only a 10% chance to change an edge. Furthermore, even with incorrect or incomplete pathway information, our method provides significantly improved power over ignoring pathway information while maintaining an appropriate type I error rate. We believe this is because many indirect connections between nodes are preserved, and these connections still provide more accurate information than incorrectly assuming independence among nodes.

One benefit of using PaIRKAT is improved power to identify pathways that are associated with clinical phenotypes. For example, an application to the COPDGene dataset using KEGG’s database of metabolic pathways also illustrated PaIRKAT’s ability to improve testing power over simply treating metabolites as independent (Figs 3, S1, and S2). The regularization technique was also able to handle pathways with few metabolites and/or disjoint components. Several tests had a notables boost in power from including pathway connectivity for both percent emphysema and the log FEV1/FVC ratio, and most pathways have been previously associated with COPD and lung function. Huang et al. linked environmental exposures, COPD risk, and metabolomic pathways, and found associations between COPD and the histidine metabolism, cysteine and methionine metabolism, and β-alanine metabolism pathways[34]. The glycine, serine and threonine metabolism, aminoacyl-tRNA biosynthesis, pyrimidine metabolism, pantothenate and CoA biosynthesis, pathways have all previously been associated with asthma[35]. The β-Alanine metabolism, ABC transporters, purine metabolism, pantothenate and CoA biosynthesis pathways were all differentially associated with COPD subclasses for patients with lung cancer[36]. Another study of the COPDGene dataset[22] using a two-step pathway enrichment approach found that the purine metabolism, mineral absorption, arginine biosynthesis, aminoacyl-tRNA biosynthesis, ABC transporters and glycine, serine and threonine metabolism pathways were all associated with various measures of lung function and increased COPD exacerbations[37]. The three ABC transporters have also been shown to be related to COPD in several murine knockout and human studies (see Chai et al.[38]). Finally, the arginine biosynthesis pathway has also been associated with COPD in multiple studies [39,40].

Graph information

We used a non-proprietary version of KEGG available in R. The proprietary version of this database has more up to date information and could have resulted in different pathway structures for the COPDGene data set. There is also a substantial literature on data driven methods for deriving networks from omics data [4147]. Chai et al. provide a nice review[48]. We leave the investigation of how these data-driven methods interact with ours to future research.

Impacts of regularization

In simulation studies and real data analyses we saw meaningful improvements in power by including pathway information through a graph’s regularized normalized Laplacian, (PaIRKAT) when compared to ignoring the pathway information or using L˜. PaIRKAT was essential to maintaining testing power when graphs had disjoint nodes or sub-graphs. Using the normalized Laplacian, L˜, hindered testing performance compared to using PaIRKAT or ignoring the pathway information when a graph was disconnected. In connected graphs PaIRKAT, using L˜, and ignoring the pathway information all performed similarly in the real data analyses (Fig 3).

It is well established that L˜ is a symmetric and positive semidefinite matrix with eigen values 0≤λ1, λ2,…,λp≤2, where the number of λi = 0 is the number of disjoint components of the undirected graph G (see Methods and Models). Therefore, graphs with very low connectivity, meaning many λi = 0, will not be as impacted by regularization since all r−1 (λi = 0) = a for some scalar a. In words, there is no extra information from a graph when most nodes are disconnected from one another (e.g., Fig 3, right column).

One limitation of this study is our focus on the Gaussian kernel. There has been success with other kernels for high dimensional data such as ones tailored to the data type [14,20] or simple linear and weighted linear kernels [23,4951]. We have shown that including that including pathway information can improve the power of the Gaussian kernel and leave the impacts on other kernels to future work.

Summary

In summary, our proposed method serves as a framework for including pathway information into a kernel machine regression test. We developed this method for application to metabolomic pathway data, but the techniques are easily generalizable to other data sources with a graph-like structure. It is important to examine the structure of a graph before applying a regularization step. Unique challenges arose from the sparsity present in many metabolomic pathways which can greatly hinder performance. We implement a graph regularization kernel to handle disconnected pathways. This regularization step is novel in the application of graph-based kernel machine regression to biological data. Our simulation studies illustrate the robustness of this method to improper and incomplete pathway knowledge. The method presented can provide powerful tests for associations between biological pathways and phenotypes of interest and will be helpful in identifying novel pathways for targeted clinical research.

Methods and models

The Kernel machine model

We assume that the data are properly filtered, imputed, and normalized for the methods described in this paper. Consider a dataset with observations from n subjects. Let Y be an n×1 vector representing a continuous or discrete phenotype of interest. Also let X be a n×q matrix of clinical covariates and Z be an n×p matrix of graph structure data. The phenotype can then be modeled through the following semiparametric model

g(Y)=Xβ+h(Z)+ϵ, 1)

where g is either the identity or logit link function, β is a q×1 vector of regression coefficients, ϵ is an n×1 vector of normally distributed error terms, and h is a kernel function. There are no parametric assumptions placed on h except that it lies in some feature space. This more relaxed requirement from the kernel regression provides flexibility and robustness to model misspecification. Another key advantage of introducing the kernel function is its ability to capture nonlinear relationships between the phenotype (Y) and the metabolome (Z) in a computationally tractable manner.

These relationships are assumed to exist in some feature space that is generated by a positive definite kernel function K(·,·). The kernel function can be understood as a feature map that delivers the dot product between zi and zj within the features space, i.e., K(zi, zj) = ⟨ϕ(zi),ϕ(zj)⟩, where ϕ(·) is the transformation to the feature space and ⟨·,·⟩ is the dot product. The representer theorem allows h(Z) to be represented through the kernel function K(·,·) as h(·)=i=1nαiK(·,zi,ρ) for some coefficients αi∈ℝ. More detailed derivations can be found in texts by Schölkopf and Smola[52] as well as Cristianini and Shawe-Taylor[53].

The kernel function K can be thought of as a measurement of similarity between two individuals. Common choices for kernel functions are the Linear Kernel: K(zi,zj)=ziTzj (the dot product), the dth Polynomial Kernel: K(zi,zj,ρ)=(ziTzj+ρ)d, and the Gaussian Kernel: K(zi, zj, ρ) = exp{−‖zizj2/ρ}, where ‖·‖ is the Euclidean (L2) norm. For this work, we employ the Gaussian kernel and use the median of all pairwise Euclidean distances between all zi and zj as an empirical estimate of ρ. We choose to work with the Gaussian kernel since it is a characteristic kernel, a desirable property meaning that probability measures embedded through the kernel function are unique.

Kernel-based score test

Liu et al. show a connection between kernel machine regression and linear mixed models for semiparametric modeling of high dimensional data [11,12]. The parameters β and h(Z) can be estimated by maximizing the scale penalized likelihood

L(β,h)=12i=1n[yixiTβh(zi)]212λh2 2)
=12i=1n[yixiTβj=1nαjK(zi,zj)]212λαTKα, 3)

where K = K(zi, zj, ρ) is the semi-positive definite kernel function of choice. h(Z) can then be understood as subject specific random effects with mean 0 and variance τK. Testing for an association between phenotype and pathway is then equivalent to testing the null hypothesis H0:τ = 0 vs H1:τ>0. We adopt Chen et al.’s adjusted kernel association test adjusted for small samples, which is common for many omics studies [23]. The standard quadratic score statistic for kernel association tests,

Q(β,σ,ρ)=1σ2(YXβ)TK(YXβ), 4)

is adjusted to account for the high variability in estimates of σ2 when n is small. The distribution of Q under the null model is then approximated as a weighted sum of χ2 variables using Davies method [27].

Graph laplacian

A network or graph, G = {V, E}, is a mathematical representation of any interconnected structure through a set V of p nodes (or vertices) and a set E of edges, where the elements of E are pairs {u, v} of distinct vertices, u, vV. When applied to omic pathways, nodes represent individual metabolites, genes, microbes, etc. within the pathway and edges represent direct interactions/reactions between them.

Two important features of any graph are its adjacency matrix, A, and degree matrix, D. A is a p×p matrix that is non-zero when an edge exists between two vertices. D is a p×p diagonal matrix with D[i,i] representing the number of nodes connected to node i. For this work, we represent pathways using undirected unweighted graphs, i.e., there is no ordering to the vertices defining an edge and {u, v} = {v, u}∈E. This means A will be a symmetric matrix with all entries either 1 or 0. Using these features, we can calculate a graph’s Laplacian LDA and its normalized Laplacian L˜D12LD12=ID12AD12, where I is a p×p identity matrix.

Both L and L˜ can be regarded as linear operators of functions f:V→ℝ that induce a semi-norm ‖fL = ⟨f,Lf⟩ = fTLf. This semi-norm can be interpreted as a measure of “smoothness” or how much f varies over its domain. Standardizing L by the number of connections per node to obtain L˜ is a common approach in graph theory since L˜ has several well-known and desirable properties. In particular, L˜ is symmetric and positive semidefinite, and its eigenvalues, λi, are bounded such that they satisfy 0≤λi≤2 for i∈1,2,…p. Another interesting feature of a graph’s normalized Laplacian, L˜, is that the number of disjoint pieces within a graph is captured by the number of L˜s eigen values equal to 0 [54].

Graph regularization

A key component of PaIRKAT is the ability to handle missing and incorrect information from the graph. Pathway databases may not be complete, and untargeted data generating techniques may not be able capture all components within a pathway. This leaves some pathways with low connectivity and others with completely disconnected nodes. This can lead to a decrease in our power to detect associations between phenotypes and metabolomic pathways. One proposed solution is to simply manipulate the adjacency matrix by adding a small constant to all entries[17,18], i.e. working with a modified adjacency matrix A˜=A+tee, where t is a nonnegative tuning parameter and e is a vector of 1s. This yields a full rank matrix as desired, but we know that the subspace spanned by A˜ is not the correct subspace on which our graph lies.

A more elegant solution can be drawn from Smola and Kondor’s work on regularization of graphs[25] in which they draw on parallels between the standard Laplacian operator (Δ=2x12+2x22++2xm2) and the graph Laplacian to design regularization kernels for graphs. Rapaport, et al.[26] took a similar approach to graph smoothing, though this work was done in the context of classification not hypothesis testing. These ideas can be generalized further to represent any metric on a space. That is, for any two observations i and j, the inner product can be expressed as zi,zjM=ziTMzj, where M defines the metric on the vector space based on ‖zizjM. Purdom[55] presents this argument in the context of a “generalized” principal component analysis using a general metric M. This can be seen as an application of a linear kernel on any metric space, whereas we apply the Gaussian kernels for hypothesis testing and, like Rapaport, focus on graph Laplacians for our metric.

For this work, we apply a regularization function to obtain a regularized normalized Laplacian: r(L˜)L˜R. Regularizations of the Laplacian can be seen as regularizations of the eigenvalues of L˜, r(λ). There are many possible choices for r; the only requirement is that r−1(λ)>0 for λ∈[0, 2] to ensure r(L˜)0. In classical Fourier analysis the size of λi∈[0, 2] is directly proportional to the frequency of component i within Fourier space, which translates to the degree of noise within the system. This intuition tells us to limit r−1(λ) to monotonically increasing functions in order to impose higher penalties to more uneven portions of the graph while preserving the lower frequency components, which we assume translate to the prevalent biological signals. Smola and Kondor recommend further limiting choices of r to functions expressible by power series such as a diffusion kernel, r(L˜)=eτ/2L˜. See [56] for complete details on the derivation of different regularization functions.

PaIRKAT implements a “linear” regularization function

L˜R=(I+τL˜)1, 5)

where τ>0 is a bandwidth parameter and I is a p×p identity matrix. We choose this regularization for its simplicity and interpretability of τ. Increasing τ linearly increases the amount of smoothing performed in r−1(λ) = 1+τλ. We can now conduct a kernel machine test while incorporating connectivity within a pathway through L˜R into (1) as

g(Y)=Xβ+h(ZL˜R)+ϵ, 6)

where h is a kernel function applied to ZL˜R and the other model components are as described in (1). ZL˜R is changing Z’s basis function to one defined by the Laplacian, with the new basis vectors representing noise dampened through the regularization function. This can be interpreted as transforming each subject’s phenotype to a weighted sum of each element where the weights are the elements’ proximity to each other within the pathway. This falls under the ‘guilt by association’ framework as nodes closer to each other will share more information and disconnected nodes will share none. The kernel-based score test can then be applied to obtain powerful tests for associations between connected or disconnected pathways and a phenotype of interest.

Simulation study

Simulation scenarios

We conducted multiple simulation studies to assess whether the proposed method is robust to imperfect pathway information. We assumed 3 different “pathway knowledge” scenarios and 4 different “pathway structure” scenarios (Fig 4). Different pathway knowledge scenarios refer to different types of missing information, whereas pathway structure scenarios refer to different configurations of the “known” nodes and edges. We simulate using both the normalized Laplacian, L˜, and PaIRKAT’s regularized normalized Laplacian, L˜R, as well as ignoring the pathway information. For comparison, we also tested using an F-test on all principal components of Z and ZL˜R and the minimum Simes’ adjusted p-value of univariate tests [29] on all columns of Z and ZL˜R.

Fig 4. Flowchart of simulation procedure.

Fig 4

We (1) simulate a graph G, (2) generate Z and Y from G, (3) drop nodes or edges from G to give a smaller graph Gs (drop corresponding columns of Z when dropping edges to create Zs), (4) permute edges to create an improperly structured graph Gs*, (5) calculate the regularized normalized Laplacian L˜R* from Gs*, and finally (5) test for an association between h(ZL˜R*) (or h(ZsL˜R*)) and Y in the model Y=β0+h(ZsL˜R*). For the “no network” simulations, we only use step (1), step (2) and step (5) without including L˜R*.

Pathway knowledge

We simulated three different knowledge scenarios to represent incomplete pathway database information and/or incomplete data collection.

  1. No missing: Assuming the nodes measured (metabolites, genes, etc.) and edges connecting them are a perfect representation of the biological pathway of interest.

  2. Missing edges: Assuming that some biological interactions (edges) are missing from the documented pathway. Here we generate a graph G = {V, E} according to the Barabasi-Albert model for a “low” edge density. We then give every set {u, v}∉E a 5% or 15% percent chance of being added to E for a “medium” or “high” edge density graph, respectively. Z and Y are then generated from the medium or high edge density graph, but L˜ or L˜R is calculated from the original “low” edge density graph. Examples of these graphs are shown in S3 Fig.

  3. Missing nodes: Assuming that some of the nodes (and hence their edges) are missing from the documented pathway. Here a graph is used to generate Z and Y. Then nodes with degree below the 25th percentile have a 25% chance of being removed before calculating L˜ or L˜R. The corresponding columns and rows of Z and β are removed as well. Examples of these graphs are shown in S4 Fig.

Pathway structures

After we simulate a pathway knowledge scenario, we alter the pathway structure to represent incorrect edge connections within a database. Examples of structures 1, 2, and 3 are displayed in Fig 5.

Fig 5. Examples of the three different pathway structures.

Fig 5

Nodes 6 and 7 are highlighted in red to help display the effects of different pathway structures. (Left) The “true” pathway or graph that is used to simulated Z and Y. This is the graph used for tests under a “perfect pathway structure” scenario. (Middle) A graph with approximately 40% of the edges from the “true” graph directly connecting the wrong nodes. This is used for tests under a “partial mismatch (40) structure” scenario. (Right) A graph with 0 shared edges with the “true” graph. This is the graph used for tests under a “complete mismatch structure” scenario.

  1. No mismatch: No alterations to graph edges. The graph used to simulate Z and Y is the same graph used to calculate L˜ or L˜R (Fig 5, left).

  2. Partial Mismatch: a graph, G1 = {V1, E1}, is used to simulate Z and Y. This graph’s edges are permuted such that any edge {u, v}∈E1 has a 10%, 40%, or 70% chance of being changed to some {u, w}∉E1; i.e., approximately 10%, 40%, or 70% of direct edges are incorrect before calculating L˜ or L˜R (Fig 5, middle).

  3. Complete Mismatch: a network G1 = {V1, E1} is used to simulate Z and Y. A new random graph, G2, is then draw and forced to have no edges that match G1, i.e., V1 = V2 but if {u, v}∈E1 then {u, v}∉E2. We then calculate L˜ or L˜R from G2 (Fig 5, right).

  4. No Pathway: a graph is used to simulate Z and Y. This connectivity is ignored while testing by not including L˜ or L˜R in the kernel function.

All pathway structures were considered under each different pathway knowledge scenario. The different pathway structures were imposed after simulating under different pathway knowledge assumptions. Each simulated pathway structure and knowledge combination followed 5 steps: (1) simulate a graph G, (2) generate Z and Y from G, (3) drop nodes and/or edges (based on knowledge assumption) from G and Z to give a smaller graph and node set GS and ZS, (4) alter GS (based on structure assumption) to create a graph Gs* with improper edge connections, (5) calculate L˜* or L˜R* from Gs*, and (5) test for an association between h(ZsL˜R*) and Y in the model Y=β0+h(ZsL˜R*). See Fig 4 for a flowchart of these simulation scenarios.

Simulated data

To evaluate PaIRKAT’s overall testing performance and robustness to incorrect pathway information, we simulate data and tests assuming various types of misspecified pathways. All simulations were performed using R[57]. Random graphs were generated using the igraph[58] package according to the Barabasi-Albert model[59] with p nodes representing p metabolites within a pathway. The graph’s adjacency matrix was converted into a positive definite precision matrix, Ω, using an approached developed by Danaher, et al.[60] and also applied by Shaddox, et al[61]. An n by p matrix of metabolite abundances, Z, was then simulated from a multivariate normal distribution with mean 0 and covariance Ω−1. In this way, node connectivity is captured by Ω. A continuous outcome Yi was then simulated from a normal distribution with mean 0.26+0.5 X1+0.25 X2+∑jβjZij and variance σ2, where X1 was a binary variable, X2 is a uniform random variable, σ2 = 1.36882. This value for σ2 was drawn from observed metabolomics data. The regularization parameter τ is set to 1 for all simulations. All βj were set to 0 to assess Type I error rates or set to 0.1 to assess power for the different pathway information scenarios described above. Each used 10,000 simulations of graphs of size p = 15, 30, 45 assuming a sample size of n = 160, and a testing level of α = 0.05 was used for all simulations.

COPDGene data

We analyzed data collected from the COPDGene study [22], a multicenter observational study that collected genetic data as well as multiple measures of lung function to study chronic obstructive pulmonary disease (COPD). Between 2007 and 2011, 10,198 participants with and without chronic obstructive pulmonary disease (COPD) enrolled (Visit 1). A five-year follow up visit took place between 2013 and 2017 (Visit 2). Blood samples were also obtained for -omics analyses from participants who provided consent. In total, 1136 subjects (1040 non-Hispanic white, 96 African American) participated in a metabolomics ancillary study in which they provide fresh frozen plasma collected using an 8.5 mL p100 tube (Becton Dickson) at Visit 2.

Metabolomics and data processing

P100 plasma was profiled using the Metabolon (Durham, NC, USA) Global Metabolomics platform. Briefly, untargeted liquid chromatography–tandem mass spectrometry (LC–MS/MS) was used to quantify 1392 metabolites and described in[62,63]. A data normalization step was performed to correct variation resulting from instrument inter-day tuning differences: metabolite intensities were divided by the metabolite run day median, then multiplied by the overall metabolite median. It was determined that no further normalization was necessary based on the reduction in the significance of association between the top PCs and sample run day after normalization. Subjects with aggregate metabolite median z-scores greater than 3.5 standard deviation from the mean (n = 6) of the cohort were removed. Metabolites were excluded if >20% of samples were missing values[64]. For the 995 remaining metabolites, missing values were imputed across metabolites with k-nearest neighbors imputation (k = 10) using the R package impute[65]. As a final step, metabolomic data was natural log transformed and standardized. Linear regression models were fit to each metabolite controlling for white blood cell count, percent eosinophil, percent lymphocytes, percent monocytes, percent neutrophils, and hemoglobin. The partial residuals were then used as the observed metabolomics data. These data are available at Metabolomics Workbench with identifier PR000907.

Four hundred and thirty six of these metabolites had an id in the KEGG database of human pathways, which was accessed using the keggLink function from the KEGGREST package[66]. These 436 metabolites appear in 161 KEGG pathways, and 28 of these 161 KEGG pathways contained 10 or more metabolites. Edges in a pathway’s graph were defined by connections within a pathway from the KEGG reaction database. Note that our filtered dataset did not contain every metabolite within the 28 KEGG pathways selected, and therefore some of the analyzed pathways have less that 10 metabolites.

Clinical variables

We focus on two COPD phenotypes: (1) percent emphysema and (2) the ratio of post-bronchodilator forced expiratory volume at one second divided by forced vital capacity (FEV1/FVC). Emphysema, a measure of erosion of the distal airspaces, has been linked with the clinical severity of COPD[67]. It is an imaging-based phenotype defined as the 15th percentile lung voxel density in Hounsfield units adjusted for total lung capacity from quantitative CT imaging analyses. FEV1/FVC is a measure of airflow obstruction. To normalize FEV1/FVC, we use the following log ratio transformation, log(FEV1/FVC1FEV1/FVC). After removing incomplete cases we were left with 1,113 complete cases for the FEV1/FVC analysis and 1,065 complete cases for the percent emphysema analysis.

Analysis

We compared results from tests that included pathway connectivity via L˜,L˜R, and tests that ignored pathway connectivity for the 28 pathways that had measurements on at least 10 of the metabolites in the pathway. P-values were calculated from a score test as described Section 2 with τ = 1 for PaIRKAT tests. P-values from each method were indistinguishable from one another for both data sets with over 1,000 observations. However, many data sets may not be that large. To demonstrate the differences in performance, 100 random subsets of sizes 100, 200, 300, 400, and 500 were taken from both the log FEV1/FVC ratio and the percent emphysema data sets. All three methods were used to test for associations between phenotype and metabolites within a pathway. The 100 p-values were then averaged to measure the performance of each method. All null models included subject age, sex, BMI, smoking status (current, former, never), pack-years of smoking, and the clinical center as covariates.

Supporting information

S1 Fig. Associations between metabolite subsets and log FEV1/FVC ratio.

Average p-values from kernel regressing tests that do not include pathway information (No Laplacian, red circles), include pathway information through a normalized Laplacian (L˜, green triangles), and include pathway information through a regularized normalized Laplacian (L˜R=(I+τL˜)1, blue squares) are displayed. P-values were averaged over 100 random subsets of size 100, 200, 300, 400, and 500 from the COPDGene dataset. τ was set to 1 for all tests that used L˜R.

(TIF)

S2 Fig. Associations between metabolite subsets and percent emphysema.

Average p-values from kernel regressing tests that do not include pathway information (No Laplacian, red circles), include pathway information through a normalized Laplacian (L˜, green triangles), and include pathway information through a regularized normalized Laplacian (L˜R=(I+τL˜)1, blue squares) are displayed. P-values were averaged over 100 random subsets of size 100, 200, 300, 400, and 500 from the COPDGene dataset. τ was set to 1 for all tests that used L˜R.

(TIF)

S3 Fig. Examples graphs with high, medium, and low edge densities.

Low density graphs were generated according the Barabasi-Albert model for graph simulation. Medium- and high-density graphs were generated by giving each unconnected node either a 5% or 15% chance of becoming connected, respectively.

(TIF)

S4 Fig. Example of a graph with missing nodes.

Graphs were generated according to the Barabasi-Albert model. Then any node with degree below the 25th percentile of degrees within the graph had a 25% chance of being dropped.

(TIF)

S1 Table. Type 1 error rates using complete pathway.

Error rates were calculated from score tests on 1000 simulated data sets. All simulations used graphs with 15, 30, or 45 nodes. No nodes or edges were dropped for these simulations. Pathway information was included in kernel score test through the normalized Laplacian L˜.

(XLSX)

S2 Table. Type 1 error rates using pathways with 5% missing edges.

Error rates were calculated from score tests on 1000 simulated data sets using graphs with 15, 30, or 45 nodes. The graph used to simulate Z and Y was of medium edge density, while the graph used to test was of low density. The low-density graphs are drawn from the Barabasi-Albert model with edge density 0.13, 0.07, and 0.04 for graphs with 15, 30, and 45 nodes, respectively. Medium edge density graphs are created by giving any 2 nodes without a direct edge between them a 5% chance of becoming directly connected. This creates graphs with an average edge density of 0.18, 0.11, and 0.09 for graphs with 15, 30, and 45 nodes, respectively. Pathway information was included in kernel score test through the normalized Laplacian L˜.

(XLSX)

S3 Table. Type 1 error rates using pathways with 15% missing edges.

Error rates were calculated from score tests on 1000 simulated data sets using graphs with 15, 30, or 45 nodes. The graph used to simulate Z and Y was of high edge density, while the graph used to test was of low density. The low density graphs are drawn from the Barabasi-Albert model with edge density 0.13, 0.07, and 0.04 for graphs with 15, 30, and 45 nodes, respectively. High edge density graphs are created by giving any 2 nodes without a direct edge between them a 15% chance of becoming directly connected. This creates graphs with an average edge density of 0.26, 0.21, and 0.19 for graphs with 15, 30, and 45 nodes, respectively. Pathway information was included in kernel score test through the normalized Laplacian L˜.

(XLSX)

S4 Table. Type 1 error rates using pathways with dropped nodes.

Error rates were calculated from score tests on 1000 simulated data sets using graphs 15, 30, or 45 nodes initially. The graph used to simulate Z and Y contained all nodes. Nodes with degree below the 25th percentile within a graph had a 25% chance of being dropped before testing. Pathway information was included in kernel score test through the normalized Laplacian L˜.

(XLSX)

Data Availability

The R code for simulations and an example workflow with source code is available at https://github.com/CharlieCarpenter/PaIRKAT. A shiny app can be found at https://csevern.shinyapps.io/pairkat/. The metabolomics data set from the COPDGene Study can be found through PMID: 20214461 or DOI: 10.3109/15412550903499522.

Funding Statement

RB was awarded U01 HL089897 and U01 HL089856 from the National Heart, Lung, and Blood Institute, https://www.nhlbi.nih.gov/. KK and DG were awarded U01 CA235488 from from the National Cancer Institute, https://www.cancer.gov/. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Fiehn O. Metabolomics—the link between genotypes and phenotypes. In: Town C, editor. Functional Genomics. Dordrecht: Springer Netherlands; 2002. pp. 155–171. doi: 10.1007/978-94-010-0448-0_11 [DOI] [PubMed] [Google Scholar]
  • 2.Alonso A, Marsal S, Julià A. Analytical Methods in Untargeted Metabolomics: State of the Art in 2015. Front Bioeng Biotechnol. 2015;3. doi: 10.3389/fbioe.2015.00003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Suvitaival T, Rogers S, Kaski S. Stronger findings from mass spectral data through multi-peak modeling. BMC Bioinformatics. 2014;15: 208. doi: 10.1186/1471-2105-15-208 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Suvitaival T, Rogers S, Kaski S. Stronger findings for metabolomics through Bayesian modeling of multiple peaks and compound correlations. Bioinformatics. 2014;30: i461–i467. doi: 10.1093/bioinformatics/btu455 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Zhan X, Patterson AD, Ghosh D. Kernel approaches for differential expression analysis of mass spectrometry-based metabolomics data. BMC Bioinformatics. 2015;16. doi: 10.1186/s12859-014-0426-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Kanehisa M, Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000;28: 27–30. doi: 10.1093/nar/28.1.27 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Wishart DS, Tzur D, Knox C, Eisner R, Guo AC, Young N, et al. HMDB: the Human Metabolome Database. Nucleic Acids Research. 2007;35: D521–D526. doi: 10.1093/nar/gkl923 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Croft D, O’Kelly G, Wu G, Haw R, Gillespie M, Matthews L, et al. Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. 2011;39: D691–D697. doi: 10.1093/nar/gkq1018 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Türei D, Korcsmáros T, Saez-Rodriguez J. OmniPath: guidelines and gateway for literature-curated signaling pathway resources. Nature Methods. 2016;13: 966–967. doi: 10.1038/nmeth.4077 [DOI] [PubMed] [Google Scholar]
  • 10.Slenter DN, Kutmon M, Hanspers K, Riutta A, Windsor J, Nunes N, et al. WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research. Nucleic Acids Res. 2018;46: D661–D667. doi: 10.1093/nar/gkx1064 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Liu D, Lin X, Ghosh D. Semiparametric Regression of Multidimensional Genetic Pathway Data: Least-Squares Kernel Machines and Linear Mixed Models. Biometrics. 2007;63: 1079–1088. doi: 10.1111/j.1541-0420.2007.00799.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Liu D, Ghosh D, Lin X. Estimation and testing for the effect of a genetic pathway on a disease outcome using logistic kernel machine regression via logistic mixed models. BMC Bioinformatics. 2008;9: 292. doi: 10.1186/1471-2105-9-292 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Broadaway KA, Cutler DJ, Duncan R, Moore JL, Ware EB, Jhun MA, et al. A Statistical Approach for Testing Cross-Phenotype Effects of Rare Variants. The American Journal of Human Genetics. 2016;98: 525–540. doi: 10.1016/j.ajhg.2016.01.017 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Zhao N, Chen J, Carroll IM, Ringel-Kulka T, Epstein MP, Zhou H, et al. Testing in Microbiome-Profiling Studies with MiRKAT, the Microbiome Regression-Based Kernel Association Test. The American Journal of Human Genetics. 2015;96: 797–807. doi: 10.1016/j.ajhg.2015.04.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Jensen AM, Tregellas JR, Sutton B, Xing F, Ghosh D. Kernel machine tests of association between brain networks and phenotypes. PLoS One. 2019;14. doi: 10.1371/journal.pone.0199340 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Chaleckis R, Meister I, Zhang P, Wheelock CE. Challenges, progress and promises of metabolite annotation for LC–MS-based metabolomics. Current Opinion in Biotechnology. 2019;55: 44–50. doi: 10.1016/j.copbio.2018.07.010 [DOI] [PubMed] [Google Scholar]
  • 17.Amini Arash A., Chen Aiyou, Bickel Peter J., Levina Elizaveta. Pseudo-Likelihood Methods for Community Detection in Large Sparse Networks. Ann Stat. 2013;41. doi: 10.1214/13-Aos1138 [Google Scholar]
  • 18.Le CM, Levina E, Vershynin R. Concentration and regularization of random graphs. Random Structures & Algorithms. 2017;51: 538–561. doi: 10.1002/rsa.20713 [DOI] [Google Scholar]
  • 19.Schaid DJ. Genomic Similarity and Kernel Methods II: Methods for Genomic Information. Hum Hered. 2010;70: 132–140. doi: 10.1159/000312643 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Freytag S, Manitz J, Schlather M, Kneib T, Amos CI, Risch A, et al. A Network-Based Kernel Machine Test for the Identification of Risk Pathways in Genome-Wide Association Studies. Hum Hered. 2013;76: 64–75. doi: 10.1159/000357567 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Manica M, Cadow J, Mathis R, Rodríguez Martínez M. PIMKL: Pathway-Induced Multiple Kernel Learning. npj Systems Biology and Applications. 2019;5: 1–8. doi: 10.1038/s41540-018-0079-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Regan EA, Hokanson JE, Murphy JR, Make B, Lynch DA, Beaty TH, et al. Genetic epidemiology of COPD (COPDGene) study design. COPD. 2010;7: 32–43. doi: 10.3109/15412550903499522 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Chen J, Chen W, Zhao N, Wu MC, Schaid DJ. Small Sample Kernel Association Tests for Human Genetic and Microbiome Association Studies. Genetic Epidemiology. 2016;40: 5–19. doi: 10.1002/gepi.21934 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Kolaczyk Eric D. Statistical ANalysis of Network Data. New York: Springer-Verlag New York; 2009. doi: 10.1103/PhysRevE.79.061916 [DOI] [Google Scholar]
  • 25.Smola AJ, Kondor R. Kernels and Regularization on Graphs. In: Schölkopf B, Warmuth MK, editors. Learning Theory and Kernel Machines. Berlin, Heidelberg: Springer Berlin Heidelberg; 2003. pp. 144–158. doi: 10.1007/978-3-540-45167-9_12 [DOI] [Google Scholar]
  • 26.Rapaport F, Zinovyev A, Dutreix M, Barillot E, Vert J-P. Classification of microarray data using gene networks. BMC Bioinformatics. 2007;8: 35. doi: 10.1186/1471-2105-8-35 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Davies RB. The distribution of a linear combination of X2 random variables. J R Stat Soc Series C (Appl Stat). 1980;29: 323–333. [Google Scholar]
  • 28.Shen Y, Zhu J. Power analysis of principal components regression in genetic association studies*. J Zhejiang Univ Sci B. 2009;10: 721–730. doi: 10.1631/jzus.B0830866 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Simes R. J. An Improved Bonferroni Procedure for multiple tests of significance. Biometrika. 1986;73: 751–4. [Google Scholar]
  • 30.Ha SS, Kim I, Wang Y, Xuan J. Applications of Different Weighting Schemes to Improve Pathway-Based Analysis. Comp Funct Genomics. 2011;2011. doi: 10.1155/2011/463645 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Kim I, Pang H, Zhao H. Bayesian semiparametric regression models for evaluating pathway effects on continuous and binary clinical outcomes. Stat Med. 2012;31: 1633–1651. doi: 10.1002/sim.4493 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Kim I, Pang H, Zhao H. Statistical properties on semiparametric regression for evaluating pathway effects. J Stat Plan Inference. 2013;143: 745–763. doi: 10.1016/j.jspi.2012.09.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Goeman JJ, Buhlmann P. Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics. 2007;23: 980–987. doi: 10.1093/bioinformatics/btm051 [DOI] [PubMed] [Google Scholar]
  • 34.Huang Q, Hu D, Wang X, Chen Y, Wu Y, Pan L, et al. The modification of indoor PM2.5 exposure to chronic obstructive pulmonary disease in Chinese elderly people: A meet-in-metabolite analysis. Environment International. 2018;121: 1243–1252. doi: 10.1016/j.envint.2018.10.046 [DOI] [PubMed] [Google Scholar]
  • 35.Kelly RS, Virkud Y, Giorgio R, Celedón JC, Weiss ST, Lasky-Su J. Metabolomic profiling of lung function in Costa-Rican children with asthma. Biochimica et Biophysica Acta (BBA)—Molecular Basis of Disease. 2017;1863: 1590–1595. doi: 10.1016/j.bbadis.2017.02.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Li X, Cheng J, Shen Y, Chen J, Wang T, Wen F, et al. Metabolomic analysis of lung cancer patients with chronic obstructive pulmonary disease using gas chromatography-mass spectrometry. Journal of Pharmaceutical and Biomedical Analysis. 2020;190: 113524. doi: 10.1016/j.jpba.2020.113524 [DOI] [PubMed] [Google Scholar]
  • 37.Cruickshank-Quinn CI, Jacobson S, Hughes G, Powell RL, Petrache I, Kechris K, et al. Metabolomics and transcriptomics pathway approach reveals outcome-specific perturbations in COPD. Sci Rep. 2018;8. doi: 10.1038/s41598-017-18329-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Chai AB, Ammit AJ, Gelissen IC. Examining the role of ABC lipid transporters in pulmonary lipid homeostasis and inflammation. Respir Res. 2017;18. doi: 10.1186/s12931-017-0503-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Ruzsics I, Nagy L, Keki S, Sarosi V, Illes B, Illes Z, et al. L-Arginine Pathway in COPD Patients with Acute Exacerbation: A New Potential Biomarker. COPD: Journal of Chronic Obstructive Pulmonary Disease. 2016;13: 139–145. doi: 10.3109/15412555.2015.1045973 [DOI] [PubMed] [Google Scholar]
  • 40.Scott JA, Duongh M, Young AW, Subbarao P, Gauvreau GM, Grasemann H. Asymmetric Dimethylarginine in Chronic Obstructive Pulmonary Disease (ADMA in COPD). Int J Mol Sci. 2014;15: 6062–6071. doi: 10.3390/ijms15046062 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Langfelder P, Horvath S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008;9: 559. doi: 10.1186/1471-2105-9-559 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Langfelder P, Cantle JP, Chatzopoulou D, Wang N, Gao F, Al-Ramahi I, et al. Integrated genomics and proteomics define huntingtin CAG length–dependent networks in mice. Nat Neurosci. 2016;19: 623–633. doi: 10.1038/nn.4256 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Shirasaki DI, Greiner ER, Al-Ramahi I, Gray M, Boontheung P, Geschwind DH, et al. Network Organization of the Huntingtin Proteomic Interactome in Mammalian Brain. Neuron. 2012;75: 41–57. doi: 10.1016/j.neuron.2012.05.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Zhang G, He P, Tan H, Budhu A, Gaedcke J, Ghadimi BM, et al. Integration of Metabolomics and Transcriptomics Revealed a Fatty Acid Network Exerting Growth Inhibitory Effects in Human Pancreatic Cancer. Clin Cancer Res. 2013;19: 4983–4993. doi: 10.1158/1078-0432.CCR-13-0209 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Mamdani M, Williamson V, McMichael GO, Blevins T, Aliev F, Adkins A, et al. Integrating mRNA and miRNA Weighted Gene Co-Expression Networks with eQTLs in the Nucleus Accumbens of Subjects with Alcohol Dependence. PLOS ONE. 2015;10: e0137671. doi: 10.1371/journal.pone.0137671 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Dobra A, Hans C, Jones B, Nevins JR, Yao G, West M. Sparse graphical models for exploring gene expression data. Journal of Multivariate Analysis. 2004;90: 196–212. doi: 10.1016/j.jmva.2004.02.009 [DOI] [Google Scholar]
  • 47.Shi WJ, Zhuang Y, Russell PH, Hobbs BD, Parker MM, Castaldi PJ, et al. Unsupervised discovery of phenotype-specific multi-omics networks. Bioinformatics. 2019;35: 4336–4343. doi: 10.1093/bioinformatics/btz226 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Chai LE, Loh SK, Low ST, Mohamad MS, Deris S, Zakaria Z. A review on the computational approaches for gene regulatory network construction. Computers in Biology and Medicine. 2014;48: 55–65. doi: 10.1016/j.compbiomed.2014.02.011 [DOI] [PubMed] [Google Scholar]
  • 49.Seoane JA, Day INM, Gaunt TR, Campbell C. A pathway-based data integration framework for prediction of disease progression. Bioinformatics. 2014;30: 838–845. doi: 10.1093/bioinformatics/btt610 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Larson NB, Chen J, Schaid DJ. A review of kernel methods for genetic association studies. Genetic Epidemiology. 2019;43: 122–136. doi: 10.1002/gepi.22180 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Karoui NE. The spectrum of kernel random matrices. Ann Statist. 2010;38. doi: 10.1214/08-AOS648 [DOI] [Google Scholar]
  • 52.Bernhard Schölkopf, Alexander J. Smola. Learning with Kernels. Massachusetts Institute of Technology; 2002. [Google Scholar]
  • 53.Cristianini Nello, John Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press; 2000. Available: Http://www.cambridge.org [Google Scholar]
  • 54.Chung Fan, Graham. Spectral Graph Theory. 1997. [Google Scholar]
  • 55.Purdom E. Analysis of a data matrix and a graph: Metagenomic data and the phylogenetic tree. Ann Appl Stat. 2011;5: 2326–2358. doi: 10.1214/10-AOAS402 [DOI] [Google Scholar]
  • 56.Kondor RI, Lafferty J. Diffusion Kernels on Graphs and Other Discrete Input Spaces.: 8. [Google Scholar]
  • 57.R Core Team. R: A language and environment for statistical computing. 2019. Available: https://www.R-project.org/ [Google Scholar]
  • 58.Csardi G, Nepusz T. The igraph software package for complex network research.: 9. [Google Scholar]
  • 59.Barabási A-L, Albert R. Emergence of Scaling in Random Networks. Science. 1999;286: 509–512. doi: 10.1126/science.286.5439.509 [DOI] [PubMed] [Google Scholar]
  • 60.Danaher P, Wang P, Witten DM. The joint graphical lasso for inverse covariance estimation across multiple classes. J R Stat Soc Series B Stat Methodol. 2014;76: 373–397. doi: 10.1111/rssb.12033 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Shaddox E, Peterson CB, Stingo FC, Hanania NA, Cruickshank-Quinn C, Kechris K, et al. Bayesian inference of networks across multiple sample groups and data types. Biostatistics. 2020;21: 561–576. doi: 10.1093/biostatistics/kxy078 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Gillenwater LA, Pratte KA, Hobbs BD, Cho MH, Zhuang Y, Halper-Stromberg E, et al. Plasma Metabolomic Signatures of Chronic Obstructive Pulmonary Disease and the Impact of Genetic Variants on Phenotype-Driven Modules. Network and Systems Medicine. 2020;3: 159–181. doi: 10.1089/nsm.2020.0009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Gillenwater LA, Kechris KJ, Pratte KA, Reisdorph N, Petrache I, Labaki WW, et al. Metabolomic Profiling Reveals Sex Specific Associations with Chronic Obstructive Pulmonary Disease and Emphysema. Metabolites. 2021;11. doi: 10.3390/metabo11030161 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Bijlsma S, Bobeldijk I, Verheij ER, Ramaker R, Kochhar S, Macdonald IA, et al. Large-scale human metabolomics studies: a strategy for data (pre-) processing and validation. Anal Chem. 2006;78: 567–574. doi: 10.1021/ac051495j [DOI] [PubMed] [Google Scholar]
  • 65.Hastie Trevor, Tibshirani Robert, Narasimhan Balasubramanian, Chu Gilbert. impute: Imputation for microarray data. Available: https://www.bioconductor.org/packages/release/bioc/html/impute.html [Google Scholar]
  • 66.Tenenbaum D. KEGGREST: Client-side REST access to KEGG. Available: https://bioconductor.riken.jp/packages/3.0/bioc/html/KEGGREST.html [Google Scholar]
  • 67.Li K, Gao Y, Pan Z, Jia X, Yan Y, Min X, et al. Influence of Emphysema and Air Trapping Heterogeneity on Pulmonary Function in Patients with COPD. Int J Chron Obstruct Pulmon Dis. 2019;14: 2863–2872. doi: 10.2147/COPD.S221684 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008986.r001

Decision Letter 0

Mark Alber, Carl Herrmann

30 May 2021

Dear Carpenter,

Thank you very much for submitting your manuscript "PaIRKAT: A pathway integrated regression-based kernel association test with applications to metabolomics and COPD phenotypes" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by two independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

As you will see, the reviewers find the method of potential interest, but havee raised several questions regarding the details of the methods, which you should clarify. One reviewer has also suggested some changes in the method, that we would ask you to consider, in order to handle the Type 1 error rate. In addition, comparison with alternative methods should be strengthen.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Carl Herrmann, Ph.D.

Associate Editor

PLOS Computational Biology

Mark Alber

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: In this paper the authors proposed a novel approach for pathway-based kernel test with application to metabolomics data. A new idea is to use a smoothed graph Laplacian matrix based on a likely incomplete graph or network for some metabolites (or genes or …) in a pathway. Through simulated and real data examples, the new test was shown to improve over using a standard graph Laplacian and not using any graph/network information. The proposed method has potential to be useful in practice.

Main comments.

1. Would the author explain why the proposed method works? In particular, the main idea is to replace h(Z) by h(ZL_R) in eq (6); why?

2. I am really surprised by how poorly using the standard Laplacian L performed in Figs 1 and 2, much worse than not using any network information. On the other hand, Presumably, all the information from L_R is more or less from L. How to explain this? Did you just replace h(Z) by h(ZL) in eq (6)? If so, then there may be a problem: By eq (5), it seems that L_R is in the scale of the (generalized) inverse of L, so perhaps a generalized inverse of L should be used?

3. It would be useful to consider a situation when there is indeed NO information from a network/graph, how the proposed and other methods perform?

4. In eq (5), how \\tau was selected in your simulations and real data examples? What were its typical values?

5. Table 1 Caption is hard to understand: if “No nodes or edges were dropped for these simulations”, what is the meanings of “Perfect” vs “Partial 10%”?

6. Author summary: “It is common for nodes (e.g. metabolites) to be disconnected from all others within the graph, which leads to meaningful decreases in testing power whether or not the graph information is included.”

i) If no graph info is used, why does it matter whether a node is connected or not with others?

ii) If in truth a node is NOT connected with other nodes, compared to what the power will decrease?

Minor comments: There are a few statements I do not understand; below are some examples.

1. “Disjoint nodes are of concern for graph-structured data since many techniques for inference on graph-structured data involve spectral decomposition and require full rank matrices from a fully connected graph.” I am not sure many methods require this: e.g. most spectral graph-based methods do NOT require a Laplacian to be of full rank.

2. “This method and others like it are tailored to genomic data and not applicable to other omics data.” Why NOT, given that “Omics data (metabolomics, genomics, etc.) can often …”?

Reviewer #2: This paper concerns the development of a pathway based test for metabolomics data. The authors specifically choose to use a kernel testing framework and propose to incorporate known pathway structure into the test by prescaling the data by the graph Laplacian. The authors make particular emphasis on the situation in which there is sparsity in the data. Overall, the work seems reasonable and the authors have been quite thorough in their investigations. I particularly like the careful consideration of the impact of incorrect network structure. This has the potential to be a nice contribution to the literature. However, I do have a few comments which follow below in no particular order.

1. The type I error rate of the kernel test appears to be quite inflated in the tables, particularly for larger pathways. In many cases, the type I error rate is 0.06 which 20% above the nominal 0.05 level and probably worse at lower alpha levels. I believe that this inflated false positive rate may be partially due to the outdated Satterthwaite method used in the authors approach. The kernel testing literature has long left this approach behind. More recent developments include using the Davies approach, saddlepoint approximations, and other small sample approximations that work better with full rank kernels, e.g. the gaussian. These approaches should be incorporated to correct the unacceptably high type I error.

2. It would be helpful if the authors could provide additional insight as to *why* incorporation of prior knowledge (graph structure) is useful in this case. Work by Schaid (upon which Freytag builds, I recall) emphasizes the idea of “guilt by association”. Is that a similar idea here?

3. While kernel based approaches appear generally reasonable for this context. I think it important to note that the hypothesis in this case is a global null: that is, if even a single metabolite is associated with the outcome then the entire pathway would be considered significant (see Goeman and Buhlmann for a technical discussion). It should be noted that some investigators find that this violates the spirit of a pathway analysis, but I do not feel strongly here other than it should be noted.

4. The discussion of comparisons with 2-step approaches ignores the fact that the inherent hypotheses are different which is an important distinction. Notably, most 2-step enrichment approaches are statistically invalid (Goeman and Buhlmann).

5. There is a distinct lack of comparison with competing methods. The paper is entirely kernel test centric without consideration for competing approaches. Simple strategies would be using the top PC(s), minimum p-value approaches, or tail strength strategies.

6. Instead of using prior knowledge, I imagine that another strategy seems to be directly estimating network structure from the data, e.g. through the graphical Lasso. This approach is taken by Zhan et al. This strategy would be different in the sense that connectivity is not due to a particular biological cascade but rather correlation in the data. I would be curious to see how this performs

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008986.r003

Decision Letter 1

Mark Alber, Carl Herrmann

28 Aug 2021

Dear Carpenter,

Thank you very much for submitting your manuscript "PaIRKAT: A pathway integrated regression-based kernel association test with applications to metabolomics and COPD phenotypes" for consideration at PLOS Computational Biology.

As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments.

While Reviewer 1 is satisfied with the revisions to his/her comments, Reviewer 2 still feels that there is a lack of justification as to why the method is working. He/she feels that that additions made to the methods section are not sufficient to clearly illustrate this. As this is at the heart of your method, I would recommend that you add more explanations (1) in your response to the reviewer, and (2) add some additional explanations in the results/discussion, in addition to the methods.

We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts.

Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Carl Herrmann, Ph.D.

Associate Editor

PLOS Computational Biology

Mark Alber

Deputy Editor

PLOS Computational Biology

***********************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: I appreciate the authors' responses to my previous comments.

Reviewer #2: The authors have addressed several of my concerns and I’m glad that they were able to change the p-value calculation approach.

However, I remain unclear as to the justification for the proposed approach. The “additions” made are not responsive to my question (with similar concern raised by other reviews). I still fail to understand *why* their approach works. The authors need to make a more solid conjecture with evidence to support their conjecture, by way of simulations or proof.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: None

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008986.r005

Decision Letter 2

Mark Alber, Carl Herrmann

13 Oct 2021

Dear Carpenter,

We are pleased to inform you that your manuscript 'PaIRKAT: A pathway integrated regression-based kernel association test with applications to metabolomics and COPD phenotypes' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. 

Best regards,

Carl Herrmann, Ph.D.

Associate Editor

PLOS Computational Biology

Mark Alber

Deputy Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #2: The authors have sufficiently addressed my concerns.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1008986.r006

Acceptance letter

Mark Alber, Carl Herrmann

15 Oct 2021

PCOMPBIOL-D-21-00744R2

PaIRKAT: A pathway integrated regression-based kernel association test with applications to metabolomics and COPD phenotypes

Dear Dr Carpenter,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Livia Horvath

PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Associations between metabolite subsets and log FEV1/FVC ratio.

    Average p-values from kernel regressing tests that do not include pathway information (No Laplacian, red circles), include pathway information through a normalized Laplacian (L˜, green triangles), and include pathway information through a regularized normalized Laplacian (L˜R=(I+τL˜)1, blue squares) are displayed. P-values were averaged over 100 random subsets of size 100, 200, 300, 400, and 500 from the COPDGene dataset. τ was set to 1 for all tests that used L˜R.

    (TIF)

    S2 Fig. Associations between metabolite subsets and percent emphysema.

    Average p-values from kernel regressing tests that do not include pathway information (No Laplacian, red circles), include pathway information through a normalized Laplacian (L˜, green triangles), and include pathway information through a regularized normalized Laplacian (L˜R=(I+τL˜)1, blue squares) are displayed. P-values were averaged over 100 random subsets of size 100, 200, 300, 400, and 500 from the COPDGene dataset. τ was set to 1 for all tests that used L˜R.

    (TIF)

    S3 Fig. Examples graphs with high, medium, and low edge densities.

    Low density graphs were generated according the Barabasi-Albert model for graph simulation. Medium- and high-density graphs were generated by giving each unconnected node either a 5% or 15% chance of becoming connected, respectively.

    (TIF)

    S4 Fig. Example of a graph with missing nodes.

    Graphs were generated according to the Barabasi-Albert model. Then any node with degree below the 25th percentile of degrees within the graph had a 25% chance of being dropped.

    (TIF)

    S1 Table. Type 1 error rates using complete pathway.

    Error rates were calculated from score tests on 1000 simulated data sets. All simulations used graphs with 15, 30, or 45 nodes. No nodes or edges were dropped for these simulations. Pathway information was included in kernel score test through the normalized Laplacian L˜.

    (XLSX)

    S2 Table. Type 1 error rates using pathways with 5% missing edges.

    Error rates were calculated from score tests on 1000 simulated data sets using graphs with 15, 30, or 45 nodes. The graph used to simulate Z and Y was of medium edge density, while the graph used to test was of low density. The low-density graphs are drawn from the Barabasi-Albert model with edge density 0.13, 0.07, and 0.04 for graphs with 15, 30, and 45 nodes, respectively. Medium edge density graphs are created by giving any 2 nodes without a direct edge between them a 5% chance of becoming directly connected. This creates graphs with an average edge density of 0.18, 0.11, and 0.09 for graphs with 15, 30, and 45 nodes, respectively. Pathway information was included in kernel score test through the normalized Laplacian L˜.

    (XLSX)

    S3 Table. Type 1 error rates using pathways with 15% missing edges.

    Error rates were calculated from score tests on 1000 simulated data sets using graphs with 15, 30, or 45 nodes. The graph used to simulate Z and Y was of high edge density, while the graph used to test was of low density. The low density graphs are drawn from the Barabasi-Albert model with edge density 0.13, 0.07, and 0.04 for graphs with 15, 30, and 45 nodes, respectively. High edge density graphs are created by giving any 2 nodes without a direct edge between them a 15% chance of becoming directly connected. This creates graphs with an average edge density of 0.26, 0.21, and 0.19 for graphs with 15, 30, and 45 nodes, respectively. Pathway information was included in kernel score test through the normalized Laplacian L˜.

    (XLSX)

    S4 Table. Type 1 error rates using pathways with dropped nodes.

    Error rates were calculated from score tests on 1000 simulated data sets using graphs 15, 30, or 45 nodes initially. The graph used to simulate Z and Y contained all nodes. Nodes with degree below the 25th percentile within a graph had a 25% chance of being dropped before testing. Pathway information was included in kernel score test through the normalized Laplacian L˜.

    (XLSX)

    Attachment

    Submitted filename: PaIRKAT_Rebuttal.docx

    Attachment

    Submitted filename: PaIRKAT_Re_Rebuttal.docx

    Data Availability Statement

    The R code for simulations and an example workflow with source code is available at https://github.com/CharlieCarpenter/PaIRKAT. A shiny app can be found at https://csevern.shinyapps.io/pairkat/. The metabolomics data set from the COPDGene Study can be found through PMID: 20214461 or DOI: 10.3109/15412550903499522.


    Articles from PLoS Computational Biology are provided here courtesy of PLOS

    RESOURCES