Statistical inference of genetic pathway analysis in high dimensions

Yang Liu; Wei Sun; Alexander P Reiner; Charles Kooperberg; Qianchuan He

doi:10.1093/biomet/asz033

. 2019 Jul 13;106(3):651. doi: 10.1093/biomet/asz033

Statistical inference of genetic pathway analysis in high dimensions

Yang Liu ¹, Wei Sun ², Alexander P Reiner ², Charles Kooperberg ², Qianchuan He ^2,^✉

PMCID: PMC6690174 PMID: 31427824

Summary

Genetic pathway analysis has become an important tool for investigating the association between a group of genetic variants and traits. With dense genotyping and extensive imputation, the number of genetic variants in biological pathways has increased considerably and sometimes exceeds the sample size Inline graphic . Conducting genetic pathway analysis and statistical inference in such settings is challenging. We introduce an approach that can handle pathways whose dimension could be greater than . Our method can be used to detect pathways that have nonsparse weak signals, as well as pathways that have sparse but stronger signals. We establish the asymptotic distribution for the proposed statistic and conduct theoretical analysis on its power. Simulation studies show that our test has correct Type I error control and is more powerful than existing approaches. An application to a genome-wide association study of high-density lipoproteins demonstrates the proposed approach.

Keywords: Genetic pathway analysis, Genetic variant, High-dimensional inference, Nonsparse signal, Power analysis, Sparse signal

1. Introduction

Genetic association analysis plays an important role in identifying genetic variants that are associated with traits. Genetic variants are often analysed by single-variant-based methods, using approaches such as Armitage’s trend test. Pathway-based analysis has become a popular tool for analysing genetic variant data (Chen et al., 2011b), whereby multiple genetic variants in the genes in a prespecified pathway are examined. There are several reasons to consider pathway analysis for association studies. First, pathways are generally defined using biological knowledge and thus are more likely to be functionally relevant (Zhong et al., 2010). Second, by analysing multiple variants simultaneously, pathway analysis has the potential to accumulate weak signals into stronger ones, while single-variant-based methods lack power in such a situation. Third, because the number of pathways is much smaller than the number of variants, the multiple-testing burden can be dramatically reduced.

One of the main challenges in pathway analysis is to deal with the high dimensionality. With increasingly dense genotyping and extensive imputation, the number of variants Inline graphic in genetic pathways has grown so rapidly that it can be larger than the sample size . This is seen in our real-data example, where the sample size is around 4000 while the number of single nucleotide polymorphisms in a pathway can be as large as 25 000. In such high dimensions, statistical testing methods that were designed for moderate Inline graphic , such as the likelihood ratio test, tend to have low power or may be inapplicable. To deal with the high dimensionality in pathway analysis, one potential approach is the burden test (Morgenthaler & Thilly, 2007), in which one simply sums the genotypes into a single predictor and then subjects this predictor to regression analysis. The burden test works well if all the variants have similar effect sizes, but this assumption rarely holds in real situations. Another common approach to dealing with high dimensions is to use principal component analysis in the regression modelling. One first derives the principal components from the genetic pathway under consideration, and then uses the leading components for association analysis (Buas et al., 2017). The disadvantages of this approach are that principal components with large variations need not be associated with the traits; it is rarely clear how many principal components to include; the interpretation of the regression coefficients can be difficult; and when Inline graphic , the estimated principal components may not be consistent (Shen et al., 2016). Complementary to the aforementioned approaches, kernel machine methods such as the sequence kernel association test (Wu et al., 2011) can also be applied to genetic pathway analysis. However, the latter test has been used primarily to analyse moderate-sized variant sets, and its performance in cases where Inline graphic is substantially larger is unclear. Other methods that have been developed for testing a group of genetic features in high-dimensional settings (Chen & Qin, 2010; Chen et al., 2011a; Gregory et al., 2015) focus on testing the mean difference between two groups rather than conducting association analysis.

In addition to the high-dimensional challenge, another difficulty in pathway analysis is power maximization under multiple plausible alternative hypotheses. For pathway analysis, the alternative hypothesis concerns both the number and the magnitudes of the nonzero genetic signals, which are generally unknown (Zhang, 2015). A situation often considered for genetic signals is that a pathway harbours potentially many variants with weak effects, called the nonsparse-signal situation. The sequence kernel association test can aggregate multiple signals and is potentially applicable to such a setting. Another possibility is that a genetic pathway contains only a few strong signals, called the sparse-signal situation. Several methods have been proposed to deal with this case, such as the Inline graphic test (Conneely & Boehnke, 2007), which first examines each variant individually and then seeks to obtain the -value for the maximum of the observed statistics. However, the test has little power in the nonsparse situation, while the sequence kernel association test loses power in the sparse situation.

In this paper, we propose a method for conducting high-dimensional genetic pathway analysis, where the dimension Inline graphic of the pathway can go to infinity and could exceed the sample size . Our approach can be used to identify pathways that harbour a large number of weak signals, i.e., nonsparse signals, as well as genetic pathways that contain only a few strong signals, i.e., sparse signals, or a mixture of weak and strong signals. We establish the asymptotic properties of the proposed statistics in high dimensions and conduct theoretical analysis of their power.

2. Methods

2.1. Model and statistics

Suppose that the data consist of a continuous trait vector Inline graphic , an adjusting covariates matrix and a genotype matrix for a genetic pathway; that is, the pathway being considered contains genetic variants. Suppose that the true regression model is

where Inline graphic is the coefficient vector for , with being the intercept, is the coefficient vector for , and is a vector of independent Gaussian errors with mean zero and variance . The design matrices and are considered fixed. The dimension of the adjusting covariates is assumed to be finite, while the dimension Inline graphic of the genotype matrix can go to infinity.

We are interested in testing the global null hypothesis Inline graphic against the alternative . Tests such as the likelihood ratio test and Wald test consider all the variants jointly and tend to perform poorly when is large; the statistics may not exist when . Marginal statistics are easy to calculate and have been widely used to evaluate the significance of each individual variant. Recall that in a marginal analysis, one first fits a regression model for a given variant, say the Inline graphic th, by () and then obtains the marginal score statistic as

with Inline graphic , where is the identity matrix. To conduct a pathway analysis, it is natural to consider the sum of all the squared marginal statistics, In fact, it can be shown that is equivalent to the sequence kernel association test statistic, if the estimator of is ignored in the latter. However, our proposed approach is not focused on Inline graphic per se, but rather uses to develop a suite of statistics for high-dimensional settings, particularly for the case of for a constant .

Under the null hypothesis Inline graphic , it can be shown that and var, where , with a diagonal matrix whose elements are (), and is the Frobenius norm. For the moment we assume that is known, but later on we will address the practical situation where needs to be estimated. We propose to standardize , which yields

(1)

where the superscript Inline graphic emphasizes that both and can go to infinity; it will be suppressed below for ease of notation. Expression (1) suggests that may converge to normality as gets large. However, the central limit theorem does not directly apply here because the are correlated. In fact, the correlation matrix for the Inline graphic , , can be shown to have the form

and it can further be shown that Inline graphic . In Lemma 1 we show that under proper conditions, is standard normal as both and go to infinity.

Before presenting Lemma 1, we define some notation. For a vector Inline graphic , let be the -norm of the vector for . For any matrix , denote the induced -norm by . When is an matrix, we denote its maximum and minimum eigenvalues by and , respectively.

Lemma 1.

Let . If

(2)

then under , the statistic in distribution.

Remark 1.

Here we have no constraint on the order of with respect to , providing they both go to infinity. Condition (2) is mild for genetic studies. By Hölder’s inequality, , where is the maximum absolute column sum of the matrix. When the correlation structure in is not overly strong, as is the case for the power-decay structure, i.e., for some , then one can show that . Here , the correlation of and , can be interpreted as the linkage disequilibrium of genetic variants after adjusting for covariates ; when there are no adjusting covariates, reduces to the linkage disequilibrium matrix of . The power decay structure indicates that two distant genetic variants have virtually no linkage disequilibrium, which is indeed what is observed in genetic studies, particularly in the human genome data (International HapMap Consortium, 2005). Similar structures have also been used in other articles on genetic studies, such as Dai et al. (2012). Our proposed statistic naturally takes linkage disequilibrium into account, because . The linkage disequilibrium can influence both the denominator and the numerator of , so the impact of the linkage disequilibrium on the power of the proposed test is influenced by the size and density of the genetic signals. However, the linkage disequilibrium will not affect the validity of the test or its asymptotic properties, because the calculation of does not involve inversion of the linkage disequilibrium matrix, and the normality of the proposed statistics requires only that distant variants tend to have linkage disequilibrium approaching zero. In practice, variants in a gene tend to be in linkage disequilibrium, while those for different genes are generally not in linkage disequilibrium; this type of structure is covered in Lemma 1.

So far we have assumed that the noise level Inline graphic is known. To make our proposal practical, it is tempting to replace with a consistent estimator . It turns out that the validity of doing so depends on the order of relative to . In the following, we elaborate on this and propose different statistics to accommodate different ratios Inline graphic .

We first consider the situation where Inline graphic for some , i.e., is of smaller order than . The following lemma shows that if we replace with a consistent estimator , normality still holds.

Lemma 2.

Suppose that (2) holds. Let be a root--consistent estimator of such that . Then under , as such that for ,

in distribution.

Next, we consider the situation in which Inline graphic for some constant . The normality of no longer holds because becomes excessively large; see the proof of Lemma 2 for more details. In light of this, we propose a new statistic

(3)

where Inline graphic , with being the number of adjusting covariates as mentioned earlier. The in the numerator of is replaced by in . The motivation behind this is that estimates under . We discovered that this replacement of enables one to overcome the limitation of in high dimensions. The following theorem shows that Inline graphic follows a normal distribution for .

Theorem 1.

Suppose that (2) holds. For any consistent estimator , as such that , if for some constant , then under we have in distribution.

Theorem 1 allows one to conduct statistical inference for pathway analysis when Inline graphic , although should not be excessively larger than . The condition is necessary to prevent the in the denominator equalling zero, as it can be shown that . To obtain a consistent estimator for under in a high-dimensional setting, Fan et al. (2012) proposed a refitted crossvalidation method based on procedures that satisfy the sure screening property. When the sparsity of the model is completely unknown, we can also estimate Inline graphic by the moment-based estimators of Dicker (2014), which are root- consistent when .

2.2. Power loss in the presence of sparse signals

The proposed statistic Inline graphic can handle situations where the association signals are spread out over a large number of genetic variants. However, the power of will be relatively low for the sparse-signal situation, in which a few genetic variants carry strong signals while all the others have zero coefficients. Fan et al. (2015) proposed the power-enhancement principle, the fundamental idea of which is to include a screening statistic that goes to zero under Inline graphic , but is nonzero under the sparse alternatives . Motivated by this principle, we propose a statistic that strengthens and is able to guard against potential power loss in the sparse-signal situation.

We define a screening set Inline graphic where is a threshold chosen to be slightly larger than the maximum estimation error of the marginal estimator, i.e., . Then, a power-enhancement component is

where Inline graphic denotes the sign of . Our statistic that is able to detect both nonsparse and sparse signals is .

Since Inline graphic has the same sign as , always has power at least that of . The threshold needs to ensure that the screening set is empty with probability approaching 1 under , so that the size of will be asymptotically equivalent to that of . Then, under , if an estimator is large enough that is nonempty, one can gain power. For Gaussian and sub-Gaussian errors, Inline graphic can be chosen to be , as suggested by Fan et al. (2015). The power-enhancement procedure in Fan et al. (2015) deals with a consistent estimator under , which is not available in our procedure, while our approach builds upon marginal estimators which are inconsistent under . Nevertheless, the size of our proposed statistic is asymptotically equivalent to that of Inline graphic under ; in the next subsection, we will show that under the sparse alternatives can be powerful even when is not.

Lemma 3.

Under the same conditions as in Theorem 1, if where as , then under the null hypothesis we have in distribution. Thus, the sizes of and are asymptotically equivalent.

To select Inline graphic in practice, we propose an adaptive procedure to accommodate different correlation structures. We first generate a vector of random errors from the standard normal distribution. Then we compute the maximum of the marginal estimators as . Finally, we repeat these two steps many times and set Inline graphic based on all the replicates. McKeague & Qian (2015) also used an adaptive approach to determine threshold parameters for high-dimensional testing.

2.3. Power analysis

In this subsection we investigate the asymptotic power of the proposed tests Inline graphic and for nonsparse and sparse alternatives. Under , let be the set of nonzero coefficients, and let . Define the subvector and the submatrix . Let be the diagonal matrix with nonzero elements for . Similarly, let , and denote the corresponding quantities for .

The following theorem states that the sum-of-squares type of statistic Inline graphic has high power for the nonsparse-signal situation when the accumulated signals are sufficiently large.

Theorem 2.

Suppose that all the conditions in Theorem 1 hold. Consider a nonsparse alternative in which for a sufficiently large constant . If and for some constants , then as , , where is the -quantile of the standard normal distribution.

While Inline graphic can have high power under nonsparse alternatives, it may lose power under sparse alternatives. In the following theorem we show that , which adds a power-enhancement term to , can be powerful under both nonsparse and sparse alternatives.

Theorem 3.

Assume that the conditions in Theorem 2 hold. Consider a sparse alternative in which for a sufficiently large constant . If for some constant , then under either the nonsparse alternative or the sparse alternative , as , .

In practice, we recommend use of Inline graphic for detecting both weak and strong signals. However, if one wishes to distinguish between the sparse signals and the nonsparse signals, one can examine the values of and . If is larger than , then the power-enhancement component is nonzero and there exist strong signals in the pathway. If Inline graphic , then there are no strong signals in the pathway and the significance is driven by weak signals.

2.4. Incorporating biological information into and

The statistics Inline graphic and give equal weight to all the variants. In some applications, one may wish to assign different weights based on prior information. For example, if the effect of a genetic variant is related to its minor allele frequency, one may assign a weight to this variant, where is the minor allele frequency for the Inline graphic th variant. In other cases, one may assign functional scores to different variants to reflect their biological functions. In lieu of these considerations, we propose incorporating prior biological information into our proposed statistics as follows.

Let Inline graphic () be prespecified positive weights, and let be the diagonal matrix with elements . Next, define and . Let and Similar to , we define a statistic . The following result shows the asymptotic normality of .

Corollary 1.

Suppose that (2) holds and as . Assume that as , . For any consistent estimator , if for some constant , then under in distribution.

As was done for Inline graphic , we can add to to guard against potential power loss in the presence of strong signals. Thus, our proposed statistics can readily accommodate prior biological information and still preserve their theoretical properties.

2.5. Edgeworth expansion for extreme significance levels

Genetic studies sometimes involve a large number of pathways, so the significance level can be much lower than 0.05. For example, in our real-data analysis, the significance level is 0.0003. At such levels, the normal distribution in Lemma 3 may be a poor approximation. We therefore propose a two-term Edgeworth expansion to characterize the tail probability of Inline graphic with higher accuracy. Recall that under , . It is known that follows a mixed chi-squared distribution with weights , where are the eigenvalues of . Using the Edgeworth expansion for independent random variables with varying distributions (Feller, 1971, p. 546), we can derive the following two-term expansion for Inline graphic :

(4)

where Inline graphic for . Further, . Then, under the conditions in Theorem 1, the last remainder term in (4) can be shown to be . This expansion tends to be more accurate than the normal approximation, as the remainder term of the normal approximation is typically . Directly calculating (4) involves computing the Inline graphic , which can be onerous when and are large. Instead, we can use the identity for . We call the test that uses (4) to approximate the -value for the test. Similarly, we can apply an Edgeworth expansion to , and we call the resulting test .

3. Simulation studies

Monte Carlo simulations were conducted to evaluate the performance of the proposed tests, Inline graphic and , in high-dimensional settings and to compare them with the Bonferroni test, the burden test, principal component analysis, and the sequence kernel association test.

We generated the genotype matrix Inline graphic similarly to He et al. (2016). For each person, we first generated a block-diagonal covariance matrix with each block being a matrix . We considered compound symmetric with diagonal elements and off-diagonal elements 0.5 and autoregressive with th off-diagonal element 0.6. Then we trichotomized the simulated vector into genotype values of Inline graphic according to the Hardy–Weinberg equilibrium.

We generated the trait by setting Inline graphic (), where is an adjusting covariate and with . For the sample size and the dimension , we considered both the and the cases by setting with and setting with .

The null model is Inline graphic (). To simulate the data under various alternatives, we assume that has nonzero signals with support set . The magnitude of the signal was set to be , and half of the had positive signs while the other half had negative signs. The magnitude of the signals varied from 0.03 to 0.05 under these set-ups, and the proportion of nonnull variants among all the variants, Inline graphic , was set to 5%, 10%, 15% or 20%.

The tests Inline graphic and were conducted as described in § 2. For variance estimation, we applied the refitted crossvalidation method (Fan et al., 2012) to obtain . The threshold parameter in the power-enhancement component was estimated over 1000 replicates. We used two versions of principal component analysis. In the first, we used the five leading principal components and performed a likelihood ratio test. In the second version, we included the principal components that explain 50% of the total variance for the likelihood ratio test. The reason for considering the second version is that, in practice, a few principal components may not always capture the majority of the variance, as seen in Avery et al. (2011). We also included the sequence kernel association test without any weights.

The Type I errors of the tests were calculated over 10 000 replications, and the power was based on 1000 replications. Table 1 displays the Type I errors of the tests for the models considered. It can be seen that the Bonferroni test, principal component analysis using the five leading components, the sequence kernel association test, and our tests Inline graphic and all have their Type I errors controlled. Principal component analysis using components that explain 50% of the variance appears to have an inflated Type I error; this is likely due to the fact that many principal components are needed to account for the 50% of variance, and hence the likelihood ratio test has a large degree of freedom. Because of its inflated Type I error, this method was excluded from the subsequent experiments.

Table 1.

Type I error Inline graphic of the tests at level

		Corr.	Bonf.	Burden	PCA	PCA50	SKAT
500	300	CS	4.66	4.90	5.00	5.97	2.79	4.84	5.10
		AR	4.47	5.30	4.91	6.40	2.77	5.03	5.13
	500	CS	4.80	4.92	5.06	6.59	2.04	4.89	4.97
		AR	4.64	4.80	5.45	6.59	1.64	4.76	4.84
	1000	CS	4.71	5.02	5.00	7.22	0.75	4.75	4.93
		AR	4.96	4.98	5.10	7.31	0.68	4.83	4.93
1000	800	CS	5.07	4.86	5.20	6.49	2.47	5.03	5.12
		AR	4.78	5.26	5.25	6.49	2.18	4.92	5.01
	1000	CS	4.92	5.08	4.71	6.13	2.02	4.85	5.04
		AR	5.24	5.00	4.90	6.60	1.81	4.88	4.99
	1500	CS	4.92	5.08	5.27	7.07	1.38	4.95	5.14
		AR	5.22	5.08	5.26	6.85	1.04	4.82	4.98

Open in a new tab

Corr., correlation structure; CS, compound symmetric; AR, autoregressive; Bonf., Bonferroni test; PCA, principal component analysis using the five leading components; PCA50, principal component analysis using components that explain 50% of the variance; SKAT, the sequence kernel association test.

For the compound symmetric dependence structure, plots of the power for different sample sizes and dimensions against the proportion Inline graphic of nonzero signals are shown in Fig. 1. When the ratio is small, all the methods have low power. This indicates that when signals are sparse and weak, it is highly difficult to detect the association for the pathway considered. As increases, the power improves for all methods, because more variants carry association signals in the studied pathway. However, the tests Inline graphic and always have higher power than the other approaches in these settings. The results for the autoregressive structure show a similar pattern.

Fig. 1. — Power of the Bonferroni test (), the burden test (), principal component analysis (), the sequence kernel association test (), () and () at level 0.05 for different sample sizes and dimensions plotted against the proportion of nonzero signals, . The compound symmetric dependence structure is considered.

Inline graphic — Power of the Bonferroni test (), the burden test (), principal component analysis (), the sequence kernel association test (), () and () at level 0.05 for different sample sizes and dimensions plotted against the proportion of nonzero signals, . The compound symmetric dependence structure is considered.

We then considered the situation in which a genetic pathway contains both weak and strong signals. We simulated weak signals as described earlier, and then simulated a strong signal with Inline graphic . Table 2 shows that both and compete favourably with the other statistics, and has higher power than .

Table 2.

Power Inline graphic of the tests under mixed signals at level

		Corr.	Bonf.	Burden	PCA	SKAT
500	300	CS	36.0	5.7	15.6	30.8	35.1	39.8
		AR	35.7	5.4	12.0	24.0	28.6	34.1
	500	CS	34.8	5.3	15.7	33.6	43.5	46.7
		AR	35.2	5.5	12.6	25.0	35.6	39.4
	1000	CS	32.4	5.2	18.5	39.2	61.8	63.6
		AR	31.1	5.0	14.8	26.8	51.0	53.5
1000	800	CS	38.4	5.2	16.0	54.0	60.1	62.4
		AR	36.1	5.0	12.3	42.4	50.1	53.3
	1000	CS	36.7	5.0	17.0	59.2	67.2	69.0
		AR	36.0	4.8	12.7	45.8	56.2	59.2
	1500	CS	35.4	5.1	18.1	67.3	79.4	80.9
		AR	34.5	4.6	13.1	52.6	69.1	71.5

Open in a new tab

Corr., correlation structure; CS, compound symmetric; AR, autoregressive; Bonf., Bonferroni test; PCA, principal component analysis using the five leading components; SKAT, the sequence kernel association test.

We conducted simulation studies to examine the performance of the Inline graphic test, which involves a two-term Edgeworth expansion and is expected to be more accurate than in controlling Type I error at extreme significance levels. We set the significance level to 0.0001 and evaluated the Type I error with 1 000 000 simulations. The threshold parameter in the power enhancement was estimated over 50 000 replicates. Table 3 shows that at level 0.0001, the Inline graphic statistic tends to have inflated Type I error due to the less accurate characterization of the tail probability. In contrast, can control the Type I error well at 0.0001 when the sample size and dimension are sufficiently large.

Table 3.

Type I error Inline graphic of the tests at level

		Corr.	Bonf.	Burden	PCA	SKAT
500	500	CS	1.01	1.08	1.31	0.01	5.06	1.04
		AR	0.96	1.16	1.16	0.00	4.08	0.97
1000	1000	CS	1.34	1.01	1.20	0.00	3.19	1.08
		AR	1.30	1.08	1.10	0.00	2.86	1.05
1500	1500	CS	1.20	1.11	1.00	0.00	2.29	0.95
		AR	1.11	1.04	1.34	0.00	2.23	0.95

Open in a new tab

4. Real-data analysis

We analysed the high-density lipoprotein cholesterol data from the Genomics and Randomized Trials Network in the Women’s Health Initiative (Coviello et al., 2012). The overall goal of the study is to identify novel genetic factors that contribute to the incidence of myocardial infarction, stroke and diabetes. DNA samples were genotyped on the HumanOmni-Quad platform, and genotypes were imputed with reference panels. Genetic variants that have imputations Inline graphic and minor allele frequency greater than 5% were included. We focused on the 3990 samples of Caucasian ancestry.

We first tested whether our approach can capture existing genetic pathways that are known to be involved in high-density lipoprotein metabolism. Assmann & Gotto (2004) listed a pathway involved in the generation and conversion of high-density lipoprotein. The pathway includes 11 genes: APOA1, APOE, LCAT, LIPC, CETP, PLTP, SCARB, LRP1, LDLR, ABCA1 and ABCF1. We mapped the genetic variants to these genes and obtained 629 variants for this pathway. We adjusted for the following covariates: age, hormone replacement therapy arm, smoking status, body mass index, and the first two principal components for ancestry (Asselbergs et al., 2012). The Inline graphic -values for the pathway analysis are displayed in Table 4. Several methods yielded low -values, including the Bonferroni test, the sequence kernel association test and the proposed tests and . The test yielded the lowest -value. The -value of the test is lower than that of because a number of variants in the CETP and LIPC genes were observed to carry strong association signals that exceed the power-enhancement threshold.

Table 4.

Real-data analysis: Inline graphic -values of the tests for the known lipid pathway

	Bonf.	Burden	PCA	SKAT
-value		0.252	0.034

Open in a new tab

Bonf., Bonferroni test; PCA, principal component analysis using the five leading components; SKAT, the sequence kernel association test.

Next, we investigated the associations between the KEGG pathways and high-density lipoprotein. The KEGG database contains 186 pathways, which represent a wide variety of cellular processes and molecular functions; for more details see http://www.genome.jp/kegg/pathway.html. We excluded one pathway from our analysis due to overlapping, so our real-data analysis includes 185 pathways. Figure 2 provides an overview of the number of variants in each of the 185 pathways. The median number of variants in these pathways is around 3000. A number of pathways have more than 10 000 variants, with some containing nearly 25 000.

Fig. 2. — The number of single-nucleotide polymorphisms, SNPs, in each of the 185 KEGG pathways.

To control for the familywise Type I error, the threshold of significance was set to Inline graphic , i.e., a Bonferroni correction. Table 5 shows the pathways that pass the significance threshold in any of the tests. The approach identified three pathways: arachidonic acid metabolism, metabolism of xenobiotics by cytochrome P450, and drug metabolism by cytochrome P450. The statistic yielded the same values as Inline graphic , indicating that no signal exceeds the power-enhancement threshold in the studied pathways. The sequence kernel association test detected only the arachidonic acid metabolism pathway, while the other methods identified no significant pathway.

Table 5.

The Inline graphic -values of the tests for the three significant KEGG pathways; -values lower than are indicated by

				-value
	#SNPs	Bonf.	Burden	PCA	SKAT
Arach. acid metab.	2590	5.11	0.57	0.07	0.02
Metab. xenobio.	2254	6.30	0.46	0.18	0.07
Drug metab.	2385	7.86	0.39	0.16	0.04

Open in a new tab

#SNPs, number of single-nucleotide polymorphisms; Bonf., Bonferroni test; PCA, principal component analysis using the five leading components; SKAT, the sequence kernel association test; Arach. acid metab., arachidonic acid metabolism pathway; Metab. xenobio., metabolism of xenobiotics by cytochrome P450; Drug metab., drug metabolism by cytochrome P450.

The arachidonic acid metabolism pathway contains 2590 variants in 55 genes. A recent biological study suggested that this pathway is an important regulator of cholesterol metabolism (Demetz et al., 2014). The linkage disequilibrium plot of the genetic variants of this pathway in Fig. 3(a) shows that variants in proximity to each other tend to have strong correlations, while those far apart have barely detectable correlations. To gain more insight, we plot the marginal Inline graphic -values for all 2590 variants in Fig. 3(b). There are a number of variants with -values between and , but none of them reaches genome-wide significance. Instead, the proposed statistic was able to aggregate these relatively mild signals into a stronger one, which leads to the detection of the arachidonic acid metabolism pathway. The variants that contribute to the significance of this pathway, the linkage disequilibrium plots and marginal Inline graphic -values of variants in the other two pathways, metabolism of xenobiotics by cytochrome P450 and drug metabolism by cytochrome P450, are given in the Supplementary Material.

Fig. 3. — Analysis of the arachidonic acid metabolism pathway: (a) linkage disequilibrium plot; (b) marginal -values of the 2590 single-nucleotide polymorphisms, SNPs, in the pathway, where the dashed line represents the Bonferroni threshold.

5. Discussion

Our approach can be extended to deal with non-Gaussian errors as long as the errors satisfy the moment condition Inline graphic for some constant and . In such a situation, we can adjust the denominator of in (3) from to , where is the kurtosis of the errors and is the th diagonal entry of . Then, using the results in Bhansali et al. (2007), we can show the asymptotic normality of the adjusted test statistic accordingly. Our approach can be also extended to accommodate genetic interactions.

Screening techniques have been used in genetic association studies to filter out irrelevant variants; see, for example, Li et al. (2014) and Cui et al. (2015). However, these screening procedures are typically used as a variable-selection step to reduce dimensions, not for statistical testing. In contrast, our screening statistic is directly integrated into the test statistic and is designed for statistical testing. Our approach has focused on the fixed design, which is commonly considered in genetic studies. It will be interesting to develop similar methods under the random design, although it remains challenging to establish the asymptotic properties of the proposed statistics in high dimensions.

Supplementary Material

asz033_Supplementary_Data

Click here for additional data file.^{(3MB, pdf)}

Acknowledgement

This research was supported by the U.S. National Institutes of Health. We thank the Women’s Health Initiative investigators for sharing the data. The Women’s Health Initiative programme is funded by the National Heart, Lung and Blood Institute. Correspondence should be addressed to QH. We thank the editor, associate editor and reviewers for helpful comments.

Supplementary material

Supplementary material available at Biometrika online includes technical proofs, together with additional simulation results and real-data analysis.

References

Asselbergs, F. W.,, Guo, Y.,, Van Iperen, E. P.,, Sivapalaratnam, S.,, Tragante, V.,, Lanktree, M. B.,, Lange, L. A.,, Almoguera, B.,, Appelman, Y. E.,, Barnard, J., et al. (2012). Large-scale gene-centric meta-analysis across 32 studies identifies multiple lipid loci. Am. J. Hum. Genet. 91, 823–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
Assmann, G. & Gotto, A. M. (2004). HDL cholesterol and protective factors in atherosclerosis. Circulation 109, III8–14. [DOI] [PubMed] [Google Scholar]
Avery, C. L.,, He, Q.,, North, K. E.,, Ambite, J. L.,, Boerwinkle, E.,, Fornage, M.,, Hindorff, L. A.,, Kooperberg, C.,, Meigs, J. B.,, Pankow, J. S., et al. (2011). A phenomics-based strategy identifies loci on APOC1, BRAP, and PLCG1 associated with metabolic syndrome phenotype domains. PLoS Genet. 7, e1002322. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bhansali, R., Giraitis, L. & Kokoszka, P. (2007). Convergence of quadratic forms with nonvanishing diagonal. Statist. Prob. Lett. 77, 726–34. [Google Scholar]
Buas, M. F.,, He, Q.,, Johnson, L. G.,, Onstad, L.,, Levine, D. M.,, Thrift, A. P.,, Gharahkhani, P.,, Palles, C.,, Lagergren, J.,, Fitzgerald, R. C., et al. (2017). Germline variation in inflammation-related pathways and risk of Barrett’s oesophagus and oesophageal adenocarcinoma. Gut 66, 1739–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen, L. S., Paul, D., Prentice, R. L. & Wang, P. (2011a). A regularized Hotelling’s test for pathway analysis in proteomic studies. J. Am. Statist. Assoc. 106, 1345–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen, M., Cho, J. & Zhao, H. (2011b). Incorporating biological pathways via a Markov random field model in genome-wide association studies. PLoS Genet. 7, e1001353. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen, S. X. & Qin, Y.-L. (2010). A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Statist. 38, 808–35. [Google Scholar]
Conneely, K. N. & Boehnke, M. (2007). So many correlated tests, so little time! Rapid adjustment of values for multiple correlated tests. Am. J. Hum. Genet. 81, 1158–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
Coviello, A.D., Haring, R., Wellons, M., Vaidya, D., Lehtimaki, T., Keildson, S., Lunetta, K.L., He, C., Fornage, M. & Lagou, V.. et al. (2012). A genome-wide association meta-analysis of circulating sex hormone–binding globulin reveals multiple Loci implicated in sex steroid hormone regulation. PLoS Genet. 8, e1002805. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cui, H., Li, R. & Zhong, W. (2015). Model-free feature screening for ultrahigh dimensional discriminant analysis. J. Am. Statist. Assoc. 110, 630–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dai, J. Y., Kooperberg, C., Leblanc, M. & Prentice, R. L. (2012). Two-stage testing procedures with independent filtering for genome-wide gene-environment interaction. Biometrika 99, 929–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
Demetz, E.,, Schroll, A.,, Auer, K.,, Heim, C.,, Patsch, J. R.,, Eller, P.,, Theurl, M.,, Theurl, I.,, Theurl, M.,, Seifert, M. et al. (2014). The arachidonic acid metabolome serves as a conserved regulator of cholesterol metabolism. Cell Metab. 20, 787–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dicker, L. H. (2014). Variance estimation in high-dimensional linear models. Biometrika 101, 269–84. [Google Scholar]
Fan, J., Guo, S. & Hao, N. (2012). Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J. R. Statist. Soc. B 74, 37–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan, J., Liao, Y. & Yao, J. (2015). Power enhancement in high dimensional cross-sectional tests. Econometrica 83, 1497–541. [DOI] [PMC free article] [PubMed] [Google Scholar]
Feller, W. (1971). Expansions in the case of varying components. In An Introduction to Probability Theory and Its Applications, vol. 2 New York: Wiley, pp. 546–8. [Google Scholar]
Gregory, K. B., Carroll, R. J., Baladandayuthapani, V. & Lahiri, S. N. (2015). A two-sample test for equality of means in high dimension. J. Am. Statist. Assoc. 110, 837–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
He, Q., Zhang, H. H., Avery, C. L. & Lin, D. (2016). Sparse meta-analysis with high-dimensional data. Biostatistics 17, 205–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
International HapMap Consortium (2005). A haplotype map of the human genome. Nature 437, 1299–320. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li, J., Zhong, W., Li, R. & Wu, R. (2014). A fast algorithm for detecting gene–gene interactions in genome-wide association studies. Ann. Appl. Statist. 8, 2292–318. [DOI] [PMC free article] [PubMed] [Google Scholar]
McKeague, I. W. & Qian, M. (2015). An adaptive resampling test for detecting the presence of significant predictors. J. Am. Statist. Assoc. 110, 1422–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
Morgenthaler, S. & Thilly, W. G. (2007). A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: A cohort allelic sums test. Mutat. Res. 615, 28–56. [DOI] [PubMed] [Google Scholar]
Shen, D., Shen, H. & Marron, J. S. (2016). A general framework for consistency of principal component analysis. J. Mach. Learn. Res. 17, 1–34. [Google Scholar]
Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M. & Lin, X. (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89, 82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang, G. (2015). Genetic architecture of complex human traits: What have we learned from genome-wide association studies? Curr. Genet. Med. 3, 143–50. [Google Scholar]
Zhong, H., Yang, X., Kaplan, L. M., Molony, C. & Schadt, E. E. (2010). Integrating pathway analysis and genetics of gene expression for genome-wide association studies. Am. J. Hum. Genet. 86, 581–91. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

asz033_Supplementary_Data

Click here for additional data file.^{(3MB, pdf)}

[B1] Asselbergs, F. W.,, Guo, Y.,, Van Iperen, E. P.,, Sivapalaratnam, S.,, Tragante, V.,, Lanktree, M. B.,, Lange, L. A.,, Almoguera, B.,, Appelman, Y. E.,, Barnard, J., et al. (2012). Large-scale gene-centric meta-analysis across 32 studies identifies multiple lipid loci. Am. J. Hum. Genet. 91, 823–38. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B2] Assmann, G. & Gotto, A. M. (2004). HDL cholesterol and protective factors in atherosclerosis. Circulation 109, III8–14. [DOI] [PubMed] [Google Scholar]

[B3] Avery, C. L.,, He, Q.,, North, K. E.,, Ambite, J. L.,, Boerwinkle, E.,, Fornage, M.,, Hindorff, L. A.,, Kooperberg, C.,, Meigs, J. B.,, Pankow, J. S., et al. (2011). A phenomics-based strategy identifies loci on APOC1, BRAP, and PLCG1 associated with metabolic syndrome phenotype domains. PLoS Genet. 7, e1002322. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B4] Bhansali, R., Giraitis, L. & Kokoszka, P. (2007). Convergence of quadratic forms with nonvanishing diagonal. Statist. Prob. Lett. 77, 726–34. [Google Scholar]

[B5] Buas, M. F.,, He, Q.,, Johnson, L. G.,, Onstad, L.,, Levine, D. M.,, Thrift, A. P.,, Gharahkhani, P.,, Palles, C.,, Lagergren, J.,, Fitzgerald, R. C., et al. (2017). Germline variation in inflammation-related pathways and risk of Barrett’s oesophagus and oesophageal adenocarcinoma. Gut 66, 1739–47. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B6] Chen, L. S., Paul, D., Prentice, R. L. & Wang, P. (2011a). A regularized Hotelling’s test for pathway analysis in proteomic studies. J. Am. Statist. Assoc. 106, 1345–60. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] Chen, M., Cho, J. & Zhao, H. (2011b). Incorporating biological pathways via a Markov random field model in genome-wide association studies. PLoS Genet. 7, e1001353. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8] Chen, S. X. & Qin, Y.-L. (2010). A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Statist. 38, 808–35. [Google Scholar]

[B9] Conneely, K. N. & Boehnke, M. (2007). So many correlated tests, so little time! Rapid adjustment of values for multiple correlated tests. Am. J. Hum. Genet. 81, 1158–68. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] Coviello, A.D., Haring, R., Wellons, M., Vaidya, D., Lehtimaki, T., Keildson, S., Lunetta, K.L., He, C., Fornage, M. & Lagou, V.. et al. (2012). A genome-wide association meta-analysis of circulating sex hormone–binding globulin reveals multiple Loci implicated in sex steroid hormone regulation. PLoS Genet. 8, e1002805. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] Cui, H., Li, R. & Zhong, W. (2015). Model-free feature screening for ultrahigh dimensional discriminant analysis. J. Am. Statist. Assoc. 110, 630–41. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B12] Dai, J. Y., Kooperberg, C., Leblanc, M. & Prentice, R. L. (2012). Two-stage testing procedures with independent filtering for genome-wide gene-environment interaction. Biometrika 99, 929–44. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] Demetz, E.,, Schroll, A.,, Auer, K.,, Heim, C.,, Patsch, J. R.,, Eller, P.,, Theurl, M.,, Theurl, I.,, Theurl, M.,, Seifert, M. et al. (2014). The arachidonic acid metabolome serves as a conserved regulator of cholesterol metabolism. Cell Metab. 20, 787–98. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] Dicker, L. H. (2014). Variance estimation in high-dimensional linear models. Biometrika 101, 269–84. [Google Scholar]

[B15] Fan, J., Guo, S. & Hao, N. (2012). Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J. R. Statist. Soc. B 74, 37–65. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] Fan, J., Liao, Y. & Yao, J. (2015). Power enhancement in high dimensional cross-sectional tests. Econometrica 83, 1497–541. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] Feller, W. (1971). Expansions in the case of varying components. In An Introduction to Probability Theory and Its Applications, vol. 2 New York: Wiley, pp. 546–8. [Google Scholar]

[B18] Gregory, K. B., Carroll, R. J., Baladandayuthapani, V. & Lahiri, S. N. (2015). A two-sample test for equality of means in high dimension. J. Am. Statist. Assoc. 110, 837–49. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] He, Q., Zhang, H. H., Avery, C. L. & Lin, D. (2016). Sparse meta-analysis with high-dimensional data. Biostatistics 17, 205–20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] International HapMap Consortium (2005). A haplotype map of the human genome. Nature 437, 1299–320. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] Li, J., Zhong, W., Li, R. & Wu, R. (2014). A fast algorithm for detecting gene–gene interactions in genome-wide association studies. Ann. Appl. Statist. 8, 2292–318. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] McKeague, I. W. & Qian, M. (2015). An adaptive resampling test for detecting the presence of significant predictors. J. Am. Statist. Assoc. 110, 1422–33. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] Morgenthaler, S. & Thilly, W. G. (2007). A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: A cohort allelic sums test. Mutat. Res. 615, 28–56. [DOI] [PubMed] [Google Scholar]

[B24] Shen, D., Shen, H. & Marron, J. S. (2016). A general framework for consistency of principal component analysis. J. Mach. Learn. Res. 17, 1–34. [Google Scholar]

[B25] Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M. & Lin, X. (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89, 82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] Zhang, G. (2015). Genetic architecture of complex human traits: What have we learned from genome-wide association studies? Curr. Genet. Med. 3, 143–50. [Google Scholar]

[B27] Zhong, H., Yang, X., Kaplan, L. M., Molony, C. & Schadt, E. E. (2010). Integrating pathway analysis and genetics of gene expression for genome-wide association studies. Am. J. Hum. Genet. 86, 581–91. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Statistical inference of genetic pathway analysis in high dimensions

Yang Liu

Wei Sun

Alexander P Reiner

Charles Kooperberg

Qianchuan He

Summary

1. Introduction

2. Methods

2.1. Model and statistics

Lemma 1.

Remark 1.

Lemma 2.

Theorem 1.

2.2. Power loss in the presence of sparse signals

Lemma 3.

2.3. Power analysis

Theorem 2.

Theorem 3.

2.4. Incorporating biological information into and

Corollary 1.

2.5. Edgeworth expansion for extreme significance levels

3. Simulation studies

Table 1.

Fig. 1.

Table 2.

Table 3.

4. Real-data analysis

Table 4.

Fig. 2.

Table 5.

Fig. 3.

5. Discussion

Supplementary Material

Acknowledgement

Supplementary material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases