Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2019 Jul 13;106(3):651. doi: 10.1093/biomet/asz033

Statistical inference of genetic pathway analysis in high dimensions

Yang Liu 1, Wei Sun 2, Alexander P Reiner 2, Charles Kooperberg 2, Qianchuan He 2,
PMCID: PMC6690174  PMID: 31427824

Summary

Genetic pathway analysis has become an important tool for investigating the association between a group of genetic variants and traits. With dense genotyping and extensive imputation, the number of genetic variants in biological pathways has increased considerably and sometimes exceeds the sample size Inline graphic. Conducting genetic pathway analysis and statistical inference in such settings is challenging. We introduce an approach that can handle pathways whose dimension Inline graphic could be greater than Inline graphic. Our method can be used to detect pathways that have nonsparse weak signals, as well as pathways that have sparse but stronger signals. We establish the asymptotic distribution for the proposed statistic and conduct theoretical analysis on its power. Simulation studies show that our test has correct Type I error control and is more powerful than existing approaches. An application to a genome-wide association study of high-density lipoproteins demonstrates the proposed approach.

Keywords: Genetic pathway analysis, Genetic variant, High-dimensional inference, Nonsparse signal, Power analysis, Sparse signal

1. Introduction

Genetic association analysis plays an important role in identifying genetic variants that are associated with traits. Genetic variants are often analysed by single-variant-based methods, using approaches such as Armitage’s trend test. Pathway-based analysis has become a popular tool for analysing genetic variant data (Chen et al., 2011b), whereby multiple genetic variants in the genes in a prespecified pathway are examined. There are several reasons to consider pathway analysis for association studies. First, pathways are generally defined using biological knowledge and thus are more likely to be functionally relevant (Zhong et al., 2010). Second, by analysing multiple variants simultaneously, pathway analysis has the potential to accumulate weak signals into stronger ones, while single-variant-based methods lack power in such a situation. Third, because the number of pathways is much smaller than the number of variants, the multiple-testing burden can be dramatically reduced.

One of the main challenges in pathway analysis is to deal with the high dimensionality. With increasingly dense genotyping and extensive imputation, the number of variants Inline graphic in genetic pathways has grown so rapidly that it can be larger than the sample size Inline graphic. This is seen in our real-data example, where the sample size is around 4000 while the number of single nucleotide polymorphisms in a pathway can be as large as 25 000. In such high dimensions, statistical testing methods that were designed for moderate Inline graphic, such as the likelihood ratio test, tend to have low power or may be inapplicable. To deal with the high dimensionality in pathway analysis, one potential approach is the burden test (Morgenthaler & Thilly, 2007), in which one simply sums the genotypes into a single predictor and then subjects this predictor to regression analysis. The burden test works well if all the variants have similar effect sizes, but this assumption rarely holds in real situations. Another common approach to dealing with high dimensions is to use principal component analysis in the regression modelling. One first derives the principal components from the genetic pathway under consideration, and then uses the leading components for association analysis (Buas et al., 2017). The disadvantages of this approach are that principal components with large variations need not be associated with the traits; it is rarely clear how many principal components to include; the interpretation of the regression coefficients can be difficult; and when Inline graphic, the estimated principal components may not be consistent (Shen et al., 2016). Complementary to the aforementioned approaches, kernel machine methods such as the sequence kernel association test (Wu et al., 2011) can also be applied to genetic pathway analysis. However, the latter test has been used primarily to analyse moderate-sized variant sets, and its performance in cases where Inline graphic is substantially larger is unclear. Other methods that have been developed for testing a group of genetic features in high-dimensional settings (Chen & Qin, 2010; Chen et al., 2011a; Gregory et al., 2015) focus on testing the mean difference between two groups rather than conducting association analysis.

In addition to the high-dimensional challenge, another difficulty in pathway analysis is power maximization under multiple plausible alternative hypotheses. For pathway analysis, the alternative hypothesis concerns both the number and the magnitudes of the nonzero genetic signals, which are generally unknown (Zhang, 2015). A situation often considered for genetic signals is that a pathway harbours potentially many variants with weak effects, called the nonsparse-signal situation. The sequence kernel association test can aggregate multiple signals and is potentially applicable to such a setting. Another possibility is that a genetic pathway contains only a few strong signals, called the sparse-signal situation. Several methods have been proposed to deal with this case, such as the Inline graphic test (Conneely & Boehnke, 2007), which first examines each variant individually and then seeks to obtain the Inline graphic-value for the maximum of the observed statistics. However, the Inline graphic test has little power in the nonsparse situation, while the sequence kernel association test loses power in the sparse situation.

In this paper, we propose a method for conducting high-dimensional genetic pathway analysis, where the dimension Inline graphic of the pathway can go to infinity and could exceed the sample size Inline graphic. Our approach can be used to identify pathways that harbour a large number of weak signals, i.e., nonsparse signals, as well as genetic pathways that contain only a few strong signals, i.e., sparse signals, or a mixture of weak and strong signals. We establish the asymptotic properties of the proposed statistics in high dimensions and conduct theoretical analysis of their power.

2. Methods

2.1. Model and statistics

Suppose that the data consist of a continuous trait vector Inline graphic, an adjusting covariates matrix Inline graphic and a genotype matrix Inline graphic for a genetic pathway; that is, the pathway being considered contains Inline graphic genetic variants. Suppose that the true regression model is

graphic file with name M18.gif

where Inline graphic is the coefficient vector for Inline graphic, with Inline graphic being the intercept, Inline graphic is the coefficient vector for Inline graphic, and Inline graphic is a vector of independent Gaussian errors with mean zero and variance Inline graphic. The design matrices Inline graphic and Inline graphic are considered fixed. The dimension Inline graphic of the adjusting covariates is assumed to be finite, while the dimension Inline graphic of the genotype matrix can go to infinity.

We are interested in testing the global null hypothesis Inline graphic against the alternative Inline graphic. Tests such as the likelihood ratio test and Wald test consider all the Inline graphic variants jointly and tend to perform poorly when Inline graphic is large; the statistics may not exist when Inline graphic. Marginal statistics are easy to calculate and have been widely used to evaluate the significance of each individual variant. Recall that in a marginal analysis, one first fits a regression model for a given variant, say the Inline graphicth, by Inline graphic (Inline graphic) and then obtains the marginal score statistic as

graphic file with name M38.gif

with Inline graphic, where Inline graphic is the identity matrix. To conduct a pathway analysis, it is natural to consider the sum of all the squared marginal statistics, Inline graphic In fact, it can be shown that Inline graphic is equivalent to the sequence kernel association test statistic, if the estimator Inline graphic of Inline graphic is ignored in the latter. However, our proposed approach is not focused on Inline graphic per se, but rather uses Inline graphic to develop a suite of statistics for high-dimensional settings, particularly for the case of Inline graphic for a constant Inline graphic.

Under the null hypothesis Inline graphic, it can be shown that Inline graphic and varInline graphic, where Inline graphic, with Inline graphic a diagonal matrix whose elements are Inline graphic (Inline graphic), and Inline graphic is the Frobenius norm. For the moment we assume that Inline graphic is known, but later on we will address the practical situation where Inline graphic needs to be estimated. We propose to standardize Inline graphic, which yields

graphic file with name M60.gif (1)

where the superscript Inline graphic emphasizes that both Inline graphic and Inline graphic can go to infinity; it will be suppressed below for ease of notation. Expression (1) suggests that Inline graphic may converge to normality as Inline graphic gets large. However, the central limit theorem does not directly apply here because the Inline graphic are correlated. In fact, the correlation matrix for the Inline graphic, Inline graphic, can be shown to have the form

graphic file with name M69.gif

and it can further be shown that Inline graphic. In Lemma 1 we show that under proper conditions, Inline graphic is standard normal as both Inline graphic and Inline graphic go to infinity.

Before presenting Lemma 1, we define some notation. For a vector Inline graphic, let Inline graphic be the Inline graphic-norm of the vector for Inline graphic. For any Inline graphic matrix Inline graphic, denote the induced Inline graphic-norm by Inline graphic. When Inline graphic is an Inline graphic matrix, we denote its maximum and minimum eigenvalues by Inline graphic and Inline graphic, respectively.

Lemma 1.

Let Inline graphic. If

Lemma 1. (2)

then under Inline graphic, the statistic Inline graphic in distribution.

Remark 1.

Here we have no constraint on the order of Inline graphic with respect to Inline graphic, providing they both go to infinity. Condition (2) is mild for genetic studies. By Hölder’s inequality, Inline graphic, where Inline graphic is the maximum absolute column sum of the matrix. When the correlation structure in Inline graphic is not overly strong, as is the case for the power-decay structure, i.e., Inline graphic for some Inline graphic, then one can show that Inline graphic. Here Inline graphic, the correlation of Inline graphic and Inline graphic, can be interpreted as the linkage disequilibrium of genetic variants after adjusting for covariates Inline graphic; when there are no adjusting covariates, Inline graphic reduces to the linkage disequilibrium matrix of Inline graphic. The power decay structure indicates that two distant genetic variants have virtually no linkage disequilibrium, which is indeed what is observed in genetic studies, particularly in the human genome data (International HapMap Consortium, 2005). Similar structures have also been used in other articles on genetic studies, such as Dai et al. (2012). Our proposed statistic naturally takes linkage disequilibrium into account, because Inline graphic. The linkage disequilibrium can influence both the denominator and the numerator of Inline graphic, so the impact of the linkage disequilibrium on the power of the proposed test is influenced by the size and density of the genetic signals. However, the linkage disequilibrium will not affect the validity of the test or its asymptotic properties, because the calculation of Inline graphic does not involve inversion of the linkage disequilibrium matrix, and the normality of the proposed statistics requires only that distant variants tend to have linkage disequilibrium approaching zero. In practice, variants in a gene tend to be in linkage disequilibrium, while those for different genes are generally not in linkage disequilibrium; this type of structure is covered in Lemma 1.

So far we have assumed that the noise level Inline graphic is known. To make our proposal practical, it is tempting to replace Inline graphic with a consistent estimator Inline graphic. It turns out that the validity of doing so depends on the order of Inline graphic relative to Inline graphic. In the following, we elaborate on this and propose different statistics to accommodate different ratios Inline graphic.

We first consider the situation where Inline graphic for some Inline graphic, i.e., Inline graphic is of smaller order than Inline graphic. The following lemma shows that if we replace Inline graphic with a consistent estimator Inline graphic, normality still holds.

Lemma 2.

Suppose that (2) holds. Let Inline graphic be a root-Inline graphic-consistent estimator of Inline graphic such that Inline graphic. Then under Inline graphic, as Inline graphic such that Inline graphic for Inline graphic,

Lemma 2.

in distribution.

Next, we consider the situation in which Inline graphic for some constant Inline graphic. The normality of Inline graphic no longer holds because Inline graphic becomes excessively large; see the proof of Lemma 2 for more details. In light of this, we propose a new statistic

graphic file with name M132.gif (3)

where Inline graphic, with Inline graphic being the number of adjusting covariates as mentioned earlier. The Inline graphic in the numerator of Inline graphic is replaced by Inline graphic in Inline graphic. The motivation behind this is that Inline graphic estimates Inline graphic under Inline graphic. We discovered that this replacement of Inline graphic enables one to overcome the limitation of Inline graphic in high dimensions. The following theorem shows that Inline graphic follows a normal distribution for Inline graphic.

Theorem 1.

Suppose that (2) holds. For any consistent estimator Inline graphic, as Inline graphic such that Inline graphic, if Inline graphic for some constant Inline graphic, then under Inline graphic we have Inline graphic in distribution.

Theorem 1 allows one to conduct statistical inference for pathway analysis when Inline graphic, although Inline graphic should not be excessively larger than Inline graphic. The condition Inline graphic is necessary to prevent the Inline graphic in the denominator equalling zero, as it can be shown that Inline graphic. To obtain a consistent estimator for Inline graphic under Inline graphic in a high-dimensional setting, Fan et al. (2012) proposed a refitted crossvalidation method based on procedures that satisfy the sure screening property. When the sparsity of the model is completely unknown, we can also estimate Inline graphic by the moment-based estimators of Dicker (2014), which are root-Inline graphic consistent when Inline graphic.

2.2. Power loss in the presence of sparse signals

The proposed statistic Inline graphic can handle situations where the association signals are spread out over a large number of genetic variants. However, the power of Inline graphic will be relatively low for the sparse-signal situation, in which a few genetic variants carry strong signals while all the others have zero coefficients. Fan et al. (2015) proposed the power-enhancement principle, the fundamental idea of which is to include a screening statistic that goes to zero under Inline graphic, but is nonzero under the sparse alternatives Inline graphic. Motivated by this principle, we propose a statistic that strengthens Inline graphic and is able to guard against potential power loss in the sparse-signal situation.

We define a screening set Inline graphic where Inline graphic is a threshold chosen to be slightly larger than the maximum estimation error of the marginal estimator, i.e., Inline graphic. Then, a power-enhancement component Inline graphic is

graphic file with name M173.gif

where Inline graphic denotes the sign of Inline graphic. Our statistic that is able to detect both nonsparse and sparse signals is Inline graphic.

Since Inline graphic has the same sign as Inline graphic, Inline graphic always has power at least that of Inline graphic. The threshold Inline graphic needs to ensure that the screening set Inline graphic is empty with probability approaching 1 under Inline graphic, so that the size of Inline graphic will be asymptotically equivalent to that of Inline graphic. Then, under Inline graphic, if an estimator is large enough that Inline graphic is nonempty, one can gain power. For Gaussian and sub-Gaussian errors, Inline graphic can be chosen to be Inline graphic, as suggested by Fan et al. (2015). The power-enhancement procedure in Fan et al. (2015) deals with a consistent estimator under Inline graphic, which is not available in our procedure, while our approach builds upon marginal estimators which are inconsistent under Inline graphic. Nevertheless, the size of our proposed statistic is asymptotically equivalent to that of Inline graphic under Inline graphic; in the next subsection, we will show that under the sparse alternatives Inline graphic can be powerful even when Inline graphic is not.

Lemma 3.

Under the same conditions as in Theorem 1, if Inline graphic where Inline graphic as Inline graphic, then under the null hypothesis Inline graphic we have Inline graphic in distribution. Thus, the sizes of Inline graphic and Inline graphic are asymptotically equivalent.

To select Inline graphic in practice, we propose an adaptive procedure to accommodate different correlation structures. We first generate a vector of Inline graphic random errors Inline graphic from the standard normal distribution. Then we compute the maximum of the marginal estimators as Inline graphic. Finally, we repeat these two steps many times and set Inline graphic based on all the replicates. McKeague & Qian (2015) also used an adaptive approach to determine threshold parameters for high-dimensional testing.

2.3. Power analysis

In this subsection we investigate the asymptotic power of the proposed tests Inline graphic and Inline graphic for nonsparse and sparse alternatives. Under Inline graphic, let Inline graphic be the set of nonzero coefficients, and let Inline graphic. Define the subvector Inline graphic and the submatrix Inline graphic. Let Inline graphic be the diagonal matrix with nonzero elements Inline graphic for Inline graphic. Similarly, let Inline graphic, Inline graphic and Inline graphic denote the corresponding quantities for Inline graphic.

The following theorem states that the sum-of-squares type of statistic Inline graphic has high power for the nonsparse-signal situation when the accumulated signals are sufficiently large.

Theorem 2.

Suppose that all the conditions in Theorem 1 hold. Consider a nonsparse alternative Inline graphic in which Inline graphic for a sufficiently large constant Inline graphic. If Inline graphic and Inline graphic for some constants Inline graphic, then as Inline graphic, Inline graphic, where Inline graphic is the Inline graphic-quantile of the standard normal distribution.

While Inline graphic can have high power under nonsparse alternatives, it may lose power under sparse alternatives. In the following theorem we show that Inline graphic, which adds a power-enhancement term Inline graphic to Inline graphic, can be powerful under both nonsparse and sparse alternatives.

Theorem 3.

Assume that the conditions in Theorem 2 hold. Consider a sparse alternative Inline graphic in which Inline graphic for a sufficiently large constant Inline graphic. If Inline graphic for some constant Inline graphic, then under either the nonsparse alternative Inline graphic or the sparse alternative Inline graphic, as Inline graphic, Inline graphic.

In practice, we recommend use of Inline graphic for detecting both weak and strong signals. However, if one wishes to distinguish between the sparse signals and the nonsparse signals, one can examine the values of Inline graphic and Inline graphic. If Inline graphic is larger than Inline graphic, then the power-enhancement component Inline graphic is nonzero and there exist strong signals in the pathway. If Inline graphic, then there are no strong signals in the pathway and the significance is driven by weak signals.

2.4. Incorporating biological information into Inline graphic and Inline graphic

The statistics Inline graphic and Inline graphic give equal weight to all the variants. In some applications, one may wish to assign different weights based on prior information. For example, if the effect of a genetic variant is related to its minor allele frequency, one may assign a weight Inline graphic to this variant, where Inline graphic is the minor allele frequency for the Inline graphicth variant. In other cases, one may assign functional scores to different variants to reflect their biological functions. In lieu of these considerations, we propose incorporating prior biological information into our proposed statistics as follows.

Let Inline graphic (Inline graphic) be prespecified positive weights, and let Inline graphic be the diagonal matrix with elements Inline graphic. Next, define Inline graphic and Inline graphic. Let Inline graphic and Inline graphic Similar to Inline graphic, we define a statistic Inline graphic. The following result shows the asymptotic normality of Inline graphic.

Corollary 1.

Suppose that (2) holds and Inline graphic as Inline graphic. Assume that as Inline graphic, Inline graphic. For any consistent estimator Inline graphic, if Inline graphic for some constant Inline graphic, then under Inline graphic in distribution.

As was done for Inline graphic, we can add Inline graphic to Inline graphic to guard against potential power loss in the presence of strong signals. Thus, our proposed statistics can readily accommodate prior biological information and still preserve their theoretical properties.

2.5. Edgeworth expansion for extreme significance levels

Genetic studies sometimes involve a large number of pathways, so the significance level can be much lower than 0.05. For example, in our real-data analysis, the significance level is 0.0003. At such levels, the normal distribution in Lemma 3 may be a poor approximation. We therefore propose a two-term Edgeworth expansion to characterize the tail probability of Inline graphic with higher accuracy. Recall that under Inline graphic, Inline graphic. It is known that Inline graphic follows a mixed chi-squared distribution with weights Inline graphic, where Inline graphic are the eigenvalues of Inline graphic. Using the Edgeworth expansion for independent random variables with varying distributions (Feller, 1971, p. 546), we can derive the following two-term expansion for Inline graphic:

graphic file with name M290.gif (4)

where Inline graphic for Inline graphic. Further, Inline graphic. Then, under the conditions in Theorem 1, the last remainder term in (4) can be shown to be Inline graphic. This expansion tends to be more accurate than the normal approximation, as the remainder term of the normal approximation is typically Inline graphic. Directly calculating (4) involves computing the Inline graphic, which can be onerous when Inline graphic and Inline graphic are large. Instead, we can use the identity Inline graphic for Inline graphic. We call the test that uses (4) to approximate the Inline graphic-value for Inline graphic the Inline graphic test. Similarly, we can apply an Edgeworth expansion to Inline graphic, and we call the resulting test Inline graphic.

3. Simulation studies

Monte Carlo simulations were conducted to evaluate the performance of the proposed tests, Inline graphic and Inline graphic, in high-dimensional settings and to compare them with the Bonferroni test, the burden test, principal component analysis, and the sequence kernel association test.

We generated the genotype matrix Inline graphic similarly to He et al. (2016). For each person, we first generated a block-diagonal covariance matrix with each block being a Inline graphic matrix Inline graphic. We considered compound symmetric Inline graphic with diagonal elements Inline graphic and off-diagonal elements 0.5 and autoregressive Inline graphic with Inline graphicth off-diagonal element 0.6Inline graphic. Then we trichotomized the simulated vector into genotype values of Inline graphic according to the Hardy–Weinberg equilibrium.

We generated the trait by setting Inline graphic (Inline graphic), where Inline graphic is an adjusting covariate and Inline graphic with Inline graphic. For the sample size Inline graphic and the dimension Inline graphic, we considered both the Inline graphic and the Inline graphic cases by setting Inline graphic with Inline graphic and setting Inline graphic with Inline graphic.

The null model is Inline graphic (Inline graphic). To simulate the data under various alternatives, we assume that Inline graphic has Inline graphic nonzero signals with support set Inline graphic. The magnitude of the signal Inline graphic was set to be Inline graphic, and half of the Inline graphic had positive signs while the other half had negative signs. The magnitude of the signals varied from 0.03 to 0.05 under these set-ups, and the proportion of nonnull variants among all the variants, Inline graphic, was set to 5%, 10%, 15% or 20%.

The tests Inline graphic and Inline graphic were conducted as described in § 2. For variance estimation, we applied the refitted crossvalidation method (Fan et al., 2012) to obtain Inline graphic. The threshold parameter Inline graphic in the power-enhancement component Inline graphic was estimated over 1000 replicates. We used two versions of principal component analysis. In the first, we used the five leading principal components and performed a likelihood ratio test. In the second version, we included the principal components that explain 50% of the total variance for the likelihood ratio test. The reason for considering the second version is that, in practice, a few principal components may not always capture the majority of the variance, as seen in Avery et al. (2011). We also included the sequence kernel association test without any weights.

The Type I errors of the tests were calculated over 10 000 replications, and the power was based on 1000 replications. Table 1 displays the Type I errors of the tests for the models considered. It can be seen that the Bonferroni test, principal component analysis using the five leading components, the sequence kernel association test, and our tests Inline graphic and Inline graphic all have their Type I errors controlled. Principal component analysis using components that explain 50% of the variance appears to have an inflated Type I error; this is likely due to the fact that many principal components are needed to account for the 50% of variance, and hence the likelihood ratio test has a large degree of freedom. Because of its inflated Type I error, this method was excluded from the subsequent experiments.

Table 1.

Type I error Inline graphic of the tests at level Inline graphic

Inline graphic Inline graphic Corr. Bonf. Burden PCA PCA50 SKAT Inline graphic Inline graphic
500 300 CS 4.66 4.90 5.00 5.97 2.79 4.84 5.10
    AR 4.47 5.30 4.91 6.40 2.77 5.03 5.13
  500 CS 4.80 4.92 5.06 6.59 2.04 4.89 4.97
    AR 4.64 4.80 5.45 6.59 1.64 4.76 4.84
  1000 CS 4.71 5.02 5.00 7.22 0.75 4.75 4.93
    AR 4.96 4.98 5.10 7.31 0.68 4.83 4.93
1000 800 CS 5.07 4.86 5.20 6.49 2.47 5.03 5.12
    AR 4.78 5.26 5.25 6.49 2.18 4.92 5.01
  1000 CS 4.92 5.08 4.71 6.13 2.02 4.85 5.04
    AR 5.24 5.00 4.90 6.60 1.81 4.88 4.99
  1500 CS 4.92 5.08 5.27 7.07 1.38 4.95 5.14
    AR 5.22 5.08 5.26 6.85 1.04 4.82 4.98

Corr., correlation structure; CS, compound symmetric; AR, autoregressive; Bonf., Bonferroni test; PCA, principal component analysis using the five leading components; PCA50, principal component analysis using components that explain 50% of the variance; SKAT, the sequence kernel association test.

For the compound symmetric dependence structure, plots of the power for different sample sizes and dimensions against the proportion Inline graphic of nonzero signals are shown in Fig. 1. When the ratio Inline graphic is small, all the methods have low power. This indicates that when signals are sparse and weak, it is highly difficult to detect the association for the pathway considered. As Inline graphic increases, the power improves for all methods, because more variants carry association signals in the studied pathway. However, the tests Inline graphic and Inline graphic always have higher power than the other approaches in these settings. The results for the autoregressive structure show a similar pattern.

Fig. 1.

Fig. 1.

Power of the Bonferroni test (Inline graphic), the burden test (Inline graphic), principal component analysis (Inline graphic), the sequence kernel association test (Inline graphic), Inline graphic (Inline graphic) and Inline graphic (Inline graphic) at level 0.05 for different sample sizes and dimensions plotted against the proportion of nonzero signals, Inline graphic. The compound symmetric dependence structure is considered.

We then considered the situation in which a genetic pathway contains both weak and strong signals. We simulated weak signals as described earlier, and then simulated a strong signal with Inline graphic. Table 2 shows that both Inline graphic and Inline graphic compete favourably with the other statistics, and Inline graphic has higher power than Inline graphic.

Table 2.

Power Inline graphic of the tests under mixed signals at level Inline graphic

Inline graphic Inline graphic Corr. Bonf. Burden PCA SKAT Inline graphic Inline graphic
500 300 CS 36.0 5.7 15.6 30.8 35.1 39.8
    AR 35.7 5.4 12.0 24.0 28.6 34.1
  500 CS 34.8 5.3 15.7 33.6 43.5 46.7
    AR 35.2 5.5 12.6 25.0 35.6 39.4
  1000 CS 32.4 5.2 18.5 39.2 61.8 63.6
    AR 31.1 5.0 14.8 26.8 51.0 53.5
1000 800 CS 38.4 5.2 16.0 54.0 60.1 62.4
    AR 36.1 5.0 12.3 42.4 50.1 53.3
  1000 CS 36.7 5.0 17.0 59.2 67.2 69.0
    AR 36.0 4.8 12.7 45.8 56.2 59.2
  1500 CS 35.4 5.1 18.1 67.3 79.4 80.9
    AR 34.5 4.6 13.1 52.6 69.1 71.5

Corr., correlation structure; CS, compound symmetric; AR, autoregressive; Bonf., Bonferroni test; PCA, principal component analysis using the five leading components; SKAT, the sequence kernel association test.

We conducted simulation studies to examine the performance of the Inline graphic test, which involves a two-term Edgeworth expansion and is expected to be more accurate than Inline graphic in controlling Type I error at extreme significance levels. We set the significance level to 0.0001 and evaluated the Type I error with 1 000 000 simulations. The threshold parameter Inline graphic in the power enhancement was estimated over 50 000 replicates. Table 3 shows that at level 0.0001, the Inline graphic statistic tends to have inflated Type I error due to the less accurate characterization of the tail probability. In contrast, Inline graphic can control the Type I error well at 0.0001 when the sample size and dimension are sufficiently large.

Table 3.

Type I error Inline graphic of the tests at level Inline graphic

Inline graphic Inline graphic Corr. Bonf. Burden PCA SKAT Inline graphic Inline graphic
500 500 CS 1.01 1.08 1.31 0.01 5.06 1.04
    AR 0.96 1.16 1.16 0.00 4.08 0.97
1000 1000 CS 1.34 1.01 1.20 0.00 3.19 1.08
    AR 1.30 1.08 1.10 0.00 2.86 1.05
1500 1500 CS 1.20 1.11 1.00 0.00 2.29 0.95
    AR 1.11 1.04 1.34 0.00 2.23 0.95

Corr., correlation structure; CS, compound symmetric; AR, autoregressive; Bonf., Bonferroni test; PCA, principal component analysis using the five leading components; SKAT, the sequence kernel association test.

4. Real-data analysis

We analysed the high-density lipoprotein cholesterol data from the Genomics and Randomized Trials Network in the Women’s Health Initiative (Coviello et al., 2012). The overall goal of the study is to identify novel genetic factors that contribute to the incidence of myocardial infarction, stroke and diabetes. DNA samples were genotyped on the HumanOmni-Quad platform, and genotypes were imputed with reference panels. Genetic variants that have imputations Inline graphic and minor allele frequency greater than 5% were included. We focused on the 3990 samples of Caucasian ancestry.

We first tested whether our approach can capture existing genetic pathways that are known to be involved in high-density lipoprotein metabolism. Assmann & Gotto (2004) listed a pathway involved in the generation and conversion of high-density lipoprotein. The pathway includes 11 genes: APOA1, APOE, LCAT, LIPC, CETP, PLTP, SCARB, LRP1, LDLR, ABCA1 and ABCF1. We mapped the genetic variants to these genes and obtained 629 variants for this pathway. We adjusted for the following covariates: age, hormone replacement therapy arm, smoking status, body mass index, and the first two principal components for ancestry (Asselbergs et al., 2012). The Inline graphic-values for the pathway analysis are displayed in Table 4. Several methods yielded low Inline graphic-values, including the Bonferroni test, the sequence kernel association test and the proposed tests Inline graphic and Inline graphic. The test Inline graphic yielded the lowest Inline graphic-value. The Inline graphic-value of the test Inline graphic is lower than that of Inline graphic because a number of variants in the CETP and LIPC genes were observed to carry strong association signals that exceed the power-enhancement threshold.

Table 4.

Real-data analysis: Inline graphic-values of the tests for the known lipid pathway

  Bonf. Burden PCA SKAT Inline graphic Inline graphic
Inline graphic-value Inline graphic 0.252 0.034 Inline graphic Inline graphic Inline graphic

Bonf., Bonferroni test; PCA, principal component analysis using the five leading components; SKAT, the sequence kernel association test.

Next, we investigated the associations between the KEGG pathways and high-density lipoprotein. The KEGG database contains 186 pathways, which represent a wide variety of cellular processes and molecular functions; for more details see http://www.genome.jp/kegg/pathway.html. We excluded one pathway from our analysis due to overlapping, so our real-data analysis includes 185 pathways. Figure 2 provides an overview of the number of variants in each of the 185 pathways. The median number of variants in these pathways is around 3000. A number of pathways have more than 10 000 variants, with some containing nearly 25 000.

Fig. 2.

Fig. 2.

The number of single-nucleotide polymorphisms, SNPs, in each of the 185 KEGG pathways.

To control for the familywise Type I error, the threshold of significance was set to Inline graphic, i.e., a Bonferroni correction. Table 5 shows the pathways that pass the significance threshold in any of the tests. The Inline graphic approach identified three pathways: arachidonic acid metabolism, metabolism of xenobiotics by cytochrome P450, and drug metabolism by cytochrome P450. The Inline graphic statistic yielded the same values as Inline graphic, indicating that no signal exceeds the power-enhancement threshold in the studied pathways. The sequence kernel association test detected only the arachidonic acid metabolism pathway, while the other methods identified no significant pathway.

Table 5.

The Inline graphic-values Inline graphic of the tests for the three significant KEGG pathways; Inline graphic-values lower than Inline graphic are indicated by Inline graphic

        Inline graphic-value    
  #SNPs Bonf. Burden PCA SKAT Inline graphic Inline graphic
Arach. acid metab. 2590 5.11 0.57 0.07 0.02Inline graphic Inline graphic Inline graphic
Metab. xenobio. 2254 6.30 0.46 0.18 0.07 Inline graphic Inline graphic
Drug metab. 2385 7.86 0.39 0.16 0.04 Inline graphic Inline graphic

#SNPs, number of single-nucleotide polymorphisms; Bonf., Bonferroni test; PCA, principal component analysis using the five leading components; SKAT, the sequence kernel association test; Arach. acid metab., arachidonic acid metabolism pathway; Metab. xenobio., metabolism of xenobiotics by cytochrome P450; Drug metab., drug metabolism by cytochrome P450.

The arachidonic acid metabolism pathway contains 2590 variants in 55 genes. A recent biological study suggested that this pathway is an important regulator of cholesterol metabolism (Demetz et al., 2014). The linkage disequilibrium plot of the genetic variants of this pathway in Fig. 3(a) shows that variants in proximity to each other tend to have strong correlations, while those far apart have barely detectable correlations. To gain more insight, we plot the marginal Inline graphic-values for all 2590 variants in Fig. 3(b). There are a number of variants with Inline graphic-values between Inline graphic and Inline graphic, but none of them reaches genome-wide significance. Instead, the proposed Inline graphic statistic was able to aggregate these relatively mild signals into a stronger one, which leads to the detection of the arachidonic acid metabolism pathway. The variants that contribute to the significance of this pathway, the linkage disequilibrium plots and marginal Inline graphic-values of variants in the other two pathways, metabolism of xenobiotics by cytochrome P450 and drug metabolism by cytochrome P450, are given in the Supplementary Material.

Fig. 3.

Fig. 3.

Analysis of the arachidonic acid metabolism pathway: (a) linkage disequilibrium plot; (b) marginal Inline graphic-values of the 2590 single-nucleotide polymorphisms, SNPs, in the pathway, where the dashed line represents the Bonferroni threshold.

5. Discussion

Our approach can be extended to deal with non-Gaussian errors as long as the errors satisfy the moment condition Inline graphic for some constant Inline graphic and Inline graphic. In such a situation, we can adjust the denominator of Inline graphic in (3) from Inline graphic to Inline graphic, where Inline graphic is the kurtosis of the errors and Inline graphic is the Inline graphicth diagonal entry of Inline graphic. Then, using the results in Bhansali et al. (2007), we can show the asymptotic normality of the adjusted test statistic accordingly. Our approach can be also extended to accommodate genetic interactions.

Screening techniques have been used in genetic association studies to filter out irrelevant variants; see, for example, Li et al. (2014) and Cui et al. (2015). However, these screening procedures are typically used as a variable-selection step to reduce dimensions, not for statistical testing. In contrast, our screening statistic is directly integrated into the test statistic and is designed for statistical testing. Our approach has focused on the fixed design, which is commonly considered in genetic studies. It will be interesting to develop similar methods under the random design, although it remains challenging to establish the asymptotic properties of the proposed statistics in high dimensions.

Supplementary Material

asz033_Supplementary_Data

Acknowledgement

This research was supported by the U.S. National Institutes of Health. We thank the Women’s Health Initiative investigators for sharing the data. The Women’s Health Initiative programme is funded by the National Heart, Lung and Blood Institute. Correspondence should be addressed to QH. We thank the editor, associate editor and reviewers for helpful comments.

Supplementary material

Supplementary material available at Biometrika online includes technical proofs, together with additional simulation results and real-data analysis.

References

  1. Asselbergs, F. W.,, Guo, Y.,, Van Iperen, E. P.,, Sivapalaratnam, S.,, Tragante, V.,, Lanktree, M. B.,, Lange, L. A.,, Almoguera, B.,, Appelman, Y. E.,, Barnard, J., et al. (2012). Large-scale gene-centric meta-analysis across 32 studies identifies multiple lipid loci. Am. J. Hum. Genet. 91, 823–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Assmann, G. & Gotto, A. M. (2004). HDL cholesterol and protective factors in atherosclerosis. Circulation 109, III8–14. [DOI] [PubMed] [Google Scholar]
  3. Avery, C. L.,, He, Q.,, North, K. E.,, Ambite, J. L.,, Boerwinkle, E.,, Fornage, M.,, Hindorff, L. A.,, Kooperberg, C.,, Meigs, J. B.,, Pankow, J. S., et al. (2011). A phenomics-based strategy identifies loci on APOC1, BRAP, and PLCG1 associated with metabolic syndrome phenotype domains. PLoS Genet. 7, e1002322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bhansali, R., Giraitis, L. & Kokoszka, P. (2007). Convergence of quadratic forms with nonvanishing diagonal. Statist. Prob. Lett. 77, 726–34. [Google Scholar]
  5. Buas, M. F.,, He, Q.,, Johnson, L. G.,, Onstad, L.,, Levine, D. M.,, Thrift, A. P.,, Gharahkhani, P.,, Palles, C.,, Lagergren, J.,, Fitzgerald, R. C., et al. (2017). Germline variation in inflammation-related pathways and risk of Barrett’s oesophagus and oesophageal adenocarcinoma. Gut 66, 1739–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Chen, L. S., Paul, D., Prentice, R. L. & Wang, P. (2011a). A regularized Hotelling’s Inline graphic test for pathway analysis in proteomic studies. J. Am. Statist. Assoc. 106, 1345–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Chen, M., Cho, J. & Zhao, H. (2011b). Incorporating biological pathways via a Markov random field model in genome-wide association studies. PLoS Genet. 7, e1001353. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chen, S. X. & Qin, Y.-L. (2010). A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Statist. 38, 808–35. [Google Scholar]
  9. Conneely, K. N. & Boehnke, M. (2007). So many correlated tests, so little time! Rapid adjustment of Inline graphic values for multiple correlated tests. Am. J. Hum. Genet. 81, 1158–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Coviello, A.D., Haring, R., Wellons, M., Vaidya, D., Lehtimaki, T., Keildson, S., Lunetta, K.L., He, C., Fornage, M. & Lagou, V.. et al. (2012). A genome-wide association meta-analysis of circulating sex hormone–binding globulin reveals multiple Loci implicated in sex steroid hormone regulation. PLoS Genet. 8, e1002805. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Cui, H., Li, R. & Zhong, W. (2015). Model-free feature screening for ultrahigh dimensional discriminant analysis. J. Am. Statist. Assoc. 110, 630–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Dai, J. Y., Kooperberg, C., Leblanc, M. & Prentice, R. L. (2012). Two-stage testing procedures with independent filtering for genome-wide gene-environment interaction. Biometrika 99, 929–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Demetz, E.,, Schroll, A.,, Auer, K.,, Heim, C.,, Patsch, J. R.,, Eller, P.,, Theurl, M.,, Theurl, I.,, Theurl, M.,, Seifert, M. et al. (2014). The arachidonic acid metabolome serves as a conserved regulator of cholesterol metabolism. Cell Metab. 20, 787–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Dicker, L. H. (2014). Variance estimation in high-dimensional linear models. Biometrika 101, 269–84. [Google Scholar]
  15. Fan, J., Guo, S. & Hao, N. (2012). Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J. R. Statist. Soc. B 74, 37–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Fan, J., Liao, Y. & Yao, J. (2015). Power enhancement in high dimensional cross-sectional tests. Econometrica 83, 1497–541. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Feller, W. (1971). Expansions in the case of varying components. In An Introduction to Probability Theory and Its Applications, vol. 2 New York: Wiley, pp. 546–8. [Google Scholar]
  18. Gregory, K. B., Carroll, R. J., Baladandayuthapani, V. & Lahiri, S. N. (2015). A two-sample test for equality of means in high dimension. J. Am. Statist. Assoc. 110, 837–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. He, Q., Zhang, H. H., Avery, C. L. & Lin, D. (2016). Sparse meta-analysis with high-dimensional data. Biostatistics 17, 205–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. International HapMap Consortium (2005). A haplotype map of the human genome. Nature 437, 1299–320. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Li, J., Zhong, W., Li, R. & Wu, R. (2014). A fast algorithm for detecting gene–gene interactions in genome-wide association studies. Ann. Appl. Statist. 8, 2292–318. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. McKeague, I. W. & Qian, M. (2015). An adaptive resampling test for detecting the presence of significant predictors. J. Am. Statist. Assoc. 110, 1422–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Morgenthaler, S. & Thilly, W. G. (2007). A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: A cohort allelic sums test. Mutat. Res. 615, 28–56. [DOI] [PubMed] [Google Scholar]
  24. Shen, D., Shen, H. & Marron, J. S. (2016). A general framework for consistency of principal component analysis. J. Mach. Learn. Res. 17, 1–34. [Google Scholar]
  25. Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M. & Lin, X. (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89, 82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Zhang, G. (2015). Genetic architecture of complex human traits: What have we learned from genome-wide association studies? Curr. Genet. Med. 3, 143–50. [Google Scholar]
  27. Zhong, H., Yang, X., Kaplan, L. M., Molony, C. & Schadt, E. E. (2010). Integrating pathway analysis and genetics of gene expression for genome-wide association studies. Am. J. Hum. Genet. 86, 581–91. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

asz033_Supplementary_Data

Articles from Biometrika are provided here courtesy of Oxford University Press

RESOURCES