Summary
Genetic pathway analysis has become an important tool for investigating the association between a group of genetic variants and traits. With dense genotyping and extensive imputation, the number of genetic variants in biological pathways has increased considerably and sometimes exceeds the sample size
. Conducting genetic pathway analysis and statistical inference in such settings is challenging. We introduce an approach that can handle pathways whose dimension
could be greater than
. Our method can be used to detect pathways that have nonsparse weak signals, as well as pathways that have sparse but stronger signals. We establish the asymptotic distribution for the proposed statistic and conduct theoretical analysis on its power. Simulation studies show that our test has correct Type I error control and is more powerful than existing approaches. An application to a genome-wide association study of high-density lipoproteins demonstrates the proposed approach.
Keywords: Genetic pathway analysis, Genetic variant, High-dimensional inference, Nonsparse signal, Power analysis, Sparse signal
1. Introduction
Genetic association analysis plays an important role in identifying genetic variants that are associated with traits. Genetic variants are often analysed by single-variant-based methods, using approaches such as Armitage’s trend test. Pathway-based analysis has become a popular tool for analysing genetic variant data (Chen et al., 2011b), whereby multiple genetic variants in the genes in a prespecified pathway are examined. There are several reasons to consider pathway analysis for association studies. First, pathways are generally defined using biological knowledge and thus are more likely to be functionally relevant (Zhong et al., 2010). Second, by analysing multiple variants simultaneously, pathway analysis has the potential to accumulate weak signals into stronger ones, while single-variant-based methods lack power in such a situation. Third, because the number of pathways is much smaller than the number of variants, the multiple-testing burden can be dramatically reduced.
One of the main challenges in pathway analysis is to deal with the high dimensionality. With increasingly dense genotyping and extensive imputation, the number of variants
in genetic pathways has grown so rapidly that it can be larger than the sample size
. This is seen in our real-data example, where the sample size is around 4000 while the number of single nucleotide polymorphisms in a pathway can be as large as 25 000. In such high dimensions, statistical testing methods that were designed for moderate
, such as the likelihood ratio test, tend to have low power or may be inapplicable. To deal with the high dimensionality in pathway analysis, one potential approach is the burden test (Morgenthaler & Thilly, 2007), in which one simply sums the genotypes into a single predictor and then subjects this predictor to regression analysis. The burden test works well if all the variants have similar effect sizes, but this assumption rarely holds in real situations. Another common approach to dealing with high dimensions is to use principal component analysis in the regression modelling. One first derives the principal components from the genetic pathway under consideration, and then uses the leading components for association analysis (Buas et al., 2017). The disadvantages of this approach are that principal components with large variations need not be associated with the traits; it is rarely clear how many principal components to include; the interpretation of the regression coefficients can be difficult; and when
, the estimated principal components may not be consistent (Shen et al., 2016). Complementary to the aforementioned approaches, kernel machine methods such as the sequence kernel association test (Wu et al., 2011) can also be applied to genetic pathway analysis. However, the latter test has been used primarily to analyse moderate-sized variant sets, and its performance in cases where
is substantially larger is unclear. Other methods that have been developed for testing a group of genetic features in high-dimensional settings (Chen & Qin, 2010; Chen et al., 2011a; Gregory et al., 2015) focus on testing the mean difference between two groups rather than conducting association analysis.
In addition to the high-dimensional challenge, another difficulty in pathway analysis is power maximization under multiple plausible alternative hypotheses. For pathway analysis, the alternative hypothesis concerns both the number and the magnitudes of the nonzero genetic signals, which are generally unknown (Zhang, 2015). A situation often considered for genetic signals is that a pathway harbours potentially many variants with weak effects, called the nonsparse-signal situation. The sequence kernel association test can aggregate multiple signals and is potentially applicable to such a setting. Another possibility is that a genetic pathway contains only a few strong signals, called the sparse-signal situation. Several methods have been proposed to deal with this case, such as the
test (Conneely & Boehnke, 2007), which first examines each variant individually and then seeks to obtain the
-value for the maximum of the observed statistics. However, the
test has little power in the nonsparse situation, while the sequence kernel association test loses power in the sparse situation.
In this paper, we propose a method for conducting high-dimensional genetic pathway analysis, where the dimension
of the pathway can go to infinity and could exceed the sample size
. Our approach can be used to identify pathways that harbour a large number of weak signals, i.e., nonsparse signals, as well as genetic pathways that contain only a few strong signals, i.e., sparse signals, or a mixture of weak and strong signals. We establish the asymptotic properties of the proposed statistics in high dimensions and conduct theoretical analysis of their power.
2. Methods
2.1. Model and statistics
Suppose that the data consist of a continuous trait vector
, an adjusting covariates matrix
and a genotype matrix
for a genetic pathway; that is, the pathway being considered contains
genetic variants. Suppose that the true regression model is
![]() |
where
is the coefficient vector for
, with
being the intercept,
is the coefficient vector for
, and
is a vector of independent Gaussian errors with mean zero and variance
. The design matrices
and
are considered fixed. The dimension
of the adjusting covariates is assumed to be finite, while the dimension
of the genotype matrix can go to infinity.
We are interested in testing the global null hypothesis
against the alternative
. Tests such as the likelihood ratio test and Wald test consider all the
variants jointly and tend to perform poorly when
is large; the statistics may not exist when
. Marginal statistics are easy to calculate and have been widely used to evaluate the significance of each individual variant. Recall that in a marginal analysis, one first fits a regression model for a given variant, say the
th, by
(
) and then obtains the marginal score statistic as
![]() |
with
, where
is the identity matrix. To conduct a pathway analysis, it is natural to consider the sum of all the squared marginal statistics,
In fact, it can be shown that
is equivalent to the sequence kernel association test statistic, if the estimator
of
is ignored in the latter. However, our proposed approach is not focused on
per se, but rather uses
to develop a suite of statistics for high-dimensional settings, particularly for the case of
for a constant
.
Under the null hypothesis
, it can be shown that
and var
, where
, with
a diagonal matrix whose elements are
(
), and
is the Frobenius norm. For the moment we assume that
is known, but later on we will address the practical situation where
needs to be estimated. We propose to standardize
, which yields
![]() |
(1) |
where the superscript
emphasizes that both
and
can go to infinity; it will be suppressed below for ease of notation. Expression (1) suggests that
may converge to normality as
gets large. However, the central limit theorem does not directly apply here because the
are correlated. In fact, the correlation matrix for the
,
, can be shown to have the form
![]() |
and it can further be shown that
. In Lemma 1 we show that under proper conditions,
is standard normal as both
and
go to infinity.
Before presenting Lemma 1, we define some notation. For a vector
, let
be the
-norm of the vector for
. For any
matrix
, denote the induced
-norm by
. When
is an
matrix, we denote its maximum and minimum eigenvalues by
and
, respectively.
Lemma 1.
Let
. If
(2) then under
, the statistic
in distribution.
Remark 1.
Here we have no constraint on the order of
with respect to
, providing they both go to infinity. Condition (2) is mild for genetic studies. By Hölder’s inequality,
, where
is the maximum absolute column sum of the matrix. When the correlation structure in
is not overly strong, as is the case for the power-decay structure, i.e.,
for some
, then one can show that
. Here
, the correlation of
and
, can be interpreted as the linkage disequilibrium of genetic variants after adjusting for covariates
; when there are no adjusting covariates,
reduces to the linkage disequilibrium matrix of
. The power decay structure indicates that two distant genetic variants have virtually no linkage disequilibrium, which is indeed what is observed in genetic studies, particularly in the human genome data (International HapMap Consortium, 2005). Similar structures have also been used in other articles on genetic studies, such as Dai et al. (2012). Our proposed statistic naturally takes linkage disequilibrium into account, because
. The linkage disequilibrium can influence both the denominator and the numerator of
, so the impact of the linkage disequilibrium on the power of the proposed test is influenced by the size and density of the genetic signals. However, the linkage disequilibrium will not affect the validity of the test or its asymptotic properties, because the calculation of
does not involve inversion of the linkage disequilibrium matrix, and the normality of the proposed statistics requires only that distant variants tend to have linkage disequilibrium approaching zero. In practice, variants in a gene tend to be in linkage disequilibrium, while those for different genes are generally not in linkage disequilibrium; this type of structure is covered in Lemma 1.
So far we have assumed that the noise level
is known. To make our proposal practical, it is tempting to replace
with a consistent estimator
. It turns out that the validity of doing so depends on the order of
relative to
. In the following, we elaborate on this and propose different statistics to accommodate different ratios
.
We first consider the situation where
for some
, i.e.,
is of smaller order than
. The following lemma shows that if we replace
with a consistent estimator
, normality still holds.
Lemma 2.
Suppose that (2) holds. Let
be a root-
-consistent estimator of
such that
. Then under
, as
such that
for
,
in distribution.
Next, we consider the situation in which
for some constant
. The normality of
no longer holds because
becomes excessively large; see the proof of Lemma 2 for more details. In light of this, we propose a new statistic
![]() |
(3) |
where
, with
being the number of adjusting covariates as mentioned earlier. The
in the numerator of
is replaced by
in
. The motivation behind this is that
estimates
under
. We discovered that this replacement of
enables one to overcome the limitation of
in high dimensions. The following theorem shows that
follows a normal distribution for
.
Theorem 1.
Suppose that (2) holds. For any consistent estimator
, as
such that
, if
for some constant
, then under
we have
in distribution.
Theorem 1 allows one to conduct statistical inference for pathway analysis when
, although
should not be excessively larger than
. The condition
is necessary to prevent the
in the denominator equalling zero, as it can be shown that
. To obtain a consistent estimator for
under
in a high-dimensional setting, Fan et al. (2012) proposed a refitted crossvalidation method based on procedures that satisfy the sure screening property. When the sparsity of the model is completely unknown, we can also estimate
by the moment-based estimators of Dicker (2014), which are root-
consistent when
.
2.2. Power loss in the presence of sparse signals
The proposed statistic
can handle situations where the association signals are spread out over a large number of genetic variants. However, the power of
will be relatively low for the sparse-signal situation, in which a few genetic variants carry strong signals while all the others have zero coefficients. Fan et al. (2015) proposed the power-enhancement principle, the fundamental idea of which is to include a screening statistic that goes to zero under
, but is nonzero under the sparse alternatives
. Motivated by this principle, we propose a statistic that strengthens
and is able to guard against potential power loss in the sparse-signal situation.
We define a screening set
where
is a threshold chosen to be slightly larger than the maximum estimation error of the marginal estimator, i.e.,
. Then, a power-enhancement component
is
![]() |
where
denotes the sign of
. Our statistic that is able to detect both nonsparse and sparse signals is
.
Since
has the same sign as
,
always has power at least that of
. The threshold
needs to ensure that the screening set
is empty with probability approaching 1 under
, so that the size of
will be asymptotically equivalent to that of
. Then, under
, if an estimator is large enough that
is nonempty, one can gain power. For Gaussian and sub-Gaussian errors,
can be chosen to be
, as suggested by Fan et al. (2015). The power-enhancement procedure in Fan et al. (2015) deals with a consistent estimator under
, which is not available in our procedure, while our approach builds upon marginal estimators which are inconsistent under
. Nevertheless, the size of our proposed statistic is asymptotically equivalent to that of
under
; in the next subsection, we will show that under the sparse alternatives
can be powerful even when
is not.
Lemma 3.
Under the same conditions as in Theorem 1, if
where
as
, then under the null hypothesis
we have
in distribution. Thus, the sizes of
and
are asymptotically equivalent.
To select
in practice, we propose an adaptive procedure to accommodate different correlation structures. We first generate a vector of
random errors
from the standard normal distribution. Then we compute the maximum of the marginal estimators as
. Finally, we repeat these two steps many times and set
based on all the replicates. McKeague & Qian (2015) also used an adaptive approach to determine threshold parameters for high-dimensional testing.
2.3. Power analysis
In this subsection we investigate the asymptotic power of the proposed tests
and
for nonsparse and sparse alternatives. Under
, let
be the set of nonzero coefficients, and let
. Define the subvector
and the submatrix
. Let
be the diagonal matrix with nonzero elements
for
. Similarly, let
,
and
denote the corresponding quantities for
.
The following theorem states that the sum-of-squares type of statistic
has high power for the nonsparse-signal situation when the accumulated signals are sufficiently large.
Theorem 2.
Suppose that all the conditions in Theorem 1 hold. Consider a nonsparse alternative
in which
for a sufficiently large constant
. If
and
for some constants
, then as
,
, where
is the
-quantile of the standard normal distribution.
While
can have high power under nonsparse alternatives, it may lose power under sparse alternatives. In the following theorem we show that
, which adds a power-enhancement term
to
, can be powerful under both nonsparse and sparse alternatives.
Theorem 3.
Assume that the conditions in Theorem 2 hold. Consider a sparse alternative
in which
for a sufficiently large constant
. If
for some constant
, then under either the nonsparse alternative
or the sparse alternative
, as
,
.
In practice, we recommend use of
for detecting both weak and strong signals. However, if one wishes to distinguish between the sparse signals and the nonsparse signals, one can examine the values of
and
. If
is larger than
, then the power-enhancement component
is nonzero and there exist strong signals in the pathway. If
, then there are no strong signals in the pathway and the significance is driven by weak signals.
2.4. Incorporating biological information into
and
The statistics
and
give equal weight to all the variants. In some applications, one may wish to assign different weights based on prior information. For example, if the effect of a genetic variant is related to its minor allele frequency, one may assign a weight
to this variant, where
is the minor allele frequency for the
th variant. In other cases, one may assign functional scores to different variants to reflect their biological functions. In lieu of these considerations, we propose incorporating prior biological information into our proposed statistics as follows.
Let
(
) be prespecified positive weights, and let
be the diagonal matrix with elements
. Next, define
and
. Let
and
Similar to
, we define a statistic
. The following result shows the asymptotic normality of
.
Corollary 1.
Suppose that (2) holds and
as
. Assume that as
,
. For any consistent estimator
, if
for some constant
, then under
in distribution.
As was done for
, we can add
to
to guard against potential power loss in the presence of strong signals. Thus, our proposed statistics can readily accommodate prior biological information and still preserve their theoretical properties.
2.5. Edgeworth expansion for extreme significance levels
Genetic studies sometimes involve a large number of pathways, so the significance level can be much lower than 0.05. For example, in our real-data analysis, the significance level is 0.0003. At such levels, the normal distribution in Lemma 3 may be a poor approximation. We therefore propose a two-term Edgeworth expansion to characterize the tail probability of
with higher accuracy. Recall that under
,
. It is known that
follows a mixed chi-squared distribution with weights
, where
are the eigenvalues of
. Using the Edgeworth expansion for independent random variables with varying distributions (Feller, 1971, p. 546), we can derive the following two-term expansion for
:
![]() |
(4) |
where
for
. Further,
. Then, under the conditions in Theorem 1, the last remainder term in (4) can be shown to be
. This expansion tends to be more accurate than the normal approximation, as the remainder term of the normal approximation is typically
. Directly calculating (4) involves computing the
, which can be onerous when
and
are large. Instead, we can use the identity
for
. We call the test that uses (4) to approximate the
-value for
the
test. Similarly, we can apply an Edgeworth expansion to
, and we call the resulting test
.
3. Simulation studies
Monte Carlo simulations were conducted to evaluate the performance of the proposed tests,
and
, in high-dimensional settings and to compare them with the Bonferroni test, the burden test, principal component analysis, and the sequence kernel association test.
We generated the genotype matrix
similarly to He et al. (2016). For each person, we first generated a block-diagonal covariance matrix with each block being a
matrix
. We considered compound symmetric
with diagonal elements
and off-diagonal elements 0.5 and autoregressive
with
th off-diagonal element 0.6
. Then we trichotomized the simulated vector into genotype values of
according to the Hardy–Weinberg equilibrium.
We generated the trait by setting
(
), where
is an adjusting covariate and
with
. For the sample size
and the dimension
, we considered both the
and the
cases by setting
with
and setting
with
.
The null model is
(
). To simulate the data under various alternatives, we assume that
has
nonzero signals with support set
. The magnitude of the signal
was set to be
, and half of the
had positive signs while the other half had negative signs. The magnitude of the signals varied from 0.03 to 0.05 under these set-ups, and the proportion of nonnull variants among all the variants,
, was set to 5%, 10%, 15% or 20%.
The tests
and
were conducted as described in § 2. For variance estimation, we applied the refitted crossvalidation method (Fan et al., 2012) to obtain
. The threshold parameter
in the power-enhancement component
was estimated over 1000 replicates. We used two versions of principal component analysis. In the first, we used the five leading principal components and performed a likelihood ratio test. In the second version, we included the principal components that explain 50% of the total variance for the likelihood ratio test. The reason for considering the second version is that, in practice, a few principal components may not always capture the majority of the variance, as seen in Avery et al. (2011). We also included the sequence kernel association test without any weights.
The Type I errors of the tests were calculated over 10 000 replications, and the power was based on 1000 replications. Table 1 displays the Type I errors of the tests for the models considered. It can be seen that the Bonferroni test, principal component analysis using the five leading components, the sequence kernel association test, and our tests
and
all have their Type I errors controlled. Principal component analysis using components that explain 50% of the variance appears to have an inflated Type I error; this is likely due to the fact that many principal components are needed to account for the 50% of variance, and hence the likelihood ratio test has a large degree of freedom. Because of its inflated Type I error, this method was excluded from the subsequent experiments.
Table 1.
Type I error
of the tests at level 
|
|
Corr. | Bonf. | Burden | PCA | PCA50 | SKAT |
|
|
|---|---|---|---|---|---|---|---|---|---|
| 500 | 300 | CS | 4.66 | 4.90 | 5.00 | 5.97 | 2.79 | 4.84 | 5.10 |
| AR | 4.47 | 5.30 | 4.91 | 6.40 | 2.77 | 5.03 | 5.13 | ||
| 500 | CS | 4.80 | 4.92 | 5.06 | 6.59 | 2.04 | 4.89 | 4.97 | |
| AR | 4.64 | 4.80 | 5.45 | 6.59 | 1.64 | 4.76 | 4.84 | ||
| 1000 | CS | 4.71 | 5.02 | 5.00 | 7.22 | 0.75 | 4.75 | 4.93 | |
| AR | 4.96 | 4.98 | 5.10 | 7.31 | 0.68 | 4.83 | 4.93 | ||
| 1000 | 800 | CS | 5.07 | 4.86 | 5.20 | 6.49 | 2.47 | 5.03 | 5.12 |
| AR | 4.78 | 5.26 | 5.25 | 6.49 | 2.18 | 4.92 | 5.01 | ||
| 1000 | CS | 4.92 | 5.08 | 4.71 | 6.13 | 2.02 | 4.85 | 5.04 | |
| AR | 5.24 | 5.00 | 4.90 | 6.60 | 1.81 | 4.88 | 4.99 | ||
| 1500 | CS | 4.92 | 5.08 | 5.27 | 7.07 | 1.38 | 4.95 | 5.14 | |
| AR | 5.22 | 5.08 | 5.26 | 6.85 | 1.04 | 4.82 | 4.98 |
Corr., correlation structure; CS, compound symmetric; AR, autoregressive; Bonf., Bonferroni test; PCA, principal component analysis using the five leading components; PCA50, principal component analysis using components that explain 50% of the variance; SKAT, the sequence kernel association test.
For the compound symmetric dependence structure, plots of the power for different sample sizes and dimensions against the proportion
of nonzero signals are shown in Fig. 1. When the ratio
is small, all the methods have low power. This indicates that when signals are sparse and weak, it is highly difficult to detect the association for the pathway considered. As
increases, the power improves for all methods, because more variants carry association signals in the studied pathway. However, the tests
and
always have higher power than the other approaches in these settings. The results for the autoregressive structure show a similar pattern.
Fig. 1.
Power of the Bonferroni test (
), the burden test (
), principal component analysis (
), the sequence kernel association test (
),
(
) and
(
) at level 0.05 for different sample sizes and dimensions plotted against the proportion of nonzero signals,
. The compound symmetric dependence structure is considered.
We then considered the situation in which a genetic pathway contains both weak and strong signals. We simulated weak signals as described earlier, and then simulated a strong signal with
. Table 2 shows that both
and
compete favourably with the other statistics, and
has higher power than
.
Table 2.
Power
of the tests under mixed signals at level 
|
|
Corr. | Bonf. | Burden | PCA | SKAT |
|
|
|---|---|---|---|---|---|---|---|---|
| 500 | 300 | CS | 36.0 | 5.7 | 15.6 | 30.8 | 35.1 | 39.8 |
| AR | 35.7 | 5.4 | 12.0 | 24.0 | 28.6 | 34.1 | ||
| 500 | CS | 34.8 | 5.3 | 15.7 | 33.6 | 43.5 | 46.7 | |
| AR | 35.2 | 5.5 | 12.6 | 25.0 | 35.6 | 39.4 | ||
| 1000 | CS | 32.4 | 5.2 | 18.5 | 39.2 | 61.8 | 63.6 | |
| AR | 31.1 | 5.0 | 14.8 | 26.8 | 51.0 | 53.5 | ||
| 1000 | 800 | CS | 38.4 | 5.2 | 16.0 | 54.0 | 60.1 | 62.4 |
| AR | 36.1 | 5.0 | 12.3 | 42.4 | 50.1 | 53.3 | ||
| 1000 | CS | 36.7 | 5.0 | 17.0 | 59.2 | 67.2 | 69.0 | |
| AR | 36.0 | 4.8 | 12.7 | 45.8 | 56.2 | 59.2 | ||
| 1500 | CS | 35.4 | 5.1 | 18.1 | 67.3 | 79.4 | 80.9 | |
| AR | 34.5 | 4.6 | 13.1 | 52.6 | 69.1 | 71.5 |
Corr., correlation structure; CS, compound symmetric; AR, autoregressive; Bonf., Bonferroni test; PCA, principal component analysis using the five leading components; SKAT, the sequence kernel association test.
We conducted simulation studies to examine the performance of the
test, which involves a two-term Edgeworth expansion and is expected to be more accurate than
in controlling Type I error at extreme significance levels. We set the significance level to 0.0001 and evaluated the Type I error with 1 000 000 simulations. The threshold parameter
in the power enhancement was estimated over 50 000 replicates. Table 3 shows that at level 0.0001, the
statistic tends to have inflated Type I error due to the less accurate characterization of the tail probability. In contrast,
can control the Type I error well at 0.0001 when the sample size and dimension are sufficiently large.
Table 3.
Type I error
of the tests at level 
|
|
Corr. | Bonf. | Burden | PCA | SKAT |
|
|
|---|---|---|---|---|---|---|---|---|
| 500 | 500 | CS | 1.01 | 1.08 | 1.31 | 0.01 | 5.06 | 1.04 |
| AR | 0.96 | 1.16 | 1.16 | 0.00 | 4.08 | 0.97 | ||
| 1000 | 1000 | CS | 1.34 | 1.01 | 1.20 | 0.00 | 3.19 | 1.08 |
| AR | 1.30 | 1.08 | 1.10 | 0.00 | 2.86 | 1.05 | ||
| 1500 | 1500 | CS | 1.20 | 1.11 | 1.00 | 0.00 | 2.29 | 0.95 |
| AR | 1.11 | 1.04 | 1.34 | 0.00 | 2.23 | 0.95 |
Corr., correlation structure; CS, compound symmetric; AR, autoregressive; Bonf., Bonferroni test; PCA, principal component analysis using the five leading components; SKAT, the sequence kernel association test.
4. Real-data analysis
We analysed the high-density lipoprotein cholesterol data from the Genomics and Randomized Trials Network in the Women’s Health Initiative (Coviello et al., 2012). The overall goal of the study is to identify novel genetic factors that contribute to the incidence of myocardial infarction, stroke and diabetes. DNA samples were genotyped on the HumanOmni-Quad platform, and genotypes were imputed with reference panels. Genetic variants that have imputations
and minor allele frequency greater than 5% were included. We focused on the 3990 samples of Caucasian ancestry.
We first tested whether our approach can capture existing genetic pathways that are known to be involved in high-density lipoprotein metabolism. Assmann & Gotto (2004) listed a pathway involved in the generation and conversion of high-density lipoprotein. The pathway includes 11 genes: APOA1, APOE, LCAT, LIPC, CETP, PLTP, SCARB, LRP1, LDLR, ABCA1 and ABCF1. We mapped the genetic variants to these genes and obtained 629 variants for this pathway. We adjusted for the following covariates: age, hormone replacement therapy arm, smoking status, body mass index, and the first two principal components for ancestry (Asselbergs et al., 2012). The
-values for the pathway analysis are displayed in Table 4. Several methods yielded low
-values, including the Bonferroni test, the sequence kernel association test and the proposed tests
and
. The test
yielded the lowest
-value. The
-value of the test
is lower than that of
because a number of variants in the CETP and LIPC genes were observed to carry strong association signals that exceed the power-enhancement threshold.
Table 4.
Real-data analysis:
-values of the tests for the known lipid pathway
| Bonf. | Burden | PCA | SKAT |
|
|
|
|---|---|---|---|---|---|---|
-value |
|
0.252 | 0.034 |
|
|
|
Bonf., Bonferroni test; PCA, principal component analysis using the five leading components; SKAT, the sequence kernel association test.
Next, we investigated the associations between the KEGG pathways and high-density lipoprotein. The KEGG database contains 186 pathways, which represent a wide variety of cellular processes and molecular functions; for more details see http://www.genome.jp/kegg/pathway.html. We excluded one pathway from our analysis due to overlapping, so our real-data analysis includes 185 pathways. Figure 2 provides an overview of the number of variants in each of the 185 pathways. The median number of variants in these pathways is around 3000. A number of pathways have more than 10 000 variants, with some containing nearly 25 000.
Fig. 2.
The number of single-nucleotide polymorphisms, SNPs, in each of the 185 KEGG pathways.
To control for the familywise Type I error, the threshold of significance was set to
, i.e., a Bonferroni correction. Table 5 shows the pathways that pass the significance threshold in any of the tests. The
approach identified three pathways: arachidonic acid metabolism, metabolism of xenobiotics by cytochrome P450, and drug metabolism by cytochrome P450. The
statistic yielded the same values as
, indicating that no signal exceeds the power-enhancement threshold in the studied pathways. The sequence kernel association test detected only the arachidonic acid metabolism pathway, while the other methods identified no significant pathway.
Table 5.
The
-values
of the tests for the three significant KEGG pathways;
-values lower than
are indicated by 
-value |
|||||||
|---|---|---|---|---|---|---|---|
| #SNPs | Bonf. | Burden | PCA | SKAT |
|
|
|
| Arach. acid metab. | 2590 | 5.11 | 0.57 | 0.07 | 0.02
|
|
|
| Metab. xenobio. | 2254 | 6.30 | 0.46 | 0.18 | 0.07 |
|
|
| Drug metab. | 2385 | 7.86 | 0.39 | 0.16 | 0.04 |
|
|
#SNPs, number of single-nucleotide polymorphisms; Bonf., Bonferroni test; PCA, principal component analysis using the five leading components; SKAT, the sequence kernel association test; Arach. acid metab., arachidonic acid metabolism pathway; Metab. xenobio., metabolism of xenobiotics by cytochrome P450; Drug metab., drug metabolism by cytochrome P450.
The arachidonic acid metabolism pathway contains 2590 variants in 55 genes. A recent biological study suggested that this pathway is an important regulator of cholesterol metabolism (Demetz et al., 2014). The linkage disequilibrium plot of the genetic variants of this pathway in Fig. 3(a) shows that variants in proximity to each other tend to have strong correlations, while those far apart have barely detectable correlations. To gain more insight, we plot the marginal
-values for all 2590 variants in Fig. 3(b). There are a number of variants with
-values between
and
, but none of them reaches genome-wide significance. Instead, the proposed
statistic was able to aggregate these relatively mild signals into a stronger one, which leads to the detection of the arachidonic acid metabolism pathway. The variants that contribute to the significance of this pathway, the linkage disequilibrium plots and marginal
-values of variants in the other two pathways, metabolism of xenobiotics by cytochrome P450 and drug metabolism by cytochrome P450, are given in the Supplementary Material.
Fig. 3.
Analysis of the arachidonic acid metabolism pathway: (a) linkage disequilibrium plot; (b) marginal
-values of the 2590 single-nucleotide polymorphisms, SNPs, in the pathway, where the dashed line represents the Bonferroni threshold.
5. Discussion
Our approach can be extended to deal with non-Gaussian errors as long as the errors satisfy the moment condition
for some constant
and
. In such a situation, we can adjust the denominator of
in (3) from
to
, where
is the kurtosis of the errors and
is the
th diagonal entry of
. Then, using the results in Bhansali et al. (2007), we can show the asymptotic normality of the adjusted test statistic accordingly. Our approach can be also extended to accommodate genetic interactions.
Screening techniques have been used in genetic association studies to filter out irrelevant variants; see, for example, Li et al. (2014) and Cui et al. (2015). However, these screening procedures are typically used as a variable-selection step to reduce dimensions, not for statistical testing. In contrast, our screening statistic is directly integrated into the test statistic and is designed for statistical testing. Our approach has focused on the fixed design, which is commonly considered in genetic studies. It will be interesting to develop similar methods under the random design, although it remains challenging to establish the asymptotic properties of the proposed statistics in high dimensions.
Supplementary Material
Acknowledgement
This research was supported by the U.S. National Institutes of Health. We thank the Women’s Health Initiative investigators for sharing the data. The Women’s Health Initiative programme is funded by the National Heart, Lung and Blood Institute. Correspondence should be addressed to QH. We thank the editor, associate editor and reviewers for helpful comments.
Supplementary material
Supplementary material available at Biometrika online includes technical proofs, together with additional simulation results and real-data analysis.
References
- Asselbergs, F. W.,, Guo, Y.,, Van Iperen, E. P.,, Sivapalaratnam, S.,, Tragante, V.,, Lanktree, M. B.,, Lange, L. A.,, Almoguera, B.,, Appelman, Y. E.,, Barnard, J., et al. (2012). Large-scale gene-centric meta-analysis across 32 studies identifies multiple lipid loci. Am. J. Hum. Genet. 91, 823–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Assmann, G. & Gotto, A. M. (2004). HDL cholesterol and protective factors in atherosclerosis. Circulation 109, III8–14. [DOI] [PubMed] [Google Scholar]
- Avery, C. L.,, He, Q.,, North, K. E.,, Ambite, J. L.,, Boerwinkle, E.,, Fornage, M.,, Hindorff, L. A.,, Kooperberg, C.,, Meigs, J. B.,, Pankow, J. S., et al. (2011). A phenomics-based strategy identifies loci on APOC1, BRAP, and PLCG1 associated with metabolic syndrome phenotype domains. PLoS Genet. 7, e1002322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bhansali, R., Giraitis, L. & Kokoszka, P. (2007). Convergence of quadratic forms with nonvanishing diagonal. Statist. Prob. Lett. 77, 726–34. [Google Scholar]
- Buas, M. F.,, He, Q.,, Johnson, L. G.,, Onstad, L.,, Levine, D. M.,, Thrift, A. P.,, Gharahkhani, P.,, Palles, C.,, Lagergren, J.,, Fitzgerald, R. C., et al. (2017). Germline variation in inflammation-related pathways and risk of Barrett’s oesophagus and oesophageal adenocarcinoma. Gut 66, 1739–47. [DOI] [PMC free article] [PubMed] [Google Scholar]
-
Chen, L. S., Paul, D., Prentice, R. L. & Wang, P. (2011a). A regularized Hotelling’s
test for pathway analysis in proteomic studies. J. Am. Statist. Assoc. 106, 1345–60. [DOI] [PMC free article] [PubMed] [Google Scholar] - Chen, M., Cho, J. & Zhao, H. (2011b). Incorporating biological pathways via a Markov random field model in genome-wide association studies. PLoS Genet. 7, e1001353. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen, S. X. & Qin, Y.-L. (2010). A two-sample test for high-dimensional data with applications to gene-set testing. Ann. Statist. 38, 808–35. [Google Scholar]
-
Conneely, K. N. & Boehnke, M. (2007). So many correlated tests, so little time! Rapid adjustment of
values for multiple correlated tests. Am. J. Hum. Genet. 81, 1158–68. [DOI] [PMC free article] [PubMed] [Google Scholar] - Coviello, A.D., Haring, R., Wellons, M., Vaidya, D., Lehtimaki, T., Keildson, S., Lunetta, K.L., He, C., Fornage, M. & Lagou, V.. et al. (2012). A genome-wide association meta-analysis of circulating sex hormone–binding globulin reveals multiple Loci implicated in sex steroid hormone regulation. PLoS Genet. 8, e1002805. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cui, H., Li, R. & Zhong, W. (2015). Model-free feature screening for ultrahigh dimensional discriminant analysis. J. Am. Statist. Assoc. 110, 630–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dai, J. Y., Kooperberg, C., Leblanc, M. & Prentice, R. L. (2012). Two-stage testing procedures with independent filtering for genome-wide gene-environment interaction. Biometrika 99, 929–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Demetz, E.,, Schroll, A.,, Auer, K.,, Heim, C.,, Patsch, J. R.,, Eller, P.,, Theurl, M.,, Theurl, I.,, Theurl, M.,, Seifert, M. et al. (2014). The arachidonic acid metabolome serves as a conserved regulator of cholesterol metabolism. Cell Metab. 20, 787–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dicker, L. H. (2014). Variance estimation in high-dimensional linear models. Biometrika 101, 269–84. [Google Scholar]
- Fan, J., Guo, S. & Hao, N. (2012). Variance estimation using refitted cross-validation in ultrahigh dimensional regression. J. R. Statist. Soc. B 74, 37–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan, J., Liao, Y. & Yao, J. (2015). Power enhancement in high dimensional cross-sectional tests. Econometrica 83, 1497–541. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Feller, W. (1971). Expansions in the case of varying components. In An Introduction to Probability Theory and Its Applications, vol. 2 New York: Wiley, pp. 546–8. [Google Scholar]
- Gregory, K. B., Carroll, R. J., Baladandayuthapani, V. & Lahiri, S. N. (2015). A two-sample test for equality of means in high dimension. J. Am. Statist. Assoc. 110, 837–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- He, Q., Zhang, H. H., Avery, C. L. & Lin, D. (2016). Sparse meta-analysis with high-dimensional data. Biostatistics 17, 205–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
- International HapMap Consortium (2005). A haplotype map of the human genome. Nature 437, 1299–320. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li, J., Zhong, W., Li, R. & Wu, R. (2014). A fast algorithm for detecting gene–gene interactions in genome-wide association studies. Ann. Appl. Statist. 8, 2292–318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McKeague, I. W. & Qian, M. (2015). An adaptive resampling test for detecting the presence of significant predictors. J. Am. Statist. Assoc. 110, 1422–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Morgenthaler, S. & Thilly, W. G. (2007). A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: A cohort allelic sums test. Mutat. Res. 615, 28–56. [DOI] [PubMed] [Google Scholar]
- Shen, D., Shen, H. & Marron, J. S. (2016). A general framework for consistency of principal component analysis. J. Mach. Learn. Res. 17, 1–34. [Google Scholar]
- Wu, M. C., Lee, S., Cai, T., Li, Y., Boehnke, M. & Lin, X. (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89, 82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang, G. (2015). Genetic architecture of complex human traits: What have we learned from genome-wide association studies? Curr. Genet. Med. 3, 143–50. [Google Scholar]
- Zhong, H., Yang, X., Kaplan, L. M., Molony, C. & Schadt, E. E. (2010). Integrating pathway analysis and genetics of gene expression for genome-wide association studies. Am. J. Hum. Genet. 86, 581–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.














































































































