Summary
Set-based association tests are widely popular in genetic association settings for their ability to aggregate weak signals and reduce multiple testing burdens. In particular, a class of set-based tests including the Higher Criticism, Berk–Jones, and other statistics have recently been popularized for reaching a so-called detection boundary when signals are rare and weak. Such tests have been applied in two subtly different settings: (a) associating a genetic variant set with a single phenotype and (b) associating a single genetic variant with a phenotype set. A significant issue in practice is the choice of test, especially when deciding between innovated and generalized type methods for detection boundary tests. Conflicting guidance is present in the literature. This work describes how correlation structures generate marked differences in relative operating characteristics for settings (a) and (b). The implications for study design are significant. We also develop novel power bounds that facilitate the aforementioned calculations and allow for analysis of individual testing settings. In more concrete terms, our investigation is motivated by translational expression quantitative trait loci (eQTL) studies in lung cancer. These studies involve both testing for groups of variants associated with a single gene expression (multiple explanatory factors) and testing whether a single variant is associated with a group of gene expressions (multiple outcomes). Results are supported by a collection of simulation studies and illustrated through lung cancer eQTL examples.
Keywords: Detection boundary, Genetic association study, Multiple outcomes, Set-based inference, Sparse alternative
1. Introduction
In recent years, rich genetic compendiums quantifying a plethora of biological attributes have become increasingly popular in medical research. In addition to genetic information, many collections offer data on tens of thousands of phenotypic features as well. Publicly accessible examples include the Genotype-Tissue Expression Project (GTEx) (Battle and others, 2017) and the UK Biobank (Bycroft and others, 2018). Researchers have taken full advantage of these resources by performing massive numbers of association analyses under a variety of frameworks including genome-wide association studies (GWAS) (Lee and others, 2014) and phenome-wide association studies (PheWAS) (Denny and others, 2010).
It is common to carry out the aforementioned analyses using set-based inference strategies that group separate hypothesis tests under a single global null hypothesis. Such tests can aggregate rare and weak effects into a more detectable signal while naturally reducing the multiplicity burden, thus alleviating two of the main challenges in genetic association studies. The groupings are generally motivated by natural biological constructs. Two broad types of settings have commonly been considered: (a) the multiple explanatory variable setting that tests a set of genetic variants against a single phenotype (Wu and others, 2011) and (b) the multiple outcome setting that tests a single genetic variant for association with a set of phenotypes (Liu and Lin, 2019).
Previous work (Liu and others, 2020) focused on dense alternatives has shown that individual tests and dense set-based tests perform differently in the two settings. In particular, for the multiple explanatory variable setting (a), individual association tests based on low-order genotype principal components (PCs) have more power (where low order refers to the PCs paired with the largest eigenvalues), and the variance component set-based test can be thought of as upweighting low-order PC-based test statistics. For the multiple outcomes setting (b), higher-order PCs of the outcomes are more powerful in individual association tests, and the Wald and variance component tests upweight higher-order PC-based test statistics. However, individual association tests and dense alternative type tests may not be applicable in many genetics settings, where effects are often sparse (Barnett and others, 2017). Little is known about the relative performance of set-based tests for sparse alternatives settings in (a) and (b), which is the focus of this work.
One class of set-based tests for sparse alternatives has recently become popular (Hu and others, 2019; Gaynor and others, 2019; Harvey and others, 2020) in genetics studies for asymptotically reaching a so-called rare-weak detection boundary (Jager and Wellner, 2007). This class, which includes the Higher Criticism (HC), Berk–Jones (BJ), and other statistics (Moscovich-Eiger and others, 2016), can in a certain sense detect the weakest signals detectable by any statistical procedure under sparse alternatives (Donoho and Jin, 2004; Berk and Jones, 1979). Sparse alternatives are those where the number of signals is less than the square root of the set size; other settings are dense (Donoho and Jin, 2004). Extensions to account for correlated elements in a set (Barnett and others, 2017; Sun and Lin, 2020) and other related strategies continue to be proposed (Moscovich-Eiger and others, 2016; Chen and others, 2019), although the HC and BJ type ideas remain arguably the most well known.
A significant issue in practice is the choice of which specific detection boundary test to apply for settings (a) and (b). When working in genetic association studies, it is necessary to accommodate varying degrees of correlation between elements in a set, and different modifications to the HC and BJ are available for this purpose. Two options are to adapt the test statistics directly to create the generalized HC (GHC) and BJ (GBJ) tests (Barnett and others, 2017; Sun and Lin, 2020) or to decorrelate the elements of a set before applying standard HC and BJ (Hall and Jin, 2010). The latter strategy is known as the innovated HC (iHC) or BJ (iBJ). It is of great interest to investigate how the two approaches behave in settings (a) and (b). Conflicting guidance is given about these methods in the literature (Hall and Jin, 2010; Barnett and others, 2017), and precise comparisons are challenged by the computational burden of performing power calculations in realistic settings.
To ground the discussion in more concrete terms, we briefly introduce our motivating example of lung cancer expression quantitative trait loci (eQTL) studies in Figure 1. The genetic etiology of lung cancer has been deeply probed in GWAS involving tens of thousands of subjects (McKay and others, 2017). It is of pivotal importance to study the biological mechanisms linking GWAS-identified variants to disease risk (Bossé and Amos, 2018). eQTL analysis is one of the most popular methods to explore these underlying biological mechanisms (Liu and others, 2021). As sample sizes in eQTL data sets like GTEx (GTEx Consortium and others, 2020) are much smaller than in GWAS, for example, often with only hundreds of subjects, set-based tests for sparse alternatives are an ideal choice for inference.
Fig. 1.
Examples of common lung cancer eQTL analyses testing both (a) multiple explanatory factors and (b) multiple outcomes. In (a), many disease-associated Single Nucleotide Polymorphisms (SNPs) at a risk locus are tested for association with the expression of one nearby gene, RAD52. Each dot is one SNP, the y-axis is the level of association between the SNP and RAD52 expression in blood, and the color represents the SNP’s association with lung cancer. Significant expression associations, as observed here, provide evidence that variants at the locus affect risk of lung cancer by regulating the level of RAD52. More specific follow-up studies can then be performed. In (b), the single previously identified risk SNP rs7705526 located on chromosome 5 at position 1285974 is tested for association with the expression of many nearby genes surrounding the TERT risk locus. Each line is one gene, and the y-axis is level of association with the SNP. Significant associations, which are not observed here, would provide evidence that the variant possesses functions related to regulating the expression levels of genes surrounding the TERT risk locus.
A common eQTL multiple explanatory factor analysis is to group many correlated disease-associated SNPs at a risk locus and then test the association of the set against the expression values of nearby genes (McKay and others, 2017), as in Figure 1(a). Significant associations attribute a specific disease mechanism to the risk locus, enabling more targeted follow-up studies. A complementary multiple outcomes analysis is to test an individual SNP against the expression levels of many genes located around a genomic region of interest (McKay and others, 2017), as in Figure 1(b). Strong associations demonstrate the functional behavior of a SNP, which is crucial for identifying causal variants and further understanding mechanisms of disease.
The two primary goals of this article are to analytically study finite sample properties of the innovated and generalized methods in settings (a) and (b) and to provide tools for the individualized design of set-based genetic association studies. A key theme is that within-set correlation influences the choice of test differently depending on the type of multiplicity—whether in the explanatory factors or the outcome. Insights into test performance are greatly facilitated by the development of upper and lower bounds on the exact power; power bounds unlock analytical calculations and comparisons in realistic testing scenarios.
Specifically, we show that in the multiple explanatory factor setting (a), the correlation structure often leads to signal weights that moderately favor generalized tests compared to innovated tests. However in the multiple outcomes setting (b), innovated tests can benefit from even larger advantages in signal configuration, depending on the specific within-set correlation structure. A simulation study in the setting of lung cancer eQTL analysis is used to confirm analytical calculations. Set-based tests mimicking the aforementioned translational lung cancer studies are then applied in GTEx.
The remainder of the article is organized as follows. In Section 2, we first define our notation and review the generalized and innovated test statistics. In Section 3, we develop upper and lower bounds on the finite sample power of detection boundary tests, facilitating numerical studies of performance. Section 4 demonstrates how the properties of the different tests can vary significantly depending on the type of multiplicity and the within-set correlation structure. Section 5 demonstrates how to utilize results from the previous two sections in realistic study settings, and Section 6 confirms calculations with a simulation study. In Section 7, we perform testing in the GTEx dataset to elucidate how GWAS-selected risk SNPs regulate gene expression activity. We conclude with a discussion in Section 8.
2. Methods
2.1. Model and notation for multiple explanatory factors
We first develop our notation by reviewing a standard framework for genetic association studies with a set of multiple genetic variants and a single outcome. Suppose for each of
subjects we observe a genotype vector
of
SNPs that belong to a set with biological relevance, for instance, a gene. For clarity of presentation, assume that the genotypes have been centered and standardized so that
and
; the nonstandardized case follows in a similar manner, albeit with additional notational complexity. Suppose that for each subject, we also observe a scalar outcome
and a
-dimensional vector
of additional covariates.
A standard model for this situation is
![]() |
(2.1) |
where
and
are the fixed effects of the non-SNP and SNP covariates, respectively. Commonly, a Gaussian distributional assumption
is also applied. Both the generalized and innovated approaches are designed to test the global null hypothesis of no SNP effects
. Under this global null, a marginal score statistic for associating the
th SNP with the outcome is
![]() |
where
,
is the
identity matrix,
,
,
is a consistent estimator for
, and
. Let
be the vector of test statistics for each genotype in the set. If no additional covariates are in the model, for example,
where
, then
follows
![]() |
(2.2) |
where
is the empirical correlation matrix of the genotypes. Note that poor estimates of
, such as estimates generated when incorrectly assuming the null hypothesis, often lead to only negligible differences from (2.2) and are mostly immaterial to our main conclusions. We can see that
enters into both the mean and covariance of
.
The exposition assuming no additional covariates is used for both simplicity of presentation and because it allows for translation to the summary statistic setting, where individual-level data are not provided for subjects
and only test statistics
are available. In such a setting, we can assume (2.2) and estimate
with reference genotype panels, so that the distribution of the test statistics is still known and upcoming results are still applicable. Even if summary statistics are created with non-SNP covariates, (2.2) can still be a good approximation if the additional covariates are approximately independent of the genotypes (Sun and Lin, 2020). Exact calculations for the model with individual-level data and non-SNP covariates can be performed with straightforward adjustments.
2.2. Model and notation for multiple outcomes
When testing in the multiple outcomes setting, assume for each subject
we observe only the genotype at a single variant
and a set of
related outcomes
, for instance a set of gene expression values for genes surrounding a risk locus. Assume that the SNP has again been centered and scaled as previously described. We then have the
models
![]() |
(2.3) |
where
is again the vector of other covariates, and we suppose for simplicity it remains the same for all
. Generally, to account for correlation among the
outcomes, it is assumed that
follows
where
is the correlation matrix of the outcomes,
, and
is the standard deviation of outcome
.
Both the generalized and innovated methods aim to test the set-based null hypothesis
, where
, and we can construct marginal score statistics for each
as
![]() |
where
,
is a consistent estimator of
, and
. Then when there are no covariates other than
,
follows
![]() |
(2.4) |
We will again proceed with the no additional covariates approach so that results are applicable to summary statistics assumed to have the distribution in (2.4); the extensions are again readily accessible.
2.3. Generalized and innovated tests
For the sake of convenience, we briefly review the HC and BJ. Use
to denote test statistics of dimension
and assume
, where the diagonal elements of
are 1. Let
denote the survival function of a standard normal random variable. Further define
be the number of test statistics with a magnitude greater than or equal to some fixed threshold
. For fixed
, and if
, then clearly 
Binomial
. The HC statistic is
and is the maximum of a centered and standardized version of
(Donoho and Jin, 2004). The BJ statistic is
and can be viewed as the maximum of a likelihood ratio test on the mean parameter of
(Berk and Jones, 1979).
The GHC replaces the denominator in the HC with the correct variance of
when the correlation matrix of
is
. GBJ similarly adjusts the likelihood ratio quantity to explicitly account for overdispersion in
caused by correlated elements of
(Barnett and others, 2017; Sun and Lin, 2020). In contrast, innovated tests iHC and iBJ transform the
to be independent, for example using a Cholesky decomposition of
. The standard HC or BJ can then be applied to the transformed
(Hall and Jin, 2010).
3. Power analysis of generalized and innovated tests
A direct approach to analyze the performance of generalized and innovated detection boundary tests is to perform power calculations across different settings. Unfortunately, the computational burden of existing approaches is extremely high, and it is generally impractical to perform such calculations (Sun and Lin, 2020). Thus, it has been difficult to make precise quantitative statements about performance except when comparing simulation results (Barnett and others, 2017).
The goal of this section is to introduce upper and lower bounds on the exact power that are simpler to compute, allow for relative comparisons of tests without performing complicated simulations, and can be used to design practical genetic association studies. Continuing with the setting of Section 2.3, let
be the vector that results from applying the absolute value operator to each element of
. When testing the set-based null with
, the rejection regions of tests we discuss can be written as
![]() |
(3.5) |
where
are the order statistics of
and
are the
boundary constants that depend on the test statistic
(e.g., GBJ or iHC), the level of the test
, and the correlation structure
. The different values of
used for each test lead to differences in finite sample performance; their calculation has previously been explored in detail by multiple authors (Moscovich-Eiger and others, 2016; Sun and Lin, 2020). In the following, discussion of rejection regions
and boundary constants
will suppress the dependency on test parameters to reduce the notational burden.
We begin by discussing lower bounds. From (3.5), exact calculation of
when
can be calculated with
computations of a
-dimensional multivariate normal integral (Sun and Lin, 2020). The computation can be vastly simplified by working with simpler subsets of the rejection region. For example, one potential subset is
![]() |
(3.6) |
Clearly
, and the probability of
can be calculated as
![]() |
(3.7) |
(see Appendix A of the Supplementary material available at Biostatistics online). Intuitively, the probability of
can capture a large portion of the power because if there is an associated feature, we would expect its corresponding test statistic to have the largest magnitude, and so most rejections of the null would happen when this magnitude exceeds
. The difference between the exact power and the probability of
is
.
It is possible to make the lower bound sharper by extending
to fill more and more of
, at the cost of escalating the computational burden. For example, a logical next step to tighten the bound is to use
![]() |
(3.8) |
which is also clearly a subset of
. The probability of
can be calculated by
![]() |
(3.9) |
(see Appendix B of the Supplementary material available at Biostatistics online). Similar reasoning can be used to calculate an
that adds events related to
,
that adds events related to
, and so on, all the way up to the exact rejection region
. For the purposes of choosing between innovated and generalized tests, we will show that
is already quite informative.
Along with a lower bound, it is obviously also useful to know an upper bound on the power. Similar to the lower bound case, one way to calculate upper bounds on power is to use subsets of
, which can be written as
![]() |
One possible subset of
is then given by
![]() |
(3.10) |
and we can write
![]() |
(3.11) |
The difference between this upper bound and the exact power is
![]() |
Similarly to the lower bound case, a tighter bound is given by
![]() |
(3.12) |
and we can calculate
![]() |
(3.13) |
Again, the bound can be lowered until it becomes exactly
, at the cost of increasing computational burden.
Calculation of the bounds and exact power when using iBJ and GBJ at
is illustrated in Figure 2 with example sets of five elements. Data is generated under models (2.1) and (2.3) for the multiple explanatory factors and multiple outcomes settings respectively, with no non-SNP covariates. In the high correlation setting (bottom), each off-diagonal element of the within-set correlation matrix is 0.5 greater than the corresponding value in the low correlation setting (top). Pre-standardization, the regression coefficient vector has one nonzero element equal to 0.25. There are
subjects, and the variance of each outcome is equal to 1.
Fig. 2.
Lower bound (using
), exact calculation, and upper bound (using
) for power of iBJ and GBJ in the multiple explanatory factors (left) and multiple outcomes (right) settings when there are five elements in a set and the correlation between them is low (top) or high (bottom). In the low correlation matrix, the first block of two elements has pairwise correlation 0.3, the other block of three elements also has all mutual pairwise correlations equal to 0.3, and the correlation between elements of the two blocks is 0.1. In the high correlation matrix, each of the aforementioned off-diagonal values is increased by 0.5. Only one element of the regression coefficients vectors is nonzero, and the location of the nonzero element is given on the x-axis. All SNPs have minor allele frequency of 0.3. Testing is performed at the 0.01 level.
This example is deliberately kept simple for illustrative purposes, but we do see that the bounds are quite accurate when the set size is small; larger examples will also be provided. Furthermore, in the multiple explanatory variable setting (Figures 2(a) and (c)), the GBJ has more power than iBJ. In contrast, in the multiple outcome setting (Figures 2(b) and (d)), the tests have similar power when within-set correlation is low (Figure 2(b)), while iBJ has much more power when within-set correlation is high (Figure 2(d)). In general, the proposed bounds can be used to calculate power under a variety of alternatives prior to performing analysis. Software to perform these calculations is publicly available in the DBpower package.
4. Finite sample operating characteristics
The next section analytically describes how within-set correlation affects test signal strengths and sparsity in opposite directions depending on the type of multiplicity.
4.1. Correlation when testing multiple outcomes
We begin by explaining how large power discrepancies between generalized and innovated tests can occur when correlation among outcomes is high, as observed in Figure 2(d). Following the notation in Section 2.2, GHC and GBJ are applied directly to
. The innovated tests must first decorrelate
. Let
. Here,
is a diagonal matrix holding the eigenvalues of
, sorted so that the largest value is in the first row.
is an orthogonal matrix where the
th column is the eigenvector corresponding to the
th largest eigenvalue. Decorrelated test statistics
are then approximately
![]() |
Note that
are also the PCs of
, so the innovated tests can be seen as applications of HC and BJ to the PC scores of the marginal test statistics.
4.2. Signal weighting for multiple outcomes
Clearly the means of
and
differ by
. The term
can be thought of as signal weights in the innovated setting. If the smallest eigenvalues are very small, the elements of
can dominate other contributions to the signal. In fact, the elements of
are unbounded.
Consider the first signal location setting of Figure 2(d). When using iBJ, the largest weight is
, and the largest value of
is 4.18, while the largest value of
is 3.24. Suppose we are calculating an upper bound on GBJ power as in (3.10) and (3.11). The first term in (3.11) will be
, and thus an upper bound is less than 0.56. Then suppose we are calculating a lower bound on iBJ as described in (3.6) and (3.7). The last term in (3.7) is
, and thus a lower bound is at least 0.78; here, even moderate values in
affect
enough to generate large power differences.
More general statements about signal weights can also be made. While it is not possible to characterize all arbitrary correlation structures, we consider some practically useful situations here (see Appendix C of the Supplementary material available at Biostatistics online).
Remark 1: One upper bound on the smallest eigenvalue is
where
is the
element of
.Remark 2: A lower bound on the largest eigenvalue is
, so a second upper bound on the smallest eigenvalue is
.
Thus, when there is a large amount of within-set correlation or a subset of highly correlated elements, there will be larger signal weights, and innovated tests will often be preferred over generalized tests in multiple outcomes settings. For more specific circumstances, the example of Figure 2(d) succinctly illustrates the utility of analysis with the proposed power bounds.
4.3. Signal sparsity for multiple outcomes
The other difference between
and
is the presence of
in
. The product
can both increase and decrease individual elements of
, although since it is a rotation matrix,
, where
denotes Euclidean norm. However, one consistently power-increasing feature of
is its ability to, roughly speaking, spread the signal, so that zero means in
become nonzero in
.
For instance, continuing with our example from Section 4.2, three nonzero means in
compared to one in
create an increase in signal density that naturally leads to increases in power. More precisely, using terms from (3.7) and (3.13), we can calculate the probability of rejecting the null even if the element of
with the largest mean results in the largest observed magnitude but does not cross the largest bound:
![]() |
For the generalized statistics
the corresponding probability is 0.001, demonstrating the increase in power that is generated by additional nonzero means.
4.4. Correlation when testing multiple explanatory factors
In the setting of Section 2.1, an innovated test for multiple explanatory factors must first decorrelate
. Write the previously defined eigendecomposition of
as
. Under the alternative, the innovated test statistics approximately possess a mean of
![]() |
In contrast,
, so the weights
are effectively squared compared to the innovated tests. The difference will be more drastic for large eigenvalues.
4.5. Signal weighting and sparsity with multiple explanatory factors
While the largest weights in the generalized setting are effectively squared compared to the innovated setting, this advantage is bounded. The sum of the eigenvalues of
equals
, the trace of the matrix, so there is an upper limit on
, whereas there is no upper limit on
.
For more general statements, there are again a variety of bounds on large eigenvalues, including the one referred to in Remark 2 and two additional ones of particular interest.
Remark 3: For a positive matrix, a lower bound on the maximal eigenvalue is the minimal row sum for any row in
,
, where
is the
element of
. An upper bound on the maximal eigenvalue is the maximal row sum for any row in
.Remark 4: It follows from Remark 3 that there are at most
eigenvalues greater than 1 for a positive matrix
, where
is the floor operator.
We can see that the generalized test advantages in the largest signal weights will increase as the within-set correlation increases. However, as the within-set correlation increases, the number of eigenvalues larger than 1 falls, and so there will be fewer weights that are larger when squared.
Also, in the multiple explanatory factors setting there is no stark contrast in signal sparsity between innovated and noninnovated tests, as the eigenvectors
multiply
in the mean parameters of both
and
. Thus in the multiple explanatory setting, generalized tests have only the weighting advantage and not the signal density advantage. The generalized tests also see their advantage decreased slightly due to changes in rejection regions, which we describe in more detail in Appendix D of the Supplementary material available at Biostatistics online.
5. Translation to practical set parameters
To translate the above discussion to realistic testing situations, we investigate how a larger correlation structures lead to specific eigenvalues. We will model specific correlation structures with block matrices of the form
![]() |
(5.14) |
where
,
, and
. This structure corresponds to two correlated clusters of features within a set, for example, two linkage disequilibrium blocks at a risk locus. We let the two clusters possess exchangeable correlation structures within themselves at
and
, and the correlation between the two clusters is
.
Straightforward calculations (Liu and Lin, 2019) show there are up to four distinct eigenvalues taking the values
, and
, where
and
We can see that if
and
increase, there can be eigenvalues very close to 0, and the largest eigenvalue also increases. We can also have an eigenvalue close to 0 even when
and
are small if
grows. A simplification of the model is
, in which case there is exchangeable correlation throughout the entire set. In this setting there are only two distinct eigenvalues,
and
. When the common correlation is large, the smallest eigenvalue can be close to 0, and the largest eigenvalue can be close to the size of the set.
Figures 3(a) and (b) show calculations of power bounds for both the multiple explanatory factors and multiple outcomes settings when
and
is varied. Figures 3(c) and (d) show the exchangeable correlation case. We see that, as expected, the generalized tests appear to outperform the innovated tests slightly in the multiple explanatory factors setting. The outperformance of the innovated tests as correlation increases in the multiple outcomes setting is more dramatic, also as expected.
Fig. 3.
Lower and upper bounds on power (using
and
) for iBJ and GBJ in the multiple explanatory factors (left) and multiple outcomes (right) settings when there are 40 elements in a set. In the top row, the correlation structure is given by (5.14) with
, and
is varied among all values that admit a positive definite matrix. In the bottom row, the correlation structure is exchangeable with
, and the level of correlation is varied. Note that the lower bounds for innovated tests can rise above upper bounds for generalized tests in the multiple outcomes setting.
6. Simulation
We next provide empirical support for the previous analytical arguments by conducting simulations designed to mimic the setting of our GTEx lung cancer data analysis. In these simulations, we consider sparse signal settings and primarily compare the GHC and GBJ against the iHC and iBJ. A variance component sequence kernel association test (SKAT) as well as the aggregated Cauchy association test (ACAT) (Liu and others, 2019) are also presented as reference points, but the main goal of this work is to illustrate the differences between innovated and generalized tests, not to repeat comparisons of detection boundary tests against other strategies (Barnett and others, 2017). SKAT is known to perform better in dense situations, while ACAT has been shown to reach the detection boundary only in parts of the sparse regime. Simulations are conducted with 400 subjects and 40 elements per set to approximate the GTEx analysis.
The data are generated without standardization from models (2.1) and (2.3), where
is composed of a standard normal random variable and a Bernoulli random variable with mean 0.5. The effect sizes on the nongenetic covariates are set at 1 always. The first four elements of
and
are always generated from a Uniform(0, 0.25) or Uniform(0, 0.4) distribution, respectively, and all other elements are 0. Thus, there are four true causal SNPs out of the 40 total SNPs in model (2.1), and the individual SNP affects four out of the 40 outcomes in model (2.3). In the multiple explanatory factors setting, we generate the genotypes
using the block correlation structure given in (5.14), where the
block corresponds to the four signal variants. In the multiple outcomes setting,
is generated with this same correlation structure, and
corresponds to the correlation among the four signal outcomes. Minor allele frequencies are held constant at 0.3, and all testing is performed at
.
Figures 4(a) and (b) present smoothed power under the same setting as Figures 3(a) and (b), with
and
varying. It was demonstrated in Section 5 that
increasing to larger values would decrease the eigenvalues and sharply increase signal weights in the multiple outcomes setting, with power bounds analysis predicting power close to 1 when
is larger. Indeed, we do generally see such behavior in Figure 4(b) (see Appendix E of Supplementary material available at Biostatistics online). In the multiple explanatory factors setting of Figure 4(a), the empirical powers mostly mirror the trends of the lower bounds, with GBJ performing slightly better. Figures 4(c) and (d) present smoothed power under the same exchangeable correlation setting as Figures 3(c) and (d). From Section 5, we again know that generalized tests should possess an advantage in the multiple explanatory factors setting, but innovated tests should outperform by even more as correlation increases in the multiple outcomes setting. Both trends are observed as expected.
Fig. 4.
Power simulations with set-based tests under the same settings as Figure 3. In the multiple explanatory factors setting (left), four causal SNPs have an exchangeable correlation structure with
, and the noncausal SNPs are similarly correlated with
. In the multiple outcomes setting (right), the SNP has an effect on only four outcomes, and the associated and unassociated outcomes form two blocks with a correlation structure that is equivalent to multiple explanatory factor correlation structure. VC refers to the variance component SKAT test. We conduct 500 simulations at each 0.01 increment of
and smooth the empirical power curve.
By comparison, SKAT demonstrates excellent power in high correlation settings when testing both multiple explanatory factors and multiple outcomes, but it can also show poor performance in sparse, low correlation situations such as part of Figure 4(a). The underperformance in certain sparse settings is a well-known weakness of SKAT. The ACAT acts almost exactly like the GHC in all tested settings. Additional simulations under a variety of different models are reported in Appendices F and G of the Supplementary material available at Biostatistics online.
7. GTEx data analysis
Many studies investigating the genetic etiology of lung cancer have been conducted across diverse cohorts of subjects, identifying and validating dozens of risk loci. However, the precise biological mechanisms linking risk loci and disease remain unclear, and it is still difficult to pinpoint exact causal variants (Bossé and Amos, 2018). There are various ongoing efforts to characterize loci with possible translational impact and identifying regions of the genome with effects on gene expression is a popular tactic.
One of the largest GWAS of lung cancer to date included over 80 000 subjects and was conducted by the International Lung Cancer Consortium (ILCCO) (McKay and others, 2017). This study identified 18 risk loci of primary interest. All SNPs at a risk locus were tested for association with the lung expression value of a likely risk gene at that locus (one risk gene per locus, 18 in all). Significant associations helped suggest eQTL disease mechanisms for genes such as RNASET2 and NRG1. The article additionally reported the results of testing the single most significant SNP at each locus against the lung expression of all genes in the vicinity of a locus, to identify possible mechanistic pathways for the most significant variants.
While lung eQTL results are very useful, the sample size was relatively small compared to the GWAS size, and expression studies in other relevant tissues may help provide a more complete understanding of risk mechanisms. Thus, we use set-based testing to complement previous results by re-performing the same eQTL study in blood data. Specifically, we (i) test sets of risk SNPs at a single locus against the blood expression of a single risk gene, aggregating signals across SNPs at the locus and (ii) test sets of blood expressions for all genes at a risk locus against individual risk SNPs, aggregating signals across gene expression values at a locus.
There are 384 subjects with complete genotyping, covariate, and blood expression data. We perform eQTL testing using the same models reported in previous GTEx analysis, inverse-normal transforming expression values and then fitting linear regression models with covariates for sex, genotyping platform, genetic PCs, and PEER factors (Battle and others, 2017). When testing sets of SNPs against one gene expression in analysis (i), we use all ILCCO-identified risk variants associated at
in a 250 kb window centered at the ILCCO-identified risk gene. When testing sets of gene expression values against one variant in (ii), we use all genes in a 2 Mb window centered around the ILCCO-identified sentinel SNP. These window sizes are used to create sets of approximately equal size in analysis. Combining all SNPs (dots) in Figure 1(a) results in one test for (i) and combining all genes (lines) in Figure 1(b) results in one test for (ii).
The five genes (out of 18) that pass a Bonferroni corrected level of
in the multiple explanatory factor analysis (i) are shown in Table 1. Thus, there is evidence that lung cancer-associated variants near RNASET2, NRG1, RAD52, CHRNA2, and CHRNA5 regulate the expression levels of these genes in blood. All five findings further bolster previous claims of translational relevance. Although these results are only generated from application to a single data set, the generalized tests do often provide stronger evidence of association. Additional analysis details are available in Appendices G–J of the Supplementary material available at Biostatistics online.
Table 1.
Top five associations between (i) blood gene expression of one ILCCO-identified risk gene and a set of lung cancer risk SNPs near that gene and (ii) a single ILCCO-identified sentinel variant and a set of nearby blood gene expression values. Some p-values are too low for the default numerical precision in R. If the sentinel variant is unavailable in GTEx, the nearest significantly associated variant is used
| GBJ | GHC | iBJ | iHC | ACAT | SKAT | |
|---|---|---|---|---|---|---|
| (i) Multiple explanatory factors locus | ||||||
| RNASET2 |
|
|
|
|
|
|
| NRG1 |
|
|
|
|
|
|
| RAD52 |
|
|
|
|
|
|
| CHRNA2 |
|
|
|
|
|
|
| CHRNA5 |
|
|
|
|
|
|
| (ii) Multiple outcomes locus | ||||||
| RNASET2 |
|
|
|
|
|
|
| AMICA1 |
|
|
|
|
|
|
| HCP5 |
|
|
|
|
|
|
| CHRNA5 |
|
|
|
|
|
|
| RAD52 |
|
|
|
|
|
|
The five single variants (out of 18) that pass a Bonferroni correction in the multiple outcomes analysis (ii) are also shown in Table 1. There is evidence that the sentinel SNPs at RNASET2, AMICA1, HCP5, CHRNA5, and RAD52 regulate the expression levels of nearby genes in blood. By chance, the expression values at the ILCCO-chosen loci generally demonstrate very low correlation, with all but one correlation structure possessing a median correlation value of less than 0.1. In such situations, we have seen that the innovated tests are not necessarily preferred, and indeed the results are mixed. Intuitively, when there is little correlation, the method for handling this correlation is not likely to be a large factor in the strength of the observed association. However, the aforementioned analysis allows us to understand why such results are expected and offers an approach for choosing tests when performance differences are more likely to be large.
8. Discussion
With the increasing popularity of large genetic compendiums, there continue to be more opportunities and interest in both multiple explanatory factors and multiple outcomes set-based testing. We have analyzed finite sample properties of detection boundary tests in such settings and demonstrated how the correlation structure of the set and type of multiplicity should influence the choice of test statistics. When interest lies in specific situations, we have shown how to calculate lower and upper bounds for the straightforward design of genetic association studies. Software to perform bounds calculations is publicly available in the DBpower R package.
When in the multiple explanatory factor setting, generalized tests possess moderate advantages in the largest signal weights, and these advantages often increase as correlation rises, although the increase is bounded. However, in the multiple outcomes setting, the weighting advantages of innovated tests are often even more dramatic as correlation increases, and increased densities of signals provides even more power compared to generalized tests. This manuscript has focused on continuous outcomes because establishing the modeling framework and interpreting test performance is more straightforward in such a setting. However, results pertaining to the operating characteristics and power bounds of various tests rely only on the distribution of
and can apply, for example, to the case where
are calculated using binary outcomes as well.
We note that the power bounds can sometimes be very wide as in parts of Figure 3; however, the bounds generally become tighter with increasing correlation. The medium and high correlation setting is actually of most importance, while the low correlation setting is less interesting. When there is no correlation, the generalized and innovated approaches are exactly the same, and similarly, in low correlation settings the two approaches show smaller differences in performance. In the more interesting high correlation setting, the bounds may significantly increase power by helping to choose a better test. When applied genome wide, the bounds will be more informative for certain sets, but if additional precision is required in any given situation, the inequalities can be made tighter at the cost of more computational burden, as we have shown.
Simulation results confirm the analytical bounds calculations work well and further demonstrate that tests known to reach the rare-weak detection boundary generally perform well over various sparse testing situations. In very sparse signal situations, the variance component test can underperform, while the ACAT performs almost identically to GHC and is well-suited for very sparse settings. The GHC and GBJ both perform well when generalized tests are advantaged, and the same is true of iHC and iBJ when innovated tests are advantaged. It remains a reasonable choice to use set-based tests that can reach the detection boundary in genetic association studies.
Our analysis of the lung cancer data finds that the disease-associated SNPs around five genes demonstrate evidence of regulating those gene expression levels in blood. Coupled with previous lung eQTL analysis, these results suggest that there are multiple possible risk mechanisms that should be investigated at each locus. In multiple outcomes analysis, individual sentinel SNPs at RNASET2, AMICA1, HCP5, CHRNA5, and RAD52 appear to regulate the blood expression values of genes around these risk loci, demonstrating their functional behavior. Such results help verify that these SNPs should be followed-up in further translational studies.
It would be interesting in future research to extend other types of detection boundary tests (Jager and Wellner, 2007) to account for within-set correlation in the presence of sparse signals. Such tests may demonstrate different performance than the GHC and GBJ. It is also of interest to investigate whether power gains can be achieved by removing some portion of variables with low weights and only testing a subset of highly upweighted variables.
Supplementary Material
Acknowledgments
The authors are grateful to two reviewers and the associate editor for their valuable comments which greatly improved the manuscript.
Conflict of Interest: None declared.
Contributor Information
Ryan Sun, Department of Biostatistics, University of Texas MD Anderson Cancer Center, 1515 Holcombe Boulevard, Houston, TX 77030, USA.
Andy Shi, Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Avenue, Boston, MA 02215, USA.
Xihong Lin, Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Avenue, Boston, MA 02215, USA.
Software
An R package implementing the methods described in this article, along with documentation and examples, is available at https://cran.r-project.org/web/packages/DBpower/index.html.
Supplementary material
Supplementary material is available at http://biostatistics.oxfordjournals.org.
Funding
National Institutes of Health (NIH) (R03-DE029238).
References
- Barnett, I., Mukherjee, R. and Lin. X., (2017). The generalized higher criticism for testing SNP-set effects in genetic association studies. Journal of the American Statistical Association 112, 64–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Battle, A., Brown, C. D., Engelhardt, B. E. and Montgomery, S. B. (2017). Genetic effects on gene expression across human tissues. Nature 550, 204–213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Berk, R. H. and Jones, D. H. (1979). Goodness-of-fit test statistics that dominate the Kolmogorov statistics. Probability Theory and Related Fields 47, 47–59. [Google Scholar]
- Bossé, Y. and Amos, C. I. (2018). A decade of GWAS results in lung cancer. Cancer Epidemiology and Prevention Biomarkers 27, 363–379. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bycroft, C., Freeman, C., Petkova, D., Band, G., Elliott, L. T., Sharp, K., Motyer, A., Vukcevic, D., Delaneau, O., O’Connell, J.. and others. (2018). The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen, S. X., Li, J. and Zhong, P.-S. (2019). Two-sample and ANOVA tests for high dimensional means. The Annals of Statistics 47, 1443–1474. [Google Scholar]
- Denny, J. C., Ritchie, M. D., Basford, M. A., Pulley, J. M., Bastarache, L., Brown-Gentry, K., Wang, D., Masys, D. R., Roden, D. M. and Crawford, D. C. (2010). PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations. Bioinformatics 26, 1205–1210. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Annals of Statistics 32, 962–994. [Google Scholar]
- Gaynor, S. M., Sun, R., Lin, X. and Quackenbush, J. (2019). Identification of differentially expressed gene sets using the generalized Berk–Jones statistic. Bioinformatics 35, 4568–4576. [DOI] [PMC free article] [PubMed] [Google Scholar]
- GTEx Consortium and others. (2020). The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hall, P. and Jin, J. (2010). Innovated higher criticism for detecting sparse signals in correlated noise. The Annals of Statistics 38, 1686–1732. [Google Scholar]
- Harvey, P. D., Sun, N., Bigdeli, T. B., Fanous, A. H., Aslan, M., Malhotra, A. K., Lu, Q., Hu, Y., Li, B., Chen, Q.. and others. (2020). Genome-wide association study of cognitive performance in US veterans with schizophrenia or bipolar disorder. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics 183, 181–194. [DOI] [PubMed] [Google Scholar]
- Hu, Y., Li, M., Lu, Q., Weng, H., Wang, J., Zekavat, S. M., Yu, Z., Li, B., Gu, J., Muchnik, S.. and others. (2019). A statistical framework for cross-tissue transcriptome-wide association analysis. Nature Genetics 51, 568–576. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jager, L. and Wellner, J. A. (2007). Goodness-of-fit tests via phi-divergences. The Annals of Statistics 35, 2018–2053. [Google Scholar]
- Lee, S., Abecasis, G. R., Boehnke, M. and Lin, X. (2014). Rare-variant association analysis: study designs and statistical tests. The American Journal of Human Genetics 95, 5–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu, Y., Chen, S., Li, Z., Morrison, A. C., Boerwinkle, E. and Lin, X. (2019). ACAT: a fast and powerful p value combination method for rare-variant analysis in sequencing studies. The American Journal of Human Genetics 104, 410–421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu, Y., Xia, J., McKay, J., Tsavachidis, S., Xiao, X., Spitz, M. R., Cheng, C., Byun, J., Hong, W., Li, Y.. and others. (2021). Rare deleterious germline variants and risk of lung cancer. NPJ Precision Oncology 5, 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu, Z., Barnett, I. and Lin, X. (2020). A comparison of principal component methods between multiple phenotype regression and multiple SNP regression in genetic association studies. The Annals of Applied Statistics 14, 433–451. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu, Z. and Lin, X. (2019). A geometric perspective on the power of principal component association tests in multiple phenotype studies. Journal of the American Statistical Association 114, 975–990. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McKay, J. D., Hung, R. J., Han, Y., Zong, X., Carreras-Torres, R., Christiani, D. C., Caporaso, N. E., Johansson, M., Xiao, X., Li, Y.. and others. (2017). Large-scale association analysis identifies new lung cancer susceptibility loci and heterogeneity in genetic susceptibility across histological subtypes. Nature Genetics 49, 1126–1132. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moscovich-Eiger, A., Nadler, B. and Spiegelman, C. (2016). On the exact Berk-Jones statistics and their p-value calculation. Electronic Journal of Statistics 10, 2329–2354. [Google Scholar]
- Sun, R. and Lin, X. (2020). Genetic variant set-based tests using the generalized Berk–Jones statistic with application to a genome-wide association study of breast cancer. Journal of the American Statistical Association 115, 1079–1091. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu, M.C., Lee, S., Cai, T., Li, Y., Boehnke, M. and Lin, X. (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics 89, 82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

























