Skip to main content
Biostatistics (Oxford, England) logoLink to Biostatistics (Oxford, England)
. 2022 Aug 24;25(1):171–187. doi: 10.1093/biostatistics/kxac036

Differences in set-based tests for sparse alternatives when testing sets of outcomes compared to sets of explanatory factors in genetic association studies

Ryan Sun 1,, Andy Shi 2, Xihong Lin 3
PMCID: PMC10724113  PMID: 36000269

Summary

Set-based association tests are widely popular in genetic association settings for their ability to aggregate weak signals and reduce multiple testing burdens. In particular, a class of set-based tests including the Higher Criticism, Berk–Jones, and other statistics have recently been popularized for reaching a so-called detection boundary when signals are rare and weak. Such tests have been applied in two subtly different settings: (a) associating a genetic variant set with a single phenotype and (b) associating a single genetic variant with a phenotype set. A significant issue in practice is the choice of test, especially when deciding between innovated and generalized type methods for detection boundary tests. Conflicting guidance is present in the literature. This work describes how correlation structures generate marked differences in relative operating characteristics for settings (a) and (b). The implications for study design are significant. We also develop novel power bounds that facilitate the aforementioned calculations and allow for analysis of individual testing settings. In more concrete terms, our investigation is motivated by translational expression quantitative trait loci (eQTL) studies in lung cancer. These studies involve both testing for groups of variants associated with a single gene expression (multiple explanatory factors) and testing whether a single variant is associated with a group of gene expressions (multiple outcomes). Results are supported by a collection of simulation studies and illustrated through lung cancer eQTL examples.

Keywords: Detection boundary, Genetic association study, Multiple outcomes, Set-based inference, Sparse alternative

1. Introduction

In recent years, rich genetic compendiums quantifying a plethora of biological attributes have become increasingly popular in medical research. In addition to genetic information, many collections offer data on tens of thousands of phenotypic features as well. Publicly accessible examples include the Genotype-Tissue Expression Project (GTEx) (Battle and others, 2017) and the UK Biobank (Bycroft and others, 2018). Researchers have taken full advantage of these resources by performing massive numbers of association analyses under a variety of frameworks including genome-wide association studies (GWAS) (Lee and others, 2014) and phenome-wide association studies (PheWAS) (Denny and others, 2010).

It is common to carry out the aforementioned analyses using set-based inference strategies that group separate hypothesis tests under a single global null hypothesis. Such tests can aggregate rare and weak effects into a more detectable signal while naturally reducing the multiplicity burden, thus alleviating two of the main challenges in genetic association studies. The groupings are generally motivated by natural biological constructs. Two broad types of settings have commonly been considered: (a) the multiple explanatory variable setting that tests a set of genetic variants against a single phenotype (Wu and others, 2011) and (b) the multiple outcome setting that tests a single genetic variant for association with a set of phenotypes (Liu and Lin, 2019).

Previous work (Liu and others, 2020) focused on dense alternatives has shown that individual tests and dense set-based tests perform differently in the two settings. In particular, for the multiple explanatory variable setting (a), individual association tests based on low-order genotype principal components (PCs) have more power (where low order refers to the PCs paired with the largest eigenvalues), and the variance component set-based test can be thought of as upweighting low-order PC-based test statistics. For the multiple outcomes setting (b), higher-order PCs of the outcomes are more powerful in individual association tests, and the Wald and variance component tests upweight higher-order PC-based test statistics. However, individual association tests and dense alternative type tests may not be applicable in many genetics settings, where effects are often sparse (Barnett and others, 2017). Little is known about the relative performance of set-based tests for sparse alternatives settings in (a) and (b), which is the focus of this work.

One class of set-based tests for sparse alternatives has recently become popular (Hu and others, 2019; Gaynor and others, 2019; Harvey and others, 2020) in genetics studies for asymptotically reaching a so-called rare-weak detection boundary (Jager and Wellner, 2007). This class, which includes the Higher Criticism (HC), Berk–Jones (BJ), and other statistics (Moscovich-Eiger and others, 2016), can in a certain sense detect the weakest signals detectable by any statistical procedure under sparse alternatives (Donoho and Jin, 2004; Berk and Jones, 1979). Sparse alternatives are those where the number of signals is less than the square root of the set size; other settings are dense (Donoho and Jin, 2004). Extensions to account for correlated elements in a set (Barnett and others, 2017; Sun and Lin, 2020) and other related strategies continue to be proposed (Moscovich-Eiger and others, 2016; Chen and others, 2019), although the HC and BJ type ideas remain arguably the most well known.

A significant issue in practice is the choice of which specific detection boundary test to apply for settings (a) and (b). When working in genetic association studies, it is necessary to accommodate varying degrees of correlation between elements in a set, and different modifications to the HC and BJ are available for this purpose. Two options are to adapt the test statistics directly to create the generalized HC (GHC) and BJ (GBJ) tests (Barnett and others, 2017; Sun and Lin, 2020) or to decorrelate the elements of a set before applying standard HC and BJ (Hall and Jin, 2010). The latter strategy is known as the innovated HC (iHC) or BJ (iBJ). It is of great interest to investigate how the two approaches behave in settings (a) and (b). Conflicting guidance is given about these methods in the literature (Hall and Jin, 2010; Barnett and others, 2017), and precise comparisons are challenged by the computational burden of performing power calculations in realistic settings.

To ground the discussion in more concrete terms, we briefly introduce our motivating example of lung cancer expression quantitative trait loci (eQTL) studies in Figure 1. The genetic etiology of lung cancer has been deeply probed in GWAS involving tens of thousands of subjects (McKay and others, 2017). It is of pivotal importance to study the biological mechanisms linking GWAS-identified variants to disease risk (Bossé and Amos, 2018). eQTL analysis is one of the most popular methods to explore these underlying biological mechanisms (Liu and others, 2021). As sample sizes in eQTL data sets like GTEx (GTEx Consortium and others, 2020) are much smaller than in GWAS, for example, often with only hundreds of subjects, set-based tests for sparse alternatives are an ideal choice for inference.

Fig. 1.

Fig. 1

Examples of common lung cancer eQTL analyses testing both (a) multiple explanatory factors and (b) multiple outcomes. In (a), many disease-associated Single Nucleotide Polymorphisms (SNPs) at a risk locus are tested for association with the expression of one nearby gene, RAD52. Each dot is one SNP, the y-axis is the level of association between the SNP and RAD52 expression in blood, and the color represents the SNP’s association with lung cancer. Significant expression associations, as observed here, provide evidence that variants at the locus affect risk of lung cancer by regulating the level of RAD52. More specific follow-up studies can then be performed. In (b), the single previously identified risk SNP rs7705526 located on chromosome 5 at position 1285974 is tested for association with the expression of many nearby genes surrounding the TERT risk locus. Each line is one gene, and the y-axis is level of association with the SNP. Significant associations, which are not observed here, would provide evidence that the variant possesses functions related to regulating the expression levels of genes surrounding the TERT risk locus.

A common eQTL multiple explanatory factor analysis is to group many correlated disease-associated SNPs at a risk locus and then test the association of the set against the expression values of nearby genes (McKay and others, 2017), as in Figure 1(a). Significant associations attribute a specific disease mechanism to the risk locus, enabling more targeted follow-up studies. A complementary multiple outcomes analysis is to test an individual SNP against the expression levels of many genes located around a genomic region of interest (McKay and others, 2017), as in Figure 1(b). Strong associations demonstrate the functional behavior of a SNP, which is crucial for identifying causal variants and further understanding mechanisms of disease.

The two primary goals of this article are to analytically study finite sample properties of the innovated and generalized methods in settings (a) and (b) and to provide tools for the individualized design of set-based genetic association studies. A key theme is that within-set correlation influences the choice of test differently depending on the type of multiplicity—whether in the explanatory factors or the outcome. Insights into test performance are greatly facilitated by the development of upper and lower bounds on the exact power; power bounds unlock analytical calculations and comparisons in realistic testing scenarios.

Specifically, we show that in the multiple explanatory factor setting (a), the correlation structure often leads to signal weights that moderately favor generalized tests compared to innovated tests. However in the multiple outcomes setting (b), innovated tests can benefit from even larger advantages in signal configuration, depending on the specific within-set correlation structure. A simulation study in the setting of lung cancer eQTL analysis is used to confirm analytical calculations. Set-based tests mimicking the aforementioned translational lung cancer studies are then applied in GTEx.

The remainder of the article is organized as follows. In Section 2, we first define our notation and review the generalized and innovated test statistics. In Section 3, we develop upper and lower bounds on the finite sample power of detection boundary tests, facilitating numerical studies of performance. Section 4 demonstrates how the properties of the different tests can vary significantly depending on the type of multiplicity and the within-set correlation structure. Section 5 demonstrates how to utilize results from the previous two sections in realistic study settings, and Section 6 confirms calculations with a simulation study. In Section 7, we perform testing in the GTEx dataset to elucidate how GWAS-selected risk SNPs regulate gene expression activity. We conclude with a discussion in Section 8.

2. Methods

2.1. Model and notation for multiple explanatory factors

We first develop our notation by reviewing a standard framework for genetic association studies with a set of multiple genetic variants and a single outcome. Suppose for each of Inline graphic subjects we observe a genotype vector Inline graphic of Inline graphic SNPs that belong to a set with biological relevance, for instance, a gene. For clarity of presentation, assume that the genotypes have been centered and standardized so that Inline graphic and Inline graphic; the nonstandardized case follows in a similar manner, albeit with additional notational complexity. Suppose that for each subject, we also observe a scalar outcome Inline graphic and a Inline graphic-dimensional vector Inline graphic of additional covariates.

A standard model for this situation is

graphic file with name Equation1.gif (2.1)

where Inline graphic and Inline graphic are the fixed effects of the non-SNP and SNP covariates, respectively. Commonly, a Gaussian distributional assumption Inline graphic is also applied. Both the generalized and innovated approaches are designed to test the global null hypothesis of no SNP effects Inline graphic. Under this global null, a marginal score statistic for associating the Inline graphicth SNP with the outcome is

graphic file with name Equation2.gif

where Inline graphic, Inline graphic is the Inline graphic identity matrix, Inline graphic, Inline graphic, Inline graphic is a consistent estimator for Inline graphic, and Inline graphic. Let Inline graphic be the vector of test statistics for each genotype in the set. If no additional covariates are in the model, for example, Inline graphic where Inline graphic, then Inline graphic follows

graphic file with name Equation3.gif (2.2)

where Inline graphic is the empirical correlation matrix of the genotypes. Note that poor estimates of Inline graphic, such as estimates generated when incorrectly assuming the null hypothesis, often lead to only negligible differences from (2.2) and are mostly immaterial to our main conclusions. We can see that Inline graphic enters into both the mean and covariance of Inline graphic.

The exposition assuming no additional covariates is used for both simplicity of presentation and because it allows for translation to the summary statistic setting, where individual-level data are not provided for subjects Inline graphic and only test statistics Inline graphic are available. In such a setting, we can assume (2.2) and estimate Inline graphic with reference genotype panels, so that the distribution of the test statistics is still known and upcoming results are still applicable. Even if summary statistics are created with non-SNP covariates, (2.2) can still be a good approximation if the additional covariates are approximately independent of the genotypes (Sun and Lin, 2020). Exact calculations for the model with individual-level data and non-SNP covariates can be performed with straightforward adjustments.

2.2. Model and notation for multiple outcomes

When testing in the multiple outcomes setting, assume for each subject Inline graphic we observe only the genotype at a single variant Inline graphic and a set of Inline graphic related outcomes Inline graphic, for instance a set of gene expression values for genes surrounding a risk locus. Assume that the SNP has again been centered and scaled as previously described. We then have the Inline graphic models

graphic file with name Equation4.gif (2.3)

where Inline graphic is again the vector of other covariates, and we suppose for simplicity it remains the same for all Inline graphic. Generally, to account for correlation among the Inline graphic outcomes, it is assumed that Inline graphic follows Inline graphic where Inline graphic is the correlation matrix of the outcomes, Inline graphic, and Inline graphic is the standard deviation of outcome Inline graphic.

Both the generalized and innovated methods aim to test the set-based null hypothesis Inline graphic, where Inline graphic, and we can construct marginal score statistics for each Inline graphic as

graphic file with name Equation5.gif

where Inline graphic, Inline graphic is a consistent estimator of Inline graphic, and Inline graphic. Then when there are no covariates other than Inline graphic, Inline graphic follows

graphic file with name Equation6.gif (2.4)

We will again proceed with the no additional covariates approach so that results are applicable to summary statistics assumed to have the distribution in (2.4); the extensions are again readily accessible.

2.3. Generalized and innovated tests

For the sake of convenience, we briefly review the HC and BJ. Use Inline graphic to denote test statistics of dimension Inline graphic and assume Inline graphic, where the diagonal elements of Inline graphic are 1. Let Inline graphic denote the survival function of a standard normal random variable. Further define Inline graphic be the number of test statistics with a magnitude greater than or equal to some fixed threshold Inline graphic. For fixed Inline graphic, and if Inline graphic, then clearly Inline graphicInline graphic BinomialInline graphic. The HC statistic is Inline graphic and is the maximum of a centered and standardized version of Inline graphic (Donoho and Jin, 2004). The BJ statistic is Inline graphic and can be viewed as the maximum of a likelihood ratio test on the mean parameter of Inline graphic (Berk and Jones, 1979).

The GHC replaces the denominator in the HC with the correct variance of Inline graphic when the correlation matrix of Inline graphic is Inline graphic. GBJ similarly adjusts the likelihood ratio quantity to explicitly account for overdispersion in Inline graphic caused by correlated elements of Inline graphic (Barnett and others, 2017; Sun and Lin, 2020). In contrast, innovated tests iHC and iBJ transform the Inline graphic to be independent, for example using a Cholesky decomposition of Inline graphic. The standard HC or BJ can then be applied to the transformed Inline graphic (Hall and Jin, 2010).

3. Power analysis of generalized and innovated tests

A direct approach to analyze the performance of generalized and innovated detection boundary tests is to perform power calculations across different settings. Unfortunately, the computational burden of existing approaches is extremely high, and it is generally impractical to perform such calculations (Sun and Lin, 2020). Thus, it has been difficult to make precise quantitative statements about performance except when comparing simulation results (Barnett and others, 2017).

The goal of this section is to introduce upper and lower bounds on the exact power that are simpler to compute, allow for relative comparisons of tests without performing complicated simulations, and can be used to design practical genetic association studies. Continuing with the setting of Section 2.3, let Inline graphic be the vector that results from applying the absolute value operator to each element of Inline graphic. When testing the set-based null with Inline graphic, the rejection regions of tests we discuss can be written as

graphic file with name Equation7.gif (3.5)

where Inline graphic are the order statistics of Inline graphic and Inline graphic are the Inline graphic boundary constants that depend on the test statistic Inline graphic (e.g., GBJ or iHC), the level of the test Inline graphic, and the correlation structure Inline graphic. The different values of Inline graphic used for each test lead to differences in finite sample performance; their calculation has previously been explored in detail by multiple authors (Moscovich-Eiger and others, 2016; Sun and Lin, 2020). In the following, discussion of rejection regions Inline graphic and boundary constants Inline graphic will suppress the dependency on test parameters to reduce the notational burden.

We begin by discussing lower bounds. From (3.5), exact calculation of Inline graphic when Inline graphic can be calculated with Inline graphic computations of a Inline graphic-dimensional multivariate normal integral (Sun and Lin, 2020). The computation can be vastly simplified by working with simpler subsets of the rejection region. For example, one potential subset is

graphic file with name Equation8.gif (3.6)

Clearly Inline graphic, and the probability of Inline graphic can be calculated as

graphic file with name Equation9.gif (3.7)

(see Appendix A of the Supplementary material available at Biostatistics online). Intuitively, the probability of Inline graphic can capture a large portion of the power because if there is an associated feature, we would expect its corresponding test statistic to have the largest magnitude, and so most rejections of the null would happen when this magnitude exceeds Inline graphic. The difference between the exact power and the probability of Inline graphic is Inline graphic.

It is possible to make the lower bound sharper by extending Inline graphic to fill more and more of Inline graphic, at the cost of escalating the computational burden. For example, a logical next step to tighten the bound is to use

graphic file with name Equation10.gif (3.8)

which is also clearly a subset of Inline graphic. The probability of Inline graphic can be calculated by

graphic file with name Equation11.gif (3.9)

(see Appendix B of the Supplementary material available at Biostatistics online). Similar reasoning can be used to calculate an Inline graphic that adds events related to Inline graphic, Inline graphic that adds events related to Inline graphic, and so on, all the way up to the exact rejection region Inline graphic. For the purposes of choosing between innovated and generalized tests, we will show that Inline graphic is already quite informative.

Along with a lower bound, it is obviously also useful to know an upper bound on the power. Similar to the lower bound case, one way to calculate upper bounds on power is to use subsets of Inline graphic, which can be written as

graphic file with name Equation12.gif

One possible subset of Inline graphic is then given by

graphic file with name Equation13.gif (3.10)

and we can write

graphic file with name Equation14.gif (3.11)

The difference between this upper bound and the exact power is

graphic file with name Equation15.gif

Similarly to the lower bound case, a tighter bound is given by

graphic file with name Equation16.gif (3.12)

and we can calculate

graphic file with name Equation17.gif (3.13)

Again, the bound can be lowered until it becomes exactly Inline graphic, at the cost of increasing computational burden.

Calculation of the bounds and exact power when using iBJ and GBJ at Inline graphic is illustrated in Figure 2 with example sets of five elements. Data is generated under models (2.1) and (2.3) for the multiple explanatory factors and multiple outcomes settings respectively, with no non-SNP covariates. In the high correlation setting (bottom), each off-diagonal element of the within-set correlation matrix is 0.5 greater than the corresponding value in the low correlation setting (top). Pre-standardization, the regression coefficient vector has one nonzero element equal to 0.25. There are Inline graphic subjects, and the variance of each outcome is equal to 1.

Fig. 2.

Fig. 2

Lower bound (using Inline graphic), exact calculation, and upper bound (using Inline graphic) for power of iBJ and GBJ in the multiple explanatory factors (left) and multiple outcomes (right) settings when there are five elements in a set and the correlation between them is low (top) or high (bottom). In the low correlation matrix, the first block of two elements has pairwise correlation 0.3, the other block of three elements also has all mutual pairwise correlations equal to 0.3, and the correlation between elements of the two blocks is 0.1. In the high correlation matrix, each of the aforementioned off-diagonal values is increased by 0.5. Only one element of the regression coefficients vectors is nonzero, and the location of the nonzero element is given on the x-axis. All SNPs have minor allele frequency of 0.3. Testing is performed at the 0.01 level.

This example is deliberately kept simple for illustrative purposes, but we do see that the bounds are quite accurate when the set size is small; larger examples will also be provided. Furthermore, in the multiple explanatory variable setting (Figures 2(a) and (c)), the GBJ has more power than iBJ. In contrast, in the multiple outcome setting (Figures 2(b) and (d)), the tests have similar power when within-set correlation is low (Figure 2(b)), while iBJ has much more power when within-set correlation is high (Figure 2(d)). In general, the proposed bounds can be used to calculate power under a variety of alternatives prior to performing analysis. Software to perform these calculations is publicly available in the DBpower package.

4. Finite sample operating characteristics

The next section analytically describes how within-set correlation affects test signal strengths and sparsity in opposite directions depending on the type of multiplicity.

4.1. Correlation when testing multiple outcomes

We begin by explaining how large power discrepancies between generalized and innovated tests can occur when correlation among outcomes is high, as observed in Figure 2(d). Following the notation in Section 2.2, GHC and GBJ are applied directly to Inline graphic. The innovated tests must first decorrelate Inline graphic. Let Inline graphic. Here, Inline graphic is a diagonal matrix holding the eigenvalues of Inline graphic, sorted so that the largest value is in the first row. Inline graphic is an orthogonal matrix where the Inline graphicth column is the eigenvector corresponding to the Inline graphicth largest eigenvalue. Decorrelated test statistics Inline graphic are then approximately

graphic file with name Equation18.gif

Note that Inline graphic are also the PCs of Inline graphic, so the innovated tests can be seen as applications of HC and BJ to the PC scores of the marginal test statistics.

4.2. Signal weighting for multiple outcomes

Clearly the means of Inline graphic and Inline graphic differ by Inline graphic. The term Inline graphic can be thought of as signal weights in the innovated setting. If the smallest eigenvalues are very small, the elements of Inline graphic can dominate other contributions to the signal. In fact, the elements of Inline graphic are unbounded.

Consider the first signal location setting of Figure 2(d). When using iBJ, the largest weight is Inline graphic, and the largest value of Inline graphic is 4.18, while the largest value of Inline graphic is 3.24. Suppose we are calculating an upper bound on GBJ power as in (3.10) and (3.11). The first term in (3.11) will be Inline graphic, and thus an upper bound is less than 0.56. Then suppose we are calculating a lower bound on iBJ as described in (3.6) and (3.7). The last term in (3.7) is Inline graphic, and thus a lower bound is at least 0.78; here, even moderate values in Inline graphic affect Inline graphic enough to generate large power differences.

More general statements about signal weights can also be made. While it is not possible to characterize all arbitrary correlation structures, we consider some practically useful situations here (see Appendix C of the Supplementary material available at Biostatistics online).

  • Remark 1: One upper bound on the smallest eigenvalue is Inline graphic where Inline graphic is the Inline graphic element of Inline graphic.

  • Remark 2: A lower bound on the largest eigenvalue is Inline graphic, so a second upper bound on the smallest eigenvalue is Inline graphic.

Thus, when there is a large amount of within-set correlation or a subset of highly correlated elements, there will be larger signal weights, and innovated tests will often be preferred over generalized tests in multiple outcomes settings. For more specific circumstances, the example of Figure 2(d) succinctly illustrates the utility of analysis with the proposed power bounds.

4.3. Signal sparsity for multiple outcomes

The other difference between Inline graphic and Inline graphic is the presence of Inline graphic in Inline graphic. The product Inline graphic can both increase and decrease individual elements of Inline graphic, although since it is a rotation matrix, Inline graphic, where Inline graphic denotes Euclidean norm. However, one consistently power-increasing feature of Inline graphic is its ability to, roughly speaking, spread the signal, so that zero means in Inline graphic become nonzero in Inline graphic.

For instance, continuing with our example from Section 4.2, three nonzero means in Inline graphic compared to one in Inline graphic create an increase in signal density that naturally leads to increases in power. More precisely, using terms from (3.7) and (3.13), we can calculate the probability of rejecting the null even if the element of Inline graphic with the largest mean results in the largest observed magnitude but does not cross the largest bound:

graphic file with name Equation19.gif

For the generalized statistics Inline graphic the corresponding probability is 0.001, demonstrating the increase in power that is generated by additional nonzero means.

4.4. Correlation when testing multiple explanatory factors

In the setting of Section 2.1, an innovated test for multiple explanatory factors must first decorrelate Inline graphic. Write the previously defined eigendecomposition of Inline graphic as Inline graphic. Under the alternative, the innovated test statistics approximately possess a mean of

graphic file with name Equation20.gif

In contrast, Inline graphic, so the weights Inline graphic are effectively squared compared to the innovated tests. The difference will be more drastic for large eigenvalues.

4.5. Signal weighting and sparsity with multiple explanatory factors

While the largest weights in the generalized setting are effectively squared compared to the innovated setting, this advantage is bounded. The sum of the eigenvalues of Inline graphic equals Inline graphic, the trace of the matrix, so there is an upper limit on Inline graphic, whereas there is no upper limit on Inline graphic.

For more general statements, there are again a variety of bounds on large eigenvalues, including the one referred to in Remark 2 and two additional ones of particular interest.

  • Remark 3: For a positive matrix, a lower bound on the maximal eigenvalue is the minimal row sum for any row in Inline graphic, Inline graphic, where Inline graphic is the Inline graphic element of Inline graphic. An upper bound on the maximal eigenvalue is the maximal row sum for any row in Inline graphic.

  • Remark 4: It follows from Remark 3 that there are at most Inline graphic eigenvalues greater than 1 for a positive matrix Inline graphic, where Inline graphic is the floor operator.

We can see that the generalized test advantages in the largest signal weights will increase as the within-set correlation increases. However, as the within-set correlation increases, the number of eigenvalues larger than 1 falls, and so there will be fewer weights that are larger when squared.

Also, in the multiple explanatory factors setting there is no stark contrast in signal sparsity between innovated and noninnovated tests, as the eigenvectors Inline graphic multiply Inline graphic in the mean parameters of both Inline graphic and Inline graphic. Thus in the multiple explanatory setting, generalized tests have only the weighting advantage and not the signal density advantage. The generalized tests also see their advantage decreased slightly due to changes in rejection regions, which we describe in more detail in Appendix D of the Supplementary material available at Biostatistics online.

5. Translation to practical set parameters

To translate the above discussion to realistic testing situations, we investigate how a larger correlation structures lead to specific eigenvalues. We will model specific correlation structures with block matrices of the form

graphic file with name Equation21.gif (5.14)

where Inline graphic, Inline graphic, and Inline graphic. This structure corresponds to two correlated clusters of features within a set, for example, two linkage disequilibrium blocks at a risk locus. We let the two clusters possess exchangeable correlation structures within themselves at Inline graphic and Inline graphic, and the correlation between the two clusters is Inline graphic.

Straightforward calculations (Liu and Lin, 2019) show there are up to four distinct eigenvalues taking the values Inline graphic, and Inline graphic, where Inline graphic and Inline graphic We can see that if Inline graphic and Inline graphic increase, there can be eigenvalues very close to 0, and the largest eigenvalue also increases. We can also have an eigenvalue close to 0 even when Inline graphic and Inline graphic are small if Inline graphic grows. A simplification of the model is Inline graphic, in which case there is exchangeable correlation throughout the entire set. In this setting there are only two distinct eigenvalues, Inline graphic and Inline graphic. When the common correlation is large, the smallest eigenvalue can be close to 0, and the largest eigenvalue can be close to the size of the set.

Figures 3(a) and (b) show calculations of power bounds for both the multiple explanatory factors and multiple outcomes settings when Inline graphic and Inline graphic is varied. Figures 3(c) and (d) show the exchangeable correlation case. We see that, as expected, the generalized tests appear to outperform the innovated tests slightly in the multiple explanatory factors setting. The outperformance of the innovated tests as correlation increases in the multiple outcomes setting is more dramatic, also as expected.

Fig. 3.

Fig. 3

Lower and upper bounds on power (using Inline graphic and Inline graphic) for iBJ and GBJ in the multiple explanatory factors (left) and multiple outcomes (right) settings when there are 40 elements in a set. In the top row, the correlation structure is given by (5.14) with Inline graphic, and Inline graphic is varied among all values that admit a positive definite matrix. In the bottom row, the correlation structure is exchangeable with Inline graphic, and the level of correlation is varied. Note that the lower bounds for innovated tests can rise above upper bounds for generalized tests in the multiple outcomes setting.

6. Simulation

We next provide empirical support for the previous analytical arguments by conducting simulations designed to mimic the setting of our GTEx lung cancer data analysis. In these simulations, we consider sparse signal settings and primarily compare the GHC and GBJ against the iHC and iBJ. A variance component sequence kernel association test (SKAT) as well as the aggregated Cauchy association test (ACAT) (Liu and others, 2019) are also presented as reference points, but the main goal of this work is to illustrate the differences between innovated and generalized tests, not to repeat comparisons of detection boundary tests against other strategies (Barnett and others, 2017). SKAT is known to perform better in dense situations, while ACAT has been shown to reach the detection boundary only in parts of the sparse regime. Simulations are conducted with 400 subjects and 40 elements per set to approximate the GTEx analysis.

The data are generated without standardization from models (2.1) and (2.3), where Inline graphic is composed of a standard normal random variable and a Bernoulli random variable with mean 0.5. The effect sizes on the nongenetic covariates are set at 1 always. The first four elements of Inline graphic and Inline graphic are always generated from a Uniform(0, 0.25) or Uniform(0, 0.4) distribution, respectively, and all other elements are 0. Thus, there are four true causal SNPs out of the 40 total SNPs in model (2.1), and the individual SNP affects four out of the 40 outcomes in model (2.3). In the multiple explanatory factors setting, we generate the genotypes Inline graphic using the block correlation structure given in (5.14), where the Inline graphic block corresponds to the four signal variants. In the multiple outcomes setting, Inline graphic is generated with this same correlation structure, and Inline graphic corresponds to the correlation among the four signal outcomes. Minor allele frequencies are held constant at 0.3, and all testing is performed at Inline graphic.

Figures 4(a) and (b) present smoothed power under the same setting as Figures 3(a) and (b), with Inline graphic and Inline graphic varying. It was demonstrated in Section 5 that Inline graphic increasing to larger values would decrease the eigenvalues and sharply increase signal weights in the multiple outcomes setting, with power bounds analysis predicting power close to 1 when Inline graphic is larger. Indeed, we do generally see such behavior in Figure 4(b) (see Appendix E of Supplementary material available at Biostatistics online). In the multiple explanatory factors setting of Figure 4(a), the empirical powers mostly mirror the trends of the lower bounds, with GBJ performing slightly better. Figures 4(c) and (d) present smoothed power under the same exchangeable correlation setting as Figures 3(c) and (d). From Section 5, we again know that generalized tests should possess an advantage in the multiple explanatory factors setting, but innovated tests should outperform by even more as correlation increases in the multiple outcomes setting. Both trends are observed as expected.

Fig. 4.

Fig. 4

Power simulations with set-based tests under the same settings as Figure 3. In the multiple explanatory factors setting (left), four causal SNPs have an exchangeable correlation structure with Inline graphic, and the noncausal SNPs are similarly correlated with Inline graphic. In the multiple outcomes setting (right), the SNP has an effect on only four outcomes, and the associated and unassociated outcomes form two blocks with a correlation structure that is equivalent to multiple explanatory factor correlation structure. VC refers to the variance component SKAT test. We conduct 500 simulations at each 0.01 increment of Inline graphic and smooth the empirical power curve.

By comparison, SKAT demonstrates excellent power in high correlation settings when testing both multiple explanatory factors and multiple outcomes, but it can also show poor performance in sparse, low correlation situations such as part of Figure 4(a). The underperformance in certain sparse settings is a well-known weakness of SKAT. The ACAT acts almost exactly like the GHC in all tested settings. Additional simulations under a variety of different models are reported in Appendices F and G of the Supplementary material available at Biostatistics online.

7. GTEx data analysis

Many studies investigating the genetic etiology of lung cancer have been conducted across diverse cohorts of subjects, identifying and validating dozens of risk loci. However, the precise biological mechanisms linking risk loci and disease remain unclear, and it is still difficult to pinpoint exact causal variants (Bossé and Amos, 2018). There are various ongoing efforts to characterize loci with possible translational impact and identifying regions of the genome with effects on gene expression is a popular tactic.

One of the largest GWAS of lung cancer to date included over 80 000 subjects and was conducted by the International Lung Cancer Consortium (ILCCO) (McKay and others, 2017). This study identified 18 risk loci of primary interest. All SNPs at a risk locus were tested for association with the lung expression value of a likely risk gene at that locus (one risk gene per locus, 18 in all). Significant associations helped suggest eQTL disease mechanisms for genes such as RNASET2 and NRG1. The article additionally reported the results of testing the single most significant SNP at each locus against the lung expression of all genes in the vicinity of a locus, to identify possible mechanistic pathways for the most significant variants.

While lung eQTL results are very useful, the sample size was relatively small compared to the GWAS size, and expression studies in other relevant tissues may help provide a more complete understanding of risk mechanisms. Thus, we use set-based testing to complement previous results by re-performing the same eQTL study in blood data. Specifically, we (i) test sets of risk SNPs at a single locus against the blood expression of a single risk gene, aggregating signals across SNPs at the locus and (ii) test sets of blood expressions for all genes at a risk locus against individual risk SNPs, aggregating signals across gene expression values at a locus.

There are 384 subjects with complete genotyping, covariate, and blood expression data. We perform eQTL testing using the same models reported in previous GTEx analysis, inverse-normal transforming expression values and then fitting linear regression models with covariates for sex, genotyping platform, genetic PCs, and PEER factors (Battle and others, 2017). When testing sets of SNPs against one gene expression in analysis (i), we use all ILCCO-identified risk variants associated at Inline graphic in a 250 kb window centered at the ILCCO-identified risk gene. When testing sets of gene expression values against one variant in (ii), we use all genes in a 2 Mb window centered around the ILCCO-identified sentinel SNP. These window sizes are used to create sets of approximately equal size in analysis. Combining all SNPs (dots) in Figure 1(a) results in one test for (i) and combining all genes (lines) in Figure 1(b) results in one test for (ii).

The five genes (out of 18) that pass a Bonferroni corrected level of Inline graphic in the multiple explanatory factor analysis (i) are shown in Table 1. Thus, there is evidence that lung cancer-associated variants near RNASET2, NRG1, RAD52, CHRNA2, and CHRNA5 regulate the expression levels of these genes in blood. All five findings further bolster previous claims of translational relevance. Although these results are only generated from application to a single data set, the generalized tests do often provide stronger evidence of association. Additional analysis details are available in Appendices G–J of the Supplementary material available at Biostatistics online.

Table 1.

Top five associations between (i) blood gene expression of one ILCCO-identified risk gene and a set of lung cancer risk SNPs near that gene and (ii) a single ILCCO-identified sentinel variant and a set of nearby blood gene expression values. Some p-values are too low for the default numerical precision in R. If the sentinel variant is unavailable in GTEx, the nearest significantly associated variant is used

  GBJ GHC iBJ iHC ACAT SKAT
(i) Multiple explanatory factors locus
   RNASET2 Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
    NRG1 Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
    RAD52 Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
    CHRNA2 Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
    CHRNA5 Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
(ii) Multiple outcomes locus
   RNASET2 Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
    AMICA1 Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
    HCP5 Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
    CHRNA5 Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic
   RAD52 Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic Inline graphic

The five single variants (out of 18) that pass a Bonferroni correction in the multiple outcomes analysis (ii) are also shown in Table 1. There is evidence that the sentinel SNPs at RNASET2, AMICA1, HCP5, CHRNA5, and RAD52 regulate the expression levels of nearby genes in blood. By chance, the expression values at the ILCCO-chosen loci generally demonstrate very low correlation, with all but one correlation structure possessing a median correlation value of less than 0.1. In such situations, we have seen that the innovated tests are not necessarily preferred, and indeed the results are mixed. Intuitively, when there is little correlation, the method for handling this correlation is not likely to be a large factor in the strength of the observed association. However, the aforementioned analysis allows us to understand why such results are expected and offers an approach for choosing tests when performance differences are more likely to be large.

8. Discussion

With the increasing popularity of large genetic compendiums, there continue to be more opportunities and interest in both multiple explanatory factors and multiple outcomes set-based testing. We have analyzed finite sample properties of detection boundary tests in such settings and demonstrated how the correlation structure of the set and type of multiplicity should influence the choice of test statistics. When interest lies in specific situations, we have shown how to calculate lower and upper bounds for the straightforward design of genetic association studies. Software to perform bounds calculations is publicly available in the DBpower R package.

When in the multiple explanatory factor setting, generalized tests possess moderate advantages in the largest signal weights, and these advantages often increase as correlation rises, although the increase is bounded. However, in the multiple outcomes setting, the weighting advantages of innovated tests are often even more dramatic as correlation increases, and increased densities of signals provides even more power compared to generalized tests. This manuscript has focused on continuous outcomes because establishing the modeling framework and interpreting test performance is more straightforward in such a setting. However, results pertaining to the operating characteristics and power bounds of various tests rely only on the distribution of Inline graphic and can apply, for example, to the case where Inline graphic are calculated using binary outcomes as well.

We note that the power bounds can sometimes be very wide as in parts of Figure 3; however, the bounds generally become tighter with increasing correlation. The medium and high correlation setting is actually of most importance, while the low correlation setting is less interesting. When there is no correlation, the generalized and innovated approaches are exactly the same, and similarly, in low correlation settings the two approaches show smaller differences in performance. In the more interesting high correlation setting, the bounds may significantly increase power by helping to choose a better test. When applied genome wide, the bounds will be more informative for certain sets, but if additional precision is required in any given situation, the inequalities can be made tighter at the cost of more computational burden, as we have shown.

Simulation results confirm the analytical bounds calculations work well and further demonstrate that tests known to reach the rare-weak detection boundary generally perform well over various sparse testing situations. In very sparse signal situations, the variance component test can underperform, while the ACAT performs almost identically to GHC and is well-suited for very sparse settings. The GHC and GBJ both perform well when generalized tests are advantaged, and the same is true of iHC and iBJ when innovated tests are advantaged. It remains a reasonable choice to use set-based tests that can reach the detection boundary in genetic association studies.

Our analysis of the lung cancer data finds that the disease-associated SNPs around five genes demonstrate evidence of regulating those gene expression levels in blood. Coupled with previous lung eQTL analysis, these results suggest that there are multiple possible risk mechanisms that should be investigated at each locus. In multiple outcomes analysis, individual sentinel SNPs at RNASET2, AMICA1, HCP5, CHRNA5, and RAD52 appear to regulate the blood expression values of genes around these risk loci, demonstrating their functional behavior. Such results help verify that these SNPs should be followed-up in further translational studies.

It would be interesting in future research to extend other types of detection boundary tests (Jager and Wellner, 2007) to account for within-set correlation in the presence of sparse signals. Such tests may demonstrate different performance than the GHC and GBJ. It is also of interest to investigate whether power gains can be achieved by removing some portion of variables with low weights and only testing a subset of highly upweighted variables.

Supplementary Material

kxac036_Supplementary_Data

Acknowledgments

The authors are grateful to two reviewers and the associate editor for their valuable comments which greatly improved the manuscript.

Conflict of Interest: None declared.

Contributor Information

Ryan Sun, Department of Biostatistics, University of Texas MD Anderson Cancer Center, 1515 Holcombe Boulevard, Houston, TX 77030, USA.

Andy Shi, Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Avenue, Boston, MA 02215, USA.

Xihong Lin, Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Avenue, Boston, MA 02215, USA.

Software

An R package implementing the methods described in this article, along with documentation and examples, is available at https://cran.r-project.org/web/packages/DBpower/index.html.

Supplementary material

Supplementary material is available at http://biostatistics.oxfordjournals.org.

Funding

National Institutes of Health (NIH) (R03-DE029238).

References

  1. Barnett, I., Mukherjee, R. and Lin. X., (2017). The generalized higher criticism for testing SNP-set effects in genetic association studies. Journal of the American Statistical Association 112, 64–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Battle, A., Brown, C. D., Engelhardt, B. E. and Montgomery, S. B. (2017). Genetic effects on gene expression across human tissues. Nature 550, 204–213. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Berk, R. H. and Jones, D. H. (1979). Goodness-of-fit test statistics that dominate the Kolmogorov statistics. Probability Theory and Related Fields 47, 47–59. [Google Scholar]
  4. Bossé, Y. and Amos, C. I. (2018). A decade of GWAS results in lung cancer. Cancer Epidemiology and Prevention Biomarkers 27, 363–379. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bycroft, C., Freeman, C., Petkova, D., Band, G., Elliott, L. T., Sharp, K., Motyer, A., Vukcevic, D., Delaneau, O., O’Connell, J.. and others. (2018). The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Chen, S. X., Li, J. and Zhong, P.-S. (2019). Two-sample and ANOVA tests for high dimensional means. The Annals of Statistics 47, 1443–1474. [Google Scholar]
  7. Denny, J. C., Ritchie, M. D., Basford, M. A., Pulley, J. M., Bastarache, L., Brown-Gentry, K., Wang, D., Masys, D. R., Roden, D. M. and Crawford, D. C. (2010). PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations. Bioinformatics 26, 1205–1210. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Annals of Statistics 32, 962–994. [Google Scholar]
  9. Gaynor, S. M., Sun, R., Lin, X. and Quackenbush, J. (2019). Identification of differentially expressed gene sets using the generalized Berk–Jones statistic. Bioinformatics 35, 4568–4576. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. GTEx Consortium and others. (2020). The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Hall, P. and Jin, J. (2010). Innovated higher criticism for detecting sparse signals in correlated noise. The Annals of Statistics 38, 1686–1732. [Google Scholar]
  12. Harvey, P. D., Sun, N., Bigdeli, T. B., Fanous, A. H., Aslan, M., Malhotra, A. K., Lu, Q., Hu, Y., Li, B., Chen, Q.. and others. (2020). Genome-wide association study of cognitive performance in US veterans with schizophrenia or bipolar disorder. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics 183, 181–194. [DOI] [PubMed] [Google Scholar]
  13. Hu, Y., Li, M., Lu, Q., Weng, H., Wang, J., Zekavat, S. M., Yu, Z., Li, B., Gu, J., Muchnik, S.. and others. (2019). A statistical framework for cross-tissue transcriptome-wide association analysis. Nature Genetics 51, 568–576. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Jager, L. and Wellner, J. A. (2007). Goodness-of-fit tests via phi-divergences. The Annals of Statistics 35, 2018–2053. [Google Scholar]
  15. Lee, S., Abecasis, G. R., Boehnke, M. and Lin, X. (2014). Rare-variant association analysis: study designs and statistical tests. The American Journal of Human Genetics 95, 5–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Liu, Y., Chen, S., Li, Z., Morrison, A. C., Boerwinkle, E. and Lin, X. (2019). ACAT: a fast and powerful p value combination method for rare-variant analysis in sequencing studies. The American Journal of Human Genetics 104, 410–421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Liu, Y., Xia, J., McKay, J., Tsavachidis, S., Xiao, X., Spitz, M. R., Cheng, C., Byun, J., Hong, W., Li, Y.. and others. (2021). Rare deleterious germline variants and risk of lung cancer. NPJ Precision Oncology 5, 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Liu, Z., Barnett, I. and Lin, X. (2020). A comparison of principal component methods between multiple phenotype regression and multiple SNP regression in genetic association studies. The Annals of Applied Statistics 14, 433–451. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Liu, Z. and Lin, X. (2019). A geometric perspective on the power of principal component association tests in multiple phenotype studies. Journal of the American Statistical Association 114, 975–990. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. McKay, J. D., Hung, R. J., Han, Y., Zong, X., Carreras-Torres, R., Christiani, D. C., Caporaso, N. E., Johansson, M., Xiao, X., Li, Y.. and others. (2017). Large-scale association analysis identifies new lung cancer susceptibility loci and heterogeneity in genetic susceptibility across histological subtypes. Nature Genetics 49, 1126–1132. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Moscovich-Eiger, A., Nadler, B. and Spiegelman, C. (2016). On the exact Berk-Jones statistics and their p-value calculation. Electronic Journal of Statistics 10, 2329–2354. [Google Scholar]
  22. Sun, R. and Lin, X. (2020). Genetic variant set-based tests using the generalized Berk–Jones statistic with application to a genome-wide association study of breast cancer. Journal of the American Statistical Association 115, 1079–1091. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Wu, M.C., Lee, S., Cai, T., Li, Y., Boehnke, M. and Lin, X. (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics 89, 82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

kxac036_Supplementary_Data

Articles from Biostatistics (Oxford, England) are provided here courtesy of Oxford University Press

RESOURCES