Abstract
We consider problems where many, somewhat redundant, hypotheses are tested and we are interested in reporting the most precise rejections, with false discovery rate (FDR) control. This is the case, for example, when researchers are interested both in individual hypotheses as well as group hypotheses corresponding to intersections of sets of the original hypotheses, at several resolution levels. A concrete application is in genome-wide association studies, where, depending on the signal strengths, it might be possible to resolve the influence of individual genetic variants on a phenotype with greater or lower precision. To adapt to the unknown signal strength, analyses are conducted at multiple resolutions and researchers are most interested in the more precise discoveries. Assuring FDR control on the reported findings with these adaptive searches is, however, often impossible. To design a multiple comparison procedure that allows for an adaptive choice of resolution with FDR control, we leverage e-values and linear programming. We adapt this approach to problems where knockoffs and group knockoffs have been successfully applied to test conditional independence hypotheses. We demonstrate its efficacy by analysing data from the UK Biobank.
Keywords: false discovery rate, grouped hypotheses, genome-wide association studies, knockoffs, multiple resolutions, multiple testing
1. Introduction
Finding, among many features, those that contain information on an outcome of interest is an important statistical problem which can be described in terms of testing multiple conditional independence hypotheses. For example, genome-wide association studies (GWAS) are devoted to identifying single-nucleotide polymorphisms (SNPs) whose alleles influence a medically relevant phenotype Y. A discovered SNP is truly interesting when it provides information on Y in addition to that available in the rest of the genome. For each polymorphism j in the study, a conditional independence hypothesis can be written as
| (1) |
with and where denotes all SNPs except SNP j. The process with which we inherit genetic material from our parents is such that, within a human population, the random variables describing alleles at SNPs located in proximity of each other along a chromosome are dependent. This local dependency, known as linkage disequilibrium, can make it difficult to reject hypotheses of the type of (1), as the variation in each SNP is typically well recapitulated by its neighbours. To avoid power loss, it is convenient to test conditional hypotheses corresponding to groups of highly correlated SNPs. This is, for example, the solution adopted in Sesia et al. (2020), where hierarchical clustering is used to define multiple groupings of SNPs, corresponding to different correlation thresholds. Carrying out conditional testing with false discovery rate (FDR) control for each of these partitions in the genome, results in discoveries at multiple resolutions, as the one illustrated in the top panel of Figure 1.
Figure 1.
Examples of the results of the analysis of height in the UK Biobank, following the procedure in Sesia et al. (2020). Genomic position is on the x-axis (this is a small portion of chromosome 4). Different heights on the y-axis correspond to different resolutions. Top panel: Each filled grey box corresponds to a group of SNPs for which the hypothesis of independence from height given the rest of the genome has been rejected, carrying out an analysis that controls FDR at each resolution (Sesia et al., 2020). Lower panel: the most specific discoveries among those in the top panel are highlighted as filled grey boxes, while the others (logically implied) are empty rectangles.
Scientists, presented with these results, are interested in focusing on the most precise discoveries, highlighted in the bottom panel of Figure 1. However, such post-hoc selection of nonredundant findings breaks FDR guarantees provided by the original algorithm (Katsevich et al., 2023). This limitation of FDR procedures is well known (Goeman & Solari, 2011) and is one reason that makes alternative approaches more attractive, despite possible loss of power (for example, in the context of genetic analysis; see the contribution by Renaux et al. (2020) and related discussion papers). Another line of work for controlling false discoveries by providing upper confidence bounds on the number of false discoveries has been developed by Genovese and Wasserman (2004); Goeman et al. (2019); Goeman and Solari (2011) and extended to e-values by Vovk and Wang (2023).
GWAS are but one example of what one might call ‘resolution adaptive’ testing: problems where scientists are aiming to simultaneously discover a signal and localize it as precisely as possible (Spector & Janson, 2024). Localization might refer to a physical position in the genome or in space (Rosenblatt et al., 2018), but it can also more broadly be construed as precision of the finding: in Cortes et al. (2017), for example, localizing a signal corresponds to identifying the nodes farther from the root in a tree describing a set of related diseases. There is typically a tension between the precision of a discovery and the ability to detect the underlying signal: scientists are interested in finding the most high resolution at which interesting rejections can be made. When the space of hypotheses explored is very large, it is reasonable to expect that this ‘optimal’ resolution might vary across the existing signals. At the same time, the adaptive selection of resolution is complicated by the need to account for multiplicity, providing some global error guarantees.
The literature documents multiple approaches to tackling this problem. Without attempting a comprehensive review, it is worth pointing out that they vary with respect to target error rate (ex. FWER or FDR), and the assumptions they make on the relation between hypotheses at different levels of resolution. For example, Meijer and Goeman (2015) develop a sequential multiple testing method to control the FWER for hypotheses that can be described by a directed acyclic graph and they are able to increase power by leveraging logical relations between hypotheses. A similar setting is considered in Ramdas et al. (2019a), which targets FDR control. In this work, we consider settings where no restriction is placed on the hypotheses at different resolutions (for example, we do not require partitions to be nested). This allows us to model situations where the different resolutions capture different, possibly quite unrelated, procedures to query the data.
Spector and Janson (2024) recently studied this problem. Following Mandozzi and Bühlmann (2016), they provide a clear formalization of the tension between localization and discovery and give an elegant and efficient solution in a Bayesian framework (BLiP). While guarantees on the Bayesian FDR are a useful step forward, when posterior probabilities for each of the tested hypotheses are available, they are not satisfactory for a large portion of the scientific community who finds the frequentist approach to inference an easier platform for agreement.
Working in a frequentist framework, Katsevich et al. (2023) considered a larger class of multiple testing problems where researchers might be interested in refining the set of discoveries to eliminate redundancy, broadly defined. They introduce a multiple testing procedure (Focused BH) that generalizes the Benjamini and Hochberg (BH) procedure (Benjamini & Hochberg, 1995), and controls the FDR when the p-values for all hypotheses considered are independent or have a special type of positive dependence called PRDS (Positive Regression Dependency on each one from a Subset) (Benjamini & Yekutieli, 2001). Given the complex dependence structure of tests build on overlapping/nested sets of predictors as the ones in GWAS, these distributional assumptions are limiting, and the extension of Focused BH described in Katsevich et al. (2023) to arbitrary filters and dependency structures is very conservative.
An example of special interest is the multi-resolution analysis of GWAS data with conditional hypotheses (Sesia et al., 2020) enabled by the knockoff framework (Barber & Candès, 2015; Candès et al., 2018) illustrated in Figure 1. Here the p-values are dependent and they are designed to be analysed with a special multiple comparison adjustment procedure that is not captured by Focused BH.
To address this gap, we introduce a resolution-adaptive, multiple comparison procedure that leads to frequentist FDR control without making distributional assumptions on the test statistics. A key ingredient to our approach are e-values, and their ‘easy’ calculus (Vovk & Wang, 2021; Wang & Ramdas, 2022). Working with e-values, we can define a linear program related to that one in Spector and Janson (2024) and leading to frequentist FDR control.
These results are concretely interesting because we can describe powerful e-values for resolution adaptive variable selection. Ren and Barber (2024) have recently shown how the knockoff filter can be re-interpreted as a BH-type procedure (Wang & Ramdas, 2022) on specially constructed e-values. Building on their work, we introduce KeLP (Knockoff e-value Linear Program), with which we can analyse genomescale data in a powerful and time-efficient manner.
The rest of the paper is organized as follows. We formally introduce the problem in Section 2, describe a solution using e-values in Section 3, and operationalize it for testing conditional hypotheses with knockoffs in Section 4. We describe additional applications of KeLP in Section 5. Section 6 is devoted to simulations and Section 7 presents the results of applying our methods to the UK Biobank data.
2. Problem statement and notation
We consider problems where investigators are interested in evaluating p null hypotheses as well as ‘group’ hypotheses, which correspond to intersections of the original ones. An individual hypothesis can contribute to the definition of multiple group hypotheses, corresponding to different levels of resolution. For example, functional magnetic resonance imaging (fMRI) studies measure millions of voxels at different time intervals. While the most precise hypotheses that can be investigated describe the behaviour of one voxel at one time point, scientists are also interested in aggregating the signal to correspond to larger structures and time frames.
Formally, let denote a set of resolutions and let denote a partition of into disjoint sets at resolution . With for we indicate the elements of the partition , which we refer to as a group.
Each group defines a hypothesis
| (2) |
To avoid confusion, we use to indicate the partition of into p sets of size 1, so that for . Referring to the index m as a resolution, we underscore how the various partitions have a different degree of coarseness; however, we do not require that the partitions are nested, nor that there is a natural ordering among them. Let denote the set of resolution-specific hypotheses for and the corresponding set of true nulls. Combining several levels of resolution, let denote the combined set of hypotheses across resolutions and the combined set of true null hypotheses. We also let denote the set of groups across all levels of resolution. For simplicity of notation, we will let a group in be denoted as with the understanding that these groups are of course resolution-specific. To each group corresponds an hypothesis , for which we have a valid test (as described further in Sections 3 and 4, we will work with e-values). Figure 2 gives a schematic illustration of a multi-resolution family of hypotheses .
Figure 2.
Schematic representation of a multi-resolution family of hypotheses and a rejection set . Resolution 1 corresponds to individual level hypotheses; resolutions 2 and 3 correspond to two different partitions of the 16 hypotheses. False null hypotheses are indicated in red and capture the underlying signal at different resolutions. The hypotheses part of the rejection set are shaded in blue; the most specific rejections are highlighted with thick blue contours. The overall false discovery proportion (FDP) is 2/10. The FDP for resolution 1 is 1/3; for resolution 2 the FDP is 1/4, and for resolution 3, the FDP is 0. The FDP for the set of non-redundant and most precise discoveries is 2/5.
When the size of these families is large, it is necessary to control some measure of global error and in many scientific settings the false discovery rate (Benjamini & Hochberg, 1995) is often a preferred choice. Given the structure of the multi-resolution families, however, there are multiple ways of defining the FDR: one can consider the total collection of rejections across resolutions, and control the expected value of the false discovery proportion among them; or one can control the FDR within each resolution separately—possibly enforcing consistency across resolutions; or one might want to focus on the most precise discoveries made (the ones indicated with a blue thick contour in Figure 2) and control the FDR among them.
All of these choices are documented in the literature. For example, as anticipated, KnockoffZoom (Sesia et al., 2020) controls the FDR separately within each resolution; the p-filter (Barber & Ramdas, 2017) and the Multilayer Knockoff Filter (MKF) (Katsevich & Sabatti, 2019) result in coordinated discoveries across resolutions, with simultaneous FDR control within each resolution. The most precise discoveries have been described as ‘outer-nodes’ when the partitions corresponding to different resolutions are nested and the tested hypotheses can be organized in a hierarchical tree structure: Yekutieli (2008) proves the first result of FDR control in this framework, under some restrictive assumptions. Mandozzi and Bühlmann (2016) tackle the problem from the point of view of Family Wise Error Rate (FWER). Both BLiP (Spector & Janson, 2024) and Focused BH (Katsevich et al., 2023) consider approaches to identify non-redundant and precise rejection sets and develop methodology that leads to FDR control in some settings. This is also our goal.
Specifically, let be a multiple testing procedure applied to that outputs a rejection set . Let . We define the rejections to be non-redundant if contains disjoint sets. As in Mandozzi and Bühlmann (2016), we are interested in ‘minimal’ non-redundant discoveries, that is those corresponding to the groups with the smallest cardinality; at the same time, as in Spector and Janson (2024), we do not want to restrict ourselves to nested partitions. To capture the value of rejecting a particular hypothesis , we use weights , as in Spector and Janson (2024); in particular, we often focus on the original suggestion in Mandozzi and Bühlmann (2016) with where is the cardinality of A. The following formalizes our goal of finding a multiple testing procedure which maximizes the resolution-adjusted rejection count, controlling the FDR and leading to non-redundant findings.
Goal Resolution-adaptive discovery with FDR control —
Given a multi-resolution family of hypotheses as described above, corresponding to a collection of groups ; a fixed weighting function , capturing the value of discoveries at different resolutions; and a level of desired FDR, we seek a multiple testing procedure which solves the following constrained optimization problem:
(3)
(4)
(5)
Before introducing a procedure that solves this problem, we pause to make a few remarks. First, in contrast to Spector and Janson (2024), who use posterior probabilities to maximize a notion of expected power, our objective (3) is to maximize the number of (weighted) rejections: we want the largest number of valuable discoveries, while controlling FDR.
Second, while we focus on situations where the value of discoveries is described by weights decreasing in , in different contexts it might be useful to consider other evaluations: the procedures we will describe adapt to any, as long as the weights are fixed.
The constraint (5) expresses our interest in obtaining non-redundant rejections. In Section 5.1 and Appendix D online supplementary material we will discuss, instead, procedures that lead to discoveries at multiple resolutions, in a coordinated fashion.
We also want to underscore how the notion of resolution-adaptive discovery is relevant also for families of hypotheses that are not associated with spatial localization of a signal. For example, as in Cortes et al. (2017), low resolution hypotheses might correspond to group of traits, and the precision of discoveries increases as more specific statements can be made about which traits are involved. We will consider related problems in Sections 5.3 and 6.3.
Finally, note that it is also possible to generalize the above definition of to sets containing multiple ‘individual’ hypotheses, as long as these are always tested together.
3. eLP: an e-value based linear program for controlled, resolution-adaptive discoveries
We now describe a multiple testing procedure that achieves the goal in (3)–(5). The procedure evaluates the evidence in favour of the hypotheses in by using e-values. The term e-value has been introduced recently and encompasses multiple related alternative approaches to testing, including betting scores and likelihood ratios (Grünwald et al., 2024; Shafer, 2021; Vovk & Wang, 2021). Like p-values, e-values are defined by their properties under the null hypothesis; however, while p-values are defined in terms of probabilities, e-values are defined in terms of expectations. Formally, a p-value P is a random variable that satisfies (often with equality) for all . In other words, a -variable is stochastically larger than under the null. An e-value E is a -valued random variable satisfying . We can conceptualize it as the stochastic return of a bet against the null hypothesis: on average, if the null is true, it multiplies the money risked by 1 (no gain), and large gains are unlikely (by Markov’s inequality a gain larger than a constant c happens with probability at most ). A large realized e-value, a substantial win in the bet, indicates that the null is unlikely to be true so that the e-value can be interpreted as the amount of evidence collected against the null. Indeed, it has been argued that, for the general public, a summary of evidence given in terms of return on a bet might be easier to interpret than the p-value (Shafer, 2021).
Aside from this possible increased efficacy in communication, the fact that e-values are defined in terms of expectation translates in a number of advantages from the point of view of designing valid (multiple) testing strategies (Wang & Ramdas, 2022). For example, in contrast to p-values, handling dependent e-values is quite straightforward due to the linearity of expectation. Crucially for our purposes, to establish that a multiple testing procedure controls FDR when it is based on e-values with any dependence structure, it is sufficient to verify that it is self-consistent. This is a purely algorithmic property, originally introduced in Blanchard and Roquain (2008) with reference to p-values. Wang and Ramdas (2022) translate it to the e-value context. Let be e-values associated with hypotheses . They define an e-testing procedure to be self-consistent at level if, for every rejected hypothesis , the corresponding e-value satisfies
| (6) |
where denotes the total number of hypotheses and denotes the number of rejections. Wang and Ramdas (2022) show that self-consistency is a sufficient condition for FDR control, with arbitrary configurations of e-values. This is in contrast to what happens for p-values, where an additional dependency control condition is needed (Blanchard & Roquain, 2008). Note that the numerator of equation (6) depends on the number of hypotheses tested, which can make relying on self-consistency for frequentist FDR control conservative.
Similarly to Spector and Janson (2024), we can then define a linear program that achieves the goal in (3)–(5).
Procedure 1 eLP, e-value linear programming —
Let be a multi-resolution family of hypotheses, corresponding to a collection of groups , defined starting from p individual hypotheses. Let each hypothesis be associated with an e-value and a weight for . Let be the desired level of FDR control and let be an indicator of whether the hypothesis is rejected. The rejection set of eLP is identified by solving the following constrained optimization:
(7a)
(7b)
(7c)
(7d)
The first constraint (7b) simply characterizes as an indicator. The objective (7a) corresponds to our goal in (3). The linear constraint (7c) ensures frequentist FDR control at level α. The sum equals the number of rejections by the procedure. Then, if , (7c) requires that , which is exactly the self-consistency condition in equation (6). If , (7c) is automatically satisfied, as . Constraint (7d) ensures that the rejected regions are disjoint (5): each can be rejected at most once, and not multiple times in different groups.
Remark 1
To use eLP, researchers need to have available e-values for all the hypotheses. In case the testing procedure leads to p-values, it is possible to convert these to e-values using ‘calibration,’ albeit with possible power loss (see Vovk and Wang (2021)).
Remark 2
A consequence of the linearity of expectation is that an arithmetic mean of e-values of individual hypotheses belonging to a certain group is an e-value for their group-level global null hypothesis. In fact, the arithmetic mean essentially dominates any symmetric e-merging function, i.e. functions that take e-values of individual hypotheses as input and provide an e-value for the group-level global null as output (see Vovk Wang (2021), who also discuss cross-merging functions, mapping several p-values into an e-value). Note that if group e-values are constructed with this strategy, eLP would only reject hypotheses at the finest level of resolution (the individual level). We describe how to get nontrivial group e-values from the knockoff filter in Section 4.
Remark 3
For a single level of resolution, that is , eLP reduces to the e-BH procedure by Wang and Ramdas (2022).
Self-consistency can also be used to define other multiple comparisons procedures based on e-values. For example, as mentioned in Wang and Ramdas (2022), one can construct the equivalence of Focused BH (Katsevich et al., 2023) for e-values. We do so precisely in Appendix B online supplementary material, obtaining a procedure we call Focused e-BH and which controls FDR for any filter and under any dependence structure.
Focused e-BH applied to a multi-resolution family of hypotheses with a filter that selects the most precise nonredundant rejections produces the same rejection set of eLP with weights ; see Appendix B online supplementary material for more details. This is interesting in the context of the comparison in Spector and Janson (2024) between Focused BH and BLiP, which showed larger power for BLiP. This equivalence relationship between eLP (which has many commonalities with BLiP) and Focused e-BH (which is a translation of Focused BH to e-values) underscores how—when considering exactly the same set of hypotheses—the observed difference in power might not be due to the linear programming formulation, but rather to the implicit estimate of the FDR underlying these two methods. Spector and Janson (2024) control the Bayesian FDR, which can be evaluated using posterior probabilities. Focused BH and Focused e-BH control the frequentist FDR. A (self-consistent) procedure controlling the frequentist FDR bounds the number of wrong rejections utilizing the total number of hypotheses tested: this can lead to conservative behaviour, when nonredundancy constraints limit the number of possible rejections to a value much smaller than the total count of tested hypotheses. To remedy this, Katsevich et al. (2023) introduced a permutation based method to estimate the number of false rejections, which, however, comes with computational costs.
4. KeLP: eLP with knockoff e-values
Given e-values for a family of hypotheses , eLP guarantees FDR control while leading to resolution-adaptive discoveries. However, defining powerful e-values is a nontrivial task (Vovk & Wang, 2021). We now describe how to accomplish this for specific types of multi-resolution families of hypotheses leveraging the knockoff framework (Barber & Candès, 2015; Candès et al., 2018).
For a set of p explanatory variables and a response Y, the knockoff framework tests, with FDR control, conditional independence hypotheses
| (8) |
and, given a partition , their group equivalent
| (9) |
where and denotes all features except with (Katsevich & Sabatti, 2019; Sesia et al., 2020).
The knockoff framework relies on two steps: a) the construction of test statistics with some distributional properties under the null and b) a multiple comparison procedure. The first step a) is based on comparing the signal of each feature (or group of features) with that of a ‘knockoff copy’ of it, indistinguishable from the original in distribution, but independent from Y (Candès et al., 2018; Sesia et al., 2020). This is done with a score designed so that swapping the original feature with its knockoff flips the sign of . For example, a popular choice (Candès et al., 2018) for the scores is the difference of the absolute values of the coefficients based on a cross validated Lasso of Y on the entire collection of original features and knockoffs: , where is the resulting coefficient of and the coefficient of the knockoff . A large value of suggests greater evidence against the null.
The b) multiple comparison procedure (filter) is based on an estimate of the FDP that relies on the distribution of the knockoff scores corresponding to null hypotheses. Precisely, to control FDR at level γ, the knockoff filter (Barber & Candès, 2015) selects the set of features according to the following rule:
| (10) |
The martingale properties that guarantee that (10) controls FDR at level γ also guarantee that:
| (11) |
This expectation motivated the construction in Ren and Barber (2024), who show how the rejections of the knockoff filter are equivalent to those of e-BH applied to e-values defined as
Note that these are not strictly e-values: it is not true that for each null , but only that . This property follows from equation (11), as
Nevertheless, Ren and Barber (2024) show that this property of ‘relaxed’ e-values is sufficient to guarantee FDR control in the rejections defined with e-BH.
Describing the knockoff filter in terms of e-BH applied to e-values is convenient because of the advantages of this framework in dealing with dependency. Ren and Barber (2024) show, for example, how dependent sets of , derived from k runs of the knockoff procedure, can be combined to achieve a stable variable selection. This is also useful in our multi-resolution framework. To illustrate this, we start generalizing the result by Wang and Ramdas (2022) for relaxed e-values.
Theorem 1
Suppose are relaxed e-values for hypotheses , i.e. they satisfy . Then any self-consistent e-testing procedure at level α taking as input controls the FDR at level α.
The proof is almost identical to the proof in Wang and Ramdas (2022) and is included in Appendix A online supplementary material. Theorem 1 guarantees that we can use relaxed e-values as an input to eLP (Procedure 1). The following describes the construction of knockoffs-based e-values for the hypotheses at each resolution and their use in a procedure we call KeLP. is said to be a (group) knockoff for partition if for each group it holds that and Y is independent of given X; see e.g. Sesia et al. (2021). means that the jth coordinate is swapped with the th coordinate for all . The construction of (group) knockoff variables has been studied in prior research; see for example Sesia et al. (2021, 2019), Dai and Barber (2016), Katsevich and Sabatti (2019), Gimenez et al. (2019), Spector and Janson (2022) and Romano et al. (2020).
Procedure 2 KeLP, knockoff e-value linear programming —
Let be a multi-resolution family of conditional independence hypotheses relative to variables . For each resolution , construct (group) knockoffs for with respect to partition and obtain for each hypothesis the knockoff score . Let be nonnegative numbers such that . For each , define knockoffs-based e-values by
(12)
(13) For a fixed weighting function and a target FDR level α, the rejection set of KeLP is given by the output of eLP (Procedure 1) taking these and the e-values (12) as input.
Due to the correspondence between the e-BH procedure by Wang and Ramdas (2022) and the traditional knockoff filter by Candès et al. (2018), as described by Ren and Barber (2024), if , KeLP is equivalent to the usual knockoff procedure.
There are two sets of parameters in KeLP: and . In (12), is a fixed multiplier that is chosen to guarantee the property for Theorem 1:
| (14) |
The first inequality follows from the property of the importance statistics in equation (11), and the second is true by construction (note that it is possible for some , which relaxes the original definition of knockoff-based e-values by Ren and Barber (2024), since for our goal in (3)–(5) we are not going to apply e-BH to each resolution separately).
The parameter used to obtain the stopping time (13) governs the trade-off between the number of non-zero e-values and their magnitude. For a larger , we might obtain a smaller , which might result in a larger number of nonzero e-values. However, those nonzero e-values might be smaller in magnitude, as there might also be more groups with .
The choice of and is left to the user, and can be guided by tuning in a hold-out dataset. However, note that even in a GWAS setting a ‘large’ tuning set might not be required; see Appendix F online supplementary material for details. In this work, we set . This choice results in knockoff e-values whose magnitude depends mostly on signal-strength, and not on the number of groups in a particular resolution. KeLP already includes the preference to reject hypotheses in smaller levels of resolution. As for , in applications with moderate dimensions, following Ren and Barber (2024), we recommend or , where should be chosen in higher dimensional settings. We suggest level-specific choices of for very high-dimensional applications with many levels of resolution and very sparse signals (e.g. genetic data): in this case, we recommend choosing for the individual level, with decreasing levels of γ as the resolutions becomes coarser. We leave the question of finding the optimal theoretical parameters for further research. We solve the linear program corresponding to KeLP using a standard optimization software: CVXR (Fu et al., 2020).
5. Additional applications of knockoff e-values and KeLP
Before exploring the performance of KeLP with simulations and real-data analysis, we pause to illustrate other advantages of e-values and of re-casting knockoff analyses in this framework (12). We consider two cases in which the dependency between p-value/test statistics presented challenges for analysis, and underscore the generality of KeLP by showing how it can be leveraged to analyse multivariate structured outcomes.
5.1. Multilayer knockoff filter
As mentioned in Section 2, when working with hypotheses at multiple resolutions, different notions of global error are meaningful. The goal of this paper is to reject the most specific hypotheses possible, while controlling the FDR across the rejection set, which might span multiple levels of resolution.
Another goal, referred to as multilayer FDR control, is to coordinate rejections across resolutions and provide simultaneous FDR control in each resolution. Barber and Ramdas (2017) introduced multilayer FDR control and developed the p-filter, which attains it when the p-values for the individual hypotheses are PRDS and group hypotheses are tested using Simes’s combination rule (Simes, 1986).
The original p-filter by Barber and Ramdas (2017) has been extended to arbitrary dependencies between the p-values in Ramdas et al. (2019b) using reshaping, which makes the procedure conservative.
E-values can be used to develop an analogue of the p-filter, as mentioned by Wang and Ramdas (2022). We include a description of the e-filter in Appendix D online supplementary material.
Katsevich and Sabatti (2019) developed the Multilayer Knockoff Filter (MKF) to attain the same goal in the context of a multi-resolution analysis via knockoffs. The MKF controls the FDR at level , where and α denotes the target FDR level. Using knockoff e-values, we describe the e-Multilayer Knockoff Filter (e-MKF ) with FDR control at the target level α in Appendix D online supplementary material. We also show that, for the same theoretical level of FDR control, the e-MKF has higher power compared to the MKF.
5.2. Partial conjunction hypotheses
A multi-resolution family of hypotheses is one type of ‘structured’ collection of hypotheses. Another type that is often useful to consider is the one motivating partial conjunction testing (Benjamini & Heller, 2008). Here one can think about an array of hypotheses , where for each j, researchers are interested to discover that at least u out of L tested hypotheses are false. One context where partial conjunctions are quite relevant is replication (Bogomolov & Heller, 2013, 2018). For example, the genetic underpinnings of a disorder might be studied in different human populations, and scientists might be interested in identifying those SNPs that carry a signal in at least u distinct groups. A related, but different, example, in the context of GWAS, is when measurements on several phenotypes may be available. Scientists might then be interested in testing whether a SNP (or group of SNPs) is conditionally independent of any (or at least a certain number) of the phenotypes.
Li et al. (2022) developed a knockoff filter to test partial conjunctions across distinct independent studies. This approach, however, cannot be applied to the investigation of multiple phenotypes collected on the same individuals—the knockoff scores relative to the different phenotypes being based on the same genotype data are going to be dependent. Using knockoff e-values we can overcome this limitation, as we describe in Appendix C online supplementary material. Testing these partial conjunction hypotheses can also be combined with testing across multiple levels of resolution. Indeed, in our application to the UK Biobank data in Section 7, we test a global null hypothesis for platelet-related outcomes.
5.3. Structured outcomes
In some cases, it might be possible to describe a hierarchy among the outcomes (Cortes et al., 2017). Let L denote the total number of outcomes. To fix ideas, consider the example in Figure 3. The binary tree has four leaves, two internal nodes and one root node. Each of the parent nodes is constructed as a union of its leaf descendants.
Figure 3.

Illustration of an outcome structure corresponding to a binary tree with seven nodes based on four leaves , and D. Each parent node is constructed based on its children.
We are interested in testing conditional independence hypotheses between each of the p features and the different outcomes. Even without considering different groupings of the p features, there is a level of redundancy between the hypotheses. For example, rejecting the hypotheses of conditional independence between feature and both and A is redundant, as it implicates the leaf-outcome A twice and stating that is not independent from A logically implies that is not independent from .
We can use the KeLP framework to filter rejections so that they point to the most specific outcome for each feature. Specifically, we obtain one knockoff e-value for each feature for each node of the tree shown in Figure 3, i.e. seven knockoff e-values for each feature in total. We require that the hypothesis of conditional independence from each feature be rejected only once for each leaf-outcome. We weigh hypotheses with the reciprocal of the number of leaf-outcomes they implicate. With these specifications, we can then apply KeLP to the e-values in total, controlling the FDR across the entire rejected set (across all outcomes and features).
The output of the procedure returns separately for each feature the most specific outcome for which the conditional independence hypothesis was rejected. We show a simulation of this setting in Section 6.3.
6. Simulations
To illustrate KeLP’s power in localizing signals among features, we use two simulation settings: in the first, the features are generated with a block-diagonal variance covariance matrix and in the second they are genotypes of SNPs on chromosome 21 of the White Non-British population of the UK Biobank. We also explore KeLP’s performance with structured outcomes, as in Figure 3. The code for the simulations is available at: https://github.com/pmgblz/KeLP.
6.1. Block-diagonal correlation structure for features
We simulate n observations of the outcome from , with the identity matrix. The feature vector is generated as , where Σ is block-diagonal with blocks of size 5 and . Within each block, the correlation pattern follows an AR(1) process with . This provides a schematic representation of the dependence structure between DNA polymorphisms. Following further Spector and Janson (2024), there are randomly chosen nonzero coefficients, rounded to the nearest integer, where denotes the sparsity. The nonzero coefficients are simulated as i.i.d. with . Across the simulations, we vary the ratio , starting at .
We consider two levels of resolutions: the individual level (groups of size 1) and groups of size 5. The (group) knockoff variables are generated following the maximum entropy criteria (Chu, 2023; Spector & Janson, 2022); see Appendix E online supplementary material for details.
We compare KeLP’s rejection sets with those of the knockoff filter applied separately at each level of resolution (‘Knockoffs individual’ and ‘Knockoffs group’), as well as two other approaches to identify rejections across resolutions. The first (‘Knockoffs outer’) consists in simply identifying the outer-nodes of the rejections that the knockoff filter makes at each resolution. Outer-nodes are those groups of variables with no rejected descendants, where a descendant is a group contained within the parent group (see Figure 1). Note that filtering the resolution-specific rejections by the knockoff filter to the outer-nodes does not have guarantees on FDR control. To obtain FDR control at level α on the filtered outer-nodes, it is possible to run e-BH on their corresponding knockoff e-values at the amended level , where denotes the number of filtered outer-nodes. This ensures self-consistency; see Wang and Ramdas (2022). We include this procedure (e-BH knockoffs outer) in our comparisons.
To evaluate performance, we look at FDP, power and precision of discoveries. The realized FDP is calculated separately for each level of resolution for the knockoff filter (individual and group) and across the resolutions for the other methods. We define ‘power’ as the number of correctly rejected nonzero individual features (whether they were rejected individually or in a group), divided by the total number of nonzero individual features. To evaluate the precision of discoveries, we look at the total number of features included in the rejected groups. For a given level of power, the most precise method points to the smallest number of features.
Figure 4 shows that the methods perform as expected in terms of FDR control. In terms of power and size of the rejection set, KeLP effectively ‘interpolates’ between the individual and the group level. For lower levels of signal strength, KeLP rejects more groups than individual features, but achieves the precision of the individual level as signal strength increases. KeLP’s power is very close to that of the outer-nodes of the knockoff filter applied separately at each resolution, but in contrast to the latter procedure, KeLP achieves FDR control. Refining the outer-nodes with e-BH leads to no power.
Figure 4.
By row, FDP, power, and size of the rejection set, averaged across 100 simulation runs. By columns, different signal sparsities (corresponding signal to noise ratios are 1, 2, and 4, from left to right). Target FDR is indicated with a dashed line. The size of the rejection set is displayed as the square root of the total number of variables included in the rejected hypotheses. Broken lines are used for methodologies not expected to control FDR.
6.2. Chromosome-wide simulation on the UK Biobank genotypes
We again simulate . Here, X are the genotypes of chromosome 21 of the unrelated White Non-British population of the UK Biobank with and . We use the ancestries, resolutions and knockoffs defined by Sesia et al. (2021); see also Appendix F online supplementary material for more details. We compare the same methods as in Section 6.1. There are randomly chosen nonzero coefficients across chromosome 21, rounded to the nearest integer. As n and p are fixed in this setting, to explore different signal strengths, we vary the absolute value of the nonzero elements β: these are set to , where a denotes the signal amplitude. The signs of the nonzero coefficients are determined by independent coin flips.
Figure 5 shows similar results to those in Figure 4: KeLP leads to a rejection set comparable to that of knockoffs outer-nodes, while controlling the FDR at the desired level.
Figure 5.
By row, FDP, power, and size of the rejection set, averaged across 25 simulation runs. By columns, different signal sparsities. Figure S.9 in Appendix E online supplementary material shows heritability () as a function of signal amplitude. See caption of Figure 4 for details. Resolutions are indicated by median group width in kilobases (kb); see Sesia et al. (2021) for further details.
6.3. Structured outcomes
We consider outcomes with a structure that corresponds to the tree in Figure 3. In particular, consider a setting where the outcomes corresponding to leaves refer to the most specific diseases. We denote by cases the proportion of people who have these diseases. Let with indicate the node. Each of the four leaves is simulated from a logistic model where is an intercept chosen to obtain an approximate percentage of cases of 15% for each of the leaves. We simulate as , where Σ has entries and . There are randomly chosen nonzero coefficients, rounded to the nearest integer, where denotes the sparsity. We also set an overlap of 50% for the nonzero features between each of the siblings corresponding to the same internal parent node. The magnitudes of the nonzero coefficients are simulated as i.i.d. with and . We define the binary outcomes corresponding to the parent nodes in the tree to be equal to 1 whenever any of their descendants is equal to 1. Across the simulations, we vary the ratio , starting at .
In this hierarchical setting, it makes sense to consider the properties of rejection sets at each level of the tree, or among outer-nodes. We compare the performance of KeLP, the combined results of the knockoff filter for the outcomes at each level of the tree (‘knockoffs leaves’, ‘knockoffs internal nodes’, ‘knockoffs root node’) and the corresponding outer-nodes.
The knockoff filter for the outcome corresponding to the root node has FDR control. However, the rejection set obtained by combining the results of the knockoff filter run separately for the two outcomes corresponding to internal nodes does not have FDR control (as we are simultaneously considering multiple outcomes). Neither does the rejection set that collects all discoveries made by the knockoff filter separately run on the leaf-outcomes. As before, to obtain FDR control, we pass these through a further e-BH procedure at the amended level (‘e-BH knockoffs leaves’, ‘e-BH knockoffs internal’), where denotes the number of rejections by the knockoff filter collected across multiple outcomes, but within the same level of the tree, and the total number of hypotheses at the specific level of the tree.
Consider the tree in Figure 3: each leaf-outcome is associated with a set of nonzero features. We call the sum of the number of nonzero features across the leaf-outcomes the ‘total number of nonzero leaf-outcome associations’. Moreover, if, for example, feature is nonzero for outcome A, it is also nonzero for the outcome corresponding to the internal node and the root node . Therefore, if is rejected at the level of the internal node or the root node, it is also counted as a correct rejection. We then define power as the total number of correct rejections divided by the total number of nonzero feature-outcome associations. We calculate power and FDR separately at each level of the tree for the rejection set obtained by combining the results of the knockoff filter at each level and across levels for KeLP and the knockoffs outer-nodes.
The discovery of an association between and both and is more precise than the discovery of an association between and . In agreement with the weighting scheme we described in Section 5.3, we measure precision in terms of number of ‘individually’ associated outcomes. The rejection of the hypothesis of conditional independence between and each of the leaves adds 1 to this count; the rejection of an internal node adds 1/2, and the rejection of the root node adds 1/4.
Figure 6 shows that KeLP has FDR control, as expected from the theoretical results in Section 4. Moreover, in terms of power and precision, it behaves similarly to the knockoffs outer-nodes. However, the knockoffs outer-nodes violate FDR control. Moreover, KeLP has highest power among all methods with theoretical FDR control (all solid lines in Figure 6).
Figure 6.
By row, FDP, power, and the square root of the number of singularly implicated outcomes (‘precision’), averaged across 100 simulation runs. By columns, different signal sparsities. Target FDR is indicated with a dashed line. Broken lines are used for methodologies not expected to control FDR. Note that, unlike in Figures 4 and 5, higher values are preferred in the bottom panel. The power and precision curves for ‘Knockoffs outer’ and ‘Knockoffs leaves’ almost coincide.
7. Application to the UK Biobank
We use KeLP to analyse the relation between two phenotypes (height and platelets) and genetic variation in approximately 337k British unrelated individuals of the UK Biobank. The genotype data consists of about 592k autosomal SNPs. We consider seven different partitions of SNPs corresponding to different resolutions, as in Sesia et al. (2021) and rely on the knockoffs generated in that study to construct e-values (12) for all group hypotheses. Our target FDR level is .
Table 1 shows, stratified by resolution, the number of rejections and of implicated SNPs by (1) KeLP, (2) the knockoff filter applied to each resolution separately, and (3) the outer-nodes of the knockoffs rejections. Appendix F.1.1 online supplementary material provides a visualization of the KeLP rejection regions.
Table 1.
UK Biobank: height
| Number of rejections | SNPs implicated | |||||
|---|---|---|---|---|---|---|
| Resolution | KeLP | Outer-nodes | Res. spec. | KeLP | Outer-nodes | Res. spec. |
| Single-SNP | 53 | 53 | 53 | 53 | 53 | 53 |
| 3 kb | 148 | 260 | 313 | |||
| 20 kb | 245 | 932 | ||||
| 41 kb | 345 | |||||
| 81 kb | 711 | 891 | ||||
| 208 kb | 555 | 882 | ||||
| 425 kb | 406 | 371 | ||||
| Total | ||||||
Notes. British unrelated population. FDR level: . The column ‘Res. spec.’ (resolution specific) reports the results of the knockoff filter applied separately at each level of resolution (these rejection sets have FDR control at the resolution level, but are redundant across resolutions). The column ‘Outer-nodes’ refers to the results of filtering these resolution-specific knockoffs rejections (which leads to a rejection set with no FDR control). Resolutions are indicated by median group width in kilobases (kb); see Sesia et al. (2021) for further details.
To explore the validity of the KeLP rejections, we compare them to the discoveries in Yengo et al. (2022), who find 12,111 independent SNPs that are significantly associated with height on the basis of a COJO (conditional and joint association analysis with summary statistics) analysis based on GWAS data of 5.4 million individuals. For this analysis, we consider groups which do not contain any of these SNPs (matching by name or location) as true nulls, and evaluate their discoveries as false. The derived estimate of FDP for KeLP is around 7.5%, while that of the knockoffs outer-nodes is approximately 20%—substantially higher than the target FDR level of 10%. The rejections by KeLP include 4,121 SNPs out of the 12,111 SNPs in Yengo et al. (2022), while 5,203 of these are discovered by the knockoffs outer-nodes: the increased FDP of this last approach comes with low power rewards.
We consider four continuous platelet-related phenotypes (platelet count, platelet crit, platelet width, platelet volume) and look for locations in the genome associated to any of them. To obtain global null e-values for each group at each level of resolution, we average the knockoff e-values across the outcomes; see Appendix C online supplementary material for details. Table 2 summarizes the results.
Table 2.
UK Biobank: platelet global null
| Number of rejections | SNPs implicated | |||||
|---|---|---|---|---|---|---|
| Resolution | KeLP | Outer-nodes | Res. spec. | KeLP | Outer-nodes | Res. spec. |
| Single-SNP | 116 | 0 | 0 | 116 | 0 | 0 |
| 3 kb | 77 | 96 | 96 | 783 | 954 | 954 |
| 20 kb | 286 | 354 | 438 | 7,162 | 8,446 | 10,531 |
| 41 kb | 143 | 146 | 539 | 5,003 | 5,123 | 19,010 |
| 81 kb | 528 | 534 | 1,071 | 28,597 | 28,959 | 57,813 |
| 208 kb | 334 | 334 | 1,268 | 30,815 | 30,815 | 123,340 |
| 425 kb | 207 | 207 | 1,275 | 28,816 | 28,816 | 204,169 |
| Total | 1,691 | 1,671 | 101,292 | 103,113 | ||
Note. See caption of Table 1 for detailed explanations.
It is interesting to remark that KeLP can increase power for higher levels of resolutions compared to the standard ‘resolution specific’ analysis. Indeed, the rejections by KeLP are not necessarily a subset of the outer-nodes of these. KeLP considers all hypotheses in the multi-resolution family jointly, which can result in a different threshold required for self-consistency than a resolution-specific threshold; see equation (6). The fact that a number of low resolution hypotheses are discovered, can then lower the threshold for high resolution ones to enter the rejection set.
8. Conclusion
We developed a method to localize signals at the smallest level of resolution possible while controlling frequentist FDR leveraging the concept of e-values. In particular, we focused on relaxed multi-resolution e-values from the knockoff filter. We have shown that our method can be used to parse over groups of features or groups of outcomes. Our method showed desirable performance in simulations and in an application to the UK Biobank. It successfully navigated the trade-off between resolution and power, adaptively choosing a level of resolution that enables discoveries while maximizing their precision and controlling FDR.
The rejections of KeLP are not necessarily a subset of the outer-nodes of resolution-specific knockoffs filters. Indeed, by looking simultaneously across resolutions, and allowing discoveries of different levels of precision, KeLP can lead to a larger number of high resolution discoveries.
We observed in our experiments that KeLP achieves an empirical FDR that is usually lower than the target level. This might be connected to the self-consistency requirement, which, as stated in Section 3, implicitly estimates the number of false discoveries as a fraction of all tested hypotheses, even though our nonredundancy requirement could never result in these many rejections. The investigation of sharper bounds that would result in higher power might be the object of future work. Another possible direction for future research is to investigate whether logical constraints could be utilized for higher power, similar to Meijer and Goeman (2015) or Ramdas et al. (2019a).
Our results include methods to test conditional independence partial conjunction hypotheses using the same sample and a new version of the multilayer knockoff filter. All together, they underscore the versatility of e-values and provide examples where the technical elegance of this methodology is accompanied by substantial power.
Supplementary Material
Contributor Information
Paula Gablenz, Department of Statistics, Stanford University, 390 Jane Stanford Way, Stanford, CA 94305-4020 USA.
Chiara Sabatti, Department of Statistics, Stanford University, 390 Jane Stanford Way, Stanford, CA 94305-4020 USA; Department of Biomedical Data Science, Stanford University, Medical School Office Building 1265 Welch Road MC5464, Stanford, CA 94305-5464, USA.
Funding
We gratefully acknowledge funding from National Science Foundation (NSF) DMS-2210392 and National Institutes of Health (NIH) R56HG010812. P.G. is supported by a Ric Weiland fellowship.
Data availability
The genetic data analysed in this paper is part of the UK Biobank. The authors have access via application 27837. To obtain access, see how to apply at https://www.ukbiobank.ac.uk/.
Supplementary material
Supplementary material is available online at Journal of the Royal Statistical Society: Series B.
References
- Barber R. F., & Candès E. J. (2015). Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5), 2055–2085. 10.1214/15-AOS1337 [DOI] [Google Scholar]
- Barber R. F., & Ramdas A. (2017). The p-filter: Multilayer false discovery rate control for grouped hypotheses. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(4), 1247–1268. 10.1111/rssb.12218 [DOI] [Google Scholar]
- Benjamini Y., & Heller R. (2008). Screening for partial conjunction hypotheses. Biometrics, 64(4), 1215–1222. 10.1111/biom.2008.64.issue-4 [DOI] [PubMed] [Google Scholar]
- Benjamini Y., & Hochberg Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 57(1), 289–300. 10.1111/j.2517-6161.1995.tb02031.x [DOI] [Google Scholar]
- Benjamini Y., & Yekutieli D. (2001). The control of the false discovery rate in multiple testing under dependency. Annals of Statistics, 29(4), 1165–1188. 10.1214/aos/1013699998 [DOI] [Google Scholar]
- Blanchard G., & Roquain E. (2008). Two simple sufficient conditions for FDR control. Electronic Journal of Statistics, 2, 963–992. 10.1214/08-EJS180 [DOI] [Google Scholar]
- Bogomolov M., & Heller R. (2013). Discovering findings that replicate from a primary study of high dimension to a follow-up study. Journal of the American Statistical Association, 108(504), 1480–1492. 10.1080/01621459.2013.829002 [DOI] [Google Scholar]
- Bogomolov M., & Heller R. (2018). Assessing replicability of findings across two studies of multiple features. Biometrika, 105(3), 505–516. 10.1093/biomet/asy029 [DOI] [Google Scholar]
- Candès E. J., Fan Y., Janson L., & Lv J. (2018). Panning for gold: Model-X knockoffs for high-dimensional controlled variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 80 (3), 551–577. 10.1111/rssb.12265 [DOI] [Google Scholar]
- Chu B. (2023). Variable Selection with Knockoffs. https://github.com/biona001/Knockoffs.jl
- Cortes A., Dendrou C. A., Motyer A., Jostins L., Vukcevic D., Dilthey A., Donnelly P., Leslie S., Fugger L., & McVean G. (2017). Bayesian analysis of genetic association across tree-structured routine healthcare data in the UK Biobank. Nature Genetics, 49(9), 1311–1318. 10.1038/ng.3926 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dai R., & Barber R. (2016). The knockoff filter for FDR control in group-sparse and multitask regression. In M. F. Balcan, & K. Q. Weinberger (Eds.). Proceedings of The 33rd International Conference on Machine Learning of Proceedings of Machine Learning Research (Vol. 48, pp. 1851–1859). PMLR. https://proceedings.mlr.press/v48/daia16.html.
- Fu A., Narasimhan B., & Boyd S. (2020). CVXR: An R package for disciplined convex optimization. Journal of Statistical Software, 94(14), 1–34. 10.18637/jss.v094.i14 [DOI] [Google Scholar]
- Genovese C., & Wasserman L. (2004). A stochastic process approach to false discovery control. Annals of Statistics, 32(3), 1035–1061. 10.1214/009053604000000283 [DOI] [Google Scholar]
- Gimenez J. R., Ghorbani A., & Zou J. (2019). Knockoffs for the mass: New feature importance statistics with false discovery guarantees. In The 22nd International Conference on Artificial Intelligence and Statistics (pp. 2125–2133). PMLR.
- Goeman J. J., Meijer R. J., Krebs T. J., & Solari A. (2019). Simultaneous control of all false discovery proportions in large-scale multiple hypothesis testing. Biometrika, 106(4), 841–856. 10.1093/biomet/asz041 [DOI] [Google Scholar]
- Goeman J. J., & Solari A. (2011). Multiple testing for exploratory research. Statistical Science, 26, 584–597. 10.1214/11-STS356 [DOI] [Google Scholar]
- Grünwald P., de Heide R., & Koolen W. (2024). Safe testing. Journal of the Royal Statistical Society Series B: Statistical Methodology, 86(5), 1091–1128. 10.1093/jrsssb/qkae011 [DOI] [Google Scholar]
- Katsevich E., & Sabatti C. (2019). Multilayer knockoff filter: Controlled variable selection at multiple resolutions. The Annals of Applied Statistics, 13(1), 1–33. 10.1214/18-AOAS1185 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Katsevich E., Sabatti C., & Bogomolov M. (2023). Filtering the rejection set while preserving false discovery rate control. Journal of the American Statistical Association, 118(541), 165–176. 10.1080/01621459.2021.1906684 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li S., Sesia M., Romano Y., Candès E., & Sabatti C. (2022). Searching for robust associations with a multi-environment knockoff filter. Biometrika, 109(3), 611–629. 10.1093/biomet/asab055 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mandozzi J., & Bühlmann P. (2016). Hierarchical testing in the high-dimensional setting with correlated variables. Journal of the American Statistical Association, 111( 513), 331–343. 10.1080/01621459.2015.1007209 [DOI] [Google Scholar]
- Meijer R. J., & Goeman J. J. (2015). A multiple testing method for hypotheses structured in a directed acyclic graph. Biometrical Journal, 57(1), 123–143. 10.1002/bimj.v57.1 [DOI] [PubMed] [Google Scholar]
- Ramdas A., Chen J., Wainwright M. J., & Jordan M. I. (2019a). A sequential algorithm for false discovery rate control on directed acyclic graphs. Biometrika, 106(1), 69–86. 10.1093/biomet/asy066 [DOI] [Google Scholar]
- Ramdas A. K., Barber R. F., Wainwright M. J., & Jordan M. I. (2019b). A unified treatment of multiple testing with prior knowledge using the p-filter. The Annals of Statistics, 47(5), 2790–2821. 10.1214/18-AOS1765 [DOI] [Google Scholar]
- Ren Z., & Barber R. F. (2024). Derandomised knockoffs: leveraging e-values for false discovery rate control. Journal of the Royal Statistical Society Series B: Statistical Methodology, 86(1), 122–154. 10.1093/jrsssb/qkad085 [DOI] [Google Scholar]
- Renaux C., Buzdugan L., Kalisch M., & Bühlmann P. (2020). Hierarchical inference for genome-wide association studies: A view on methodology with software. Computational Statistics, 35, 1–40. 10.1007/s00180-019-00939-2 [DOI] [Google Scholar]
- Romano Y., Sesia M., & Candès E. (2020). Deep knockoffs. Journal of the American Statistical Association, 115(532), 1861–1872. 10.1080/01621459.2019.1660174 [DOI] [Google Scholar]
- Rosenblatt J. D., Finos L., Weeda W. D., Solari A., & Goeman J. J. (2018). All-resolutions inference for brain imaging. NeuroImage, 181.786–796. 10.1016/j.neuroimage.2018.07.060 [DOI] [PubMed] [Google Scholar]
- Sesia M., Bates S., Candès E., Marchini J., & Sabatti C. (2021). False discovery rate control in genome-wide association studies with population structure. Proceedings of the National Academy of Sciences, 118(40), e2105841118. 10.1073/pnas.2105841118 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sesia M., Katsevich E., Bates S., Candès E., & Sabatti C. (2020). Multi-resolution localization of causal variants across the genome. Nature Communications, 11, 1–10. 10.1038/s41467-020-14791-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sesia M., Sabatti C., & Candès E. J. (2019). Gene hunting with hidden Markov model knockoffs. Biometrika, 106(1), 1–18. 10.1093/biomet/asy033 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shafer G. (2021). Testing by betting: A strategy for statistical and scientific communication. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(2), 407–431. 10.1111/rssa.12647 [DOI] [Google Scholar]
- Simes R. J. (1986). An improved Bonferroni procedure for multiple tests of significance. Biometrika, 73(3), 751–754. 10.1093/biomet/73.3.751 [DOI] [Google Scholar]
- Spector A., & Janson L. (2022). Powerful knockoffs via minimizing reconstructability. The Annals of Statistics, 50(1), 252–276. 10.1214/21-AOS2104 [DOI] [Google Scholar]
- Spector A., & Janson L. (2024). Controlled discovery and localization of signals via Bayesian linear programming. Journal of the American Statistical Association. 10.1080/01621459.2024.2347667 [DOI] [Google Scholar]
- Vovk V., & Wang R. (2021). E-values: Calibration, combination and applications. The Annals of Statistics, 49(3), 1736–1754. 10.1214/20-AOS2020 [DOI] [Google Scholar]
- Vovk V., & Wang R. (2023). Confidence and discoveries with e-values. Statistical Science, 38(2), 329–354. 10.1214/22-STS874 [DOI] [Google Scholar]
- Wang R., & Ramdas A. (2022). False discovery rate control with e-values. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(3), 822–852. 10.1111/rssb.12489 [DOI] [Google Scholar]
- Yekutieli D. (2008). Hierarchical false discovery rate–controlling methodology. Journal of the American Statistical Association, 103(481), 309–316. 10.1198/016214507000001373 [DOI] [Google Scholar]
- Yengo L., Vedantam S., Marouli E., Sidorenko J., Bartell E., Sakaue S., Graff M., Eliasen A. U., Jiang Y., Raghavan S., Miao J., Arias J. D., Graham S. E., Mukamel R. E., Spracklen C. N., Yin X., Chen S.-H., Ferreira T., Highland H. H., …Hirschhorn J. N. (2022). A saturated map of common genetic variants associated with human height. Nature, 610(7933), 704–712. 10.1038/s41586-022-05275-y [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The genetic data analysed in this paper is part of the UK Biobank. The authors have access via application 27837. To obtain access, see how to apply at https://www.ukbiobank.ac.uk/.





