Skip to main content

This is a preprint.

It has not yet been peer reviewed by a journal.

The National Library of Medicine is running a pilot to include preprints that result from research funded by NIH in PMC and PubMed.

ArXiv logoLink to ArXiv
[Preprint]. 2024 Feb 20:arXiv:2402.12724v1. [Version 1]

Controlled Variable Selection from Summary Statistics Only? A Solution via GhostKnockoffs and Penalized Regression

Zhaomeng Chen 1,*, Zihuai He 2,3,*, Benjamin B Chu 4, Jiaqi Gu 2, Tim Morrison 1, Chiara Sabatti 1,4, Emmanuel Candès 1,5
PMCID: PMC10925382  PMID: 38463500

Abstract

Identifying which variables do influence a response while controlling false positives pervades statistics and data science. In this paper, we consider a scenario in which we only have access to summary statistics, such as the values of marginal empirical correlations between each dependent variable of potential interest and the response. This situation may arise due to privacy concerns, e.g., to avoid the release of sensitive genetic information. We extend GhostKnockoffs He et al. [2022] and introduce variable selection methods based on penalized regression achieving false discovery rate (FDR) control. We report empirical results in extensive simulation studies, demonstrating enhanced performance over previous work. We also apply our methods to genome-wide association studies of Alzheimer’s disease, and evidence a significant improvement in power.

Keywords: Variable selection, replicability, summary statistics, false discovery rate (FDR), knockoffs, genome-wide association study (GWAS), pseudo-lasso

1. Introduction

1.1. Background and contributions

Modern large-scale studies frequently involve a multitude of explanatory variables potentially associated with an outcome we would like to better understand. Oftentimes, the goal is to select those explanatory variables that are meaningfully associated with the response variable. For instance, with recent advances in genome sequencing technologies and genotype imputation techniques, one can now gather tens of millions of variants from hundreds of thousands of samples in large-scale genetic studies, with the aim of pinpointing which genetic variants are biologically associated with specific diseases. This information could provide mechanistic insights and potentially aid the development of targeted drugs. In statistics, this challenge is typically framed as a multiple testing problem. Further, due to the sheer number of hypotheses considered and the cost of following false leads, it is generally required to control some form of error rate on the false positives.

In this paper, we focus on controlling the false discovery rate (FDR), which is the expected proportion of false selections among all selected variables. Compared to the more stringent familywise error rate (FWER) control, keeping the FDR under a nominal level allows for more discoveries while maintaining a reasonable statistical guarantee on the rate of false positives. Several methods for FDR control have been proposed in the literature, with the Benjamini-Hochberg procedure being particularly popular [Benjamini and Hochberg 1995]. However, these approaches often assume a parametric model or the existence of valid p-values, which remains difficult, and even problematic, in high-dimensional settings.

Candès et al. [2018] proposed the model-X knockoffs, a broad and flexible framework which allows the statistician to select variables that retain dependence with the response conditional on all other covariates while maintaining FDR control. Model-X knockoffs differs from previous approaches in that (1) it makes no modeling assumptions on the distribution of the response Y we wish to study conditional on the family of covariates X, and (2) it does not require the construction of valid p-values. Instead, the crucial assumption is that the distribution of X is known. The main idea in Candès et al. [2018] is to generate fake variables X~, knockoffs, which we can view as negative controls and can be used to tease apart variables that do influence the response from those who do not. Model-X knockoffs has proved effective in a number of real-world applications, particularly in GWAS; see Bates et al. [2020], Sesia et al. [2021] and He et al. [2022] for examples.

To deploy model-X knockoffs, researchers must have in hand the covariates and responses from all samples. However, in certain situations, individual-level data that may reveal sensitive personal information is not readily accessible. For example, due to privacy concerns, many GWAS studies only publish summary statistics of the original data Pasaniuc and Price, [2017]. Yet in such cases, we would still like to develop controlled variable selection methods that rely solely on summary statistics. In genetic studies, this would enable us to utilize available summary data from different data centers to conduct meta-analysis, enhancing the effective sample size and improving variable selection power. On this front, He et al. [2022] proposed the framework of GhostKnockoffs, which implements the knockoffs procedure with the marginal correlation difference feature importance statistic directly from summary statistics. As we shall review next, the main idea is to generate knockoff Z–scores directly without creating knockoff variables; all that is needed are marginal correlations between the response and the features under study. In details, with n being the sample size and p the number of variables being assayed, the method operates with only XY and Y22, where X is the n×p matrix of covariates, and Y is the n×1 response vector.

In this paper, we extend the family of GhostKnockoffs methods to incorporate feature importance statistics obtained from penalized regression. We first consider in Section 3 the situation in which the empirical covariance of the covariate-response pair (X,Y) is available; with the above notation, this means that the summary statistics XX, XY,Y22 are available along with the sample size n. Unsurprisingly, we observe substantial power improvement over the method of He et al. [2022] because we can now employ far more effective test statistics. Next, in Section 4 we consider the case where the empirical covariance XX of the features is not available. There, we propose new imputation methods that consistently outperform He et al. [2022] in comprehensive synthetic and semi-synthetic simulations and rigorously control the FDR under suitable conditions. Finally, in Section 5 we apply our methods to a meta-analysis of nine large-scale array-based genome-wide association and whole-exome/-genome sequencing studies of Alzheimer’s disease, in which our methods yield more discoveries than He et al. [2022]. We note that existing work in the genetics literature has implemented variable selection methods based on penalized regression with summary statistics, e.g., Mak et al. [2017] and Zou et al. [2022]. However, none of these provide any guarantee of FDR control. In fact, as we note in the main text, these methods can be leveraged in our approach to create knockoffs versions that do control the FDR.

1.2. Code availability and reproducibility

The software and example code that reproduce the results presented in this paper can be found at https://github.com/biona001/ghostknockoff-gwas-reproducibility/tree/main/chen_et_al. Simulation results in Section 3.5, Section 4.4.2 and Section 4.4.3 can be exactly reproduced. Due to data accessibility issue, we only provide code without real data for Section 4.4.1 and Section 5.

2. Model-X Knockoffs and GhostKnockoffs

To begin with, we define the controlled variable selection problem and give a brief review of model-X knockoffs and GhostKnockoffs. For a more detailed exposition, we refer readers to Candès et al. [2018], Barber and Candès [2015], and He et al. [2022]. In the following, we use boldface letters for vectors and matrices.* We use XjRn and xiRp to respectively represent the jth column and ith row of the covariate matrix X.

2.1. Problem statement

Given covariates XRp and a response YR, we are interested in understanding which variables influence Y. We formulate this selection problem as testing the conditional independence hypotheses 0j:XjYX-j for 1jp, where X-j is a shorthand for all the variables except the jth; that is X-j=X1,,Xj-1,Xj+1,,Xn. In words, we should reject 0j if we believe that Xj can help better predict the outcome than if we only had available the values of all the other variables. Put differently, Xj has information about Y which cannot be subsumed by the information contained in all the other variables. By conditioning on X-j, these hypothesis tests aim to weed out variables whose relationship to Y is driven by residual correlations with other covariates.

Let 0[p] be the set of indices for which the null conditional independence hypothesis 0j is true, and let 𝒮[p] be the set of indices of the hypotheses rejected by a selection procedure. The false discovery rate (FDR) is the expected fraction of false positives among the selected, defined as

FDRE𝒮0|𝒮ˆ|

with the convention that 0/0=0. Our goal is to make as many rejections as possible while controlling the FDR below a user-specified level q.

In this paper, we consider the setting in which, instead of observing i.i.d. samples from the distribution of (X,Y), we only have some summary statistics of the i.i.d. samples. In particular, we will show how one can, quite remarkably, perform tests of conditional independence when we do not directly observe the i.i.d. samples. Throughout this paper, we assume that X~𝒩(0,Σ) where Σ is known (or, in practice, can be estimated).

2.2. Model-X knockoffs

2.2.1. The procedure

Suppose we observe n i.i.d. samples Xi,Yi,1in, arranged in a data matrix XRn×p and response vector YRn. In the model-X knockoffs framework Candès et al. [2018], we assume we know the distribution PX of the covariates X while having no knowledge of the conditional distribution YX. The model-X approach is well-suited to genetic applications where reference panels may be available to estimate PX or where we have good models of linkage disequilibrium.

To implement model-X knockoffs, we first generate a matrix X~Rn×p of knockoffs such that the following two conditions hold:

(Exchangeability):(Xj,X˜j,Xj,X˜j)=d(X˜j,Xj,Xj,X˜j),1jp (1)
(Conditional independence):X˜YX. (2)

Roughly, the first says that we cannot distinguish between [X X~] and [X X~]swap(j), where [X X~]swap(j) is obtained from [X X~] by swapping the jth and (j+p)th columns. The second condition implies that X~ does not provide any new information about Y conditional on X and is guaranteed if X~ is constructed without looking at Y. If these properties hold, it can be shown that Xj and X~j are indistinguishable conditional on Y for each j0.

Next, we define feature importance statistics W=w([X,X~],Y)Rp to be any function of X,X~ and Y such that a flip-sign property holds; namely, switching a column Xj with its knockoff X~j flips the sign of the jth component of the output; formally, wj[X,X~]swap(j),Y=-wj([X,X~],Y). Common choices include Wj=XjY-X~jY (marginal correlation difference statistic) and Wj=βˆjλCV-βˆj+pλCV (Lasso coefficient difference statistic), where βˆλCV is the solution to the Lasso problem

argminβR2p12Y-XX~β22+λCVβ1,

and λCV is usually chosen by cross-validation.

Finally, the knockoff filter selects the variables 𝒮=j:WjT, where

T=min{t𝒲:1+#{j:Wjt}#{j:Wjt}1q}. (3)

Here, 𝒲=Wj:j=1,,p{0}, and T=+ if 𝒲 is empty. Intuitively, the threshold T is chosen to be the most liberal one such that an estimate of FDP is bounded by q. Candès et al. [2018] showed that this procedure controls the FDR of the conditional testing problem at level q.

2.2.2. Gaussian knockoff sampler

Under the assumption that the rows of the data matrix X are i.i.d. from the Gaussian distribution 𝒩(0,Σ), we can generate a knockoff vector x~i for each row xi of the data matrix X by sampling x~i~𝒩Pxi,V independently across rows, where P=I-Σ-1D,V=2D-DΣ-1D,D=diag{s}, and sRp is a vector of free parameters usually obtained by solving a convex optimization problem that depends on Σ [Candès et al. 2018]. See Appendix A for details of computing s. Concatenating all the knockoff vectors then gives a valid matrix X~Rn×p of knockoffs. In matrix form, the construction above is

X˜=XP+EV1/2, (4)

where E is an n by p matrix with i.i.d. standard Gaussian entries, independent of X and Y. For later reference, we summarize the Gaussian knockoff sampler in Algorithm 1 and denote it as 𝒢.

Algorithm 1 Gaussian Knockoff Sampler 𝒢
1: Input: X and Σ.
2: Compute s by solving a convex optimization problem as defined in (15).
3: Compute D=diag{s},P=IΣ1D, and V=2DDΣ1D.
4: Simulate En×p whose entries are i.i.d. standard Gaussian variables.
5: Output: X˜=XP+EV1/2.

2.3. GhostKnockoffs with marginal correlation difference statistic

The original model-X knockoffs procedure relies on having access to the covariates and responses from all data points, i.e., the matrix of covariates X and the response vector Y. Henceforth, we call these individual-level data. In many application scenarios, however, individual-level data are not available due to privacy concerns. Instead, we only have access to some summary statistics of X and Y, e.g., the empirical covariance matrix of the covariaties and the empirical covariance between each covariate and the response.

He et al. [2022] proposed GhostKnockoffs, which implements the knockoffs procedure with marginal correlation difference statistic when only XY and Y22 are available. The key idea of He et al. [2022] is to sample the knockoff Z-score Z~s from XY and Y22 directly, in a way such that

Z˜s|X,Y=dX˜Y|X,Y, (5)

where X~=𝒢(X,Σ) is the knockoff matrix generated by the Gaussian knockoff sampler (Algorithm 1). If we use W=Zs-Z~s (where Zs=XY) as the feature importance statistic and run the knockoff filter, the resulting rejection set will have the same distribution as that of the knockoffs procedure with marginal correlation difference statistic. Therefore, the two procedures are statistically identical. In particular, they both control the FDR.

Specifically, He et al. [2022] showed that for P and V computed in step 3 of Algorithm 1,

Z˜s=PXY+Y2Z where Z~𝒩(0,V)is independent of X and Y (6)

satisfies (5) as detailed in Appendix B. All this is summarized in Algorithm 2. In the following sections, we refer to Algorithm 2 as GhostKnockoffs with marginal correlation difference statistic (GK-marginal).

Algorithm 2 GhostKnockoffs with Marginal Correlation Difference Statistic (GK-marginal)
1: Input: XY,Y22, and Σ.
2: Compute s, P, and V as in Algorithm 1.
3: Compute the feature importance statistics W=|Zs||Z˜s|, where Z˜s is generated according to (6).
4: Input W into the knockoffs selection procedure.
5: Output: Knockoffs selection set.

3. GhostKnockoffs with Penalized Regression: Known Empirical Covariance

3.1. Setting

As we have just seen, GhostKnockoffs-marginal gives a way to test conditional hypotheses while maintaining FDR control when only the summary statistics XY and Y22 are available to the analyst. Now, we consider the setting in which we have knowledge of the empirical covariance matrix XX and the sample size n, in addition to XY and Y22. These quantities only reveal sample averages of relevant quantities, as opposed to all the individual-level information.

In this section, we propose a variable selection method that utilizes only XX,XY,Y22, and n. Our method achieves FDR control and power comparable to the knockoffs procedure with the cross-validated Lasso coefficient difference statistic defined in Section 2. This is interesting because the latter usually outperforms GhostKnockoffs with the marginal correlation difference statistic by a significant margin. Notably, for a fixed tuning parameter λ, we show that our procedure is equivalent to a corresponding knockoffs method using the Lasso coefficient difference statistic with the same penalty level λ.

3.2. GhostKnockoffs with the Lasso

Recall that in the knockoffs procedure with the Lasso coefficient difference statistic, we solve the optimization problem

βˆ(λ)arg minβ2p12Y[XX˜]β22+λβ1, (7)

where X~=𝒢(X,Σ). We then define the Lasso coefficient difference feature importance statistics by Wj=βˆj(λ)-βˆj+p(λ) for 1jp. If we have access to individual-level data, λ is usually chosen by cross-validation (Candès et al. [2018] and Weinstein et al. [2020]).*

As a first step, we would like to run a statistically equivalent procedure using XX,XY,Y22, and n for a fixed λ. Note that, with λ fixed, (7) depends on the data only through

[XXXX˜X˜XX˜X˜]

and

[XYX˜Y].

Define the Gram matrix of [X,X~,Y]

𝒯(X,X˜,Y)=[X,X˜,Y][X,X˜,Y].

The Gram matrix can of course be equivalently reconstructed from Y22,XY,X~Y,XX,X~X,X~X~. The main idea is to sample from the joint distribution of 𝒯(X,X~,Y) using the Gram matrix of [X,Y] only. Based on this, we can then generate the solution to the Lasso problem (7) (in distribution) for a fixed λ.* This is achieved via the following Proposition 1, which says in words that if we generate ‘fake’ data matrices Xˇ and Yˇ that lead to the same Gram matrix as that of X and Y, then the distribution of 𝒯 remains unchanged if we replace the original data matrices by the fake data matrices.

Proposition 1. Suppose XˇRn×p and YˇRn are constructed such that [XˇYˇ][XˇYˇ]=[XY][XY]. Setting X˜=𝒢(X,Σ) and X˜=𝒢(Xˇ,Σ) as the outputs of Algorithm 1,* we have

𝒯(X,X˜,Y)|X,Y=d𝒯(Xˇ,Xˇ˜,Yˇ)|X,Y.

Proof of Proposition 1 is provided in Appendix C Specifically, Proposition 1 suggests that summary statistics XX,XY,Y22,Σ are sufficient for sampling the Gram matrix 𝒯(X,X~,Y).

Algorithm 3 GhostKnockoffs with Penalized Regression: Known Empirical Covariance
1: Input: XX,XY,Y22,Σ, and n.
2: Find Xˇ and Yˇ such that [XˇYˇ][XˇYˇ]=[XY][XY] by eigen-decomposition or Cholesky decomposition.
3: Generate Xˇ~=𝒢(Xˇ,Σ) via Algorithm 1.
4: Run the standard knockoffs procedure (at level q) with the Lasso coefficient difference statistic on Xˇ and Xˇ~ for a fixed penalty level λ or use the methods from Sections 3.3 and 3.4.
5: Output: Knockoffs selection set.

We are now able to write down a procedure, namely, Algorithm 3, which is statistically equivalent to the corresponding individual-level knockoffs procedure using the Lasso coefficient difference statistic (or any statistic defined in Sections 3.3 and 3.4). In step 2, Xˇ and Yˇ can be obtained by performing the eigen-decomposition or Cholesky decomposition of [X Y][X Y]. Brief procedures to construct Xˇ and Yˇ via eigen-decomposition are provided in Appendix D. All we need to do is to run the knockoffs procedure with Xˇ and Xˇ~ in lieu of X and X~. We say that the procedure is equivalent since the rejection sets have the same distribution. In particular, this proves that Algorithm 3 controls the FDR.

Corollary 1. Consider a knockoffs feature importance statistic W=f(𝒯(X, X~,Y),U)Rp, which is a deterministic function of 𝒯(X,X~,Y) and an independent random variable U. Define W^=f(𝒯(Xˇ,Xˇ~, Y),U). Let 𝒮1  (resp. 𝒮2) be the rejection set obtained from applying the knockoffs filter on W (resp. W^). Then 𝒮1X,Y=d𝒮2X,Y. Thus, if W obeys the flip-sign property, both procedures have equal FDR at most equal to q.

Proof. Proposition 1 gives W|X, Y=dW^|X,Y. Since the selection set is uniquely determined by the values of W (or W^), it follows that 𝒮1X,Y=d𝒮2X, Y. Therefore, the procedures have the same FDR. □

We can easily adapt the method above to accommodate other types of regularization, such as Ridge regression and Elastic Net.

3.3. GhostKnockoffs with the square-root Lasso

In Section 3.2, we assumed that the tuning parameter λ in (7) is fixed. In practice, one may choose the penalty level using information from the Gram matrix of [X,Y], and the sample size n. Since individual-level data is not available, we are unable to use data-splitting approaches such as cross-validation.

An alternative way to define feature importance is to use the square-root Lasso [Belloni et al., 2011], for which the choice of a reasonable tuning parameter is convenient. The square-root Lasso applied to the knockoffs setting solves

βˆ(λ)arg minβ2pY[XX˜]β2+λβ1, (8)

and a good choice of λ is given by

λ=κE[[XX˜]ϵϵ2| X,X˜], (9)

where ϵ~𝒩0,In and κ is a unitless hyperparameter [Tian et al., 2018]. This value is a scalar multiple of the expected value of the minimal penalty level required such that all the coefficients are shrunk to zero under the global null model. The square-root Lasso has the benefit that the value of the hyperparameter does not depend on the details of the distribution of Y conditional on X. We also found that the performance of our procedure does not depend very sensitively on the choice of κ. In our data examples, we take κ=0.3.

In the setting where we only know about values of the summary statistics, we simply replace (X,X~,Y) by (Xˇ,Xˇ~,Yˇ) in 8. Further, we note that for any orthogonal matrix Q,

([XX˜]Qϵ,ϵϵ)|X,X˜=d([XX˜]Qϵ,ϵQQϵ)|X,X˜=d([XX˜]ϵ,ϵϵ)X,X˜,

where the second equality follows from Qϵ=dϵ. Therefore, the value of the hyperparameter in (9) remains unchanged if we multiply [X X~] by Q on the left. This implies that (9) is a deterministic function of [X X~][X X~]. Hence, the feature importance statistic is a function of 𝒯(X,X~,Y). Following Corollary 1, we can apply the knockoffs procedure with the square-root Lasso and matrices (Xˇ,Xˇ~) in lieu of (X,X~). Upon choosing

λ=κE[[XˇXˇ˜]ϵϵ2| Xˇ,Xˇ˜], (10)

we get a procedure, which is statistically indistinguishable from that we would get if we were performing all the same steps with X and X~. (In practice, we compute the value in (10) via Monte Carlo simulation.) In the sequel, we call the resulting procedure summary statistics GhostKnockoffs with square-root Lasso importance statistic (GK-sqrtlasso). Note that GK-sqrtlasso controls the FDR as the flip-sign property of the feature importance statistic holds. This is because swapping a variable with its knockoff does not change the value of the hyperparameter. Therefore, by Corollary 1, applying the knockoff filter to the square-root Lasso feature importance statistics yields FDR control.

3.4. GhostKnockoffs with the Lasso-max

In the standard fixed-X knockoffs setting, cross-validation is also not feasible, since doing so would violate the sufficiency condition required for the feature importance statistics. As one possible alternative, Barber and Candès [2015] considered using as the feature importance statistic the value of λ on the Lasso path at which feature Xj first enters the model. Formally, they define the feature importance statistic

Wj=sup{λ:βˆj(λ)0}sup{λ:βˆj+p(λ)0},

where βˆ(λ) is as in (7). We call this statistic the Lasso-max statistic. Intuitively, a larger penalty level is required to shrink an important feature to zero, so we should expect Wj to be large and positive for non-nulls.

By Corollary 1, with the Lasso-max statistic Algorithm 3 produces a rejection set that has the same distribution as the rejection set obtained from the corresponding individual-data-based knockoffs procedure. We call this summary-statistic-based procedure GhostKnockoffs with Lasso-max statistic (GK-lassomax).

We remark that choices of other tuning parameters and feature importance statistics are also possible. For instance, we may choose λ to minimize the Stein’s unbiased risk estimate (SURE) associated with (7). We shall however focus on the two approaches we have described.

3.5. Numerical simulations

We consider a variety of simulation settings in which we compare the performance of the proposed GhostKnockoffs with square-root Lasso and Lasso-max statistics (GK-sqrtlasso and GK-lassomax, defined in Sections 3.3 and 3.4), GhostKnockoffs with marginal correlation difference statistic (GK-marginal, defined in Section 2), and the knockoffs procedure with (cross-validated) Lasso coefficient difference statistic with individual-level data (KF-lassocv). Note that the first three are statistically equivalent to the corresponding knockoffs procedures with individual-level data.

3.5.1. Independent features

In the first set of simulations (Figure 1), we generate random samples xi~iid𝒩0,Ip and Yi=βxi+nϵi, where ϵi~iid𝒩(0,1) for i{1,2,,n}.* We consider three settings of varying dimensionality measured by the ratio p/n:(n,p){(600,200),(400,400),(200,600)}. In each of the three settings, we create a sparse vector β by selecting 30 coordinates to be non-zero uniformly at random. The signs of these non-zero coordinates are assigned to be either positive or negative with equal probability. We vary the signal amplitudes such that we explore a wide power range below. For the square-root Lasso, we average over 200 Monte Carlo samples to calculate

λ=κE[[XX˜]ϵϵ2| X,X˜].

The target FDR is 20%. Each point on the curves represents the average of the results from 200 replications.

Figure 1:

Figure 1:

Power and FDR plots for independent features and a Gaussian linear model with varying dimensions. Each point is an average over 200 replications.

We observe that GK-sqrtlasso and GK-lassomax generally demonstrate greater power than GK-marginal. This enhanced performance is not surprising, as GK-sqrtlasso and GK-lassomax (1) have access to additional information via XX, and (2) employing a joint modeling algorithm such as Lasso generally provides a better assessment of variable importance for understanding conditional (in)dependence since such a model explicitly adjusts for the effects from all the other variables. We also note the presence of power gaps between GK-lassocv and GK-sqrtlasso/GK-lassomax, likely due to the fact that we are unable to perform cross-validation without individual-level data. All methods control the FDR at the desired level.

3.5.2. AR(1) features

In the second set of simulations (Figures 2), we generate xi~iid𝒩0,Σρ for i{1,2,,n}, where Σρs,t=ρ|s-t| for 1s,tp. As before, we generate Yi=βxi+nϵi, where ϵi~iidN(0,1) for i{1,2,,n}. We consider the same three (n,p) combinations. In each of the three cases, we create a sparse vector β exactly as before, except that we fix the signal amplitudes to 4, 4, and 7 respectively to explore a wide power range. We vary ρ in {0, 0.1, 0.2, …, 0.8} The target FDR is set to be 20%. Each point represents the average of the results from 200 replications.

Figure 2:

Figure 2:

Power and FDR plots for AR(1) features and a Gaussian linear model with varying dimensions. Each point is an average over 200 replications.

Again, we observe that GK-sqrtlasso and GK-lassomax generally have greater power than GK-marginal. All methods have (almost) decreasing power as the autocorrelation coefficient increases, since it becomes harder to separate true signals from null variables that are correlated with them. All methods control the FDR at the desired level.

4. GhostKnockoffs with Penalized Regression: Missing Empirical Covariance

4.1. Setting

Thus far, we have discussed how incorporating the additional information from XX and n could enhance our ability to detect significant features. However, in applications such as genetics, XX may not be available. In this section, we propose alternative procedures when the scientist only knows about XY,Y2 and the sample size n. As before, we assume that X~𝒩(0,Σ), where the covariance matrix Σ is known (or can be estimated from other data sources).

4.2. GhostKnockoffs with pseudo-lasso

The idea of our method is to modify the Lasso objective function so that it can be constructed from the available summary statistics. It turns out that the solution of our modified objective function is proportional to that of the scout procedure (with known precision matrix) proposed by Witten and Tibshirani [2009]. We will see through simulation studies that our procedure improves the power of the original GhostKnockoffs method of [He et al., 2022] while maintaining FDR control.

4.2.1. The procedure

Recall that in the knockoffs procedure with the Lasso statistic, we solve the following optimization problem:

βˆ(λ)=argminβ2p12nβ[XXXX˜X˜XX˜X˜]β1nβ[XYX˜Y]+λβ1.

To mimic the form of the loss function when we do not observe the empirical covariance of the features, we may want to substitute them with their population version: i.e. we swap XX/n and X~ X~/n with Σ and XX~/n with Σ-D. As usual, D=diag{s} is obtained by solving the convex optimization problem (15). In the language of fixed-X knockoffs [Barber and Candès, 2015], this is equivalent to regarding X~ as a fixed-X knockoff of X and replacing XX/n by Σ.* This yields Algorithm 4.

Algorithm 4 GhostKnockoffs with Penalized Regression: Missing Empirical Covariance
1: Input: XY,Y22,Σ and n.
2: Simulate Z~𝒩(0,V), where V is defined as in Algorithm 2.
3: Solve βˆ(λ)=argminβR2p12βΣΣ-DΣ-DΣβ-1nβXYPXY+Y2Z+λβ1, where D and P are defined as in Section 2.2.2 and λ is fixed or as chosen in Section 4.2.2
4: Run the standard knockoffs procedure (at level q) with importance statistic Wj=βˆj(λ)-βˆj+p(λ).
5: Output: Knockoffs selection set.

We call this procedure GhostKnockoffs with pseudo-lasso statistic (GK-pseudolasso). We show below that Algorithm 4 controls the FDR of selections at level q. Before doing so, we first state a general proposition that includes GK-marginal as a special case.

Proposition 2. Suppose V and P are defined as in Algorithm 2, Z~𝒩(0,V) is independent of X and Y, and X~=𝒢(X,Σ). Consider a knockoffs feature importance statistic W=gY22,XY,X~Y,URp, which is a deterministic function of Y22,XY,X~Y and an independent random variable U. Define W^=gY22,XY,PXY+ Y2Z,U. Let 𝒮1 (resp. 𝒮2) be the rejection set obtained from applying the knockoffs filter on W (resp. W^). Then 𝒮1X,Y=d𝒮2X,Y. Thus, if W obeys the flip-sign property, both procedures have equal FDR at most equal to q.

Proof. In Appendix B, we prove that

X˜Y|X,Y=dPXY+Y2Z|X,Y.

As a result, W X,Y=dW ^X,Y. Since the selection set is uniquely determined by the values of W (or W^), it follows that 𝒮1X,Y=d𝒮2X,Y. Therefore, the procedures have the same FDR. □

Set λ to be a fixed numerical constant. Consider the feature importance statistics W defined by Wj=βˆj(λ)-βˆj+p(λ), where βˆ(λ) is the solution to

arg minβ2p12β[ΣΣDΣDΣ]β1nβ[XYX˜Y]+λβ1, (11)

and X~=𝒢(X,Σ) is the Gaussian knockoff data matrix. The feature importance statistic in Algorithm 4 is thus obtained by replacing X~Y by PXY+Y2Z in (11). Since W is determined by Y22,XY and X~Y, it follows from Proposition 2 that the rejection set of Algorithm 4 has the same distribution as that obtained from running the knockoff filter on W.

Thus to prove that Algorithm 4 controls the FDR of rejections at level q, it suffices to verify the flip-sign property of the feature importance statistic for W (see Section 2). This is a consequence of the following lemma:

Lemma 1. Consider the problem

arg minβ2p12βCβdβ+λβ1+γβ22. (12)

Let ΠS be any permutation matrix which swaps the jth and (j+p)th entries of a 2p-dimensional vector for each jS{1,,p}. Assume that C is S-swap invariant in the sense that ΠSCΠS=C. Then βˆ is a solution to (12) if and only if ΠSβˆ is a solution to the same problem with d and ΠSd swapped. In other words, swapping the entries of d has the effect of swapping the corresponding entries of the solution.

Proof. Consider the objective with problem data ΠSd:

12βCβ(ΠSd)β+λβ1+γβ22=12βCβdΠSβ+λβ1+γβ22.

Set β=ΠSβ so that β=ΠSβ. Upon changing variables, the objective takes the form

12βΠSCΠSβ-dβ+λΠSβ1+γΠSβ22=12βCβ-dβ+λβ1+γβ22,

where the equality follows because ΠSCΠS=C and because the 1-norm and 2-norm are invariant under permutation. Now, the objective on the right-hand side is the objective with data d. If βˆ is the solution with data d, it follows that ΠSβˆ is the solution with data ΠSd, and vice versa. This proves the lemma. □

Corollary 2. Algorithm 4 with a fixed λ controls the FDR of rejections at level q.

Proof. It is easy to show that ΣΣ-DΣ-DΣ is S-swap invariant for any S{1,,p}. Taking

C=ΣΣ-DΣ-DΣ

and

d=1nXYX~Y

in Lemma 1 establishes the flip-sign property of W and, therefore, the FDR control of Algorithm 4 for a fixed λ. □

In practice, to ensure numerical stability, we add a small positive constant multiple of the identity matrix to

ΣΣ-DΣ-DΣ

when solving for βˆ. This is equivalent to incorporating a small Ridge penalty into the objective function. It is easy to see that the lemma proved above guarantees that this modification does not compromise the FDR control as

Σ+cIΣ-DΣ-DΣ+cI

is also S-swap invariant for any cR and any S{1,,p}.

4.2.2. Choice of tuning parameter

Several methods can be used to tune the value of the hyperparameter λ. We here consider two approaches.

Method 1 (lasso-min)

Pretend a homogeneous Gaussian linear model holds, i.e. Y=Xβ*+σϵ for some β*Rp,σ>0 and ϵ~N0,In.

Focus on (11) first and imagine that we have a method for computing λ that depends on data only through Y22,XY, and X~Y. Note that the objective in Algorithm 4 only substitutes X~Y in (11) with PXY+Y2Z. Therefore, by Proposition 2 if we set λ via the same functional and work with PXY+Y2Z in lieu of X~Y, we shall achieve FDR control with this data-driven value of the hyperparameter λ. This holds of course with the proviso that our selection of hyperparameter is symmetric in the sense that it produces feature importance statistic obeying the flip-sign property.

To set the tuning parameter λ0 in (11), we use the common choice of taking a constant multiple of the expected value of the minimum λ value such that βˆ(λ)=02p under the null model Y=σϵ. By the Karush-Kuhn-Tucker (KKT) conditions [Boyd and Vandenberghe, 2004], this results in a tuning parameter of the form

λ0=κσnEXX~ϵ,

where κ is a hyperparameter between 0 and 1. Since XX~ is a data matrix whose rows are iid samples from

𝒩0,ΣΣ-DΣ-DΣ,

EXX~ϵ is a numerical constant, which can be estimated arbitrarily well via Monte Carlo simulations. We use the approach from Dicker [2014] to give an estimate of σ, which crucially requires knowing only Y22,XY, and X~Y. Dicker [2014] showed that the estimator is consistent and asymptotic normal in the high-dimensional regime. Specifically, in our setting, we estimate σ by

σ^0=max2p+n+1n(n+1)Y22-1n(n+1)YXX~ΣΣ-DΣ-DΣ-1XX~Y,0.

In sum, a choice for λ in Algorithm 4 is this:

  1. Approximate ERϵ via Monte Carlo simulations, where RRn×2p has iid 𝒩0,ΣΣ-DΣ-DΣ rows, ϵ~N0,In is independent.

  2. Compute
    σ^0=max(2p+n+1n(n+1)Y221n(n+1)[YX   YXP+Y2Z][ΣΣDΣDΣ]1[XYPXY+Y2Z],0),
    where Z is independent of everything else.
  3. Output λκσ^0nERϵ where the approximation sign ≈ reminds us that the expectation is only approximate.

As in the square-root Lasso case, we observe that the power of our method is not very sensitive to the choice of κ. We use κ=0.6 in our simulations below. In Appendix E, we provide details of computation of λ and prove that Algorithm 4 maintains FDR control with the computed λ.

Method 2 (pseudo-sum)

An alternative way of choosing λ is to adapt the pseudo-summary statistics approach proposed by Zhang et al. [2021]. Set r=XY/n and r~=Pr+Y2Z/n. The main idea of Zhang et al. [2021]. is to generate training summary statistics rt and validation summary statistics rv from r and r~ based on the training and validation sample sizes nt and nv respectively (in this paper we take nt=0.8n and nv=0.2n). Following Zhang et al. [2021], we generate the training summary statistics

rr~t=rr~+nvn×ntR,

where

R~𝒩0,ΣΣ-DΣ-DΣ,

and the validation summary statistics

rr~v=1nvnrr~--ntrr~t.

Given a sequence of candidate λ values, we choose that which maximizes an approximation f(λ) of the correlation between the predicted values and the true values on the pseudo-validation set.* Specifically, Zhang et al. [2021]. considered the approximation

f(λ)=βˆt,λ[rr˜]vβˆt,λ[ΣΣDΣDΣ]βˆt,λ, (13)

where

βˆt,λ=arg minβ2p12β[ΣΣDΣDΣ]ββ[rr˜]t+λβ1. (14)

Therefore, we choose the λ value that maximizes (13) among a set of candidate values. Since the objective function (11) is convex in β, we may employ the BASIL framework proposed by Qian et al. [2020], which implements a batch version of the strong rules introduced in Tibshirani et al. [2012]. BASIL can be directly applied to compute the solution path of (14) efficiently.

Note that there exist other ways to choose the penalty level λ using XY,Y2 and n (for example, the Lassosum by Mak et al. [2017]). We do not attempt to claim an optimal strategy.

Connection with the scout procedure

It turns out that step 3 of Algorithm 4 is closely related to the scout procedure [Witten and Tibshirani, 2009]. The scout procedure defines a family of covariance-regularized regression methods that achieve superior prediction via shrinking the inverse covariance matrix. It includes the Lasso, Ridge and Elastic Net as special cases. In Appendix F, we show that the solution of objective function (11) is proportional to that of the scout procedure (with known precision matrix Σ-1). This connection provides a justification on why the objective function (11) is effective.

4.2.3. GhostKnockoffs with other feature importance statistics

In the previous sections, we presented a feature importance statistic based on summary statistics that leads to better power than the marginal correlation difference statistic. By Proposition 2, GhostKnockoffs techniques can be combined with any other feature importance statistics that i) are based on the summary statistics XY,Y2 and the sample size n and ii) satisfy the flip-sign property. The procedures generated will still guarantee FDR control. In our simulation studies, we found that using the posterior inclusion probability (PIP) produced by the SuSiE-RSS model [Zou et al., 2022] as the feature importance statistic also results in consistent power improvement over GK-marginal. SuSiE-RSS is based on the Sum of Single Effects (SuSiE) model proposed by Wang et al. [2020], which assumes a Bayesian linear model with true coefficients β represented as the sum of multiple one-hot (random) individual effect vectors. Zou et al. [2022] combines SuSiE with a modified likelihood function to accommodate applications in which only summary statistics are available (see Zou et al. [2022] for details).* We call the resulting procedure GhostKnockoffs with SuSiE-RSS statistic and denote it by GK-susie-rss. We include this method in the simulation section below.

4.3. Variants of GhostKnockoffs

The methods we presented so far can be adapted to work with various related procedures. We give three examples below for illustration.

4.3.1. Multi-knockoffs

The knockoffs procedure is a randomized procedure which could produce very different selection sets on different runs. This is especially true when the knockoffs rejection set is small. In fact, the offset on the numerator in (3) implies that knockoffs either rejects more than 1q hypotheses, where q is the target FDR level, or rejects nothing. To improve the stability of the knockoffs procedure, Gimenez and Zou [2019] proposed simultaneous multi-knockoffs, which is substantially more stable and powerful than knockoffs when the rejection set is small and maintains FDR control in general.

The idea of Gimenez and Zou [2019] is to create M (instead of one) knockoff copies for every feature so that they jointly satisfy an extended exchangeability condition.* If X~𝒩(0,Σ), Gimenez and Zou [2019] showed that X~RpM is a valid M multi-knockoff for XRp if XX^~𝒩(0,G), where

G=ΣΣ-DΣ-DΣ-DΣΣ-DΣ-DΣRM+1p×M+1p,

Here, D=diag{s}, and s is obtained by solving a more restrictive convex optimization problem than in (15) which guarantees that  G is positive semi-definite (see Gimenez and Zou [2019] for details). In data matrix form, we generate valid M multi-knockoffs by

X~=XP+EV1/2,

where P=I-Σ-1DI-Σ-1DRp×Mp,ERn×Mp has i.i.d. standard normal entries, and

V=2D-DΣ-1DD-DΣ-1DD-DΣ-1DD-DΣ-1D2D-DΣ-1DD-DΣ-1DD-DΣ-1DD-DΣ-1D2D-DΣ-1D.

Gimenez and Zou [2019] generalized the knockoffs threshold (3) and the flip-sign property to produce FDRcontrolling rejection sets after generating multiple knockoffs via this procedure.

In the summary statistics settings, upon redefining P,V and s as above and replacing the standard knockoffs filter by the multi-knockoffs filter, Algorithms 2 and 3 produce rejection sets that have the same distribution as those produced by their corresponding versions with individual-level data. For Algorithm 4, we simply need to further replace

ΣΣ-DΣ-DΣ

by G.

4.3.2. Group knockoffs

When variables are highly correlated, selection procedures become conservative. For example, if a non-null variable Xj is highly correlated with a null variable Xk, it becomes difficult to reject XjYX-j. This is an important practical concern because highly correlated features are ubiquitous in many settings, particularly GWAS datasets. To overcome this challenge, group knockoffs [Dai and Barber, 2016] can be useful; please see Chu et al. [2023], whose algorithms we employ in the data analyses of Section 5. In group knockoffs, the object of inference is shifted from single variables to groups of highly correlated variables. Specifically, suppose we partition p features into g groups and reorder all features such that features of the same group are in adjacent columns of X. The objective is to test group conditional independence hypothesis:

Hγ0:XγYX-γ

where γ{1,,g} denotes a group and Xγ is the vector of features in group γ. When these groups have strong correlation, single-variable knockoffs may struggle to identify signals, but group knockoffs retain power to identify significant groups. As in Section 4.3.1, all methods described in this paper apply to group knockoffs after redefining D to the equivalent version in group knockoffs. In Appendix G, we detail the construction of group knockoffs and examples of importance scores at the group level for inference.

4.3.3. Conditional randomization test

The conditional randomization test (CRT) [Candès et al., 2018] is an alternative method to test the conditional independence hypotheses Hj:XjYX-j for 1jp. By generating a valid ‘CRT p-value’ pj for each hypothesis Hj, existing multiple testing procedures, including the Benjamini-Hochberg procedure [Benjamini and Hochberg, 1995 and the selective SeqStep+ filter [Li and Candès, 2021], can be used to simultaneously test H1,,Hp with FDR control.* As shown in Candès et al. [2018] and Wang and Janson [2021], doing so can improve the power of multiple testing with greater computational complexity.

In Appendix H, we introduce Ghostknockoffs for CRT (GhostCRT), which adopts techniques introduced in this paper to the framework of CRT.

4.4. Numerical simulations

We conduct simulations on synthetic data as well as semi-synthetic data generated from a real-world genetic dataset. Specifically, we apply GhostKnockoffs with pseudo-lasso statistic (GK-pseudolasso, defined in Algorithm 4 with tuning parameter λ chosen by either lasso-min or pseudo-sum from Section 4.2.2) and GhostKnockoffs with SuSiE-RSS statistic (GK-susie-rss, defined in Section 4.2.3). We compare their performance with GhostKnockoffs with marginal correlation difference statistic (GK-marginal, defined in Section 2) and the knockoffs procedure with (cross-validated) Lasso coefficient difference statistic based on individual-level data (KF-lassocv). We also demonstrate empirically the robustness of our procedures by showing the FDR control when only an estimate of the true covariance matrix Σ is available and when the features are discrete.

4.4.1. Simulations based on real-world genetic data

To mimic the dependency structure among features in real-world applications, we generate synthetic data based on the whole genome sequencing (WGS) data from the Alzheimer’s Disease Sequencing Project (ADSP). The data are obtained from the ADSP consortium following the SNP/Indel Variant Calling Pipeline and data management tool (VCPA) [Leung et al., 2019]. The ADSP WGS data records counts of minor alleles of genetic variants over 16,906 individuals. Using reference populations from the 1000 Genomes Consortium [The 1000 Genomes Project Consortium, 2015], we estimate ancestry rates of each individual by SNPWeights v2.1 Chen et al. 2013 and extract 6,952 individuals with estimated European ancestry rate greater than 80%. We further restrict our simulations to 2,000 randomly selected genetic variants within 0.5Mb distance to the APOE gene (chr19:44909011–45912650; hg38), whose ε2 allele and ε4 allele are known to be respectively the strongest genetic protective factor and the strongest genetic risk factor for Alzheimer’s disease [Serrano-Pozo et al., 2021, Belloy et al., 2023], and with minor allele frequency (MAF) larger than 0.01. Since our simulations focus on performance at identifying relevant clusters of tightly linked variants, we simplify the simulation design by pruning variants to eliminate pairs with absolute correlation greater than 0.75. To do so, we first compute the correlation matrix corXj,Xk2000×2000 of the 2,000 selected variants over the 6,952 extracted individuals using the shrinkage estimate in the R package corpcor [Schäfer and Strimmer, 2005] and apply hierarchical clustering (single-linkage with cutoff value 0.25) on the distance matrix 1-corXj,Xk2000×2000. As a result, we obtain 512 variant clusters such that pairwise correlation between any pair of variants from different clusters is in [-0.75,0.75]. By randomly choosing one representative variant from each cluster, we include p=512 tested genetic variants in the simulation study.

For each replicate, we obtain synthetic data by randomly sampling n=3,000 individuals without replacement and collecting the sampled individuals’ records on the p=512 tested genetic variants as the n×p covariate matrix X. We further sample another n=3,000 individuals without replacement as the reference panel on which we compute the correlation matrix Σ using the shrinkage estimate in the R package corpcor [Schäfer and Strimmer, 2005]. Based on the covariate matrix X, we generate the response vector Y=Y1,,Yn from either the linear model (continuous response),

Yi=β1Xi1++βpXip+ϵiC,  where ϵiC~N0,32,

or the mixed-effect logit model (binary response),

Yi~Bernounliμi,  where gμi=β0+β1Xi1++βpXip+ϵiB,ϵiB~N0,12 and gx=logx1-x.

Specifically, β0 under the mixed-effect logit model is -log(9) so that the prevalence (or the expected proportion of Yi=1) is 10%. ϵiC’s and ϵiB’s reflect variation due to unobserved covariates. Only 10 randomly selected coefficients βj are nonzero, with value βj=120mj1-mj, where mj is the MAF of the j-th variant.

With the relevant summary statistics computed, we apply GK-pseudolasso and GK-susie-rss and compare their performances with GK-marginal and KF-lassocv.

Over 1000 replicates under both the linear model and the mixed-effect logit model, average power and FDR of different methods with respect to different target FDR levels are visualized in Figure 3. Under both models, we observe that GK-pseudolasso with both ways of selecting the tuning parameter and GK-susie-rss are uniformly more powerful than GK-marginal. The performance of the proposed methods is very close to that of KF-lassocv. Despite the covariance matrix being estimated using an independent sample and the entries of X being discrete, the FDRs of our proposed methods are controlled in both settings, suggesting the robustness of our methods.

Figure 3:

Figure 3:

Average power and FDR over 1000 replications with respect to different target FDR levels in simulations based on genetic data, where features are genotypes of existing patients, and the response is simulated from a linear model (continuous response) or a mixed-effect logit model (binary response).

GhostKnockoffs with discrete features

We note that discrete covariates do not follow a Gaussian distribution. However, the knockoffs procedure ensures FDR control whenever the feature importance statistics Wj=wTj,Tp+j, where w is an anti-symmetric function, and TR2p is distributionally invariant upon swapping Tj with Tj+p for each null j. Using Lemma 1, we know that Algorithm 4 controls the FDR if swapping the j-th entry of Z=XY and the j-th entry of Z˜=PXY+Y2Z does not change their joint distribution for each null j. In Appendix J, we visually demonstrate the approximate preservation of this distributional invariance. This, along with the robustness of knockoffs [Candès et al., 2018, Barber et al., 2020], helps in explaining why we have not observed FDR inflation with discrete covariates.

4.4.2. Independent features

We revisit the setting from Section 3.5.1 in which Σ=Ip. For the pseudo-sum method for GK-pseudolasso, we optimize over λ using a grid of 100 candidate values interpolating between λmax and λmax/1000 linearly in log scale, and

λmax=1nEXYPXY+Y2Z

is the minimal λ value that shrinks all the coefficients to zero. To calculate ERϵ for the lasso-min parameter method, we use a Monte Carlo estimate averaged over 200 samples. The target FDR is 20%. Each point represents an average over 200 replications.

Note that when Σ=Ip, the solution to (15) is D=Ip. It is easy to see that (11) gives

βˆ=1nSλXX~Y,

where the soft-threshold operator Sλ(x)=sign(x)(|x|-λ)+ is applied coordinate-wise. Therefore, the method in Section 4.2 soft-thresholds the marginal correlation of X and Y.

As shown in Figure 4, all three new methods (GK-pseudolasso with lasso-min/pseudo-sum and GK-susie-rss) consistently outperform GK-marginal, and the FDR is always controlled at the expected level, as theoretically guaranteed.

Figure 4:

Figure 4:

Power and FDR plots for independent features and a Gaussian linear model with varying dimensions. Each point is an average over 200 replications.

As n/p grows, we see that the three new methods have power closer to KF-lassocv. This is further demonstrated in additional simulations in Appendix I.

4.4.3. AR(1) features

Figure 5 shows the corresponding plots when the covariate matrix is generated from an AR(1) distribution. We found similar patterns to those with independent features. The power of all methods drops when the autocorrelation coefficient increases, as it is then harder to separate true signals from other variables.

Figure 5:

Figure 5:

Power and FDR plots for AR(1) features and a Gaussian linear model with varying dimensions. Each point is an average over 200 replications.

5. Application to meta-analysis for Alzheimer’s disease

To illustrate the empirical performance of the methods in detecting genetic variants associated with Alzheimer’s disease (AD), we apply them to a meta-analysis of nine large-scale array-based genome-wide association and wholeexome/-genome sequencing studies for AD. We include the details of the nine studies in Appendix K.

As all studies share the same focus on individuals with European ancestry, we perform a meta-analysis by aggregating their Z-scores and obtain the meta-analysis Z-score Zmeta (see Appendix L for details). In addition, we obtain the block-diagonal covariance matrix Σ with respect to approximately independent linkage disequilibrium blocks provided by Berisa and Pickrell [2016]. Within each block, we use the UK Biobank directly genotyped data as the reference panel and compute the covariance matrix via the Pan-UKB consortium (https://pan.ukbbbroadinstitute.org) with details in Appendix M. To improve the power in the presence of tightly linked variants, we apply the group knockoffs construction on top of the GhostKnockoff algorithm, as detailed in Section 4.3.2. Finally, we implement GK-pseudolasso with tuning parameter chosen by the lasso-min method on the meta-analysis Z-score Zmeta and the covariance matrix Σ. To stabilize the GhostKnockoffs procedures, we use M=5 multi-knockoffs as defined in Section 4.3.1.

Figure 6 presents the result of the meta-analysis of the nine studies via our proposed method with target FDR level 0.1. Here, we specify loci based on variant groups and annotate two loci as different loci if they are 1 Mb away from each other. We adopt the most proximal gene’s name as the locus name.* As shown by Table 1 in Appendix N. GK-pseudolasso identifies variant groups in 42 and 63 loci when the target FDR level is 0.1 and 0.2 respectively, substantially more than GK-marginal (10 and 17 when the target FDR level is 0.1 and 0.2, respectively). This is consistent with our simulation results in Section 4.4. In addition, we observe from Table 1 that GK-susie-rss identifies fewer loci (35 and 47 when the target FDR level is 0.1 and 0.2, respectively), although it exhibits similar power in simulation studies. In Appendix O we analogously visualize results of the meta-analysis via conventional marginal association test (with p-value cutoff 5 × 10−8), GK-marginal (with target FDR level 0.10), and GK-susie-rss (with target FDR level 0.10).

Figure 6:

Figure 6:

Graphical representation of the feature importance statistics after applying the GK-pseudolasso on a meta-analysis of AD. Each point represents a group of genetic variants. With an target FDR level of 0.1, identified groups are highlighted in blue or purple. For each locus with at least one identified group, the name of the locus is presented at the variant group with the largest importance statistic (highlighted in purple). Variant density is shown at the bottom of plot (number of variants per 1Mb).

Table 2 in Appendix N shows the top variant with the largest feature importance statistic in each identified group. Most discoveries exhibit relatively strong marginal associations (marginal p-value ≤ 0.05) in individual studies and the same direction of effects across all studies. Although some loci have an opposite direction of effect in one individual study, such effects are not significant. The consistency across individual studies supports the validity of the proposed method in discovering putative causal variants. In addition, we observe that all top variants of identified groups have small meta-analysis p-values (less than 0.05), though some are not smaller than the stringent genome-wide threshold (5 × 10−8) in marginal association tests with FWER control.

Table 2:

Details of top variants of identified loci given by GK-pseudolasso (target FDR level: 0.20).

Chr. SNP Ref. Alt. TopS2GGene Closest gene Z-scores from different individual studies Meta-analysis Z-scores W Marginal p-values
Study 1 Study 2 Study 3 Study 4 Study 5 Study 6 Study 7 Study 8 Study 9
1 20853688 C T EIF4G3 EIF4G3 2.72 3.93 3.29 4.29 2.89 - - 0.68 1.65 4.95 2.516×10−3 3.697×10−7
1 200984367 A G KIF21B KIF21B −2.21 −3.73 −4.33 −3.60 −2.92 - - - −0.67 −4.36 1.837×10−3 6.490×10−6
1 207611623 A G CR1 CR1 −4.84 −8.81 −7.97 −9.65 −6.37 - - - −3.96 −10.97 1.228× 10−2 2.802× 10−28
2 37270395 G A - NDUFAF7 1.50 4.02 3.73 4.03 2.05 - - - - 4.62 1.998×10−3 1.953×10−6
2 44026309 T C - LRPPRC −1.23 −3.80 −1.88 −3.55 −2.26 - - - −3.64 −4.37 2.089×10−3 6.208×10−6
2 65409567 G A - SPRED2 - −3.93 −2.41 −4.05 −0.23 - - - 0.15 −4.44 1.993×10−3 4.538×10−6
2 105805908 T C - NCK2 0.10 −3.94 −2.80 −4.67 −2.08 - - - - −4.72 2.490× 10−3 1.185×10−6
2 127136908 A T BIN1 BIN1 3.77 10.94 8.68 11.95 8.74 - - - 4.90 13.36 1.141×10−2 5.466× 10−41
2 233117202 G C NGEF INPP5D 2.03 6.15 5.16 6.42 2.29 - - - 2.40 7.21 4.762× 10−3 2.826× 10−13
3 136105288 G A SLC35G2 PPP2R3A −1.24 −3.66 −2.03 −4.64 −1.98 - - - −3.04 −4.84 2.250×10−3 6.607×10−7
4 11024404 A G - CLNK −2.47 −6.00 −4.06 −6.50 −4.32 - - - −2.52 −7.32 5.039×10−3 1.275× 10−13
4 71303158 G A - SLC4A4 −2.60 −3.77 −3.65 −3.34 −1.49 - - - −1.53 −4.28 1.817×10−3 9.466×10−6
4 112082387 A C - FAM241A 2.47 4.82 1.76 3.36 0.40 - - - −0.71 4.68 2.478× 10−3 1.418×10−6
4 143428212 C T - GAB1 - −3.68 −2.84 −3.95 −1.56 - - - −1.36 −4.37 2.017×10−3 6.081×10−6
4 158808801 G A RAPGEF2 FNIP2 −2.43 −3.95 −2.39 −3.82 −2.40 - - - −1.91 −4.66 1.894×10−3 1.553×10−6
5 4068226 C T - IRX1 - 4.53 1.20 3.62 0.34 - - - −0.91 4.50 2.144× 10−3 3.323×10−6
5 14707491 C T ANKH ANKH −3.92 −3.20 −3.95 −4.36 −3.60 - - - 0.60 −4.66 2.480× 10−3 1.602×10−6
5 86923485 A G - LINC02059 2.49 4.70 2.45 3.76 3.49 - - - 2.49 5.12 3.059×10−3 1.517×10−7
5 177559423 G A RAB24 FAM193B 1.96 3.85 3.99 4.16 2.48 - - - 1.38 4.71 2.313×10−3 1.248×10−6
5 179373099 C T - ADAMTS2 1.52 2.54 4.29 4.80 3.12 - - - 2.35 4.36 1.938×10−3 6.512×10−6
6 935171 T C - LINC01622 −2.80 −3.20 −3.33 −4.55 −3.37 - - - −2.17 −4.75 2.380× 10−3 1.040×10−6
6 32686937 T C HLA-DQA2 HLA-DQB1 −3.88 −6.46 −4.86 −7.53 −2.29 - - - −1.10 −8.13 4.461×10−3 2.090×10−16
6 41066261 G C OARD1 OARD1 2.69 3.78 6.91 7.12 4.06 - - - - 6.37 5.558×10−3 9.364×10−11
6 47484147 C T CD2AP CD2AP 2.95 5.74 5.21 6.10 5.33 - - - 2.24 7.05 3.271× 10−3 8.942×10−13
7 1543652 A G TMEM184A MAFK 2.33 4.06 2.93 3.64 2.36 - - - 0.33 4.54 1.810×10−3 2.868×10−6
7 37842715 G A - NME8 2.95 4.15 3.81 3.74 3.20 - - - 1.13 4.79 2.045× 10−3 8.230×10−7
7 100406823 C T - ZCWPW1 4.25 7.53 4.01 8.41 5.04 - - 3.59 1.29 9.35 8.987× 10−3 4.266× 10−21
7 143410495 G T EPHA1-AS1 EPHA1 1.19 6.56 4.37 6.81 2.70 - - - 1.63 7.52 3.795× 10−3 2.751× 10−14
8 27362470 C T PTK2B PTK2B 3.84 6.79 6.12 7.94 5.19 - - - 2.12 8.70 4.345× 10−3 1.668× 10−18
8 95041772 C T - NDUFAF6 4.06 3.96 4.03 4.50 2.81 - - - 0.36 5.17 2.207× 10−3 1.172×10−7
8 97359646 A G - SNORD3H 2.70 3.01 3.70 3.99 2.42 - - - 1.30 4.25 1.767×10−3 1.067×10−5
8 102564430 G A - ODF1 1.72 4.00 2.66 3.53 1.42 - - - −0.48 4.29 1.855×10−3 8.825×10−6
8 111515902 C T - LINC02237 - 4.13 0.44 3.77 −0.41 - - - - 4.40 2.051×10−3 5.387×10−6
8 144042819 T C PARP10 SPATC1 0.17 4.66 2.47 4.57 3.68 - - - - 5.16 3.389× 10−3 1.210×10−7
10 29966853 G A - JCAD - 3.72 2.05 4.56 1.31 - - - 0.48 4.68 2.501× 10−3 1.443×10−6
10 42722997 T C - LOC283028 0.39 4.79 2.57 4.34 1.14 - - - 0.25 5.02 2.128× 10−3 2.616×10−7
10 59962515 T G - LINC01553 1.43 3.63 3.30 5.14 3.48 - - - 3.17 5.18 2.031× 10−3 1.130×10−7
10 80494228 C T TSPAN14 TSPAN14 3.23 3.22 4.17 5.83 2.03 - - - - 5.35 2.041× 10−3 4.295×10−8
11 60254475 G A - MS4A4E −5.74 −7.97 −8.27 −9.09 −6.66 - - - −3.32 −10.30 8.499× 10−3 3.570× 10−25
11 65888811 G A FIBP FIBP −2.13 −4.59 −1.22 −3.57 −1.62 - - - −0.38 −4.74 2.589× 10−3 1.070×10−6
11 86156833 A G PICALM PICALM 6.78 8.67 8.07 10.55 5.11 - - - 3.08 11.50 1.074×10−2 6.418×10−31
11 121578263 T C - SORL1 −3.10 −4.40 −3.82 −5.59 −3.38 - - - −0.52 −5.90 3.920× 10−3 1.768×10−9
13 43679792 C T - ENOX1 0.19 3.79 1.22 4.28 0.01 - - - −1.03 4.30 1.865×10−3 8.441×10−6
13 93594511 A T - GPC6-AS2 - −0.04 −1.09 −0.85 −2.34 - - - −0.57 −0.62 7.282× 10−2 2.672×10−1
14 32478306 T C AKAP6 AKAP6 −1.45 −4.35 −1.77 −3.63 −0.28 - - - 0.77 −4.44 1.869×10−3 4.449×10−6
14 52924962 A G - FERMT2 4.68 4.58 4.97 6.27 2.90 - - - 1.32 6.58 4.682× 10−3 2.429×10−11
14 92470949 C T - SLC24A4 −3.83 −6.10 −5.16 −6.67 −2.90 - - - −2.58 −7.57 4.647× 10−3 1.836× 10−14
15 50735410 C T HDC SPPL2A −3.16 −4.81 −4.09 −6.02 −2.45 - - - 0.09 −6.29 5.133×10−3 1.547× 10−10
15 58753575 A G - ADAM10 −2.86 −5.90 −4.16 −5.97 −2.81 - - - −2.16 −6.94 3.385× 10−3 1.910×10−12
15 63277703 C T APH1B APH1B 1.20 5.52 3.68 5.72 2.58 2.46 1.61 0.98 2.05 6.45 3.285× 10−3 5.482×10−11
16 31120929 A G KAT8 KAT8 −2.28 −5.50 −2.72 −5.84 −2.89 - - - −1.45 −6.56 3.913×10−3 2.702×10−11
17 5233752 G A SCIMP SCIMP 3.30 6.04 3.82 5.48 1.93 - - - 2.40 6.79 3.297× 10−3 5.560×10−12
17 7581494 G A CD68 LOC100996842 −1.82 −3.60 −1.57 −3.49 −3.37 −1.95 −1.61 −2.72 −3.18 −4.42 1.933×10−3 4.941×10−6
17 49219935 T C ABI3 ABI3 - −4.94 - - −4.75 −2.68 0.20 - −2.61 −5.25 2.982× 10−3 7.430×10−8
17 58331728 G C BZRAP1 MIR142 −1.00 −4.94 −5.09 −5.12 −3.81 - - - −1.35 −5.75 3.909× 10−3 4.412×10−9
17 63482562 C T ACE ACE 2.73 5.07 3.54 5.25 3.92 1.93 2.67 2.09 2.45 6.32 5.299×10−3 1.268×10−10
19 1058177 A G - ABCA7 −0.93 −4.61 −2.73 −4.94 −3.96 −1.16 −1.48 −0.38 0.52 −5.45 4.973× 10−3 2.534×10−8
19 6876985 T C VAV1 ADGRE1 1.05 3.04 3.58 4.42 1.59 - - - 0.42 4.21 2.119× 10−3 1.254×10−5
19 44888997 C T PVRL2 NECTIN2 20.83 51.85 - - - - - - - 53.66 8.573 0.000
19 51224706 C A CD33 CD33 −3.40 −5.84 −5.09 −5.69 −3.76 - - - −3.97 −6.x96 4.936× 10−3 1.696×10−12
19 54664811 A G LILRB4 LILRB4 −2.61 −3.61 −3.13 −3.89 −1.05 - - - 0.54 −4.37 1.958×10−3 6.300×10−6
20 56409712 G T CASS4 CASS4 −3.82 −5.84 −4.56 −6.07 −5.14 - - - - −7.12 6.582×10−3 5.526×10−13
21 26775872 C T ADAMTS1 ADAMTS1 −1.60 −2.90 −5.17 −5.54 −3.39 - - - −0.22 −4.87 2.469× 10−3 5.668×10−7

To further investigate whether the identified groups are functionally enriched, we apply a SNP-to-gene linking strategy proposed by [Gazal et al., 2022] to link the top variants of identified groups to the genes that they potentially regulate. Out of 63 top variants, we find that 34 (54.0%) can be mapped with functional evidence (e.g., being an expression quantitative trait locus, in a Hi-C linked enhancer region, near the exon of a gene, etc.), where the proportion is significantly higher than the average percentage of the background genome (28.6%). In summary, the proposed method can identify functional genetic variants with weaker statistical effects missed by conventional association tests.

6. Discussion

This paper introduced novel approaches for performing variable selection with FDR control on the basis of summary statistics. We proposed methods for testing conditional independence hypotheses from summary statistics alone. For the methods from Section 4, all we need are essentially the marginal correlations between X and Y,* which, at first sight, may appear surprising. Our arguments rely on the assumption that the covariates follow a Gaussian distribution, as well as on the linearity and rotational invariance of Gaussian distributions. Since our methods are based on the knockoffs procedure, they do not require any knowledge about the model of Y given X. Our methods extend, and generally give better power than, the work by He al. [2022] by employing penalized regression to produce the measure of feature importance. The techniques employed in this paper provide a wrapper that can be combined with a variety of feature selection methods, yielding knockoffs versions that guarantee FDR control.

We applied our methods to genetic studies, in which summary statistics are typically available. Due to linkage disequilibrium, the application of our methods to individual genetic variants may yield conservative results. In a parallel work Chu et al. [2023], we have developed tools for constructing group knockoffs efficiently and effectively. When combined, our methods offer a powerful new approach to controlled variable selection in GWAS. This is further supported in our companion work He et al. [2023], where we see the methods in this paper led to significant scientific discoveries.

Acknowledgement

Z.C. would like to thank Kevin Guo and Amber Hu for helpful discussions. Z.C. was supported by the Simons Foundation under award 814641. Z.H. was supported by NIH/NIA award AG066206 and AG066515. T.M. was supported by a B.C. and E.J. Eaves Stanford Graduate Fellowship. C.S. was supported by the grants NIH R56HG010812 and NSF DMS2210392. E.J.C. was supported by the Office of Naval Research grant N00014-20-1-2157.

A. Computation of free parameters s

In this paper, we use the semidefinite program (SDP) construction of second-order knockoffs Candès et al. [2018]. Without loss of generality, we assume that columns of the data matrix X have been standardized with mean 0 and variance 1 such that diagonal entries Σ are 1. As a result, s is the solution of the convex optimization problem.

minimizej=1p|1sj|subject to sj0, 1jp,diag{s}2Σ. (15)

Other methods to compute s include the minimum variance-based reconstructability (MVR) construction [Spector and Janson, 2022] and maximum entropy (ME) construction [Gimenez and Zou, 2019, Spector and Janson, 2022], which are all compatible with our methods in this paper.

B. Equivalence of GhostKnockoffs and the Gaussian knockoff sampler in sampling the knockoff Z-score Z~s

In this section, we summarize the proof of He et al. [2022] that Z~s computed by (6) satisfies (5) as follows.

Lemma 2. [He et al., 2022] For any P and V computed in step 3 of Algorithm 1, we have

Z~sX,Y=dX~YX,Y,

where Z~s is computed by (6) and X~ is the output of Algorithm 1.

Proof. By step 5 of Algorithm 1, we have X~=XP+EV1/2, where E is an n by p matrix with i.i.d. standard Gaussian entries, independent of X. Therefore,

X~YX,Y=PXY+V1/2EYX,Y.

Because EYX,Y~𝒩0,Y22Ip, we have

EYX,Y=dY2SX,Y,  where S~𝒩0,Ip is independent of X and Y

Thus, we have

X˜YX,Y=dPXY+Y2V1/2SX,Y=dPXY+Y2ZX,Y  where Z~𝒩(0,V) is independent of X and Y.=Z˜sX,Y.

C. Proof of Proposition 1

To prove Proposition 1, we need to first prove Lemma 3.

Lemma 3. Let Z1 and Z2 be two real n by p matrices. For any n and p, if Z1Z1=Z2Z2, there must exists an orthogonal matrix QRp×p such that Z1=QZ2.

Proof. Suppose Z1Z1=Z2Z2=UΛU, where URp×r is an orthogonal matrix such that UU=Ir,ΛRr×r is diagonal with positive entries and r is the rank of Z1Z1. In other words, we perform eigen-decomposition of Z1Z1=Z2Z2 and remove all zero eigenvalues and their corresponding eigenvectors. Note that UU is a projection matrix that projects any vector onto colspace (U), the column space of U.

It is clear that

colspaceUΛU colspace U.

Because U=UΛUUΛ-1, we also have

colspaceUcolspaceUΛU.

As a result, we have colspaceUΛU=colspace(U).

Thus, for k=1,2,UU is a projection matrix that projects any vector onto the column space of UΛU=ZkZk. Because colspace ZkZk=rowspace Zk, we have

Zk=ZkUU=ZkUΛ-1/2Λ1/2U.

Let Qk=ZkUΛ-1/2, we have Zk=QkΛ1/2U and

QkQk=Λ-1/2UZkZkUΛ-1/2=Λ-1/2UUΛUUΛ-1/2=Ir, k=1,2.

Thus, we have

Z1=Q1Λ1/2U=Q1Q2Q2Λ1/2U=Q1Q2Z2,
Z2=Q2Λ1/2U=Q2Q1Q1Λ1/2U=Q2Q1Z1.

Because Q1Q1=Q2Q2=Ir, there exist Q1,Q2Rp×(p-r) such that V1=Q1Q1 and V2=Q2Q2 are both orthogonal matrices. Thus, we have

Z1=Q1Q2Z2=V1V2-Q1Q2Z2
Z2=Q2Q1Z1=V2V1-Q2Q1Z1

Substituting Z1=Q1Q2Z2 in Z2=Q2Q1Z1, we have

Z2=Q2Q1Q1Q2Z2=Q2Q2Z2

and thus

Q1Q2Z2=Q2Q2Z2=0.

Thus, these exists an orthogonal matrix Q=V1V2 such that Z1=QZ2. □

We can then prove Proposition 1 as follows. By Lemma 3, since [XˇYˇ][XˇYˇ]=[XY][XY], we know that [XˇYˇ]=Q[XY] for some orthogonal matrix Q.

Let ERn×p be a matrix with i.i.d. standard Gaussian entries, we have QE is also a matrix with i.i.d. standard Gaussian entries (i.e. E=dQE) and

(E[XˇYˇ],EE)X,Y=(EQ[XY],EQQE)X,Y=d(E[XY],EE)X,Y.

By the construction of [XˇYˇ][XˇYˇ], we have that XˇXˇ=XX,XˇYˇ=XY and Yˇ2=Y2. Therefore, we focus on the third, fifth and sixth arguments of 𝒯 where

(X˜Y,XX,X˜X˜)X,Y=(PXY+V1/2EY,PXX+V1/2EX,PXXP+V1/2EEV1/2+PXEV1/2+V1/2EXP)X,Y=d(PXˇYˇ+V1/2EYˇ,PXˇXˇ+V1/2EXˇ,PXˇXˇP+V1/2EEV1/2+PXˇEV1/2+V1/2EXˇP)X,Y=(Xˇ˜Yˇ,Xˇ˜Xˇ,Xˇ˜Xˇ˜)X,Y.

Hence,

𝒯X,X~,YX,Y=d𝒯Xˇ,Xˇ~,YˇX,Y.

D. Construction of [Xˇ Yˇ] via eigen-decomposition

In this section, we give details on how to construct [X ˇYˇ] such that [Xˇ Yˇ][Xˇ Yˇ]=[X Y][X Y] using eigen-decomposition,

XYXY=UDU,  where U=u1up+1 is an orthogonal matrix, D=diagd1,,dp+1

with d1dp+1. We consider two cases as follows.

Case 1  (n<p+1): Since rank[X Y][X Y]n, we have dn+1==dp+1=0 and

[X Y][X Y]=U1DnU1,

where U1=u1un, and Dn=diagd1,,dn. Under this case, we let

[XˇYˇ]=Dn1/2U1

such that [XˇYˇ][XˇYˇ]=[XY][XY] is satisfied.

Case 2 (np+1): Under this case, we let

[X  ˇYˇ]=D1/2U0(n-p-1)×(p+1)

such that

[XˇYˇ][XˇYˇ]=UDU=[XY][XY]

is satisfied.

E. Computation of the tuning parameter λ for the lasso-min method

Suppose we had access to individual level data such that Gaussian knockoffs X~ can be constructed, we can follow the method of Dicker [2014] to estimate the noise level σ by

σ^0=max(2p+n+1n(n+1)Y22Y[X  X˜]G1[X  X˜]Yn(n+1),0), where G=[ΣΣDΣDΣ]. (16)

We could then compute λ=κσ^0nERϵ, where RRn×2p is a data matrix whose rows are i.i.d. samples from 𝒩(0,G), and ϵ~𝒩0,In is independent of R. In the summary statistics setting, we replace X~Y in (16) by PXY+Y2Z, where P and Z are obtained in Algorithm 4.

Figure 7:

Figure 7:

Boxplots of 100 simulated samples LZ with p=200,Σi,j=ρ|i-j| and D obtained via (15) for different ρ values.

The expectation ERϵ can be computed using Monte Carlo integration. However, when both n and p are very large, sampling R and ϵ becomes too time-consuming. Observing that

Rϵ=Rϵϵ2ϵ2

where Rϵϵ2~𝒩(0,G) and ϵ are independent, we have

ERϵ=EN(0,G)Eϵ2=EN(0,G)2Γ{(n+1)/2}Γ(n/2).

By Stirling’s formula that

Γz=2πzzez1+O1z,

we have

2Γ{(n+1)/2}Γ(n/2)~n.

Therefore, we may approximate

ERϵnELZ,

where L is the Cholesky decomposition of G and Z~𝒩0,I2p.

In practice, the simulated LZ usually concentrates around its mean as shown in Figure 7. Thus, only several Monte Carlo samples are needed to accurately estimate ELZ, and we draw 10 Monte Carlo samples throughout numerical experiments of this paper.

Next, we prove that Algorithm 4 maintains FDR control when λ is computed as described in Section 4.2.2. By Proposition 2, it suffices to show that (11) with the computed λ produces feature importance statistics that satisfy the flip sign property. By λκσˆ0nERϵ, it suffices to show that σˆ0 is invariant to swapping variables with their knockoffs [Candès et al. 2018].

Let ΠjR2p×2p be the permutation matrix that swaps the j-th column with the (j+p)-th column of a matrix. Thus, we have Πj-1=Πj=Πj. This leads to

Y[XX˜]Πj(G)1([XX˜]Πj)Y=Y[XX˜] Πj(G)1Πj[XX˜]Y=Y[XX˜](ΠjGΠj)1[XX˜]Y=Y[XX˜](G)1[XX˜]Y,

suggesting σˆ0 is invariant to swapping variables with their knockoffs [Candès et al. 2018]. Therefore, the FDR of Algorithm 4 is controlled when λ is computed as described in Section 4.2.2.

Since all variables in X~ are null, in practice we may replace XX~ by X,G by Σ and 2p+n+1 by p+n+1 in (16) to reduce the dimension when estimating σ. Although this would, in theory, break the flip-sign property required for FDR control, no FDR inflation is observed in our simulations.

F. Connection with the scout procedure

In this section, we explain the connection of the feature importance statsitic defined in Algorithm 4 and the scout procedure [Witten and Tibshirani, 2009].

For covariates XRp and response YR, Witten and Tibshirani [2009] assume that XY~𝒩0,ΣX,Y. The population linear regression coefficient of Y on X, which induces a linear predictor that achieves the minimal mean squared prediction error, is given by β=-ΘXY/ΘYY, where Θ=ΘXXΘXYΘYXΘYY=ΣX,Y-1 is the precision matrix. Let S be the empirical covariance matrix of X and Y, they consider the following covariance-regularized regression approach to estimate β,

  1. Compute ΘˆXX to maximize logdetΘXX-trSXXΘXX-J1ΘXX

  2. Compute Θˆ to maximize log{det(Θ)}-tr(SΘ)-J2(Θ) subject to ΘXX=ΘˆXX obtained from Step 1.

  3. Compute βˆ=-ΘˆXY/ΘˆYY.

  4. Compute βˆ*=cβˆ where c is the regression coefficient of Y onto Xβˆ.

Here, J1 and J2 are two penalty functions. The first two steps are to appropriately separate true conditional correlations from those purely due to noise. As shown in Witten and Tibshirani [2009], when J2(Θ)=λ2Θ1 (resp. λ2Θ22, the solution to step 3 is proportional to the solution of

βˆ=arg min ββGXXβ-2SXYβ+λ2β1 resp. λ2β22,

where GXX is the inverse of the solution ΘˆXX from step 1. In other words, the Lasso corresponds to the setting that J1=0 and GXX=SXX. Witten and Tibshirani [2009] consider various settings in which they demonstrate the superiority of the scout procedure over the Lasso, Ridge and Elastic Net. In the setting of Section 4 we have cov(X,X~)=ΣΣ-DΣ-DΣ. Therefore, the objective function (11) corresponds to the case that the true ΘXX is used in step 1 (here we include both X and X~ as explanatory variables).

G. Construction of group knockoffs and examples of importance scores at the group level

For group knockoffs, we test the group conditional independence hypothesis:

Hγ0:XγYX-γ

where γ{1,,g} denotes a group and Xγ is the vector of features in group γ.

In addition to the conditional independence (2), group knockoffs X~ must satisfy the group exchangeability condition that

(Group exchangeability):Xγ,X~γ,X-γ,X~-γ=dX~γ,Xγ,X-γ,X~-γ,γ1,,g.

No exchangeability property is required for features within the same group, which allows greater flexibility in the construction knockoffs.

When X~𝒩(0,Σ), the group exchangeability condition allows the diagonal matrix D=diag(s) described in Sections 2 and 4 becomes a block-diagonal matrix D=diagS1,,Sg, where Sγ is a symmetric matrix whose dimension equals the number of variables in group γ(γ{1,,g}). With the block-diagonal matrix D obtained following the SDP construction of Chu et al. [2023] in step 3, Algorithm 1 can construct valid group knockoffs X~ of X with respect to g feature groups. Analogously, Algorithms 24 can also be modified correspondingly to perform inference of Hγ0’s.

Although it is conceptually straightforward to modify D from a diagonal to a block-diagonal matrix, note that doing so introduces significantly more optimization variables. To reduce computational burden in practice, we exploit a form of conditional independence across groups, described in section 4 of [Chu et al., 2023]. The main idea is to select a few key variables in each group that capture most within-group variations, and perform a reduced optimization problem only on the key variables. In the real data analysis result, we defined groups via average-linkage hierarchical clustering with correlation cutoff 0.5, selected representatives within groups via Algorithm A1 of Chu et al. [2023]. with c=0.5, and replaced objective (15) by the maximum entropy (ME) objective, which has improved power over SDP constructions in simulations.

In this paper, we use M multi-knockoffs. To define the importance score for group γ and its knockoffs, we sum the effect for variants in each group. With M knockoff copies, we explicitly compute Zγ=i𝒜γβi and Z~γ()=i𝒜γβ~i() for =1,,M, where β=β,β~1,,β~M is the estimated effect sizes from step 4 of Algorithm 4. One may use other choices of feature importance such as the l2 norm. The group-wise Lasso coefficient difference is then defined as

Wγ=Zγ-medianZ~γ(1),,Z~γ(M)IZγmaxZ~γ(1),,Z~γ(M)

and groups with Wγ>τ are selected, where τ is calculated from the multiple knockoff filter [Gimenez and Zou, 2019]. Note that Wγ is the feature importance statistic first introduced in He et al. [2021].

H. Ghostknockoffs for CRT (GhostCRT)

Let XRn×p and YRn be the covariate matrix and the response vector respectively. Recall that in the conditional randomization test, to test Hj:XjYX-j, Candès et al. [2018] draw i.i.d. samples X~j1,,X~jB~XjX-j (Xj is the j-th column of the covariate matrix X) and compute the CRT p-value as

pj=1B+1[1+b=1B1T(X˜jb,xj,Y)T(Xj,Xj,Y)], (17)

for some feature importance function T.

Under the assumption that rows of X are i.i.d. samples from 𝒩(0,Σ), we can generate X~j1,,X~jB by

X˜jb=Xjγj+vj1/2Ejb, (18)

where γj=Σ-j,-j-1Σ-j,jRp-1,vj=Σj,j-Σj,-jΣ-j,-j-1Σ-j,j, and Ej1,,EjB~iid𝒩0,In are independent of everything else. Utilizing the analogy between (4) and (18), we develop the GhostCRT with counterparts of Algorithms 23 as follows, while the counterpart Algorithm 4 is derived in the similar way.

Algorithm 5 GhostKnockoffs with Marginal Correlation Difference Statistic for CRT
1: Input: XY,Y22, and Σ.
2: for j=1,,p do
3:  Compute γj=Σj,j1Σj,jp1 and vj=Σj,jΣj,jΣj,j1Σj,j.
4: for b=1,,B do
5:   Generate Z˜jb=γjXjY+Y2Zjb where Zjb~𝒩(0,vj) and is independent of everything else.
6: end for
7:  Compute the CRT p-value pj via (17) with T(Xj,Xj,Y)=|XjY| and T(X˜jb,Xj,Y)=Z˜jb and T(X˜jb,Xj,Y)=Z˜jb.
8: end for
9: Output: Selection set by conducting existing multiple testing procedures on CRT p-values p1,,pp.
Algorithm 6 GhostKnockoffs with Penalized Regression for CRT: Known Empirical Covariance
1: Input: XX,XY,Y22,Σ, and n.
2: Find Xˇ and Yˇ such that [X  ˇYˇ][X  ˇYˇ]=[X  Y][X  Y] by eigen-decomposition or Cholesky decomposition.
3: for j=1,,p do
4:  Compute γj=Σ-j,-j-1Σ-j,jRp-1 and vj=Σj,j-Σj,-jΣ-j,-j-1Σ-j,j.
5: for b=1,,B do
6:   Generate Xˇ~jb via (18) using Xˇ-j as input.
7: end for
8:  Compute the CRT p-value pj via (17) and replacing X~jb by Xˇ~jb with feature importance statistic defined by TXj,X-j,Y=βˆj, where βˆj,βˆ-j=argminβj,β-jRp12Y-Xjβj-X-jβ-j22+λβj,β-j1.
9: end for
10: Output: Selection set by conducting existing multiple testing procedures on CRT p-values p1,,pp.

As (18) is a special case of (4) where

  • P is obtained by substituting the (j,j)-entry and other entries in the j-th column of Ip by 0 and γj respectively;

  • V is a matrix of zeros expect the (j,j)-entry equals vj,

all theoretical results in Sections 24 remain true for the GhostCRT.

Remark 1. In Algorithm 6, the tuning parameter λ is allowed to depend on XX,XY,YY and n. We may also use the square-root Lasso or the Lasso-max importance statistic as outlined in Sections 3.3 and 3.4.

I. Additional results for Section 4.4.2

To further demonstrate the effect of sample size on the new GhostKnockoffs methods in comparison to the individual level knockoffs with (cross-validated) Lasso coefficient difference, we consider additional experiments with p=600 and n=600/1800/3000 under the same setting of Section 4.4.2. Note that the noise level scales in the order of n such that the signal to noise ratio does not change dramatically. From Figure 8 we observe that as n increases, all three new methods proposed in this paper have comparable power with KF-lassocv and outperform GK-marginal [He et al., 2022] consistently, with FDR controlled in all cases.

Figure 8:

Figure 8:

Power and FDR plots for independent features and a Gaussian linear model with varying sample sizes and a fixed feature dimension. Each point shown is an average over 200 replications.

J. Supplementary plots for Section 4.4.1

Let Z=XY and Z˜=PXY+Y2Z be defined as in Algorithm 4. Note that for Algorithm 4 to control the FDR, it suffices to require that swapping the j-th entry of Z with the j-th entry of Z˜ does not change the joint distribution of (Z,Z˜) for each null j.

By the Central Limit Theorem, short sets of entries (e.g. single entries, pairs, and triples etc.) of (Z, Z˜) are approximately Gaussian. Additionally, in Figure 11 we show empirically that the covariance of (Z,Z˜) (approximately) satisfies the required swap-invariance for null positions. These approximations, coupled with the robustness of the knockoff framework, empirically yield the FDR control. This is similar to the robustness of second-order knockoffs observed empirically in Candès et al. [2018].

In the setting from Section 4.4.1, Figure 9 depicts the ordered empirical values of Zj (respectively Z˜j) plotted against an equal-size ordered random sample from a Gaussian distribution with matching mean and variance as the empirical mean and variance of Zj (respectively Z˜j), for three randomly selected indices. This comparison is based on the 1000 sub-sampled data replications from Section 4.4.1. In Figure 10, we overlay the plots for all indices. We observe that Zj and Z˜j approximately follow Gaussian distributions. In Figure 11, we present the scatter plots of relevant empirical covariances. We observe that all the points roughly concentrate around the y=x line. This shows the approximate swap-invariance of Z and Z˜ (for null indices).

K. Details of the nine studies in Section 5

Section 5 considers the following nine studies for Alzheimer’s disease:

  1. The genome-wide association study performed by Huang et al. [2017].

  2. The genome-wide meta-analysis of clinically diagnosed AD and AD-by-proxy by Jansen et al. [2019].

  3. The genome-wide meta-analysis of clinically diagnosed AD by Kunkle et al. [2019].

  4. The genome-wide meta-analysis by Schwartzentruber et al. [2021].

  5. In-house genome-wide associations study of 15,209 cases and 14,452 controls aggregating 27 cohorts across 39 SNP array data sets, imputed using the TOPMed reference panels [Belloy et al., 2022a].

  6. A whole-exome sequencing analyse of data from ADSP by Bis et al. [2020].

  7. A whole-exome sequencing analyse of data from ADSP by Le Guen et al. [2021].

  8. In-house whole-exome sequencing analysis of ADSP (6155 cases, 5418 controls).

  9. In-house whole-genome sequencing analysis of the 2021 ADSP release (3584 cases, 2949 controls) [Belloy et al., 2022b].

L. Calculation of meta-analysis Z-score

Based on Z-scores Z1,Z2,,ZK from K studies, we adopt the definition in He et al. [2022] that meta-analysis Z-score with overlapping samples is

Zmeta =Hk=1KwkCkZk.

Specifically,

  • optimal weights w1,,wK are obtained by solving the optimization problem
    minimizek=1Kl=1Kwkwl cor. Skl  subject to k=1Kwknk=1,w1,,wK0;
  • for k=1,,K,Ck=diagck1,,ckp is a diagonal matrix where ckj=1 if Z-score of the j-th variant is observed in the k-th study and ckj=0 otherwise (j=1,,p);

  • H=diagh1,,hp is a diagonal matrix where hj=klwkwlckjcljcor.Skl-1/2(j=1,,p);

  • cor.Skl is the study correlation between the k-th study and the l-th study.

Figure 9:

Figure 9:

QQ plots of Zj’s (left) and Z˜j’s (right) against Gaussian samples with matching mean and variance for three randomly sampled indices.

Figure 10:

Figure 10:

QQ plots of Zj’s (left) and Z˜j’s (right) against Gaussian samples with matching mean and variance for all indices overlaid.

Figure 11:

Figure 11:

Scatter plots for relevant empirical covariances. Each correlation is estimated from 1000 samples drawn as described in Section 4.4.1.

In practice, when calculating cor.Skl, we only use variants whose Z-scores are bounded in [−1.96, 1.96] in both the k-th study and the l-th study to eliminate the impact of polygenic effects. This meta-analysis approach is a generalization of the METAL method proposed by Willer et al. [2010].

M. Obtaining the covariance matrix Σ in meta-analysis for AD

To perform meta-analysis for AD, we need a suitable estimate of the covariance matrix Σ. In this paper, we adopt strategies in He et al. [2023] and Chu et al. [2023] as follows.

We first download the covariance matrix from the Pan-UKB consortium (https://pan.ukbb.broadinstitute.org), which contains about 24 million variants across the human genome derived from about 500, 000 British samples. We then extract p=650,576 variants which satisfy the following three conditions: (a) the variant is recorded in the UK Biobank genotype array, (b) its MAF exceeds 0.01, (c) its reference/alternate allele pair matches with the ones listed in all the nine studies in meta-analysis. Based on the covariance matrix of extracted variants, we further partition extracted variants into 1703 quasi-independent blocks using the partition given by Berisa and Pickrell [2016]. Finally, we compute the block-diagonal covariance matrix

Σ=Σ1Σ1703,

where Σl is the shrinkage estimator of the covariance matrix of variants in the l-th block using the R package corpcor [Schäfer and Strimmer, 2005]. To ensure that all blocks Σ1,,Σ1703 are positive definite, we perform eigen-decomposition and increase all their eigenvalues not larger than 10−5 to 10−5.

N. Supplementary tables of meta-analysis for AD

Tables 1 and 2 provide more details of the meta-analysis for AD in Section 5. Specifically, Table 1 presents the number of loci, average signals per locus, standard deviation of the number of signals per locus, average groups per locus, and standard deviation of the number of groups per locus identified by conventional marginal association test, GK-marginal, GK-pseudolasso, and GK-susie-rss. Here, the p-value threshold of the conventional marginal association test is 5×10-8, and GK-pseudolasso uses the tuning parameter chosen by the lasso-min method in Section 4.2.2. For GK-marginal, GK-pseudolasso, and GK-susie-rss, we display results with respect to target FDR levels 0.05, 0.1, and 0.2. Table 2 provides details of top variants of identified loci given by GK-pseudolasso (target FDR level: 0.20), including their positions (columns “Chr.” and “SNP”), their reference alleles (column “Ref.”) and alternative alleles (column “Alt.”), genes that they potentially regulate (column “TopS2GGene”), their closest genes, their Z-scores from different individual studies, their meta-analysis Z-scores, their feature importance scores (W), and the marginal p-values obtained from meta-analysis Z-scores.

O. Supplementary figures of meta-analysis for AD

Analogous to Figure 6 in Section 5 Figures 12, 13 and 14 respectively present Manhattan plots of the meta-analysis of the nine studies via conventional marginal association test (with p-value cutoff 5×10-8), GK-marginal (with target FDR level 0.10), and GK-susie-rss (with target FDR level 0.10).

The conventional marginal association test selects many feature groups because it focuses on marginal correlations between feature groups and the response while ignoring spurious correlation induced by linkage disequilibrium. This is shown in Figure 12, where the conventional marginal association test tends to select many nearby loci. This issue is alleviated by the GhostKnockoffs approach that tests conditional independence as seen in Figures 6, 13, and 14.

Figure 12:

Figure 12:

Graphical illustration of the result by applying conventional marginal association test on meta-analysis for AD. The dotted line represents the conventional genome-wide p-value threshold of 5 × 10−8. P-values are truncated at 10−50 for better visualization. The results are obtained from the meta-analysis p-values calculated based on Section L. Variant density is shown at the bottom of plot (number of variants per 1Mb).

Figure 13:

Figure 13:

Graphical illustration of the result by applying the GK-marginal on meta-analysis for AD. Each point represents a group of genetic variants. With respect to the target FDR level 0.1, points of identified groups are highlighted in blue or purple. For each locus with at least one identified group, the name of the locus is presented at the variant group with the largest importance statistic (highlighted in purple). Variant density is shown at the bottom of plot (number of variants per 1Mb).

Table 1:

Summary of results by applying different methods on meta-analysis for AD

Method Target FDR level Number of identified loci Average signals per locus SD of signals per locus Average groups per locus SD of groups per locus
Marginal association test 29 15.517 32.231 1.000 0.000
0.05 3 107.667 97.027 3.667 3.055
GK-marginal 0.10 10 95.600 214.568 2.700 3.622
0.20 17 76.176 174.714 2.412 3.318
0.05 30 21.500 44.323 2.333 4.722
GK-pseudolasso 0.10 42 17.214 38.019 2.024 4.015
0.20 63 15.889 47.074 1.794 3.561
0.05 21 14.619 27.902 1.286 0.644
GK-susie-rss 0.10 35 12.000 23.506 1.257 0.657
0.20 47 10.191 20.591 1.319 0.695

Figure 14:

Figure 14:

Graphical illustration of the result by applying the GK-susie-rss on meta-analysis for AD. Each point represents a group of genetic variants. With respect to the target FDR level 0.1, points of identified groups are highlighted in blue or purple. For each locus with at least one identified group, the name of the locus is presented at the variant group with the largest importance statistic (highlighted in purple). Variant density is shown at the bottom of plot (number of variants per 1Mb).

P. Running Lasso on binary responses

In genetic datasets, the response Y is often binary. Performing Lasso or Lasso-type regressions on binary response may sound unreasonable since it violates the usual linear model assumption. One might assume that utilizing penalized logistic regression to generate feature importance statistics would be much more effective. However, a bit surprisingly, we demonstrate that this intuition may not be correct through the following two simulations.

For the first column of Figure 15, we generate Xi~iid𝒩0,1nIp, and, conditional on Xi,PYi=1=11+e-βXi and PYi=0=1-PYi=1, where n=1000 and p=300. We create β by uniformly randomly selecting 30 coordinates to be non-zero. The signs of these non-zero coordinates are assigned to be either positive or negative with equal probability. The dark curve represents the knockoffs procedure with Lasso coefficient difference statistic (with tuning parameter chosen by cross-validation), i.e., KF-lassocv. The red curve represents the knockoffs procedure with coefficient difference statistic generated by L1-penalized logistic regression. We vary the signal amplitudes such that we observe relatively complete power profiles below. The target FDR is 0.1. Each point on the curves represents an average over 200 replications. For the second column of Figure 15, we show the result for AR(1) features. Here, n=600,p=200 and the signal amplitude (i.e., the magnitude of non-zero β values) is fixed to be 0.5. Otherwise, the simulation setting is exactly the same as the independent case. We observe that the two methods considered have almost the same power and FDR, so the use of penalized logistic regression does not meaningfully affect the results.

Figure 15:

Figure 15:

Power and FDR plots when the response is generated by a logistic regression model. Each point is an average over 200 replications.

Footnotes

*

As an exception, we use X,X~, and Y to represent generic covariates, their knockoffs, and the response.

*

In the case that Y is binary, one may think that utilizing (penalized) logistic regression would give much better power than Lasso. In Appendix P, we show that this intuition may not be correct through simulations, even when Y is generated according to a logistic regression model.

*

Careful readers may realize that the solution of the Lasso problem does not depend on Y22. Here we include Y22 as an input of to be able to make a more general statement later that goes beyond the Lasso.

*

Note that Xˇ may not be a data matrix with i.i.d. rows and covariance matrix Σ and we should call X~~ the pseudo-Gaussian knockoff data matrix.

*

The simulation setting is designed in a way that the signal-to-noise ratio has the same scale as n varies.

*

We remark that similar objective functions have been used in, for example, Mak et al. [2017]. and Zou et al. [2022].

*

Unlike the previous approach, this tuning parameter choice will not induce the exact flip-sign property. However, we observe empirically that our method is robust to this issue, and no FDR inflation occurred. In theory, one could randomly swap all the variables with their corresponding knockoffs and compute the average of all the λ values obtained. In the limit, the average will give a data-driven value of λ that is invariant to swapping variables with their knockoffs due to symmetry.

*

We used the susie_rss function inside the R package susieR in our simulations.

*

Specifically, the extended exchangeability condition says that if we permute variables with their corresponding (multiple) knockoffs arbitrarily, the joint distribution remains unchanged.

*

In general, CRT p-values may not be independent of each other or satisfy the PRDS property [Benjamini and Yekutieli, 2001]. Therefore, applying the Benjamini-Hochberg procedure on CRT p–values does not guarantee FDR control theoretically. However, as noted in Candès et al. [2018], the FDR is usually under control empirically.

*

Specifically, we consider the variant group with the largest group knockoff feature importance statistic within a locus, and then map the locus to the most proximal gene of the variant within the group that has the highest knockoff importance score.

*

Along with Y2 and n.

References

  1. Barber R. F. and Candès E. J.. Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5): 2055–2085, 2015. URL 10.1214/15-AOS1337. [DOI] [Google Scholar]
  2. Barber R. F., Candès E. J., and Samworth R. J.. Robust inference with knockoffs. 2020.
  3. Bates S., Sesia M., Sabatti C., and Candès E.. Causal inference in genetic trio studies. Proceedings of the National Academy of Sciences, 117(39):24117–24126, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Belloni A., Chernozhukov V., and Wang L.. Square-root lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806, 2011. [Google Scholar]
  5. Belloy M. E., Eger S. J., Le Guen Y., Damotte V., Ahmad S., Ikram M. A., Ramirez A., Tsolaki A. C., Rossi G., Jansen I. E., et al. Challenges at the APOE locus: a robust quality control approach for accurate APOE genotyping. Alzheimer’s Research & Therapy, 14:22, 2022a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Belloy M. E., Le Guen Y., Eger S. J., Napolioni V., Greicius M. D., and He Z.. A Fast and Robust Strategy to Remove Variant-Level Artifacts in Alzheimer Disease Sequencing Project Data. Neurology Genetics, 8(5):e200012, 2022b. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Belloy M. E., Andrews S. J., Le Guen Y., Cuccaro M., Farrer L. A., Napolioni V., and Greicius M. D.. APOE Genotype and Alzheimer Disease Risk Across Age, Sex, and Population Ancestry. JAMA Neurology, 80(12): 1284–1294, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Benjamini Y. and Hochberg Y.. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1):289–300, 1995. [Google Scholar]
  9. Benjamini Y. and Yekutieli D.. The control of the false discovery rate in multiple testing under dependency. Annals of statistics, pages 1165–1188, 2001. [Google Scholar]
  10. Berisa T. and Pickrell J. K.. Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics, 32(2):283–285, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Bis J. C., Jian X., Kunkle B. W., Chen Y., Hamilton-Nelson K. L., Bush W. S., Salerno W. J., Lancour D., Ma Y., Renton A. E., et al. Whole exome sequencing study identifies novel rare and common Alzheimer’s-Associated variants involved in immune response and transcriptional regulation. Molecular psychiatry, 25:1859–1875, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Boyd S. and Vandenberghe L.. Convex Optimization. Cambridge University Press, 2004. [Google Scholar]
  13. Candès E., Fan Y., Janson L., and Lv J.. Panning for Gold: ‘Model-X’ Knockoffs for High Dimensional Controlled Variable Selection. Journal of the Royal Statistical Society Series B: Statistical Methodology, 80(3):551–577, 2018. [Google Scholar]
  14. Chen C.-Y., Pollack S., Hunter D. J., Hirschhorn J. N., Kraft P., and Price A. L.. Improved ancestry inference using weights from external reference panels. Bioinformatics, 29(11):1399–1406, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Chu B. B., Gu J., Chen Z., Morrison T., Candès E., He Z., and Sabatti C.. Second-order group knockoffs with applications to GWAS. arXiv preprint arXiv:2310.15069, 2023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Dai R. and Barber R.. The knockoff filter for FDR control in group-sparse and multitask regression. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pages 1851–1859. PMLR, 2016. [Google Scholar]
  17. Dicker L. H.. Variance estimation in high-dimensional linear models. Biometrika, 101(2):269–284, 2014. [Google Scholar]
  18. Gazal S., Weissbrod O., Hormozdiari F., Dey K. K., Nasser J., Jagadeesh K. A., Weiner D. J., Shi H., Fulco C. P., O’Connor L. J., et al. Combining SNP-to-gene linking strategies to identify disease genes and assess disease omnigenicity. Nature Genetics, 54:827–836, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Gimenez J. R. and Zou J.. Improving the Stability of the Knockoff Procedure: Multiple Simultaneous Knockoffs and Entropy Maximization. In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89, pages 2184–2192. PMLR, 2019. [Google Scholar]
  20. He Z., Liu L., Wang C., Le Guen Y., Lee J., Gogarten S., Lu F., Montgomery S., Tang H., Silverman E. K., et al. Identification of putative causal loci in whole-genome sequencing data via knockoff statistics. Nature Communications, 12:3152, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. He Z., Liu L., Belloy M. E., Le Guen Y., Sossin A., Liu X., Qi X., Ma S., Gyawali P. K., Wyss-Coray T., et al. Ghostknockoff inference empowers identification of putative causal variants in genome-wide association studies. Nature Communications, 13:7209, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. He Z. et al. In silico identification of putative causal genetic variants. 2023.
  23. Huang K.-l., Marcora E., Pimenova A. A., Di Narzo A. F., Kapoor M., Jin S. C., Harari O., Bertelsen S., Fairfax B. P., Czajkowski J., et al. A common haplotype lowers PU. 1 expression in myeloid cells and delays onset of Alzheimer’s disease. Nature Neuroscience, 20:1052–1061, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Jansen I. E., Savage J. E., Watanabe K., Bryois J., Williams D. M., Steinberg S., Sealock J., Karlsson I. K., Hägg S., Athanasiu L., et al. Genome-wide meta-analysis identifies new loci and functional pathways influencing Alzheimer’s disease risk. Nature Genetics, 51:404–413, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Kunkle B. W., Grenier-Boley B., Sims R., Bis J. C., Damotte V., Naj A. C., Boland A., Vronskaya M., Van Der Lee S. J., Amlie-Wolf A., et al. Genetic meta-analysis of diagnosed Alzheimer’s disease identifies new risk loci and implicates Aβ, tau, immunity and lipid processing. Nature Genetics, 51:414–430, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Le Guen Y., Belloy M. E., Napolioni V., Eger S. J., Kennedy G., Tao R., He Z., and Greicius M. D.. A novel age-informed approach for genetic association analysis in Alzheimer’s disease. Alzheimer’s Research & Therapy, 13:72, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Leung Y. Y., Valladares O., Chou Y.-F., Lin H.-J., Kuzma A. B., Cantwell L., Qu L., Gangadharan P., Salerno W. J., Schellenberg G. D., et al. VCPA: genomic variant calling pipeline and data management tool for Alzheimer’s Disease Sequencing Project. Bioinformatics, 35(10):1768–1770, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Li S. and Candès E. J.. Deploying the Conditional Randomization Test in High Multiplicity Problems. arXiv preprint arXiv:2110.02422, 2021. [Google Scholar]
  29. Mak T. S. H., Porsch R. M., Choi S. W., Zhou X., and Sham P. C.. Polygenic scores via penalized regression on summary statistics. Genetic Epidemiology, 41:469–480, 2017. [DOI] [PubMed] [Google Scholar]
  30. Pasaniuc B. and Price A. L.. Dissecting the genetics of complex traits using summary association statistics. Nature Reviews Genetics, 18:117–127, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Qian J., Tanigawa Y., Du W., Aguirre M., Chang C., Tibshirani R., Rivas M. A., and Hastie T.. A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank. PLoS Genetics, 16(10):e1009141, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Schäfer J. and Strimmer K.. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 4:32, 2005. [DOI] [PubMed] [Google Scholar]
  33. Schwartzentruber J., Cooper S., Liu J. Z., Barrio-Hernandez I., Bello E., Kumasaka N., Young A. M., Franklin R. J., Johnson T., Estrada K., et al. Genome-wide meta-analysis, fine-mapping and integrative prioritization implicate new Alzheimer’s disease risk genes. Nature Genetics, 53:392–402, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Serrano-Pozo A., Das S., and Hyman B. T.. APOE and Alzheimer’s disease: advances in genetics, pathophysiology, and therapeutic approaches. The Lancet Neurology, 20(1):68–80, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Sesia M., Bates S., Candès E., Marchini J., and Sabatti C.. False discovery rate control in genome-wide association studies with population structure. Proceedings of the National Academy of Sciences, 118(40):e2105841118, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Spector A. and Janson L.. Powerful knockoffs via minimizing reconstructability. The Annals of Statistics, 50(1): 252–276, 2022. [Google Scholar]
  37. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 526:68–74, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Tian X., Loftus J. R., and Taylor J. E.. Selective inference with unknown variance via the square-root lasso. Biometrika, 105(4):755–768, 2018. [Google Scholar]
  39. Tibshirani R., Bien J., Friedman J., Hastie T., Simon N., Taylor J., and Tibshirani R. J.. Strong Rules for Discarding Predictors in Lasso-Type Problems. Journal of the Royal Statistical Society Series B: Statistical Methodology, 74 (2):245–266, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Wang G., Sarkar A., Carbonetto P., and Stephens M.. A Simple New Approach to Variable Selection in Regression, with Application to Genetic Fine Mapping. Journal of the Royal Statistical Society Series B: Statistical Methodology, 82(5):1273–1300, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Wang W. and Janson L.. A high-dimensional power analysis of the conditional randomization test and knockoffs. Biometrika, 109(3):631–645, 2021. [Google Scholar]
  42. Weinstein A., Su W. J., Bogdan M., Barber R. F., and Candès E. J.. A Power Analysis for Model-X Knockoffs with ℓp-Regularized Statistics. arXiv preprint arXiv:2007.15346, 2020. [Google Scholar]
  43. Willer C. J., Li Y., and Abecasis G. R.. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics, 26(17):2190–2191, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Witten D. M. and Tibshirani R.. Covariance-regularized regression and classification for high dimensional problems. Journal of the Royal Statistical Society Series B: Statistical Methodology, 71(3):615–636, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Zhang Q., Privé F., Vilhjálmsson B., and Speed D.. Improved genetic prediction of complex traits from individual-level data or summary statistics. Nature Communications, 12:4192, 2021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Zou Y., Carbonetto P., Wang G., and Stephens M.. Fine-mapping from summary data with the “Sum of Single Effects” model. PLoS Genetics, 18(7):e1010299, 2022. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from ArXiv are provided here courtesy of arXiv

RESOURCES