Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2025 Nov 7;44(25-27):e70308. doi: 10.1002/sim.70308

A Modification to Two‐Stage Least Squares With Genetic Applications

Lei Fang 1, Wei Pan 1,
PMCID: PMC12593333  PMID: 41201239

ABSTRACT

Two‐stage least squares (2SLS) is by default applied to infer a putative causal association between an exposure, such as a gene or a protein, with an outcome such as a complex disease or trait, in transcriptome‐ or proteome‐wide association studies (TWAS/PWAS). In a typical two‐sample setting for TWAS/PWAS, the stage 1 sample size is much smaller than that of stage 2. To reduce the resulting attenuation bias and estimation uncertainty in stage 1 and boost the statistical power of the conventional TWAS, we propose a new method, called reverse two‐stage least squares (r2SLS): Instead of imputing a gene's expression (using genetic variants as instrumental variables, IVs) in stage 1 and then testing the association between the imputed expression and the observed outcome in stage 2 in the conventional 2SLS approach, we propose predicting the outcome (using IVs) and testing the association between the predicted outcome and the observed gene expression. Theoretically, we establish that the r2SLS estimator is asymptotically unbiased with a normal distribution. We also show theoretically when 2SLS and r2SLS are asymptotically equivalent and when r2SLS is asymptotically more efficient than 2SLS. We also consider the practical issue of how to select invalid IVs. We use simulations and three real data examples based on the GTEx gene expression data, UKB‐PPP proteomic data, and several GWAS summary datasets to demonstrate some advantages of r2SLS over 2SLS, including possibly better type I error control, higher statistical power and robustness to weak IVs.

Keywords: 2sls, causal inference, instrumental variable regression, r2sls, twas

1. Introduction

Genetic studies aim to discover genetic factors associated with disease, facilitating the understanding of the underlying mechanism of disease and thus the development of personalized medicine [1]. In particular, in the last twenty years, genome‐wide association studies (GWAS) have successfully uncovered tens of thousands of genetic variants and loci associated with various complex diseases and traits [2, 3]. However, these GWAS hits usually do not directly offer mechanistic insights [4]. Accordingly, transcriptome‐, proteome‐ and metabolome‐wide association studies (TWAS/PWAS/MWAS) have been proposed to identify (causally) related genes, proteins, and metabolites to shed biological insights [5, 6, 7, 8]; more backgrounds and related references can be found in a recent review [9]. We will use TWAS as a motivating example. In stage 1 of TWAS, one trains a model to predict a gene's expression level using genetic variants like SNPs. In stage 2, one imputes the gene's expression using the genetic variants and tests its association with the outcome. Hence, an interpretation of TWAS is to detect the association between the genetically regulated expression (GREX) of a gene with an outcome [5]. It turns out that the above two‐stage procedure in TWAS is an instance of two‐stage least squares (2SLS) in instrumental variable (IV) regression (with genetic variants as IVs); under suitable conditions (to be elaborated next), TWAS can be interpreted as detecting causal genes for the outcome [6].

In practice, two‐stage least squares (2SLS) is performed in a two‐sample setting for TWAS. Typically, stage 1 is conducted on a relatively small sample, for example, only in hundreds, as in the high‐impact GTEx data we used in our motivating example [10], due to the high cost of collecting gene expression data. On the other hand, stage 2 involves a much larger sample of GWAS on an outcome/trait of interest. In such a practical setting, inference based on 2SLS may be problematic. When the sample size in stage 1 is small, especially when the SNPs/IVs are only weakly associated with the exposure, the exposure would be poorly predicted with a large estimation uncertainty. Consequently, the causal estimate in stage 2 analysis would be biased towards the null, called attenuation bias [11]. In practice, the F‐statistic is often adopted as a criterion (e.g., >10) to check the presence of weak IVs in 2SLS. However, with a small sample size in stage 1, for example, in hundreds, the threshold is difficult to achieve. The consequence is that many exposures (e.g., genes) would be excluded with such a stringent criterion, thereby limiting the discovery of new exposure‐outcome associations. In addition, simply relying on F‐statistics does not guarantee avoidance of weak IV bias [12]. The bias from 2SLS could also further inflate the Type I error [13, 14].

In this report, we develop a novel approach, called reverse two‐stage least squares (r2SLS), by modifying 2SLS to address its issues. Rather than using IVs to predict the exposure based on the stage 1 sample, we leverage the larger sample size in stage 2 to predict the outcome by IVs, then test the association between the predicted outcome and the observed exposure. Intuitively, our prediction step will be more accurate, and so will be the final inference. In particular, our r2SLS also avoids the problem of weak IVs because we do not use the IV‐predicted exposure as a regressor.

The rest of the article is organized as follows: In Section 2, we introduce our model and proposed r2SLS along with its theoretical property and comparison with 2SLS. We also suggest how to handle invalid IVs by incorporating an existing method to identify them. In Section 3, through simulations, we demonstrate the superior performance of r2SLS over 2SLS in point estimation, Type I error control, and statistical power in some scenarios. Finally, in Sections 3.4 and 3.5, we confirm the advantage of r2SLS in two real data applications of TWAS with the GTEx gene expression data, UKB‐PPP proteomic data, and an Alzheimer's disease GWAS summary dataset. In addition, we also show that r2SLS does not introduce false positives in some negative control experiments. We end with a short summary and discussion.

2. Methods

Let X be an exposure of interest (e.g., a gene's expression in TWAS), Y be an outcome of interest, and Zp be some genetic variants (i.e., SNPs) to be used as IVs. We consider the most representative two‐sample setting: For simplicity of presentation, we assume that we have individual‐level data in two independent samples, 𝔻1=(x1i,Z1i)|i=1,,n1 and 𝔻2=(y2i,Z2i)|i=1,,n2; we will extend the method to GWAS summary data later. We use the vector or matrix form x1n1, Z1n1×p, y2n2, Z2n2×p to represent the corresponding samples. We also assume that n2 is usually much larger than n1 with a fixed p.

We consider the following two‐stage structural model:

x1=Z1β+ε (1)
y2=x2θ+ξ, (2)

where the random errors ε and ξ both follow a normal distribution with mean 0, and are independent of both Z1 and Z2. These random errors may contain environmental and other confounding effects, and thus are correlated if observed in the same sample. The exposure x2 is not observed but assumed to have the same distribution as (1). Similarly, y1 is not observed in the stage 1 sample but assumed to follow the same model as (2). Without loss of generality, we assume each IV in Z1 and Z2, x1 and y2 are standardized to have mean 0 and standard deviation 1. Our goal is to infer θ, often representing the causal effect of the exposure on the outcome.

2.1. New Method: r2SLS

Our strategy is to use a predicted outcome to draw an inference. Since we can rewrite (2) as y2=Z2βθ+εθ+ξ, our first step is to estimate βθ by regressing y2 on Z2 via ordinary least squares (OLS) with sample 𝔻2, obtaining an estimate βθ˜, then predict y1 as y^1=Z1βθ˜. Using the working model:

y^1=x1θ+εy1, (3)

with εy1=Z1(βθ˜βθ)εθ, we study the OLS estimator θ^ and derive its distribution. Specifically,

2.1. (4)

where θ^ can be decomposed into four parts: The true θ, a bias component that depends on the true θ, and two random terms that are from stage 2 and stage 1 estimation, respectively.

Under the null hypothesis θ=0, θ^ is an unbiased estimator. Thus, we have the following lemma:

Lemma 1

Under models ( 1 ) and ( 2 ), when Z2Z2/n2 is invertible, and θ^ is defined in Equation ( 4 ), then under the null hypothesis θ=0, we have the distribution of θ^ as follows:

n2θ^N(0,σt2ΦΨΦ), (5)

where Φ=x1Z1/x1x1,Ψ=(Z2Z2/n2)1, and σt2 represents the total variance of εθ+ξ. If n1,n2, then Φ=limn1x1Z1/x1x1=βZ and Ψ=limn2(Z2Z2/n2)1=Z1. We have

n2θ^dN(0,σt2βZβ). (6)

Remark 1

A naive method for drawing inference on θ with θ^ is to estimate the standard error from the OLS using var(y^1)/(n1var(x1)). However, as shown in Section 3, it is incorrect, leading to inflated Type I errors. The reason is that εy1i's are no longer identically distributed. Specifically, for each observation i, we have Z1i(βθ˜βθ)N(0,σt2Z1i[Z2Z2]1Z1i). Since Z1i's are different across i's, the variances of εy1i's are not equal. We also tried weighted least squares (WLS) to handle this heteroskedasticity problem in the simulation study. Unfortunately, it did not work well as expected.

When θ0, there are two additional components in θ^: εεθ/x1x1 and (Z1β)εθ/x1x1. Since E(εεθ/x1x1)0 and E((Z1β)εθ/x1x1)=0, θ^ is a biased estimate and the bias exclusively results from εεθ/x1x1. Denotes the bias factor κ=εε/x1x1=1R2, where R2 is the (empirical/sample) multiple coefficient of determination for the stage 1 model (1). Based on (4), to obtain an unbiased estimate of a non‐zero θ, we define a new estimator θ˜=θ^/(1κ). Then by incorporating the additional variance component (Z1β)εθ/x1x1, the distribution of θ˜ is given in Theorem 1.

Theorem 1

Under models ( 1 ) and ( 2 ), when Z2Z2/n2 is invertible, we have the distribution of θ˜ as follows:

n2(θ˜θ)N0,1(1κ)2(σt2ΦΨΦ+n2θ2σε2Z1β22/(x1x1)2), (7)

where σε2 is the variance of ε and Inline graphic is the L2 norm. If n1,n2 and limn1,n2n2/n1=ϕ(0,), then we have

n2(θ˜θ)dN0,σt2/(βZβ)+ϕθ2σε2/(βZβ). (8)

Remark 2

In practice, we recommend using (7) to conduct hypothesis testing if the stage 1 sample size n1 is small. The variance component n2θ2σε2Z1β22/(x1x1)2 can be estimated as (n2/n1)θ˜κ(1κ).

As n1, 1κ converges to the (population) coefficient of determination R2. To estimate 1κ, we propose applying either the Olkin‐Pratt R2 estimator [15] or the adjusted R2. Both methods provide similar performance in most cases. In general, we also do not recommend using unadjusted R2 since it is biased, as shown in Table S1. When R2 is low, for example the expression levels of some genes are not highly heritable, either the Olkin‐Pratt or the adjusted R2 estimate could be 0 or even negative, in which case the bias correction in r2SLS is not applicable. Therefore, in practice, if R2 in stage 1 is too small, for example, <0.01, we recommend using (5) directly to test null hypothesis H0:θ=0. See Table S2 for some simulation results in such a case.

When individual‐level data are not available, our method still works with GWAS summary data. Specifically, we use the adjusted R2 (to estimate R2 as the Olkin‐Pratt estimator is not applicable to summary statistics). The estimate θ^ is calculated as cor(x1,Z1)βθ˜, where cor(x1,Z1) is the vector of correlations of each SNP in Z with exposure X calculated from the GWAS summary statistics. Specifically, for any SNP Zj, given the marginal estimate of its effect size γ^jg and variance var(γ^jg), the marginal correlation rjg is calculated [16] as

rjg=γ^jgγ^jg2+(Nj2)·var(γ^jg),

where Nj is the corresponding sample size. For the standard error, Ψ is the inverse of the correlation matrix of Z that can be estimated from a reference panel. And the other components can also be estimated using their sample versions.

2.2. Comparison With 2SLS

The conventional 2SLS method uses the OLS in the stage 1 model (1) to predict the exposure in stage 2 (2) for inference. From [17], we first have

n1(β^β)dN(0,σε2/Z)

as n1. Then the predicted x^2=Z2β^, and the estimator for θ is Inline graphic. The asymptotic distribution for θ^ as n2 is

n2(θ^θ)dN(0,σt2/βZβ+ϕθ2σε2/βZβ). (9)

From Theorem 1, the asymptotic variance of θ˜ is the same as θ^, thus r2SLS reaches the same asymptotic efficiency as 2SLS. However, with finite samples, especially when the stage 1 sample size is small and the signals of IVs are weak, the accuracy of β is not attainable, leading to poor prediction of x^2. The resulting 2SLS estimator θ^ would bias the θ toward zero, as a well‐known phenomenon of attenuation in the measurement error literature. This scenario was discovered before [18] and further discussed in the context of genetic studies [14, 19]. As demonstrated in the simulation in Section 3, this bias could inflate Type I errors. On the contrary, r2SLS leverages the large sample size in stage 2 to yield a good prediction of y^1, largely alleviating the corresponding attenuation bias.

It is notable that the above asymptotic equivalence between 2SLS and r2SLS is achieved when the same set of SNPs/IVs is used to impute the exposure in 2SLS and to impute the outcome in r2SLS. On the other hand, in TWAS applications of 2SLS, the standard way to select IVs is to use some cis‐SNPs that are located around a gene's coding region. However, most gene expression variation is regulated by trans‐SNPs far away from the gene and dispersed across the whole genome [20]. Due to the small sample size in stage 1, trans‐SNPs are usually not used in 2SLS/TWAS. However, for r2SLS, the large stage 2 sample size allows trans‐SNPs to be considered for predicting y^1 and potentially increase power. With additional trans‐SNPs for prediction, we would have a smaller asymptotic variance of θ˜ as demonstrated in Corollary 1 below. Thus, in these situations, r2SLS would be (asymptotically) moreefficient than 2SLS. The corresponding advantage of r2SLS with higher power than 2SLS will be confirmed in simulations.

Corollary 1

In the true model ( 1 ), we decompose Z1=(Z1,cis,Z1,trans) and thus have

x1=Z1,cisβcis+Z1,transβtrans+ε. (10)

In addition, we assume Z1,cis is independent of Z1,trans. If we only use Z1,cis to predict x^1 as in a typical application of 2SLS to TWAS, obtaining the corresponding estimator θ^cis of θ, we have

n2(θ^cisθ)dN(0,σt2/βcisZcisβcis+ϕθ2σε2/βcisZcisβcis). (11)

Remark 3

It is clear that we have βcisZcisβcisβZβ, thus the variance in Equation (11) is no smaller than that in Equation (8).

2.3. Invalid IVs

In models (1) and (2), we assume that all SNPs are valid IVs, satisfying the following IV assumptions [11]: A valid IV is

  1. associated with the exposure;

  2. independent of the hidden confounders of the exposure‐outcome association;

  3. independent of the outcome, conditional on the exposure and the hidden confounders.

The second IV assumption also implies that the instrument‐outcome relationship should not be confounded. However, in genetic studies, there is a widespread presence of the pleiotropic effect of SNPs [21], meaning that SNPs may affect the outcome directly or through other hidden confounders. Here, we consider the situation when there are invalid IVs. In this situation, model (2) becomes the following model

y2=x2θ+Z2α+ξ (12)

to adjust such a pleiotropic effect. The support of α represents those invalid IVs. To obtain a correct inference of θ in the presence of invalid IVs, we adopted a two‐step procedure. The first step is to use TScML [21] to select those invalid IVs, which is defined as a set B^ while the true set is B. The key advantage of TScML is its ability to consistently identify the invalid IVs with a theoretical guarantee. Then in the second step, we adjust the effect of those identified invalid IVs in step 1, to infer θ by considering the following working model:

y^1=x1θ+Z1,BαB+εy1. (13)

Define θ^=x1(IPZ1,B^)y^1x1(IPZ1,B^)x1 with P represents the projection matrix, then our proposed r2SLS with invalid selection (r2SLS‐S) estimator θ˜=θ^/(1κ), where κ=εε(1m/n1)x1(IPZ1,B^)x1, where m=|B^|0. Let limn1n1Z1,BˆcβBˆc(IPZ1,Bˆ)2x1(IPZ1,Bˆ)x1=τ, limn1x1(IPZ1,Bˆ)Z1x1(IPZ1,Bˆ)x1=Φ, we then have the following theorem.

Theorem 2

Under models ( 1 ) and ( 12 ), when Z2Z2/n2 is invertible and other mild conditions, in addition, n1,n2 and limn1,n2n2/n1=ϕ(0,), TScML selects the true invalid set B with probability converges to 1. We then have the following asymptotic distribution of θ˜:

n2(θ˜θ)dN0,1(1κ)2(σt2ΦΨΦ+ϕθ2σε2τ2). (14)

Remark 4

The additional mild conditions include the plurality condition and eigenvalue conditions of the covariance matrix of Z2 for TScML to have a variable selection consistency, see the supplement for the details.

Remark 5

Ideally, we would want to develop an invalid IV selection method (similar to TScML) under our r2SLS framework. However, due to the fact that conditional on Z1, x1 is independent of y^1, because, by construction, Z1 is directly causal to y^1, not mediated through x1. Thus, it is challenging to select those invalid IVs under the r2SLS framework; we leave this as a future topic.

Remark 6

In practice, when selecting IVs for r2SLS, we recommend including both IVs (e.g., cis‐SNPs) related to the (gene/protein) exposure (based on the exposure GWAS data) and IVs (e.g., trans‐SNPs) associated with the outcome (based on the outcome GWAS data). There are two reasons to include IVs that are related to the exposure. First, selecting IVs based solely on their association with the outcome can be problematic: If none of these IVs are related to the exposure, then all IVs are invalid. Moreover, if most IVs are unrelated to the exposure, the predicted outcome will contain substantial variability unrelated to the exposure, which increases the estimation variance and thus reduces power. At the same time, including IVs related to the outcome, which may be trans‐SNPs for the exposure, may improve statistical power, especially with a small dataset of exposure. On the other hand, with a large dataset of exposure, using IVs only related to the exposure may have sufficient power.

2.4. Extension to Multiple Exposures

We extend r2SLS to multivariate analysis with multiple exposures. Suppose we have d exposures of interest, a typical multivariate two‐stage model in the two‐sample setting is the following:

x1,j=Z1βj+εj,y2=j=1dx2,jθj+ξ, (15)

where x1,j represents exposure j in the stage 1 sample and y2 is from the stage 2 sample; all the error terms εj and ξ are assumed to be from a multivariate normal distribution.

Denote X=(x1,1,,x1,d), B=(β1,,βd), V=(ε1,,εd), and θ=(θ1,,θd). As in the univariate case, we use Z1 to predict y^1 as Z1γ^. We still use the OLS estimator to derive the corresponding estimator and its asymptotic normality. Specifically, we have θ^=(XX)1Xy^1. Let the correction matrix Λ=IdE(XX)1VV1, then the unbiased estimator for θ is θ˜=Λθ^. Let σt2 denote the total variance of j=1dεjθj+ξ, and σtX2 denote the total variance of j=1dεjθj. We have the following asymptotic distribution for θ˜.

Theorem 3

Under the model setting ( 15 ), when n1,n2, we have

n1(θ˜θ)dN(0,ϕσt2ΛWE(Z1Z1)1WΛ+σtX2Λ(X)1BZB(X)1) (16)

with W=E(XX)1E(XZ1).

Λ can be estimated as Inline graphic, where each entry in ^X is the sample covariance of x1,i and x1,j, and each entry in ^ε is the sample covariance of εi and εj.

We can also consider the presence of invalid IVs in the multivariate case using the same strategy as before, that is, selecting invalid IVs via a constrained maximum likelihood approach [22].

3. Results

3.1. Simulations: More Accurate Estimation and Better Controlled Type I Errors by r2SLS

We conducted simulations to investigate the operating characteristics of r2SLS as compared with the following methods: r2SLS, r2SLS‐naive, 2SLS, observed‐1, observed‐2, and weighted least squares (WLS). In r2SLS we used the adjusted R2 to calculate κ. The method r2SLS‐naive used the predicted y^1 to conduct a simple linear regression on the observed exposure with the stage 1 sample. Both observed‐1 and observed‐2 performed a simple linear regression on the observed exposure and outcome in samples 1 and 2, respectively (while assuming the availability of both the exposure and outcome in both samples). We added WLS as a comparison; the weights were estimated from the squared residuals plus 106 to avoid a possible (near) singularity. Considering a typical TWAS application, we set the stage 1 sample size as n1=500. We generated p=30 cis‐SNPs independently from a standard normal distribution with their effects βjN(0.1,0.12) for j=1,,30. The random errors ε and ξ in the same sample were drawn from a bivariate normal distribution with mean 0 and a covariance matrix with all non‐diagonals as 0.25 and diagonals as 1. We set the stage 1 R2=0.235. We also considered a weak IV scenario with p=120 cis‐SNPs of weaker effects βj0.5N(0.1,0.12) but with the same R2. The stage 2 sample size was n2=10000 with θ{0,0.1,0.2}. The results for estimating θ are summarized in Table 1 based on 500 repetitions in each scenario.

TABLE 1.

Simulation results for estimating θ and testing H0:θ=0 versus H0:θ0 for each method with n1=500 and n2=10000. SD, SE, and MSE represent the standard deviation of the estimates, their mean standard error, and mean squared error, respectively.

θ=0
θ=0.1
θ=0.2
Methods Mean SD SE MSE Type‐I Mean SD SE MSE Power Mean SD SE MSE Power
p=30
r2SLS 1.32e−4 0.020 0.021 4.03e−4 0.042 0.102 0.022 0.023 4.96e−4 0.996 0.204 0.028 0.028 8.08e−4 1.000
βjN(0.1,0.12)
2SLS ‐7.86e−5 0.016 0.017 2.58e−4 0.040 0.083 0.017 0.018 5.60e−4 0.998 0.167 0.020 0.022 1.47e−3 1.000
r2SLS‐naive 7.79e−5 0.005 2.11e−3 2.12e−5 0.372 0.023 0.005 2.80e−3 5.89e−3 0.996 0.047 0.006 4.31e−3 0.024 1.000
observed‐1 0.188 0.038 0.037 0.037 1.000 0.288 0.038 0.038 0.037 1.000 0.388 0.038 0.038 0.037 1.000
observed‐2 0.190 0.009 0.009 0.036 1.000 0.291 0.009 0.009 0.036 1.000 0.391 0.009 0.009 0.036 1.000
WLS 5.71e−5 0.005 2.69e−4 2.05e−5 0.892 0.022 0.005 3.19e−4 6.07e−3 1.000 0.044 0.006 4.06e−4 0.024 1.000
p=120
r2SLS 1.21e−3 0.027 0.027 7.35e−4 0.042 0.104 0.031 0.029 9.55e−4 0.976 0.207 0.041 0.033 1.72e−3 1.000
βj0.5N(0.1,0.12)
2SLS 8.08e−4 0.013 0.013 1.66e−4 0.042 0.051 0.014 0.014 2.62e−3 0.974 0.100 0.016 0.015 0.010 1.000
r2SLS‐naive 3.87e−4 0.006 4.29e−3 3.59e−5 0.162 0.024 0.007 4.80e−3 5.81e−3 1.000 0.048 0.008 5.94e−3 0.023 1.000
observed‐1 0.189 0.038 0.038 0.037 1.000 0.289 0.038 0.038 0.037 1.000 0.389 0.038 0.038 0.037 1.000
observed‐2 0.191 0.009 0.009 0.036 1.000 0.291 0.009 0.009 0.036 1.000 0.391 0.009 0.009 0.036 1.000
WLS 3.63e−4 0.006 4.14e−3 3.62e−5 0.894 0.024 0.007 4.78e−4 5.87e−3 1.000 0.047 0.008 4.91e−3 0.023 1.000

When θ=0, r2SLS performed similarly to 2SLS with well‐controlled Type I errors and high power in both situations, while the other methods, r2SLS‐naive, observed‐1, and observed‐2, were unable to control the Type I error at all. However, if θ0, 2SLS suffered from biased estimation, and it became worse with weaker IVs in stage 1 (though the R2 in stage 1 remained the same). As θ increased, r2SLS demonstrated a greater advantage by achieving a smaller mean squared error (MSE). In addition, WLS performed badly in this situation. We suspect that because of the difficulty of accurately estimating the weights in the small sample size, the standard error from WLS was not well estimated to obtain a valid inference of θ.

When we used correlated IVs with an autocorrelation of 0.5 while keeping every other aspect of the settings the same, as shown in Table 2, we still reached a similar conclusion. Therefore, r2SLS worked well with correlated IVs.

TABLE 2.

Simulation results for estimating θ and testing H0:θ=0 versus H0:θ0 for each method with n1=500 and n2=10000 with IVs of autocorrelation of 0.5. SD, SE, and MSE represent the standard deviation of the estimates, their mean standard error, and mean squared error, respectively.

θ=0
θ=0.1
θ=0.2
Methods Mean SD SE MSE Type‐I Mean SD SE MSE Power Mean SD SE MSE Power
p=30
r2SLS 1.01e−4 0.011 0.011 1.32e−4 0.046 0.101 0.013 0.013 1.65e−4 1.000 0.201 0.016 0.022 2.53e−4 1.000
βjN(0.1,0.12)
2SLS ‐1.57e−4 0.011 0.010 1.13e−4 0.046 0.094 0.012 0.012 1.77e−4 1.000 0.187 0.014 0.014 3.59e−4 1.000
p=120
r2SLS 5.80e−5 0.013 0.013 1.73e−4 0.048 0.101 0.015 0.014 2.12e−4 1.000 0.202 0.018 0.014 3.32e−4 1.000
βj0.5N(0.1,0.12)
2SLS 5.34e−5 0.006 0.004 3.78e−5 0.054 0.075 0.010 0.010 7.49e−4 1.000 0.149 0.012 0.012 2.7e−3 1.000

Furthermore, the bias from 2SLS also inflated its Type I error when we were testing the null hypothesis H0:θ=θ0 versus HA:θθ0 for a non‐zero θ0. As Figure 1A shows, the Type I error was not well controlled by 2SLS and jumped up quickly as θ increased, while r2SLS yielded much more stable and better controlled Type I errors. When θ became large, r2SLS had a slightly inflated type I error due to the bias from estimating 1/(1κ) or 1/R2. The nonlinear nature of this bias correction factor with respect to R2 poses a difficulty in obtaining an unbiased estimate, which we leave as future work.

FIGURE 1.

FIGURE 1

Comparison of r2SLS and 2SLS in simulations: (A) Type I errors for r2SLS and 2SLS for testing H0:θ=θ0 versus HA:θθ0 with 30 cis‐SNPs; (B) Type I errors (for θ0=0) and power (for θ00) for r2SLS and 2SLS for testing H0:θ=0 versus HA:θ0 with 20 trans‐SNPs; (C) Similar to (B) but with 20 trans‐SNPs of stronger effects. The horizontal dashed line represents the nominal significance level of 0.05.

In addition, as shown in Supporting Information: Table S2, if R2 was equal or close to 0 in the (true) stage 1 model, unbiased estimation was not feasible by either r2SLS (because the correction factor 1/R2 would be exactly or nearly ) or 2SLS; however, based on (5), one may still use the un‐corrected version of r2SLS to conduct hypothesis testing.

3.2. Simulations: r2SLS May Have Higher Power Than 2SLS

To mimic real TWAS applications, we considered both cis‐SNPs and trans‐SNPs. We simulated cis‐SNPs and trans‐SNPs independently for a single exposure with a normal distribution. We had p1=20 cis‐SNPs and p2=20 trans‐SNPs. We generated the exposure and outcome from the following model:

X=j=1p1Zjβj+j=p1+1p1+p2Zjβj+ϵx,Y=Xθ+ϵy. (17)

The errors ϵx and ϵy were drawn from a bivariate normal distribution with mean 0 and a covariance matrix of non‐diagonals as 0.25 and diagonals as 1. We generated βj's from N(0.05,0.052) for cis‐SNPs with j=1,,20, and from N(0.05,0.12) or N(0.1,0.12) for trans‐SNPs with j=21,,40. We set the stage 1 and stage 2 sample sizes as n1=500 and n2=10000, respectively. For 2SLS, we only used cis‐SNPs to predict X with the stage 1 sample, whereas for r2SLS, we used both cis‐SNPs and trans‐SNPs to predict Y with the stage 2 sample. With their empirical Type I errors and power in Figure 1B,C based on 500 repetitions, it is clear that r2SLS with trans‐SNPs achieved higher power than 2SLS, and the advantage appeared larger with more variations of the exposure explained by trans‐SNPs.

Finally, when the outcome did not follow a normal distribution, but a skewed or a heavy‐tailed distribution such as an exponential distribution or a t‐distribution with degrees of freedom 5, respectively, r2SLS still outperformed 2SLS with better controlled type I errors and higher power as demonstrated in Supporting Information: Figures S1 and S2, respectively.

3.3. Simulations: r2SLS‐S Performed Well in the Presence of Invalid IVs

We evaluated the performance of r2SLS‐S in the presence of invalid IVs. We set the IVs as correlated with an autocorrelation of 0.5, and invalid IVs with βj=0 or βjN(0.1,0.12) and αj following Unif(0.1,0.2) for j=1,,5 when p=30, and for j=1,,20 when p=120. The rest of the parameters were set the same as in the valid IV case in Section 3.1. In Tables 3 and 4, both 2SLS and r2SLS gave biased estimates, causing inflated Type I errors. In contrast, the proposed r2SLS‐S controlled the Type I error well as TScML, while achieving similar power. This suggests that if we adopt an existing method that can effectively select out invalid IVs, r2SLS‐S would work well in the presence of invalid IVs.

TABLE 3.

Simulation results with invalid IVs for estimating θ and testing H0:θ=0 versus H0:θ0 for each method with n1=500 and n2=10000. The invalid IVs βj=0 and αjUnif(0.1,0.2) for j=1,,5 when p=30 and for j=1,,20 when p=120. For the valid IVs, βjN(0.1,0.12) when p=30, and βj0.5N(0.1,0.12) when p=120. SD, SE, and MSE represent the standard deviation of the estimates, their mean standard error, and mean squared error, respectively.

θ=0
θ=0.025
θ=0.05
Methods Mean SD SE MSE Type‐I Mean SD SE MSE Power Mean SD SE MSE Power
p=30
r2SLS‐S −1.60e−4 0.014 0.013 1.85e−4 0.058 0.025 0.014 0.013 1.90e−4 0.480 0.050 0.014 0.014 2.05e−4 0.960
r2SLS 0.045 0.043 0.014 3.91e−3 0.698 0.071 0.043 0.014 3.94e−3 0.866 0.096 0.043 0.014 3.98e−3 0.950
TScML −1.45e−4 0.013 0.012 1.58e−4 0.054 0.023 0.013 0.012 1.65e−4 0.500 0.046 0.013 0.012 1.87e−4 0.962
2SLS 0.040 0.033 0.012 2.65e−3 0.716 0.063 0.033 0.012 2.51e−3 0.888 0.086 0.033 0.013 2.37e−3 0.974
p=120
r2SLS‐S 4.91e−4 0.015 0.014 2.23e−4 0.054 0.025 0.015 0.015 2.27e−4 0.414 0.050 0.015 0.015 2.37e−4 0.920
r2SLS 0.023 0.095 0.016 9.57e−3 0.744 0.048 0.095 0.016 9.61e−3 0.800 0.074 0.095 0.016 9.66e−3 0.838
TScML 4.67e−4 0.011 0.010 1.15e−4 0.056 0.019 0.011 0.010 1.55e−4 0.466 0.037 0.011 0.010 2.85e−4 0.946
2SLS 0.016 0.059 0.010 3.75e−3 0.754 0.034 0.059 0.010 3.58e−3 0.796 0.051 0.059 0.011 3.52e−3 0.842

TABLE 4.

Simulation results with invalid IVs for estimating θ and testing H0:θ=0 versus H0:θ0 for each method with n1=500 and n2=10000. The invalid IVs βjN(0.1,0.12) and αjUnif(0.1,0.2) for j=1,,5 when p=30 and for j=1,,20 when p=120. For the valid IVs, βjN(0.1,0.12) when p=30, and βj0.5N(0.1,0.12) when p=120. SD, SE, and MSE represent the standard deviation of the estimates, their mean standard error, and mean squared error, respectively.

θ=0
θ=0.025
θ=0.050
Method Mean SD SE MSE Type‐I Mean SD SE MSE Power Mean SD SE MSE Power
p=30
r2SLS‐S −1.60e−4 0.014 0.013 1.85e−4 0.058 0.025 0.014 0.013 1.90e−4 0.478 0.050 0.014 0.014 2.05e−4 0.960
r2SLS 0.229 0.037 0.016 0.054 1.000 0.254 0.037 0.017 0.054 1.000 0.279 0.038 0.018 0.054 1.000
TScML −1.44e−4 0.013 0.012 1.59e−4 0.054 0.023 0.013 0.012 1.65e−4 0.500 0.046 0.013 0.012 1.87e−4 0.962
2SLS 0.211 0.028 0.015 0.045 1.000 0.234 0.028 0.016 0.045 1.000 0.258 0.029 0.017 0.044 1.000
p=120
r2SLS‐S 4.91e−4 0.015 0.014 2.23e−4 0.054 0.025 0.015 0.015 2.23e−4 0.414 0.050 0.015 0.015 2.35e−4 0.920
r2SLS 0.488 0.081 0.027 0.245 1.000 0.514 0.082 0.028 0.245 1.000 0.539 0.082 0.029 0.246 1.000
TScML 4.69e−4 0.011 0.010 1.15e−4 0.056 0.019 0.011 0.010 1.55e−4 0.468 0.037 0.011 0.010 2.85e−4 0.946
2SLS 0.360 0.051 0.020 0.132 1.000 0.378 0.051 0.021 0.127 1.000 0.397 0.052 0.022 0.123 1.000

3.4. Application 1: r2SLS is More Powerful Than 2SLS in Testing Gene‐Protein Associations

We illustrate the performance of r2SLS‐S against 2SLS via a real data example. Biologically, the genetic information from a gene flows into its encoded protein. Here, we treat the expression of a protein as the outcome and the corresponding gene's expression as the exposure; it is biologically reasonable to assume that they (or at least most of them) are causally associated. We tested H0:θ=0 versus HA:θ0 for the effect of gene expression on protein expression for each gene‐protein pair. We chose the candidate gene‐protein pairs from two large‐scale Alzheimer's disease (AD) GWASs, IGAP [23] and EADB [24]. Combining their results, we obtained in total of 110 loci significantly associated with AD. For stage 2, we used the latest large‐scale proteomic data from the UK Biobank Pharma Proteomics Project (UKB‐PPP), containing the plasma proteomic profiles of 54 219 participants with 2923 proteins [25]. Among those AD‐associated genes, 24 had their protein expression levels measured in the UKB‐PPP. In addition, 21 of the 24 genes were identified as putative causal AD genes by the Alzheimer's Disease Sequencing Project (ADSP) gene verification committee [26, 27, 28, 29]. We used these 21 genes/proteins for the remaining analysis. The gene expression data for stage 1 were from the GTEx v8 [10]. We used the whole blood gene expression data from 558 participants of European ancestry. We first conducted GWAS for each gene based on the individual‐level gene expression data. For quality control, we only keep the SNPs with a maximum missing rate of 0.05, a minor allele frequency greater than 0.01, and a Hardy‐Weinberg equilibrium test p value above 106. For each gene, we selected its cis‐SNPs from an expanded region of ±500kb surrounding the gene. For one gene, there were no cis‐SNPs from the GTEx v8 data, and it was excluded for further analysis. We further performed clumping with a threshold of r2=0.5 to remove highly correlated SNPs. For trans‐SNP selection for each protein, we first removed a small portion of participants with multiple visits and only used the baseline protein measurements in UKB‐PPP. We further removed related individuals to avoid bias from genetic correlations [30], and those from other population/ethnic groups, leaving a final cohort of 38 083 unrelated participants of self‐reported European ancestry for further analysis. We then conducted GWAS for each protein (on its inverse‐normal‐transformed expression levels) with the individual‐level UKB‐PPP data [25], then selected the SNPs with p values <5108 after clumping with r2=0.5. Among those cis‐SNPs that were available in both GTEx v8 and UKB‐PPP data, we selected the top 50 SNPs with the highest (marginal) absolute correlation to the expression level of the corresponding protein. We used the individual‐level GTEx and UKB‐PPP data in stages 1 and 2, respectively. Since the stage 1 adjusted R2 for gene IL34 using both cis‐ and trans‐SNPs was negative, its result was excluded. We also excluded gene SORT1 as its SNP covariance matrix in stage 1 was (nearly) singular.

We compare the point estimates for each gene and the corresponding p values obtained from r2SLS and 2SLS in Figure 2. The detailed results are provided in Table S3. Figure 2A shows that for the majority, the estimates of the causal effects θ from r2SLS were larger than those from 2SLS in magnitude, confirming the attenuation bias of 2SLS as shown in the simulation study. Figure 2B shows that there were six significant genes (TREM2, MME, CTSB, ICA1, GRN, CTSB) identified by r2SLS, but missed by 2SLS, after the Bonferroni adjustment, indicating that r2SLS is more powerful than 2SLS.

FIGURE 2.

FIGURE 2

Comparison of r2SLS and 2SLS for inferring associations of 19 AD risk genes' expression levels with their corresponding proteins': (A) The point estimates of θ, where the red dots represent the estimates from r2SLS are larger than those from 2SLS in magnitude and with the same sign, the (blue) line is the identity line; (B) the p values for H0:θ=0, where the red dots represent the eight significant genes identified by r2SLS but not by 2SLS and the black dot represents otherwise, and the (red) lines represent the Bonferroni‐adjusted significance threshold.

Finally, we also followed the same procedure for trans‐SNP selection for proteins to select trans‐SNPs for each gene with GTEx data. However, only genes CTSH and EPHA1 had any trans‐SNPs selected. By including such trans‐SNPs in 2SLS, we obtained similar results, including their statistical significance levels, though their effect size estimates were larger, as expected. The detailed results are included in Table S4. These results confirmed our main point: Due to a much smaller stage 1 sample size and importantly, expected weak effects of trans‐SNPs, the statistical power would be too low to select trans‐SNPs as IVs in stage 1 for 2SLS or standard TWAS; thus, such a selection is almost always skipped in practice [5, 6]. In contrast, as demonstrated here, with a much larger sample size in stage 2, it is much more feasible with higher power to successfully select some trans‐SNPs; by taking advantage of this feature of genetic data in TWAS, our proposed r2SLS is less biased and more powerful.

3.5. Application 2: r2SLS Detects APOE‐AD Association

Due to often small sample sizes of gene expression data, as for GTEx v8, many genes will be excluded from 2SLS‐based TWAS analysis to avoid weak IV bias with their insignificant cis‐heritability (or R2) or low F‐statistics (e.g., less than the usual threshold of 10). In this data example, we aimed to show the unnecessary conservativeness of this practice based on 2SLS in the conventional TWAS; in contrast, adopting r2SLS sidesteps this issue. More specifically, we considered a well‐known causal gene APOE for AD. In the GTEx v8 data from the brain hippocampus, which is likely related to AD, there were n1=150 samples of European ancestry. With such a small sample size, APOE failed to achieve a significant cis‐heritability (as shown in the Fusion website [6]) or the usual F‐statistic threshold. Hence, in any conventional TWAS analysis, APOE would be excluded. Here, for stage 2, we used the aforementioned IGAP AD GWAS data with 21 982 cases and 41 944 controls. We used the same procedure as in the previous data example to select cis‐SNPs and trans‐SNPs. The reference panel is from those 38 083 UKB‐PPP participants in the previous Section 3.4. But given an even smaller sample size, we chose only 10 IVs as the top 10 cis‐SNPs with the highest absolute correlations to the APOE expression levels for 2SLS, while we obtained 57 additional SNPs significantly associated with AD (p<5108) after clumping with r2=0.5 (as trans‐SNPs) for r2SLS. For 2SLS, the estimated (causal) effect size of APOE on AD was 0.020, with a standard error of 0.010 and a p value of 0.042. By r2SLS, we obtained a significant p value of 1.68e4 with a larger (in magnitude) effect estimate of 0.126 and a standard error of 0.035. This example demonstrated again the usefulness of r2SLS in TWAS for picking up a gene missed by 2SLS due to a small sample size in stage 1, yielding higher power and coverage for scientific discoveries.

3.6. Application 3: r2SLS Yielded No False Positives in a Negative Control Experiment

Here, we added a real data example to demonstrate the performance of r2SLS for negative controls. We chose a few negative controls for outcome hair color (blonde) with the exposures as high‐density lipoprotein cholesterol (HDL), coronary heart disease (CAD), and asthma. The hair color GWAS is from the Neale Lab (GWAS round 2) with a sample size of 360 270 individuals of European ancestry. The GWASs of the exposures have a sample size of 187 167, 184 305, and 26 475, respectively, from different consortia [31, 32, 33].

With the large sample sizes of the GWAS data, we selected candidate IVs based on the p value threshold of 5108 in each exposure GWAS for 2SLS with clumping r2=0.001. For r2SLS, we selected IVs once based on the hair color GWAS with the same p value and clumping criteria as for 2SLS. The reference panel consisted of those 38 083 UKB‐PPP participants in Section 3.4. We applied both r2SLS and 2SLS and summarized the results in Table 5. The results for both methods remained negative, showing no statistically significant exposure‐outcome associations, thus supporting the use of r2SLS (and 2SLS) in this application.

TABLE 5.

Inference results of the negative controls of HDL, CAD, asthma for hair color (blonde) for r2SLS and 2SLS.

Exposures Hair color (outcome)
r2SLS 2SLS
(# of SNPs r2SLS/2SLS) Effect est SE
p
Effect est SE
p
HDL (105/106) −0.070 0.042 0.09 −0.014 0.008 0.07
CAD (188/35) −2.48 2.67 0.35 0.006 0.016 0.72
Asthma (42/7) −0.359 0.213 0.09 0.008 0.015 0.62

4. Discussion

We have proposed r2SLS to address some potential limitations of 2SLS, which has been exclusively adopted in many applications for genetic studies, such as current TWAS/PWAS/MWAS. In r2SLS, we draw inference based on an IV‐predicted outcome and an observed exposure, instead of on an IV‐predicted exposure and an observed outcome in the conventional 2SLS. By leveraging often much larger stage 2 samples in TWAS, r2SLS can provide more accurate estimation, whereas 2SLS may suffer from severe attenuation biases, leading to inflated type I errors and loss of power. At the end, the proposed r2SLS can be more powerful than 2SLS, as shown in simulations and two real data applications. In addition, theoretically there is a non‐identifiability issue with 2SLS, but not with r2SLS, if R2=0 in the stage 1 model, though, as found in our simulations, numerically it does not seem to be a real problem for hypothesis testing on, but not estimation of, θ with finite samples. This suggests that the current practice of excluding an exposure (e.g., a gene) with a too small R2 (or cis‐heritability) from exposure‐outcome association testing in 2SLS/TWAS is overly conservative and unnecessary. In particular, in such a situation (and those with very weakly associated IVs), applying our uncorrected version of r2SLS in Equation (5) performs well for hypothesis testing (but not for parameter estimation), and should be applied. This is highly relevant in practice because with often a small sample size for an exposure, its R2 (or cis‐heritability) may be under‐estimated, and thus unnecessarily excluded from TWAS analysis; our proposed method overcomes this severe limitation, as demonstrated in our second data example for APOE‐AD association. Finally, the proposed r2SLS also has a causal interpretation just like 2SLS because of their estimating the same causal parameter [34]; on the other hand, for the same reason, if one would prefer to interpret TWAS as a gene‐based association test for the association between the genetically regulated component of a gene and an outcome, r2SLS can be equally used for this purpose.

One future direction for r2SLS is to further develop and study strategies to deal with invalid IVs, for example, under its own framework, such as under the working model (13). Another direction is to extend the outcome prediction from OLS to other methods, such as penalized or Bayesian regression with high‐dimensional data. Finally, the normality assumption on both the exposure and outcome may be relaxed for large‐sample approximations. These are interesting topics warranting future investigation.

Conflicts of Interest

The authors declare no conflicts of interest.

Supporting information

Data S1. Supporting Information.

SIM-44-0-s001.pdf (204.4KB, pdf)

Acknowledgments

We thank the reviewers and editors for many helpful and insightful comments. We thank Mykhaylo Malakhov for helpful suggestions. This research was supported by NIH Grant Nos. RF1 AG067924, R01 AG065636, U01 AG073079, and R01 AG074858, and the Minnesota Supercomputing Institute (MSI) at the University of Minnesota.

Fang L. and Pan W., “A Modification to Two‐Stage Least Squares With Genetic Applications,” Statistics in Medicine 44, no. 25‐27 (2025): e70308, 10.1002/sim.70308.

Funding: This work was supported by the National Institutes of Health (Grant Nos. RF1 AG067924, R01 AG065636, U01 AG073079, and R01 AG074858).

Data Availability Statement

The data that support the findings of this study are openly available to approved users in the UK Biobank at https://www.ukbiobank.ac.uk/.

The R package for r2SLS is available on https://github.com/lei‐fang‐stat/r2SLS. The access to the UKB data and GTEx data was approved through UKB Application #35107 and dbGaP Project #26511, respectively. The hair color (blond), HDL, CAD, and asthma GWAS results can be obtained from TwoSampleMR R package with request ID of “ukb‐d‐1747_1”, “ieu‐a‐299”, “ieu‐a‐7”, “ieu‐a‐44”, respectively.

References

  • 1. Collins F. S., Green E. D., Guttmacher A. E., and Guyer M. S., “Institute US National Human Genome Research. A Vision for the Future of Genomics Research,” Nature 422 (2003): 835–847. [DOI] [PubMed] [Google Scholar]
  • 2. Abdellaoui A., Yengo L., Verweij Karin J. H., and Visscher P. M., “15 Years of GWAS Discovery: Realizing the Promise,” American Journal of Human Genetics 110 (2023): 179–194. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Uffelmann E., Huang Q. Q., Munung N. S., et al., “Genome‐Wide Association Studies,” Nature Reviews Methods Primers 1 (2021): 59. [Google Scholar]
  • 4. Visscher P. M., Wray N. R., Zhang Q., et al., “10 Years of GWAS Discovery: Biology, Function, and Translation,” American Journal of Human Genetics 101 (2017): 5–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Gamazon E. R., Wheeler H. E., Shah K. P., et al., “A Gene‐Based Association Method for Mapping Traits Using Reference Transcriptome Data,” Nature Genetics 47 (2015): 1091–1098. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Gusev A., Ko A., Shi H., et al., “Integrative Approaches for Large‐Scale Transcriptome‐Wide Association Studies,” Nature Genetics 48 (2016): 245–252. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Panyard D. J., Kim K. M., Darst B. F., et al., “Cerebrospinal Fluid Metabolomics Identifies 19 Brain‐Related Phenotype Associations,” Communications Biology 4 (2021): 63. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Liu S., Zhu J., Zhong H., et al., “Identification of Proteins Associated With Type 2 Diabetes Risk in Diverse Racial and Ethnic Populations,” Diabetologia 67 (2024): 1–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Evans P., Nagai T., Konkashbaev A., Zhou D., Knapik E. W., and Gamazon E. R., “Transcriptome‐Wide Association Studies (TWAS): Methodologies, Applications, and Challenges,” Current Protocols 4 (2024): e981. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Lonsdale J., Thomas J., Salvatore M., et al., “The Genotype‐Tissue Expression (GTEx) Project,” Nature Genetics 45 (2013): 580–585. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Burgess S., Davies N. M., and Thompson S. G., “Bias due to Participant Overlap in Two‐Sample Mendelian Randomization,” Genetic Epidemiology 40 (2016): 597–608. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Burgess S. and Thompson S. G., “Collaboration Crp Chd Genetics. Avoiding Bias From Weak Instruments in Mendelian Randomization Studies,” International Journal of Epidemiology 40 (2011): 755–764. [DOI] [PubMed] [Google Scholar]
  • 13. Choi J., Gu J., and Shen S., “Weak‐Instrument Robust Inference for Two‐Sample Instrumental Variables Regression,” Journal of Applied Econometrics 33 (2018): 109–125. [Google Scholar]
  • 14. Wang S. and Kang H., “Weak‐Instrument Robust Tests in Two‐Sample Summary‐Data Mendelian Randomization,” Biometrics 78 (2022): 1699–1713. [DOI] [PubMed] [Google Scholar]
  • 15. Olkin I. and Pratt J. W., “Unbiased Estimation of Certain Correlation Coefficients,” Annals of Mathematical Statistics 29 (1958): 201–211. [Google Scholar]
  • 16. Xue H. and Pan W., “Inferring Causal Direction Between Two Traits in the Presence of Horizontal Pleiotropy With GWAS Summary Data,” PLoS Genetics 16 (2020): e1009105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Inoue A. and Solon G., “Two‐Sample Instrumental Variables Estimators,” Review of Economics and Statistics 92 (2010): 557–561. [Google Scholar]
  • 18. Angrist J. D. and Krueger A. B., “Split‐Sample Instrumental Variables Estimates of the Return to Schooling,” Journal of Business & Economic Statistics 13 (1995): 225–235. [Google Scholar]
  • 19. Burgess S. and Thompson S. G., “Bias in Causal Estimates From Mendelian Randomization Studies With Weak Instruments,” Statistics in Medicine 30 (2011): 1312–1323. [DOI] [PubMed] [Google Scholar]
  • 20. Albert F. W., Bloom J. S., Siegel J., Day L., and Kruglyak L., “Genetics of Trans‐Regulatory Variation in Gene Expression,” eLife 7 (2018): e35471. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Xue H., Shen X., and Pan W., “Causal Inference in Transcriptome‐Wide Association Studies With Invalid Instruments and GWAS Summary Data,” Journal of the American Statistical Association 118 (2023): 1525–1537. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Fang L., Xue H., Lin Z., and Pan W., “Multivariate Proteome‐Wide Association Study to Identify Causal Proteins for Alzheimer Disease,” American Journal of Human Genetics 112 (2025): 291–300. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Kunkle B. W., Grenier‐Boley B., Sims R., et al., “Genetic Meta‐Analysis of Diagnosed Alzheimer s Disease Identifies New Risk Loci and Implicates Aβ, Tau, Immunity and Lipid Processing,” Nature Genetics 51 (2019): 414–430. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Bellenguez C., Küçükali F., Jansen I. E., et al., “New Insights Into the Genetic Etiology of Alzheimer s Disease and Related Dementias,” Nature Genetics 54 (2022): 412–436. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Sun B. B., Chiou J., Traylor M., et al., “Plasma Proteomic Associations With Genetics and Health in the UK Biobank,” Nature 622 (2023): 329–338. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. King E. A., Davis J. W., and Degner J. F., “Are Drug Targets With Genetic Support Twice as Likely to Be Approved? Revised Estimates of the Impact of Genetic Support for Drug Mechanisms on the Probability of Drug Approval,” PLoS Genetics 15 (2019): e1008489. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Novikova G., Kapoor M., Tcw J., et al., “Integration of Alzheimer s Disease Genetics and Myeloid Genomics Identifies Disease Risk Regulatory Elements and Genes,” Nature Communications 12 (2021): 1610. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Mountjoy E., Schmidt E. M., Carmona M., et al., “An Open Approach to Systematically Prioritize Causal Variants and Genes at all Published Human GWAS Trait‐Associated Loci,” Nature Genetics 53 (2021): 1527–1533. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Amlie‐Wolf A., Tang M., Mlynarski E. E., et al., “INFERNO: Inferring the Molecular Mechanisms of Noncoding Genetic Variants,” Nucleic Acids Research 46 (2018): 8740–8753. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Bycroft C., Freeman C., Petkova D., et al., “The UK Biobank Resource With Deep Phenotyping and Genomic Data,” Nature 562 (2018): 203–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Consortium Global Lipids Genetics , “Discovery and Refinement of Loci Associated With Lipid Levels,” Nature Genetics 45 (2013): 1274–1283. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. CARDIoGRAMplusC4D Consortium , “A Comprehensive 1000 Genomes–Based Genome‐Wide Association Meta‐Analysis of Coronary Artery Disease,” Nature Genetics 47 (2015): 1121–1130. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Sanderson E., Richardson T. G., Hemani G., and Davey S. G., “The Use of Negative Control Outcomes in Mendelian Randomization to Detect Potential Population Stratification,” International Journal of Epidemiology 50 (2021): 1350–1361. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Angrist Joshua D., Imbens Guido W., and Rubin D. B., “Identification of Causal Effects Using Instrumental Variables,” Journal of the American Statistical Association 91 (1996): 444–455. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Data S1. Supporting Information.

SIM-44-0-s001.pdf (204.4KB, pdf)

Data Availability Statement

The data that support the findings of this study are openly available to approved users in the UK Biobank at https://www.ukbiobank.ac.uk/.

The R package for r2SLS is available on https://github.com/lei‐fang‐stat/r2SLS. The access to the UKB data and GTEx data was approved through UKB Application #35107 and dbGaP Project #26511, respectively. The hair color (blond), HDL, CAD, and asthma GWAS results can be obtained from TwoSampleMR R package with request ID of “ukb‐d‐1747_1”, “ieu‐a‐299”, “ieu‐a‐7”, “ieu‐a‐44”, respectively.


Articles from Statistics in Medicine are provided here courtesy of Wiley

RESOURCES