Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Oct 6.
Published in final edited form as: J Am Stat Assoc. 2023 Mar 17;118(543):1525–1537. doi: 10.1080/01621459.2023.2183127

Causal Inference in Transcriptome-Wide Association Studies with Invalid Instruments and GWAS Summary Data

Haoran Xue a,b,*, Xiaotong Shen a, Wei Pan b
PMCID: PMC10557939  NIHMSID: NIHMS1877198  PMID: 37808547

Abstract

Transcriptome-wide association studies (TWAS) have recently emerged as a popular tool to discover (putative) causal genes by integrating an outcome GWAS dataset with another gene expression/transcriptome GWAS (called eQTL) dataset. In our motivating and target application, we’d like to identify causal genes for low-density lipoprotein cholesterol (LDL), which is crucial for developing new treatments for hyperlipidemia and cardiovascular diseases. The statistical principle underlying TWAS is (two-sample) two-stage least squares (2SLS) using multiple correlated SNPs as instrumental variables (IVs); it is closely related to typical (two-sample) Mendelian randomization (MR) using independent SNPs as IVs, which is expected to be impractical and lower-powered for TWAS (and some other) applications. However, often some of the SNPs used may not be valid IVs, e.g. due to the widespread pleiotropy of their direct effects on the outcome not mediated through the gene of interest, leading to false conclusions by TWAS (or MR). Building on recent advances in sparse regression, we propose a robust and efficient inferential method to account for both hidden confounding and some invalid IVs via two-stage constrained maximum likelihood (2ScML), an extension of 2SLS. We first develop the proposed method with individual-level data, then extend it both theoretically and computationally to GWAS summary data for the most popular two-sample TWAS design, to which almost all existing robust IV regression methods are however not applicable. We show that the proposed method achieves asymptotically valid statistical inference on causal effects, demonstrating its wider applicability and superior finite-sample performance over the standard 2SLS/TWAS (and MR). We apply the methods to identify putative causal genes for LDL by integrating large-scale lipid GWAS summary data with eQTL data.

Keywords: 2SLS, Causal inference, Genome-wide association studies, Mendelian randomization (MR), SNP, Truncated L1-constraint (TLC), Reference panel

1. Introduction

Transcriptome-wide association studies (TWAS), as implemented in PrediXcan [14] and TWAS [17], were recently proposed to boost statistical power and enhance interpretation. It was motivated by one key hypothesis that many genetic variants influence complex traits through transcriptional regulation. They have soon become popular with applications to common diseases like type 2 diabetes, schizophrenia, and cancer, convincingly showing the power of integrating genome-wide association studies (GWAS) and expression quantitative trait locus (eQTL) data to gain biological insights. Specifically, TWAS implicate (putative) causal genes of a GWAS trait, overcoming a severe limitation of GWAS in the lack of biological insights from GWAS discoveries of trait-associated genetic variants. Statistically, TWAS apply the standard (two-sample) two-stage least squares (2SLS) in the framework of instrumental variable (IV) regression for causal inference. IV regression is a general and powerful tool for estimating and drawing an inference about the causal effect from exposure to an outcome in the presence of unmeasured confounding. A valid IV must satisfy three assumptions:

  1. Relevance: it is associated with the exposure;

  2. Exchangeability: it is not associated with unmeasured confounders;

  3. Exclusion restriction: it is not associated with the outcome conditional on the exposure.

Given valid IVs, 2SLS makes a correct inference about the causal effect; yet it may break down and give erroneous results in the presence of invalid IVs. Assumption (A) ensures the inclusion of relevant IVs, which is more straightforward and typically handled by using a stringent significance cut-off. In contrast, testing assumptions (B) or (C) is more challenging; between (B) and (C), the former is even more difficult (due to the hidden confounding), while the existing literature (especially concerning MR) is more focused on (C). As to be discussed, the proposed method can deal with the violation of all three assumptions. Kang et al. [21] proposed a Lasso-type method called sisVIVE for estimating causal effect with some invalid IVs but did not address the problem of inference, which, instead of only point estimation, is essential for TWAS and is the focus here. Lin et al. [25] proposed a two-stage regularization method to select optimal instruments and jointly estimate the effects of multiple exposures on the outcome but did not permit invalid IVs in stage 2 and did not consider the problem of inference either. Windmeijer et al. [45] proposed a two-step method, including the use of adaptive Lasso in the second step. Because of the median estimator used in the first step, their method requires the “Majority Condition”, that is, more than 50% of the instruments are valid. When the “Majority Condition” fails but a weaker “Plurality Condition” holds, Two-Stage Hard Thresholding (TSHT) by Guo et al. [16] works by selecting and using valid IVs. Windmeijer et al. [46] proposed a method of combining confidence intervals (CIs) of each SNP/IV-based causal parameter estimate as a competitor to TSHT. However, all the three aforementioned works can only deal with the one-sample case, in which the data used for the two-stage models are collected from the same sample of individuals. In contrast, the two-sample case is way more flexible and thus popular in genetics and has dominated recent genetic applications in TWAS and MR as to be discussed, where the exposure and the outcome data for the two-stage model are from two independent samples; due to its importance in genetics, the two-sample case is the focus of this paper.

We propose a Two-Stage Constrained Maximum Likelihood (2ScML) method to infer causal effects in the framework of instrumental variables regression as 2SLS. First, we aim to tackle the problem in a more general setting than many other methods. In particular, we allow the presence of invalid IVs violating any of the three IV assumptions. Compared to some existing methods with two different initial and final estimators, we propose a unified constrained regression approach for identifying valid IVs and drawing inference simultaneously. Towards this end, we propose a non-convex truncated Lasso constraint (TLC) to account for invalid IVs. Second, in contrast to TSHT which uses valid IVs satisfying all three assumptions, our method is more efficient by including all IVs satisfying assumption (A) but possibly violating assumptions (B) and/or (C) in stages 1 and 2 respectively. Third, most importantly, our method applies to the two-sample case with only GWAS summary data (and a reference panel of genotypic data) in stage 2, where individual-level data from large-scale GWAS are often unavailable, as most often in TWAS, whereas the aforementioned methods are not applicable. We note that such two-sample data have dominated in recent genetic studies. For this purpose, we develop our method for such two-sample data as in our real data examples. We propose using BIC for consistent model selection, applicable to either GWAS individual-level data or summary data. In almost all current genetic applications with GWAS summary data, including TWAS, a naive estimate of the covariance matrix, ignoring the difference between the GWAS genotypic data and the reference panel, is simply used for inference. We point out that it would under-estimate the variance and thus lead to inflated Type-I errors, especially with a small reference panel size of a few hundred commonly used in practice. A motivating example in Section 2.3.3 in the simple context of ordinary least square (OLS) regression with GWAS summary data and a reference panel illustrates the severity of the problem and thus the necessity for correction to achieve valid inference. Hence, we propose a corrected variance estimator for GWAS summary data in stage 2, which is shown to perform much better than the naive estimator. We are not aware of any other existing methods with all the above features of our proposed method, which are necessary for robust applications to TWAS with the anticipated presence of some invalid IVs.

The proposed method was motivated by and is particularly suitable for applications to TWAS to identify causal genes or other molecular/imaging/clinical endophenotypes by integrating GWAS with other eQTL/xQTL data [14, 17, 54, 56, 48, 49, 38, 9, 19]. In these applications, multiple correlated SNPs (so-called cis-SNPs) near a gene are used as IVs to impute or predict the gene’s expression level (or another endophenotype) to infer whether the gene’s expression (or another risk factor) has a causal effect on a trait, say low-density lipoprotein cholesterol (LDL). However, due to strong modeling assumptions on valid IVs that may be violated frequently in practice, cautions have to be taken about the conclusions from the standard TWAS. For example, it is known that TWAS tends to identify multiple genes per locus, most of which are likely false positives due to confounding caused by linkage disequilibrium (LD) among nearby SNPs [26, 43, 47]. In particular, due to confounding through LD between an eQTL (i.e. an SNP causal to a gene’s expression) and a true causal SNP to a GWAS trait, a target gene identified by TWAS (or MR) may be only marginally associated with, but not causal to, the GWAS trait, similar to that of a significant tagging SNP in GWAS may not be causal. Furthermore, due to widespread (horizontal) pleiotropy [42], some SNPs used in TWAS may not be valid IVs, again leading to violations of a critical assumption in TWAS/2SLS [2, 7]. As alternatives to TWAS, another class of popular IV analysis using (often independent) SNPs as IVs is (two-sample) Mendelian randomization (MR) [10, 11, 12]. In these applications, due to often a small sample size of an eQTL study (i.e. stage 1 in 2SLS), it would be low-powered to apply a single SNP/IV-based method as in MR, as implemented in SMR and GSMR for the same purpose [54, 56]; instead, it would be more powerful and thus more desirable to apply a method with multiple SNPs used to predict the gene’s expression level (or another exposure/trait in stage 1). In our and many other TWAS applications with typically much smaller sample sizes for eQTL data, if MR is applied with the usual genome-wide significance threshold to select the SNPs as IVs, none or few of the SNPs are expected to be selected as IVs for most of the genes; even if this significance threshold is largely relaxed, due to strong correlations (i.e. LD) among the candidate SNPs in the cis-region of any gene, often no more than one independent SNP would be selected, rendering all robust MR methods inapplicable because they all require the use of multiple independent IVs. For these reasons, in this paper, we will focus on TWAS, not MR, though we will briefly compare in a simulation with many new and popular MR methods as reviewed in [36, 50], showing much higher statistical power of our new method over many robust MR methods. Note that we do not consider other IV regression methods inapplicable to GWAS summary data (e.g., [40]).

2. Methods

2.1. Model

We denote an exposure as D, an outcome of interest as Y, p IVs (such as SNPs) as Zp and the true covariance matrix of Z as Σp×p. In the following, for a subset GS = {1, 2, ⋯, p} and vector Vp,VG is the corresponding sub-vector of V. Corresponding to the true causal model in Figure 1, the stages 1 and 2 models for the exposure and the outcome are

D=ZTγ0+ξ,Y=β0D+ZTα0+ϵ. (1)

Fig. 1.

Fig. 1

The true causal model for (1). Directed edges represent direct effects; elements of both γA0 and αB0 are non-zero; depending on whether β0 ≠ 0 or not, D has or does not have a causal effect on Y.

Here ξ and ϵ are the error terms independent of instruments Z, we have E(ξ) = E(ϵ) = 0, Var(ξ)=σ12, Var(ϵ)=σ22, and Cov(ξ, ϵ) =σ12. In general ξ, ϵ are correlated and σ12 ≠ 0, which accounts for unobserved confounders. β0 is the parameter of interest, representing the causal effect of D on Y; γ0p are the true effects of the IVs on the exposure, and for some AS, γj00 if and only if jA; α0p are the direct effects of the IVs on Y, and for some BS, αj00 if and only if jB. Note that, if B is not empty, it explicitly accounts for the violation of IV assumptions (B) or/and (C), the main problem to be addressed here.

Subsequently, we assume that by default the above two-stage linear models in (1) always hold. We also note that after centering all variables at the sample mean 0, we do not have the intercepts in the two models in (1). One primary aim is to infer the causal effect β0. Note that in general D and ϵ are not independent due to σ12 ≠ 0, for this reason, the OLS gives a biased estimate of β0 (both in finite samples and asymptotically), and 2SLS in the general framework of instrument-variable regression has been proposed for (asymptotically) unbiased inference.

The following Plurality Condition, as stated in [16], is both sufficient and necessary for parameter identifiability in model (1).

Assumption 1. (Plurality Condition) Assume that |ABc|>maxc0|jA:αj0/γj0=c|.

Here ABc is the set of valid IVs satisfying all three IV assumptions. When Σ is invertible, model (1) is identifiable if and only if the Assumption 1 holds. See Theorem 1 in [16] for proof.

In most TWAS applications, we have a two-sample design with an eQTL dataset for gene expression as the exposure and a GWAS dataset for the outcome coming from two independent samples. Typically tens of SNPs are used as the IVs, and the size of the first sample ranges from a few hundred to a few thousand, while that of the second sample is in tens to hundreds of thousands. Based on these facts, we focus on the two-sample case with a fixed p.

2.2. Estimation and Inference with Individual-Level Data

We first assume the availability of individual-level data for both samples, then generalize that with summary data in the second sample. Suppose we have two independent samples of sizes n1 and n2, each with iid observations, D1={(D1,i,Z1,i)i=1,,n1} and D2={(Y2,i,Z2,i)i=1,,n2} for the two stages respectively. We use their vector and matrix forms D1n1 (with its ith element as D1,i) and Z1n1×p (with its ith row as Z1,i) for the first sample, and Y2n2 (with its ith element as Y2,i) and Z2n2×p (with its ith row as Z2,i) for the second. For any set GS, we use ZG to denote the corresponding columns of the matrix Z.

2.2.1. The Oracle Estimator

Assuming that we know by the oracle the set A of relevant IVs in stage 1 and the set B of IVs having direct effects on Y in stage 2, we have the (ideal but impractical) two-stage oracle-2SLS estimator defined as

Stage 1:γ^Aor=argminγAD1Z1,AγA2,D2=Z2,Aγ^Aor ,Stage 2:(β^or,αBor)=argminβ,αBY2βD2Z2,BαB2. (2)

Here we explicitly show the stage 1 oracle-2SLS. In some applications, D1 is available for the stage 1 analysis, while in others it is not but some estimate γ^A of γA0 is provided by a third party, e.g. the TWAS Fusion website [17]. In the following Proposition 1, we assume a consistent estimator γ^A of γA0 is obtained such that n1(γ^AγA0)N(0,Θ) as n1 →∞ for some constant matrix Θ, which is either obtained from stage 1 analysis with D1 or provided by the third party. In fact, since stage 1 is an OLS, rewrite D1=Z1,AγA0+ξ1 with ξ1 as the vector of random errors, we have

γ^Aor=(Z1,ATZ1,A)1Z1,ATD1=γA0+(Z1,ATZ1,A)1Z1,ATξ1,n1(γ^AorγA0)=n1(Z1,ATZ1,A)1Z1,ATξ1N(0,σ12E(ZAZAT)1), (3)

and Θ=σ12E(ZAZAT)1. In the following, we exchangeably use γ^A and γ^Aor, and expand γ^Aor to γ^or and αBor to αor by letting γ^Acor=0 and αBcor=0, respectively. Denote σt2=Var(β0ξ+ϵ)=(σ22+2σ12β0+(β0)2σ12), and for subsets I, JS, ΣIJ is the sub-matrix of Σ corresponding to rows in I and columns in J. We define the following matrices

Ψ=((γA0)TΣAAγA0(γA0)TΣABΣBAγA0ΣBB),Φ=((γA0)TΣAAΘΣAAγA0(γA0)TΣAAΘΣABΣBAΘΣAAγA0ΣBAΘΣAB).

Although the two-sample 2SLS estimator has been studied previously [30, 20, 23], often invalid IVs having direct effects on the outcome were not considered. To be complete, we show Proposition 1 to establish some properties of the oracle estimator in the presence of invalid IVs.

Proposition 1. When Σ is invertible and |ABc|> 0, assuming n1(γ^AγA0)N(0,Θ) as n1 → ∞ and n2/n1w for some positive and finite constant w, then the probability of the oracle estimator β^or defined in (2) being unique converges to 1 as n1, n2 → ∞, and β^or is a consistent estimator of true causal effect β0 with β^orpβ0 as n1, n2 → ∞. Furthermore, we have n2(β^orβ0)dN(0,v) with v=(σt2Ψ1+w(β0)2Ψ1ΦΨ1)11.

In practice, we plug the estimates of parameters into v to obtain a variance estimate v^. With β^or and v^, we could make inference on β0; this method is denoted as “oracle-2SLS-Ind”.

2.2.2. New Method: Two-stage Constrained Maximum Likelihood

The proposed method consists of two stages as an extension to 2SLS. In stage 1, with D1 we derive a constrained maximum likelihood to select relevant IVs to satisfy Assumption (A), similar to [35] for general linear regression:

γ^K1=argminγD1Z1γ2 subject to 1τ1j=1pmin(|γj|,τ1)K1, (4)

where 1τ1min(|γj|,τ1) [34] is the truncated L1-function for Yj, which is a continuous surrogate of the L0 loss I(γj ≠ 0) with I(·) the indicator function. Denote A^K1={jS{γ^j,K10} as the estimate of set A. The tuning parameter K1 is an integer and can be interpreted as the number of non-zero components of γ0, and the constrained problem (4) performs a best-subset-like (but computationally much more efficient) search to select K1 relevant IVs. In practice, as required by some technical conditions shown in Supplementary, we set T1 to be a small fixed value like 1×10−5 to ensure an adequate TLC approximation to the L0-constraint, and use BIC to estimate the optimal K1. If we assume ξ following normal distribution, after ignoring constant terms, the log-likelihood for stage 1 is l1(γ^K1)=[n1log(σ12)+D1Z1γ^K12/σ12]/2. As σ12 is unknown, we plug in its estimate σ^12=D1Z1γ^K12/n1 to derive BIC for stage 1:

BIC1(K1)=n1logD1Z1γ^K12n1+log(n1)γ^K10. (5)

With a candidate set K1 for K1, the optimal K1 is obtained as K^1=argminK1K1BIC1(K1), and the estimate of γ0 is γ^:=γ^K^1. Note that the normality assumption on ξ is only for deriving BIC1 in (5) and is not required to be shown next. As shown by Proposition 2 in the Supplementary, when some mild assumptions are satisfied and |A|K1, BIC consistently selects the true tuning parameter with P(K^1=|A|)1 and A^K^1 is a consistent estimator of the true set A, thus γ^ has the oracle property with P(γ^=γ^or)1. As for the oracle-2SLS in Proposition 1, we assume n1(γ^AγA0)N(0,Θ).

Given γ^, with D2 we obtain the predicted exposure as D2=Z2γ^. Then, in stage 2, we solve constrained minimization:

(β^K2,αK2)=argminβ,αY2βD2Z2α2 subject to 1τ2j=1pmin(|αj|,τ2)K2. (6)

Denote B^K2={jS{α^j,K20} as the estimate of set B. Again, in practice, we set T2 to be a small fixed value like 1×10−5 and use BIC to estimate the optimal integer K2, the number of invalid IVs. Here we model the direct effects of the IVs explicitly and use the non-convex constraint to select and thus account for invalid IVs that violate the IV Assumptions (B) and (C). Similar to (5), the BIC for stage 2 is

BIC2(K2)=n2logY2β^K2D2Z2αK22n2+log(n2)αK20. (7)

With a candidate set K2 for K2, the optimal K2 is obtained as K^2=argminK2K2BIC2(K2), and the final estimate of (β0, α0) is (β^,α):=(β^K^2,αK^2). It is noted that if error terms (ξ, ϵ) in model (1) are from a bivariate normal distribution, in each stage, the objective function is both the squared error loss and minus the log-likelihood as used in 2SLS, though a truncated L1 constraint (TLC) is imposed to select relevant IVs and invalid IVs respectively in the two stages. We refer to our method as the constrained maximum likelihood in anticipation of its extensions to other parametric models.

Next, we establish that our proposed 2ScML estimator has the oracle property, then we use the asymptotic distribution of the oracle estimator to draw an inference. The following Assumption 2 states T2 should be sufficiently small to have TLC approximating the L0-constraint well.

Assumption 2. Assume 0<τ21/n2pcmax(Z2TZ2), where cmax(·) denotes the largest eigenvalue of a matrix.

Theorem 1 shows that the 2ScML estimator possesses the oracle property when Assumptions 1 and 2 are satisfied. Note that the error terms ξ and ϵ are not required to be normal.

Theorem 1. Assume that Σ is invertible, n1(γ^AγA0)N(0,Θ) as n1 → ∞, and Assumptions 1 and 2 hold, then as both n1, n2 → ∞, BIC consistently selects the tuning parameter with P(K^2=|B|)1 and B^K^2 is a consistent estimator of the true set B such that P(B^K^2=B)1, and we have P((β^,α)=(β^or,αor))1.

Plugging the parameter estimates (including the estimates of sets A and B) in the oracle variance v in Proposition 1, we obtain an estimated variance for β^ and thus make inference about β0; this method is denoted as “2ScML-Ind” (indicating its dependence on individual-level data).

2.2.3. Computation

To solve nonconvex constrained minimization (4), we use a difference convex (DC) method to approximate the nonconvex constraint with a sequence of convex constraints iteratively. First, we decompose the constraint function into a difference of two convex functions: j=1pmin(|γj|,τ1)/τ1=(j=1p|γj|max(|γj|τ1,0))/τ1. Given an estimate γ^j,K1(m) at iteration m, we note max(|γj|τ1,0)max(|γ^j,K1(m)|τ1,0)+(|γj||γ^j,K1(m)|)I(|γ^j,K1(m)|>τ1). Thus we have

1τ1j=1pmin(|γj|,τ1)1τ1(j=1p|γj|max(|γ^j,K1(m)|τ1,0)(|γj||γ^j,K1(m)|)I(|γ^j,K1(m)|>τ1))=1τ1(j=1p|γj|I(γ^j,K1(m)τ1)+τ1I(γ^j,K1(m)>τ1)). (8)

We then relax (4) as a convex constrained minimization problem:

γ^K1(m+1)=argminγD1Z1γ2 subject to 1τ1j=1p|γj|I(γ^j,K1(m)τ1)K1j=1pI(γ^j,K1(m)>τ1). (9)

Problem (9) is equivalent to a constrained Lasso problem, which can be solved by the algorithm in [28]. Similarly, we iteratively relax nonconvex minimization (6) as

(β^K2(m+1),αK2(m+1))=argminβ,αY2βD2Z2α2 subject to 1τ2j=1p|αj|I(α^j,K2(m)τ2)K2j=1pI(α^j,K2(m)>τ2). (10)

Again, the algorithm of [28] is applied to solve the constrained Lasso problem (10). This iterative process continues until a termination criterion is met.

We initialize the DC algorithm with the constrained lasso estimates: γ^K1(0)=argminγD1Z1γ2 subject to j=1p|γj|/τ1K1, and (β^K2(0),αK2(0))=argminβ,αY2βD2Z2α2 subject to j=1p|αj|/τ2K2. The DC algorithm is not guaranteed to reach a global minimum for the non-convex (4) and (6) (while it is difficult to check whether a solution is global), though it seems to perform well in practice (as shown in our simulations).

2.3. Extension to GWAS Summary Data

For most TWAS applications we have either individual-level data D1 or a consistent estimate of γA0 from a third party for stage 1 analysis, thus we assume that an estimate γ^A is available and n1(γ^AγA0)N(0,Θ). For stage 2, based on a GWAS of trait Y, we have an estimated marginal effect size of each Z on Y as β^YZ along with its standard error se(β^YZ). Due to logistic and privacy issues, individual-level genotypes (i.e. Z’s) and phenotypes (i.e. Y) are typically not publicly available, but only summary data in the form of β^YZ‘s and se(β^YZ)‘s are available for all SNPs/Z’s, only summary data in the form of with which we could calculate the sample correlations between Y and Z’s as rYZ’s. From a reference panel consisting of a group of n0 individuals, such as from the 1000 Genomes Project [1] or UK Biobank [39], we obtain individual-level genotype data for the p SNPs as Z0n0×p with its rows corresponding to individuals. We next extend the oracle-2SLS and the proposed 2ScML in stage 2 to the situation with only GWAS summary-statistics and a reference panel, assuming that the two original samples and the reference panel are independent and from the same population. This extension allows our method to be applied to some published large-scale GWAS summary data with a wide range of traits.

Without loss of generality, we assume D1, Y2, and columns of Z1, Z2, Z0 are all standardized to have a sample mean 0 and a sample variance 1, so for example, for the jth IV Zj we have Z2,jY2/n2=rYZj; j = 1, …, p. For a positive integer k, we use Ik to denote the k × k identity matrix.

2.3.1. The Oracle Estimator

If we have individual-level data Y2 and Z2 in the second sample, we can get the oracle estimator (β^or,αBor) from stage 2 in equation (2), which is an OLS and has close form solution

(β^orαBor)=[(D2TZ2,BT)(D2Z2,B)]1(D2TZ2,BT)Y2=(γ^AT(Z2,ATZ2,A/n2)γ^Aγ^AT(Z2,ATZ2,B/n2)(Z2,BTZ2,A/n2)γ^AZ2,BTZ2,B/n2)(γ^AT(Z2,ATY2/n2)Z2,BTY2/n2). (11)

From the GWAS summary statistics we obtain Z2TY2/n2, but not Z2TZ2/n2. As usual, replacing Z2TZ2/n2 with Z0TZ0/n0 in (11), we obtain an estimate of (β0,αB0) as

(β˜orαBor)=(γ^AT(Z0,ATZ0,A/n0)γ^Aγ^AT(Z0,ATZ0,B/n0)(Z0,BTZ0,A/n0)γ^AZ0,BTZ0,B/n0)1(γ^AT(Z2,ATY2/n2)Z2,BY2/n2). (12)

We expand αBor to αor by adding a component αBcor=0. For finite n0 and n2, we expect Z2TZ2/n2Z0TZ0/n0, leading to (β^or,αBor)(β˜or,αBor). Intuitively, the difference between Z2TZ2/n2 and Z0TZ0/n0 introduces some extra variation to the estimate β˜or as compared to β^or, and this additional variation is not captured by variance v in Proposition 1 based on individual-level data (without approximation errors). It would result in inflated Type-I errors when β˜or and variance v in Proposition 1 are used for inference, as supported by our later simulation studies. To account for and quantify the effects of using a reference panel, as stated in Assumption 3, we impose an additional assumption of these two matrices following the Wishart distributions. Although this assumption does not hold exactly for SNP data, a Wishart distribution is widely adopted for a covariance matrix (e.g. as a prior in Bayesian statistics); here we use it as a finite-sample approximation to an asymptotically normal sample covariance matrix [24, 29]: as shown in our simulations, it works well for SNP data. Denote W(Σ, n) as the Wishart distribution with scale matrix Σ and degrees of freedom n.

Assumption 3. Assume that Z0TZ0~W(Σ,n0) and Z2TZ2~W(Σ,n2).

Now we introduce Theorem 2 that gives the asymptotic distribution of β˜or.

Theorem 2. Assume that Σ is invertible and |ABc|> 0; n1(γ^AγA0)N(0,Θ) as n1 → ∞; n2/n1w, n2/n0u for some positive and finite constants w and u; and Assumption 3 holds. Then the probability of the oracle estimator β˜or defined in (12) being unique converges to 1 as n1, n2, n0 → ∞, and β˜or is a consistent estimator of true causal effect β0 with β˜orpβ0 as n1, n2, n0 → ∞. Furthermore, we have n2(β˜orβ0)dN(0,vc) with vc = v(u + 1)·Tr(ΨBΨ−1, where v and Ψ are given in Proposition 1, and B=(β0,(αB0)T)T(β0,(αB0)T).

Here Tr(·) denotes the trace of a matrix. The variance vc for GWAS summary data corrects the original individual-level data-based variance v by accounting for its approximation errors with the reference sample. Plugging the parameter estimates into vc, we obtain a corrected variance estimate v˜c. With β˜or and v˜c we make inference on β0; this method is denoted as “oracle-2SLS-Sum-C” (with “C” for the corrected variance).

2.3.2. New Method: Two-Stage Constrained Maximum Likelihood

In stage 2 for 2ScML in equation (6), the objective function Y2βD2Z2α2=Y2Z2(γ^,Ip)(β,αT)T2. Denote Λ:=(γ^,Ip)T(Z0TZ0/n0)(γ^,Ip)(p+1)×(p+1), which is singular even if Z0 has a full rank p. For computational simplicity, we add a small constant δ, such as 1 × 10−5, on the diagonal elements of Λ to obtain Λ* = Λ + δ·I(p+1). Recall Y2TY2/n2=1; after some simplification, we have

(β˜K2,αK2)=argminβ,α(Λ*)1/2(γ^,Ip)T(Z2TY2/n2)(Λ*)1/2(β,αT)T2subject to 1τ2j=1pmin(|αj|,τ2)K2, (13)

which could be solved iteratively with a sequence of constrained Lasso problems as in Section 2.2.3. Denote B˜K2={jS{α˜j,K20} as the estimate of set B. Note that Y2TZ2/n2 is the vector of the sample correlations between Y and Z’s, available from the GWAS summary data. The BIC is

BIC2(K2)=n2log{12(Y2TZ2/n2)(γ^,Ip)(β˜K2,αK2T)T+(β˜K2,αK2T)Λ*(β˜K2,αK2T)T}+log(n2)αK20. (14)

With a candidate set K2 for K2, the optimal K2 is obtained as K˜2=argminK2K2BIC2(K2), and the estimate of (β0, α0) is (β˜,α):=(β˜K˜2,αK˜2). Theorem 3 states the oracle property of (β˜,α).

Theorem 3. Assume that Σ is invertible, n1(γ^AγA0)N(0,Θ) as n1 → ∞, and Assumptions 1 and 2 hold, then as n1, n2, n0 → ∞, BIC consistently selects the tuning parameter with P(K˜2=|B|)1 and B˜K˜2 is a consistent estimator of the true set B such that P(B˜K˜2=B)1, and we have P((β˜,α)=(β˜or,αor))1.

Plugging the parameter estimates (including those for sets A and B) in the corrected oracle variance vc in Theorem 2, we obtain a corrected variance estimate for β˜ and thus make inference about β0; this method is denoted as “2ScML-Sum-C”.

2.3.3. A Motivating Example: OLS with Summary Data

This example of OLS regression illustrates some key differences between using individual-level data and using summary data with a reference panel. We have p predictors Xp and a response variable Y from a true model: Y = + ϵ, where ϵ ~ N(0, σ2) is the random error independent of X. We have a sample of size n denoted by Xn×p and Yn from the true model, and an independent reference panel of size n0 denote by X0n0×p. We scale the columns of X and X0, and vector Y to have a sample mean 0 and sample variance 1. With individual-level data we obtain the OLS estimate β = (XTX)−1XTY, and its estimated covariance matrix Cov(β)=σ^2(XTX)1 with σ^2=YXβ2/n. With summary data XTY/n (as the marginal association estimates), by replacing XTX/n with X0TX0/n0 in β, we obtain

β=(X0TX0/n0)1(XTY/n), Cov(βX,X0)=σ2(X0TX0/n0)1(XTX/n2)(X0TX0/n0)1.

Since XTX/n is unknown, again we approximate it by X0TX0/n0, obtaining the usual uncorrected covariance matrix estimate cov(β)=σ˜2(X0TX0/n0)1/n with σ˜2=12(YTX/n)β+βT(X0TX0/n0)β (since YTY/n =1 after scaling). Assuming XTX ~ W(Σ, n) and X0TX0~W(Σ,n0), with B=ββTp×p, as shown in the Supplementary we derive the (marginal) covariance matrix as

Cov(β)=n+n0nn0Tr(ΣB)Σ1+σ2nΣ1. (15)

Plugging in the estimates σ˜2,Σ=X0TX0/n0 and B= ββT, we obtain a corrected estimate Cov(β). Note that the second term in the corrected estimate Cov(β) is the uncorrected estimate Cov(β).

We can make inference about β with β and Cov(β), denoted by “OLS-Ind”; or with β and Cov(β), denoted by “OLS-Sum”; or with β and Cov(β), denoted by “OLS-Sum-C”.

We did simulations to compare different methods. We set p = 5, Σ = I5, β= (1, 1, 1, 1, 1), and σ2 = 5. We simulated X ~ N(0, Σ). i.e. the five predictors X1 to X5 were iid from a standard normal distribution. We had n = 500 and tried different n0 = 100, 500, 1000 or 10000, for each setup we did 1000 replications. As all five predictors were symmetric, we show simulation results for β1 in Table 1, all estimates and standard errors in the table were scaled back to the original scale. We could see that all methods were almost unbiased, but OLS-Sum had inflated Type-I errors and the inflation decreased as n0 increased. In contrast, OLS-Ind and OLS-Sum-C could control Type-I error well, while the estimate of OLS-Sum-C had a much larger variance than that of OLS-Ind, clearly indicating the cost of using summary data and a reference panel with an inflated variance of the estimate.

Table 1.

Estimating β1 and testing H0: β1 = 1 versus H1:β1 ≠ 1 at the significance level 0.05 based on 1000 simulations for the motivating example. In each column, from top to bottom we show the mean of the estimates, the mean of the standard errors, the standard deviation of the estimates, and the empirical Type-I error.

Method OLS-Ind OLS-Sum OLS-Sum-C
n 0 100 500 1000 10000 100 500 1000 10000
Mean(Est) 1.0003 1.0474 1.0021 0.9981 1.0031 1.0474 1.0021 0.9981 1.0031
Mean(SE) 0.0998 0.0988 0.0993 0.0996 0.0992 0.2749 0.1744 0.1586 0.1432
SD(Est) 0.0993 0.2540 0.1596 0.1442 0.1292 0.2540 0.1596 0.1442 0.1292
Type-I Error 0.048 0.460 0.229 0.171 0.142 0.035 0.032 0.030 0.033

3. Simulations

3.1. Simulation 1: TWAS

We compared 2ScML with naive-2SLS/TWAS and oracle-2SLS through simulations; to be realistic, the setup mimicked real TWAS applications with real SNP data, and two independent samples of different sizes as in model (1). We extracted the genotypic data of 408339 individuals in UK Biobank [39] for p = 56 correlated SNPs from gene MAFB on chromosome 20 as the population data for our simulation. The minor allele frequencies (MAFs) of the 56 SNPs ranged from 0.05 to 0.45 and their correlation matrix is shown in the Supplementary. The genotype data in the two samples and the reference panel were independently drawn with replacement from the population. The error terms (ϵ, ξ) were generated from a bivariate normal distribution with both means 0, variances σ12=σ22=σ2=1 or 2, and correlation 0.5. For 2 ≤ i ≤ 8, γi0=1; otherwise γi0=0; i.e. the 2nd to 8th IVs were relevant with an equal effect size 1. For for i =1, 7, 8, 9, αi0=1; otherwise αi0=0; i.e. the relevant 7th and 8th IVs were invalid, and the irrelevant 1st and 9th IVs were also invalid. When σ2 was 1 or 2, the true R2 in stage 1 was 0.303 or 0.179 respectively. When β0 = 0, there was no causal effect from the exposure to the outcome, i.e. it was a null case.

In each simulation we generated two independent samples from model (1) of sizes n1 = 500, 1000, or 2000 and n2 =50000 or 100000 for stages 1 and 2 respectively, and generated reference panel of size n0 = 500, 10000, 50000 or 100000 when n2 = 50000, n0 = 500, 10000, 100000 or 200000 when n2 =100000. Then we applied different methods to the simulated data to test H0 :β0 = 0 versus H1 :β0 ≠ 0. For all methods, in stage 1 we used individual-level data and the 2nd to 8th IVs to get γ^. In stage 2, for 2ScML, we chose the best K2 from 0 to 10 and set τ2 = 1 × 10−5; for naive-2SLS, we fitted a linear regression model of Y on D^; for oracle-2SLS, we fitted a linear regression model of Y on D^ and included the 4 truly invalid IVs with direct effects in the stage 2 model. In stage 2 we could use either individual-level data, denoted by “-Ind”, or summary data (and a reference panel) with the uncorrected or corrected variance estimator, denoted by “-Sum” or “-Sum-C” respectively.

We varied β0 from −0.1 to 0.1 with a step size of 0.02 and applied all methods with 1000 independent replicates to calculate their empirical Type-I error and power at significance level 0.05. Figure 2 compares the results of oracle-2SLS and 2ScML for σ2 = 2, n1 = 500 and n2 = 50000 with different n0’s; the results for n0 =100000 were similar to those for n0 = 50000, and thus are not shown here. The complete simulation results for other setups and naive-2SLS are in the Supplementary. From Figure 2 we could see that, with individual-level data, 2ScML-Ind performed almost identically to oracle-2SLS-Ind: they both controlled Type-I error well and had high power. This confirmed the oracle property of 2ScML in Theorem 1. With summary data and a reference panel, both 2ScML-Sum and oracle-2SLS-Sum had inflated Type-I errors: the former had a bit larger inflation; the inflation was significant with small n0 = 500 and decreased as n0 increased. When the corrected variance estimator was used, both 2ScML-Sum-C and oracle-2SLS-Sum-C could control Type-I error for all n0, and their power increased as n0 increased. Though both 2ScML-Ind and 2ScML-Sum-C could control Type-I error, the former had much higher power than the latter, demonstrating significant loss of information with summary data, which has largely been neglected in the literature; it was the same for oracle-2SLS.

Fig. 2.

Fig. 2

Empirical Type-I error rates (for β0 = 0 in the x-axis) and power (for β0 ≠ 0) in Simulation 1 when σ2 = 2, n1 = 500, and n2 = 50000.

Table 2 shows more detailed estimation and inference results for true β0 = 0. It is clear that for estimation both oracle-2SLS and 2ScML were almost unbiased regardless of using individual-level data or summary data, but the estimates with summary data had larger variations than those with individual-level data. For example, when n0 = 500, the estimates of oracle-2SLS and 2ScML with summary data had SD(β˜) 0.0504 and 0.0695 respectively, almost five and seven times of their SD(β˜) 0.0101 and 0.102 with individual-level data. This again confirmed the information loss by using summary data. As n0 increased, SD(β˜) for both oracle-2SLS and 2ScML with summary data decreased. For testing, both oracle-2SLS-Sum-C and 2ScML-Sum-C could control Type-I error well, though the former was a little conservative. As naive-2SLS failed to account for invalid IVs, it always gave largely biased estimates and thus highly inflated Type-I errors.

Table 2.

Detailed results for different methods in Simulation 1 when β0 = 0, σ22=2, n1 = 500, and n2 = 50000. In each cell from top to bottom, we show the means of the causal estimates and their standard errors, the standard deviation of the causal estimates, and the Type-I error.

Method Ind Sum Sum-C
n 0 500 10000 50000 500 10000 50000
oracle-2SLS Mean(Est) 3e-04 4e-04 −1e-04 3e-04 4e-04 −1e-04 3e-04
Mean(SE) 0.0101 0.0102 0.0102 0.0101 0.0588 0.0171 0.0129
SD(Est) 0.0101 0.0504 0.0153 0.0122 0.0504 0.0153 0.0122
Type-I Error 0.044 0.729 0.187 0.101 0.024 0.025 0.041
2ScML Mean(Est) 3e-04 0.0040 3e-04 3e-04 0.0040 3e-04 3e-04
Mean(SE) 0.0101 0.0116 0.0103 0.0102 0.0650 0.0174 0.0129
SD(Est) 0.0102 0.0695 0.0171 0.0126 0.0695 0.0171 0.0126
Type-I Error 0.047 0.794 0.239 0.109 0.056 0.039 0.045
naive-2SLS Mean(Est) 0.1960 0.1959 0.1961 0.1961 0.1959 0.1961 0.1961
Mean(SE) 0.0204 0.0193 0.0200 0.0200 0.0233 0.0202 0.0201
SD(Est) 0.0963 0.0971 0.0965 0.0964 0.0971 0.0965 0.0964
Type-I Error 0.988 0.991 0.988 0.988 0.991 0.988 0.988

3.2. Simulation 2: MR

Although not our main purpose, we show the versatile application of our proposed method to typical MR studies and compare with many existing MR methods. We compared 2ScML with many new and popular two-sample MR methods, including MR-ContMix [8], MR-Mix [33], MR-Lasso [6], MR-cML [50], MR-PRESSO [42], MR-IVW (random-effect (RE) meta-analysis) [5], MR-Egger regression [3], weighted median method (MR-W-Median) [4], weighted mode method (MR-W-Mode) [18], MR-RAPS with over-dispersion and using the Tukey loss (MR-RAPS1) and MR-RAPS without over-dispersion and using squared error loss (MR-RAPS2) [51]. In each simulation, we generated two independent samples for the two-stage analysis and calculated the summary data with marginal linear regressions, and used the same summary data for all methods (including the proposed methods).

In summary, 2ScML-Sum-C, MR-ContMix, MR-Lasso, and MR-cML appeared to be the winners with well-controlled Type-I error and high power. More specifically, when IV Assumption (B) and thus the InSIDE assumption was satisfied, MR-PRESSO, MR-RAPS2 and 2ScML-Sum had inflated Type-I errors, MR-IVW and MR-Egger also had slightly inflated Type-I errors, while all other methods could control the Type-I error well at the nominal level of 0.05; 2ScML-Sum-C, MR-ContMix, MR-Lasso and MR-cML performed similarly with Type-I error satisfactorily controlled and higher power than all the other methods. When IV Assumption (B) and thus the InSIDE assumption was violated, MR-IVW, MR-Egger, and MR-RAPS1 had more highly inflated Type-I errors, while all other methods had similar performance to their counterparts when Assumption (B) was satisfied. See Supplementary for detailed results.

3.3. Sensitivity Analysis of IV Strengths

We further studied how our proposed methods perform depending on the strengths of IVs. The concentration parameter μ2 defined in [37] quantifies the IV strength; in the Supplementary we discussed the effects of μ2 on the asymptotic distribution and efficiency of our proposed estimators. We varied the IV effects γ0 and thus μ2 in simulations for TWAS and MR. Our results show that when the IVs were moderately strong, the distributions of the proposed estimators were well approximated by normal distributions. The proposed methods could always control Type-I errors, and had increased power as the IV strength increased. Detailed results are in the Supplementary.

4. Application to TWAS for LDL

With only GWAS summary data available in stage 2, we can only consider the methods applicable to GWAS summary data and thus drop “sum” for a simplified notation for each method. For example, 2ScML-C would be 2ScML-Sum-C.

4.1. Main Analysis

We applied 2ScML and the naive-2SLS to identify (putative) causal genes for LDL with GWAS summary statistics. For each gene, we used the TWAS Fusion pre-calculated coefficients γ^‘s for our stage 1 analysis [17]; the coefficients were estimated based on microarray expression data of blood from the Young Finns Study (YFS) with sample size n1 = 1264 [27]. The GWAS summary data of LDL were drawn from [41] with sample sizes up to n2 = 95454; we removed the SNPs with sample sizes less than 80000. We used software ImpG [31] to impute the LDL GWAS summary statistics with 489 unrelated individuals of European ancestry from the 1000 Genomes Project [1] as the reference panel. As stated in [31], we used the imputation accuracy measure r2 to quantify the imputation quality for each SNP and removed imputed SNPs with r2 < 0.3. With the availability of the genotype data of 408339 individuals of white British ancestry from UK Biobank, we could take a random sample of n0 = 500, 10000, 95454, or all 408339 individuals as our reference panel for stage 2 analysis in TWAS. We removed the SNPs with MAFs less than 0.05 or failing the Hardy-Weinberg equilibrium test with p-values less than 0.001.

There were 4700 genes with pre-calculated γ^ in the TWAS Fusion database. We first removed the genes with stage 1 regression p-value greater than 0.05/4700. Then for each of the genes left, we identified the set of its eSNPs with non-zero regression coefficients and also available in both the reference panel and the GWAS summary data, and removed the genes with no more than 1 eSNP. We removed 880 genes in total and analyzed the remaining 3820 genes. We calculated the first stage joint F-statistics for the 3820 genes; the mean and the range of their F-statistics were 34.73 and [3.18, 1144.51] respectively. There were 826 genes with their F-statistics less than 10, while those of the other 2994 genes were greater than 10. We show the distributions of the F-statistics as histograms in the Supplementary. For each gene, we extracted all SNPs near its eSNPs, then pruned out SNPs with pairwise absolute correlations greater than 0.6. If more than 100 SNPs were left, we only kept the top 100 SNPs with the highest absolute correlations with LDL; otherwise, we kept all of them. These SNPs were used in the stage 2 analysis. We set the candidate set of K2 for 2ScML as K2={0,1,,p/2}, and set τ2 = 1 × 10−5.

For each of the 3820 genes, we obtained its estimated effect sizes and p-values from naive-2SLS and 2ScML for different n0’s. Since the results for n0 = 10000, 95454, and 408339 were similar, we present those for n0 = 500 and 95454 while relegating the others to the Supplementary. Figure 3 shows the quantile-quantile (Q-Q) plots of the p-values for different methods. From panels (A) and (D) we see that for both n0 = 500 and 95454, the Q-Q plot for 2ScML-C had the left tail in good agreement with the identity line, and its genomic inflation factor λ [13] was close to 1, indicating that the Type-I error was controlled satisfactorily; its heavier right tail could be due to the polygenicity of a complex trait like LDL with many genes having small effects. On the other hand, panels (B) and (C) show that, when n0 = 500 was small, compared to 2ScML-C, the other three methods could have inflated Type-I errors possibly due to the effects of invalid IVs and/or failing to account for the effects of using a small reference panel. As n0 increased to 95454, as shown by panels (E) and (F), 2ScML and naive-2SLS had similar performance to 2ScML-C and naive-2SLS-C respectively, while naive-2SLS-C seemed to still have an inflated left tail possibly due to its failure to account for invalid IVs.

Fig. 3.

Fig. 3

The Q-Q plots of p-values in the −log10 scale for different methods when n0 = 500 (top row) and 95454 (bottom row). The left column shows a Q-Q plot of p-values of 2ScML-C versus the expected p-values under the null; the genomic inflation factor λ is shown too. The middle column shows a Q-Q plot of p-values of 2ScML-C versus other methods, and the right column zooms in. The grey solid line in each panel is the identity line.

With different n0’s, at the Bonferroni corrected significance cutoff 0.05/3820, naive-2SLS, naive-2SLS-C and 2ScML, 2ScML-C identified 27 significant genes in total. We did a literature search on each of the 27 significant genes. We excluded the study generating the LDL GWAS data we used [41]. Based on the literature support from other studies, we assigned a score to each gene: if there were other studies (1) supporting this gene being associated with LDL, we assigned the highest score of 5; (2) supporting this gene associated with a trait related to LDL, we assigned a score of 4; (3) identifying one or more SNPs mapped to or nearby this gene, which were significantly associated with LDL, we assign a score of 3; (4) identifying some SNPs mapped to or nearby this gene, which were significantly associated with other traits related to LDL, we assigned it a score of 2; (5) identifying some SNPs mapped to or nearby this gene, which were significantly associated with any traits, we assigned a score of 1; (6) otherwise, we assigned the lowest score of 0. See Supplementary for a list of all the 27 genes with their supporting references.

When n0 = 500 and 95454, naive-2SLS and 2ScML-C identified 22 genes in total, Figure 4 shows the numbers of genes identified by each method and their overlaps. Table 3 lists out the 22 genes with their p-values given by naive-2SLS and 2ScML-C; the non-significant ones with p > 0.05/3820 were masked out with an asterisk. Out of the 22 genes, 15 were also analyzed by the joint-tissue imputation (JTI) approach in [52]; their q-values are also shown with non-significant ones at false discovery rate (FDR) > 0.05 marked out. From Figure 4 and Table 3 we can see that, when n0 increased from 500 to 95454, the number of the significant genes identified by naive-2SLS decreased from 20 to 15, while that by 2ScML-C from 12 to 10, again suggesting possibly more liberal results with false positives when a smaller reference panel was used without suitably accounting for its effects. Two genes HSPA6 and DDAH2 were not significant by JTI; depending on the reference panel size, 2ScML-C also identified one or both as non-significant, while naive-2SLS identified both to be significant. This supports better Type-I error control by 2ScML-C than by naive-2SLS. It is also notable that 2ScML-C did not always give less significant results than naive-2SLS: for both n0 = 500 and 95454, gene HLA-DQB1 was non-significant by naive-2SLS, while 2ScML-C agreed with JTI and discovered it as significant. Another gene CDKN2D was not significant by naive-2SLS, but in contrast 2ScML-C (when n0 = 95454) and JTI claimed otherwise.

Fig. 4.

Fig. 4

The numbers of significant genes associated with LDL identified by the (naive-)2SLS or 2ScML using either n0 = 500 or 95454.

Table 3.

The 22 significant genes identified to be associated with LDL by 2ScML-C and/or naive-2SLS with a reference sample size of n0 = 500 or 95454. Insignificant p-values (or q-values for JTI) are each marked with an asterisk.

Name Chr p n0 = 500 n0 = 95454 JTI Score
K^2 naive-2SLS 2ScML-C K^2 naive-2SLS 2ScML-C
DOCK7 1 39 0 6.36e-09 6.09e-08 0 6.69e-08 6.80e-08 NA 3
PSRC1 1 29 7 5.31e-57 2.33e-21 2 9.25e-57 2.34e-47 0.00e+00 5
GNAI3 1 34 11 2.42e-06 5.49e-01* 11 3.08e-06 1.05e-02* NA 4
GSTM4 1 44 9 7.56e-07 8.60e-01* 7 1.75e-06 2.69e-01* 4.44e-02 1
HSPA6 1 45 1 2.05e-11 4.70e-09 1 1.16e-05 1.70e-05* 8.08e-01* 1
MKRN2 3 12 0 2.11e-07 9.62e-07 0 5.59e-07 5.67e-07 5.99e-07 3
MARCH6 5 33 0 1.65e-17 9.68e-14 0 2.18e-02* 2.18e-02* NA 2
HCG27 6 42 4 1.37e-11 1.04e-04* 3 2.51e-03* 4.61e-03* 7.59e-03 2
MICA 6 34 7 8.67e-07 2.00e-04* 7 7.32e-02* 6.39e-03* 1.12e-07 3
DDAH2 6 12 1 8.66e-06 2.56e-03* 1 9.05e-06 8.38e-04* 9.49e-01* 4
HLA-DQB1 6 55 6 3.40e-04* 1.92e-09 3 4.54e-04* 1.31e-12 3.57e-03 3
TMED4 7 15 5 2.39e-08 1.94e-06 1 3.40e-08 2.84e-05* 9.55e-13 2
CLDN15 7 30 0 4.26e-14 1.30e-11 11 5.86e-02* 4.00e-02* 6.09e-03 1
NSMAF 8 28 1 1.25e-05 2.95e-01* 1 2.14e-05* 4.54e-01* NA 3
PARP10 8 3 0 4.25e-08 2.68e-07 0 4.12e-08 4.20e-08 1.08e-06 4
GRINA 8 12 0 9.47e-11 2.65e-09 0 7.53e-11 7.83e-11 NA 3
FADS1 11 15 1 1.21e-09 2.14e-07 1 1.14e-09 2.84e-08 4.19e-31 5
OASL 12 38 1 6.63e-15 8.87e-13 1 6.20e-08 8.47e-10 NA 3
TBKBP1 17 5 0 9.96e-06 2.36e-05* 0 9.59e-06 9.67e-06 7.74e-17 3
CDKN2D 19 9 4 6.31e-02* 1.98e-05* 4 6.19e-02* 8.35e-07 1.24e-09 3
SLC44A2 19 11 5 9.27e-06 1.85e-02* 5 8.04e-06 2.44e-04* 2.90e-02 3
PVRL2 19 10 5 2.30e-11 7.23e-05* 5 1.87e-11 1.15e-04* NA 3

Two genes PSRC1 and FADS1 had a score of 5 based on the literature search, both identified by all three methods. Gene PSRC1 modulates cholesterol metabolism and inflammation; over-expression of PSRC1 in mice decreased the LDL level [15]. Mice with gene FADS1 knocked out had decreased cholesterol levels [32], and the gene was in the fatty acid metabolism pathway from the Kyoto Encyclopedia of Genes and Genomes (KEGG) [22]. Gene FADS1 is also in the silver-standard set of LDL-related genes compiled in [52], leading to the validation rates of 1/15 = 6.7% and 1/10 = 10% respectively for naive-2SLS and 2ScML-C with the large reference sample of n0 = 95454, compared to smaller 17/680=2.5% for JTI and 9/411=2.2% for PrediXcan (even with a larger LDL GWAS dataset) [52].

Finally, we note that it is largely infeasible to apply any robust MR method here because they require the use of multiple SNPs as IVs for each gene: there are no more than one or few independent SNPs near many genes even before a stringent significance cut-off is imposed (to ensure that IV Assumption (A) holds). Specifically, for each of 3820 genes, we pruned its eSNPs with a pairwise absolute correlation threshold of less than or equal to 0.01 to obtain a set of (nearly) independent SNPs. For each of 1328 genes, only one independent eSNP remained; for 2125 genes, we had only 2 eSNPs; for the other 357 (or 10) genes, there were only 3 (or 4) independent eSNPs.

4.2. Secondary Analysis: Comparison with MR-JTI

As mentioned in the previous section, a recent study applied JTI to identify putative causal genes for LDL [52]. A distinct feature of JTI is to build a stage 1 regression model by borrowing information from eQTL data of multiple tissues, which was shown to improve the performance. When applied to the GTEx multi-tissue eQTL data (with the liver data of n1 = 208 as the primary/target data) and UK Biobank (quantile-transformed) LDL GWAS summary data of n2 = 343621, JTI identified 680 LDL-associated genes at FDR < 0.05. While the JTI method does not account for invalid IVs, its MR version called MR-JTI does. When applied to the 680 genes with the same data, at the Bonferroni adjusted significance cutoff 0.05/680, MR-JTI identified 138 significant genes, and 6 of them, genes SORT1, TNKS, LPA, FADS3, PLTP, LPIN3, were in the silver-standard set of the LDL-related genes based on the KEGG cholesterol metabolism pathway and literature search.

For comparison, as MR-JTI, we applied 2ScML and naive-2SLS to these 680 genes with the same data and the same significance cut-off, including directly using their fitted models (based on the GTEx multi-tissue data) for stage 1. Since we had access to the UK Biobank genotypic data of 408339 individuals of white British ancestry, after excluding the 333462 individuals identified to be included in the UKB LDL GWAS summary data, we could use a subset of n0 = 500, 10000, or all of the remaining 74877 individuals as the reference panel. As shown in the Supplementary, with the large reference panel of n0 = 74877, 2ScML and naive-2SLS (regardless of whether to correct the variance) identified 55 and 73 putative causal genes, 5 and 6 of which were in the silver-standard set respectively: genes SORT1, FADS1, LIPC, TNKS and LPA for both, and gene FADS3 for the latter only. The validation rates by the silver-standard set for 2ScML and naive-2SLS were 5/55=9.1% and 6/73=8.2% respectively, much higher than 6/138 = 4.3% for MR-JTI. We also note the small sample size of the GTEx data being used as the reference panel in stage 2 analysis by JTI and MR-JTI. In addition, we applied MR-JTI to the previous simulations and found its inflated Type-I errors as shown in the Supplementary. Finally, as n0 increased, for 2ScML-C, both the number of the significant genes and that of validated ones increased, suggesting higher power with a larger reference sample (as shown in simulations).

5. Conclusions and Discussion

We have proposed a Two-Stage Constrained Maximum Likelihood (2ScML) method as an extension to 2SLS to draw inference on causal effects in the presence of invalid instruments. Our modeling assumptions are less stringent than many existing methods, allowing correlated IVs, among which some may not be valid IVs with any or all of the three IV assumptions being violated. This is in contrast to the naive/standard 2SLS/TWAS, and many robust MR methods such as the popular MR-Egger regression. Theoretical and numerical results confirm that 2ScML has superior performance over the standard 2SLS/TWAS, and many new and robust MR methods, including MR-Egger regression. Perhaps most importantly, our method overcomes some practical limitations of many existing robust IV methods, including some recent and strong competitors such as TSHT [16], an adaptive Lasso-based method [45] and a confidence interval-combining method [46], which do not apply to two-sample GWAS summary data that are most widely available as for our motivating TWAS for LDL, though these methods may be extended in the future. Like some other methods based on model selection, including TSHT [16] and the adaptive Lasso-based one [45], 2ScML shares the same limitation of making inference after model selection and its valid inference depends on the selection consistency.

We have pointed out that using individual-level data and using summary data with a reference panel give different estimates of a parameter (e.g. the causal effect of an exposure on an outcome), and there is a loss of information in the latter, especially with a small reference panel as often used in practice; the latter point has been largely unknown or neglected in the literature. Importantly, failing to account for such differences, as in almost all current genetic studies, including TWAS, would often lead to inflated Type-I errors. As shown in Theorems 2 and 3, we have developed a corrected variance estimator for valid inference with GWAS summary data and a reference panel in stage 2. In Section 2.3.3 we have offered a corrected variance estimator for GWAS, thus a better alternative to the current dominant GWAS method failing to account for finite-sample approximation errors of a reference panel to GWAS summary data. Recently Wang et al. [44] proposed several methods using two-sample summary data that are robust to weak IVs. As indicated by Theorems 1 and 2 therein, their methods require either the exact sample correlation matrix or the true population correlation matrix for the IVs, neither of which is typically available in practice. In contrast, our proposed 2ScML uses an estimated correlation matrix from a reference panel and thus is more widely applicable. We have showcased the application of 2ScML to discover putative causal genes for LDL with large-scale GWAS summary data, leading to some encouraging results. More applications to other data with comparisons with other methods warrant future investigation.

An R package implementing 2ScML with some example data, code, and tutorial is publicly available at https://github.com/xue-hr/TScML.

Supplementary Material

Supplement

Acknowledgments

We thank the reviewers and the editors for many helpful and insightful comments and suggestions. The research was supported by NIH grants R01 AG065636, R01 AG069895, RF1 AG067924, U01 AG073079, R01 AG074858, R01 HL116720 and R01 GM126002, and by the Minnesota Supercomputing Institute at the University of Minnesota.

Footnotes

Conflict of Interest

The authors report there are no competing interests to declare.

Supplementary Materials

In a Supplementary File, we provide the proofs of the theorems and more numerical results.

References

  • [1].1000 Genomes Project Consortium. (2015). A global reference for human genetic variation. Nature, 526(7571), 68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Barfield R, Feng H, Gusev A, Wu L, Zheng W, Pasaniuc B, Kraft P. (2018). Transcriptome-wide association studies accounting for colocalization using Egger regression. Genetic Epidemiology, 42(5), 418–433. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Bowden J, Davey Smith G, Burgess S. (2015). Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. Int J Epidemiol, 44(2), 512–525. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Bowden J, Davey Smith G, Haycock PC, and Burgess S (2016). Consistent estimation in Mendelian randomization with some invalid instruments using a weighted median estimator. Genetic Epidemiology, 40, 304–314. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Burgess S, Butterworth AS, and Thompson SG (2013). Mendelian randomization analysis with multiple genetic variants using summarized data. Genetic Epidemiology, 37, 658–665. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Burgess S, Bowden J, Dudbridge F, and Thompson SG (2016). Robust instrumental variable methods using multiple candidate instruments with application to Mendelian randomization. arXiv 1606.03279. [Google Scholar]
  • [7].Burgess S, et al. (2017). Sensitivity analysis for robust causal inference from Mendelian randomization analysis with multiple genetic variants. Epidemiology, 28, 30–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Burgess S, Foley CN, Allara E, Staley JR, and Howson JM (2020). A robust and efficient method for Mendelian randomization with hundreds of genetic variants. Nature Communications, 11(1), 376. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Cai M, Chen L, Liu J, Yang C (2019). Quantifying the impact of genetically regulated expression on complex traits and diseases. bioRxiv. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Davey Smith G, Ebrahim S. (2003). ‘Mendelian randomization’: can genetic epidemiology contribute to understanding environmental determinants of disease? International Journal of Epidemiology, 32, 1–22. [DOI] [PubMed] [Google Scholar]
  • [11].Davey Smith G, Ebrahim S. (2004). Mendelian randomization: prospects, potentials, and limitations. International Journal of Epidemiology, 33, 30–42. [DOI] [PubMed] [Google Scholar]
  • [12].Davey Smith G, Hemani G. (2014). Mendelian randomization: genetic anchors for causal inference in epidemiological studies. Human Molecular Genetics, 23, R89–R98. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Devlin B, & Roeder K (1999). Genomic control for association studies. Biometrics, 55(4), 997–1004. [DOI] [PubMed] [Google Scholar]
  • [14].Gamazon ER et al. (2015) A gene-based association method for mapping traits using reference transcriptome data. Nature Genetics, 47, 1091–1098. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Guo K, Hu L, Xi D, Zhao J, Liu J, Luo T, … & Guo Z (2018). PSRC1 overexpression attenuates atherosclerosis progression in apoE−/− mice by modulating cholesterol transportation and inflammation. Journal of Molecular and Cellular Cardiology, 116, 69–80. [DOI] [PubMed] [Google Scholar]
  • [16].Guo Z, Kang H, Tony Cai T, Small DS (2018). Confidence intervals for causal effects with invalid instruments by using two-stage hard thresholding with voting. Journal of the Royal Statistical Society: Series B, 80(4), 793–815. [Google Scholar]
  • [17].Gusev A, Ko A, Shi H, Bhatia G, Chung W, Penninx BW, …, Pasaniuc B (2016). Integrative approaches for large-scale transcriptome-wide association studies. Nature Genetics, 48(3), 245–252. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Hartwig FP, Davey Smith G, and Bowden J (2017). Robust inference in summary data Mendelian randomization via the zero modal pleiotropy assumption. International Journal of Epidemiology, 46, 1985–1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Hu Y, Li M, Lu Q, Weng H, Wang J, Zekavat SM, Yu Z, Li B, Gu J, Muchnik S et al. (2019). A statistical framework for cross-tissue transcriptome-wide association analysis. Nature Genetics, 51, 568–576. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Inoue A, & Solon G (2010). Two-sample instrumental variables estimators. The Review of Economics and Statistics, 92(3), 557–561. [Google Scholar]
  • [21].Kang H, Zhang A, Cai TT, Small DS (2016). Instrumental variables estimation with some invalid instruments and its application to Mendelian randomization. Journal of the American Statistical Association, 111(513), 132–144. [Google Scholar]
  • [22].Kanehisa M, Furumichi M, Sato Y, Ishiguro-Watanabe M, & Tanabe M (2021). KEGG: integrating viruses and cellular organisms. Nucleic Acids Research, 49(D1), D545–D551. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Klevmarken A (1982). Missing variables and two-stage least-squares estimation from more than one data set (No. 62). IUI Working Paper. [Google Scholar]
  • [24].Kollo T, & von Rosen D (1995). Approximating by the Wishart distribution. Annals of the Institute of Statistical Mathematics, 47(4), 767–783. [Google Scholar]
  • [25].Lin W, Feng R, Li H (2015). Regularization methods for high-dimensional instrumental variables regression with an application to genetical genomics. Journal of the American Statistical Association, 110(509), 270–288. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Mancuso N, Freund MK, Johnson R, Shi H, Kichaev G, Gusev A, and Pasaniuc B (2019). Probabilistic fine-mapping of transcriptome-wide association studies. Nat Genet, 51, 675–682. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Nuotio J, Oikonen M, Magnussen CG, Jokinen E, Laitinen T, Hutri-Kahonen N, …, Jula A (2014). Cardiovascular risk factors in 2011 and secular trends since 2007: the Cardiovascular Risk in Young Finns Study. Scandinavian Journal of Public Health, 42(7), 563–571. [DOI] [PubMed] [Google Scholar]
  • [28].Osborne MR, Presnell B, Turlach BA (2000). On the lasso and its dual. Journal of Computational and Graphical Statistics, 9(2), 319–337. [Google Scholar]
  • [29].Ouimet F (2022). A symmetric matrix-variate normal local approximation for the Wishart distribution and some applications. Journal of Multivariate Analysis, 189, 104923. [Google Scholar]
  • [30].Pacini D, & Windmeijer F (2016). Robust inference for the Two-Sample 2SLS estimator. Economics Letters, 146, 50–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [31].Pasaniuc B, Zaitlen N, Shi H, Bhatia G, Gusev A, Pickrell J, …, Price AL (2014). Fast and accurate imputation of summary statistics enhances evidence of functional enrichment. Bioinformatics, 30(20), 2906–2914. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [32].Powell DR, Gay JP, Smith M, Wilganowski N, Harris A, Holland A, … & Desai U (2016). Fatty acid desaturase 1 knockout mice are lean with improved glycemic control and decreased development of atheromatous plaque. Diabetes, Metabolic Syndrome and Obesity: Targets and Therapy, 9, 185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [33].Qi G, and Chatterjee N (2020). Mendelian randomization analysis using mixture models for robust and efficient estimation of causal effects. Nature Communications, 10, 1941. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].Shen X, Pan W, and Zhu Y (2012). Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association. 107, 223–232. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [35].Shen X, Pan W, Zhu Y, Zhou H (2013). On constrained and regularized high-dimensional regression. Annals of the Institute of Statistical Mathematics, 65(5), 807–832. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [36].Slob EA, Burgess S (2020). A comparison of robust Mendelian randomization methods using summary data. Genetic Epidemiology, 44(4), 313–329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Stock JH, Wright JH, & Yogo M (2002). A survey of weak instruments and weak identification in generalized method of moments. Journal of Business & Economic Statistics, 20(4), 518–529. [Google Scholar]
  • [38].Su YR, Di C, Bien S, Huang L, Dong X, Abecasis G, et al. (2018). A Mixed-Effects Model for Powerful Association Tests in Integrative Functional Genomics. Am J Hum Genet, 102(5), 904–919. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [39].Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, … & Collins R (2015). UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Medicine, 12(3), e1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [40].Tchetgen Eric J. Tchetgen, Sun BaoLuo, and Walter Stefan. (2017). The GENIUS approach to robust Mendelian randomization inference. arXiv:1709.07779. [Google Scholar]
  • [41].Teslovich TM, Musunuru K, Smith AV, Edmondson AC, Stylianou IM, Koseki M, … & Johansen CT (2010). Biological, clinical and population relevance of 95 loci for blood lipids. Nature, 466(7307), 707–713. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [42].Verbanck M, Chen C-Y, Neale B, Do R (2018). Detection of widespread horizontal pleiotropy in causal relationships inferred from Mendelian randomization between complex traits and diseases. Nature Genetics, 50, 693–698. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [43].Wainberg M, Sinnott-Armstrong N, Mancuso N, Barbeira AN, Knowles DA, Golan D, …Kundaje A (2019). Opportunities and challenges for transcriptome-wide association studies. Nature Genetics, 51(4), 592–599. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [44].Wang S, & Kang H (2021). Weak-instrument robust tests in two-sample summary-data Mendelian randomization. Biometrics. [DOI] [PubMed] [Google Scholar]
  • [45].Windmeijer F, Farbmacher H, Davies N, Davey Smith G (2019). On the use of the lasso for instrumental variables estimation with some invalid instruments. Journal of the American Statistical Association, 114(527), 1339–1350. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [46].Windmeijer F, Liang X, Hartwig FP, Bowden J (2019). The Confidence Interval Method for Selecting Valid Instrumental Variables. Discussion Paper 19/715, Department of Economics, University of Bristol. [Google Scholar]
  • [47].Wu C, Pan W (2020). A powerful fine-mapping method for transcriptome-wide association studies. Hum Genet, 139, 199–213. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [48].Xu Z, Wu C, Wei P, Pan W. (2017a). A Powerful Framework for Integrating eQTL and GWAS Summary Data. Genetics, 207, 893–902. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [49].Xu Z, Wu C, Pan W; Alzheimer’s Disease Neuroimaging Initiative. (2017b). Imaging-wide association study: Integrating imaging endophenotypes in GWAS. Neuroimage, 159, 159–169. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [50].Xue H, Shen X, & Pan W (2021). Constrained maximum likelihood-based Mendelian randomization robust to both correlated and uncorrelated pleiotropic effects. American Journal of Human Genetics, 108(7), 1251–1269. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [51].Zhao Q, Wang J, Hemani G, Bowden J, Small DS. (2020). Statistical inference in two-sample summary-data Mendelian randomization using a robust adjusted profile score. Annals of Statistics, 48, 1742–1769. [Google Scholar]
  • [52].Zhou D, Jiang Y, Zhong X, Cox NJ, Liu C, & Gamazon ER (2020). A unified framework for joint-tissue transcriptome-wide association and Mendelian randomization analysis. Nature Genetics, 52(11), 1239–1246. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [53].Zhu X, & Stephens M (2017). Bayesian large-scale multiple regression with summary statistics from genome-wide association studies. Annals of Applied Statistics, 11(3), 1561. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [54].Zhu Z, Zhang F, Hu H, Bakshi A, Robinson MR, Powell JE, et al. (2016). Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nature Genetics, 48(5), 481–7. [DOI] [PubMed] [Google Scholar]
  • [55].Zhu Z, Zheng Z, Zhang F, Wu Y, Trzaskowski M, Maier R, … & Yang J (2018). Causal associations between risk factors and common diseases inferred from GWAS summary data. Nature Communications, 9, 224. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [56].Zhu Z, Zheng Z, Zhang F et al. (2018). Causal associations between risk factors and common diseases inferred from GWAS summary data. Nature Communications, 9, 224. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

RESOURCES