Causal Inference in Transcriptome-Wide Association Studies with Invalid Instruments and GWAS Summary Data

Haoran Xue; Xiaotong Shen; Wei Pan

doi:10.1080/01621459.2023.2183127

. Author manuscript; available in PMC: 2023 Oct 6.

Published in final edited form as: J Am Stat Assoc. 2023 Mar 17;118(543):1525–1537. doi: 10.1080/01621459.2023.2183127

Causal Inference in Transcriptome-Wide Association Studies with Invalid Instruments and GWAS Summary Data

Haoran Xue ^a,^b,^*, Xiaotong Shen ^a, Wei Pan ^b

PMCID: PMC10557939 NIHMSID: NIHMS1877198 PMID: 37808547

Abstract

Transcriptome-wide association studies (TWAS) have recently emerged as a popular tool to discover (putative) causal genes by integrating an outcome GWAS dataset with another gene expression/transcriptome GWAS (called eQTL) dataset. In our motivating and target application, we’d like to identify causal genes for low-density lipoprotein cholesterol (LDL), which is crucial for developing new treatments for hyperlipidemia and cardiovascular diseases. The statistical principle underlying TWAS is (two-sample) two-stage least squares (2SLS) using multiple correlated SNPs as instrumental variables (IVs); it is closely related to typical (two-sample) Mendelian randomization (MR) using independent SNPs as IVs, which is expected to be impractical and lower-powered for TWAS (and some other) applications. However, often some of the SNPs used may not be valid IVs, e.g. due to the widespread pleiotropy of their direct effects on the outcome not mediated through the gene of interest, leading to false conclusions by TWAS (or MR). Building on recent advances in sparse regression, we propose a robust and efficient inferential method to account for both hidden confounding and some invalid IVs via two-stage constrained maximum likelihood (2ScML), an extension of 2SLS. We first develop the proposed method with individual-level data, then extend it both theoretically and computationally to GWAS summary data for the most popular two-sample TWAS design, to which almost all existing robust IV regression methods are however not applicable. We show that the proposed method achieves asymptotically valid statistical inference on causal effects, demonstrating its wider applicability and superior finite-sample performance over the standard 2SLS/TWAS (and MR). We apply the methods to identify putative causal genes for LDL by integrating large-scale lipid GWAS summary data with eQTL data.

Keywords: 2SLS, Causal inference, Genome-wide association studies, Mendelian randomization (MR), SNP, Truncated L₁-constraint (TLC), Reference panel

1. Introduction

Transcriptome-wide association studies (TWAS), as implemented in PrediXcan [14] and TWAS [17], were recently proposed to boost statistical power and enhance interpretation. It was motivated by one key hypothesis that many genetic variants influence complex traits through transcriptional regulation. They have soon become popular with applications to common diseases like type 2 diabetes, schizophrenia, and cancer, convincingly showing the power of integrating genome-wide association studies (GWAS) and expression quantitative trait locus (eQTL) data to gain biological insights. Specifically, TWAS implicate (putative) causal genes of a GWAS trait, overcoming a severe limitation of GWAS in the lack of biological insights from GWAS discoveries of trait-associated genetic variants. Statistically, TWAS apply the standard (two-sample) two-stage least squares (2SLS) in the framework of instrumental variable (IV) regression for causal inference. IV regression is a general and powerful tool for estimating and drawing an inference about the causal effect from exposure to an outcome in the presence of unmeasured confounding. A valid IV must satisfy three assumptions:

Relevance: it is associated with the exposure;
Exchangeability: it is not associated with unmeasured confounders;
Exclusion restriction: it is not associated with the outcome conditional on the exposure.

Given valid IVs, 2SLS makes a correct inference about the causal effect; yet it may break down and give erroneous results in the presence of invalid IVs. Assumption (A) ensures the inclusion of relevant IVs, which is more straightforward and typically handled by using a stringent significance cut-off. In contrast, testing assumptions (B) or (C) is more challenging; between (B) and (C), the former is even more difficult (due to the hidden confounding), while the existing literature (especially concerning MR) is more focused on (C). As to be discussed, the proposed method can deal with the violation of all three assumptions. Kang et al. [21] proposed a Lasso-type method called sisVIVE for estimating causal effect with some invalid IVs but did not address the problem of inference, which, instead of only point estimation, is essential for TWAS and is the focus here. Lin et al. [25] proposed a two-stage regularization method to select optimal instruments and jointly estimate the effects of multiple exposures on the outcome but did not permit invalid IVs in stage 2 and did not consider the problem of inference either. Windmeijer et al. [45] proposed a two-step method, including the use of adaptive Lasso in the second step. Because of the median estimator used in the first step, their method requires the “Majority Condition”, that is, more than 50% of the instruments are valid. When the “Majority Condition” fails but a weaker “Plurality Condition” holds, Two-Stage Hard Thresholding (TSHT) by Guo et al. [16] works by selecting and using valid IVs. Windmeijer et al. [46] proposed a method of combining confidence intervals (CIs) of each SNP/IV-based causal parameter estimate as a competitor to TSHT. However, all the three aforementioned works can only deal with the one-sample case, in which the data used for the two-stage models are collected from the same sample of individuals. In contrast, the two-sample case is way more flexible and thus popular in genetics and has dominated recent genetic applications in TWAS and MR as to be discussed, where the exposure and the outcome data for the two-stage model are from two independent samples; due to its importance in genetics, the two-sample case is the focus of this paper.

We propose a Two-Stage Constrained Maximum Likelihood (2ScML) method to infer causal effects in the framework of instrumental variables regression as 2SLS. First, we aim to tackle the problem in a more general setting than many other methods. In particular, we allow the presence of invalid IVs violating any of the three IV assumptions. Compared to some existing methods with two different initial and final estimators, we propose a unified constrained regression approach for identifying valid IVs and drawing inference simultaneously. Towards this end, we propose a non-convex truncated Lasso constraint (TLC) to account for invalid IVs. Second, in contrast to TSHT which uses valid IVs satisfying all three assumptions, our method is more efficient by including all IVs satisfying assumption (A) but possibly violating assumptions (B) and/or (C) in stages 1 and 2 respectively. Third, most importantly, our method applies to the two-sample case with only GWAS summary data (and a reference panel of genotypic data) in stage 2, where individual-level data from large-scale GWAS are often unavailable, as most often in TWAS, whereas the aforementioned methods are not applicable. We note that such two-sample data have dominated in recent genetic studies. For this purpose, we develop our method for such two-sample data as in our real data examples. We propose using BIC for consistent model selection, applicable to either GWAS individual-level data or summary data. In almost all current genetic applications with GWAS summary data, including TWAS, a naive estimate of the covariance matrix, ignoring the difference between the GWAS genotypic data and the reference panel, is simply used for inference. We point out that it would under-estimate the variance and thus lead to inflated Type-I errors, especially with a small reference panel size of a few hundred commonly used in practice. A motivating example in Section 2.3.3 in the simple context of ordinary least square (OLS) regression with GWAS summary data and a reference panel illustrates the severity of the problem and thus the necessity for correction to achieve valid inference. Hence, we propose a corrected variance estimator for GWAS summary data in stage 2, which is shown to perform much better than the naive estimator. We are not aware of any other existing methods with all the above features of our proposed method, which are necessary for robust applications to TWAS with the anticipated presence of some invalid IVs.

The proposed method was motivated by and is particularly suitable for applications to TWAS to identify causal genes or other molecular/imaging/clinical endophenotypes by integrating GWAS with other eQTL/xQTL data [14, 17, 54, 56, 48, 49, 38, 9, 19]. In these applications, multiple correlated SNPs (so-called cis-SNPs) near a gene are used as IVs to impute or predict the gene’s expression level (or another endophenotype) to infer whether the gene’s expression (or another risk factor) has a causal effect on a trait, say low-density lipoprotein cholesterol (LDL). However, due to strong modeling assumptions on valid IVs that may be violated frequently in practice, cautions have to be taken about the conclusions from the standard TWAS. For example, it is known that TWAS tends to identify multiple genes per locus, most of which are likely false positives due to confounding caused by linkage disequilibrium (LD) among nearby SNPs [26, 43, 47]. In particular, due to confounding through LD between an eQTL (i.e. an SNP causal to a gene’s expression) and a true causal SNP to a GWAS trait, a target gene identified by TWAS (or MR) may be only marginally associated with, but not causal to, the GWAS trait, similar to that of a significant tagging SNP in GWAS may not be causal. Furthermore, due to widespread (horizontal) pleiotropy [42], some SNPs used in TWAS may not be valid IVs, again leading to violations of a critical assumption in TWAS/2SLS [2, 7]. As alternatives to TWAS, another class of popular IV analysis using (often independent) SNPs as IVs is (two-sample) Mendelian randomization (MR) [10, 11, 12]. In these applications, due to often a small sample size of an eQTL study (i.e. stage 1 in 2SLS), it would be low-powered to apply a single SNP/IV-based method as in MR, as implemented in SMR and GSMR for the same purpose [54, 56]; instead, it would be more powerful and thus more desirable to apply a method with multiple SNPs used to predict the gene’s expression level (or another exposure/trait in stage 1). In our and many other TWAS applications with typically much smaller sample sizes for eQTL data, if MR is applied with the usual genome-wide significance threshold to select the SNPs as IVs, none or few of the SNPs are expected to be selected as IVs for most of the genes; even if this significance threshold is largely relaxed, due to strong correlations (i.e. LD) among the candidate SNPs in the cis-region of any gene, often no more than one independent SNP would be selected, rendering all robust MR methods inapplicable because they all require the use of multiple independent IVs. For these reasons, in this paper, we will focus on TWAS, not MR, though we will briefly compare in a simulation with many new and popular MR methods as reviewed in [36, 50], showing much higher statistical power of our new method over many robust MR methods. Note that we do not consider other IV regression methods inapplicable to GWAS summary data (e.g., [40]).

2. Methods

2.1. Model

We denote an exposure as $D \in ℝ$ , an outcome of interest as $Y \in ℝ$ , p IVs (such as SNPs) as $Z \in ℝ^{p}$ and the true covariance matrix of Z as $Σ \in ℝ^{p \times p}$ . In the following, for a subset G ⊆ S = {1, 2, ⋯, p} and vector $V \in ℝ^{p}, V_{G}$ is the corresponding sub-vector of V. Corresponding to the true causal model in Figure 1, the stages 1 and 2 models for the exposure and the outcome are

D = Z^{T} γ^{0} + ξ, Y = β^{0} \cdot D + Z^{T} α^{0} + ϵ .

(1)

Fig. 1 — The true causal model for (1). Directed edges represent direct effects; elements of both $γ_{A}^{0}$ and $α_{B}^{0}$ are non-zero; depending on whether β⁰ ≠ 0 or not, D has or does not have a causal effect on Y.

Here ξ and ϵ are the error terms independent of instruments Z, we have E(ξ) = E(ϵ) = 0, $Var (ξ) = σ_{1}^{2}$ , $Var (ϵ) = σ_{2}^{2}$ , and Cov(ξ, ϵ) =σ₁₂. In general ξ, ϵ are correlated and σ₁₂ ≠ 0, which accounts for unobserved confounders. $β^{0} \in ℝ$ is the parameter of interest, representing the causal effect of D on Y; $γ^{0} \in ℝ^{p}$ are the true effects of the IVs on the exposure, and for some A ⊆ S, $γ_{j}^{0} \neq 0$ if and only if j ∈ A; $α^{0} \in ℝ^{p}$ are the direct effects of the IVs on Y, and for some B ⊆ S, $α_{j}^{0} \neq 0$ if and only if j ∈ B. Note that, if B is not empty, it explicitly accounts for the violation of IV assumptions (B) or/and (C), the main problem to be addressed here.

Subsequently, we assume that by default the above two-stage linear models in (1) always hold. We also note that after centering all variables at the sample mean 0, we do not have the intercepts in the two models in (1). One primary aim is to infer the causal effect β⁰. Note that in general D and ϵ are not independent due to σ₁₂ ≠ 0, for this reason, the OLS gives a biased estimate of β⁰ (both in finite samples and asymptotically), and 2SLS in the general framework of instrument-variable regression has been proposed for (asymptotically) unbiased inference.

The following Plurality Condition, as stated in [16], is both sufficient and necessary for parameter identifiability in model (1).

Assumption 1. (Plurality Condition) Assume that $| A \cap B^{c} | > {max}_{c \neq 0} | j \in A : α_{j}^{0} / γ_{j}^{0} = c |$ .

Here A ∩ B^c is the set of valid IVs satisfying all three IV assumptions. When Σ is invertible, model (1) is identifiable if and only if the Assumption 1 holds. See Theorem 1 in [16] for proof.

In most TWAS applications, we have a two-sample design with an eQTL dataset for gene expression as the exposure and a GWAS dataset for the outcome coming from two independent samples. Typically tens of SNPs are used as the IVs, and the size of the first sample ranges from a few hundred to a few thousand, while that of the second sample is in tens to hundreds of thousands. Based on these facts, we focus on the two-sample case with a fixed p.

2.2. Estimation and Inference with Individual-Level Data

We first assume the availability of individual-level data for both samples, then generalize that with summary data in the second sample. Suppose we have two independent samples of sizes n₁ and n₂, each with iid observations, $D_{1} = {(D_{1, i}, Z_{1, i}) ∣ i = 1, \dots, n_{1}}$ and $D_{2} = {(Y_{2, i}, Z_{2, i}) ∣ i = 1, \dots, n_{2}}$ for the two stages respectively. We use their vector and matrix forms $D_{1} \in ℝ^{n_{1}}$ (with its i^th element as D_1,i) and $Z_{1} \in ℝ^{n_{1} \times p}$ (with its i^th row as Z_1,i) for the first sample, and $Y_{2} \in ℝ^{n_{2}}$ (with its i^th element as Y_2,i) and $Z_{2} \in ℝ^{n_{2} \times p}$ (with its i^th row as Z_2,i) for the second. For any set G ⊆ S, we use Z_G to denote the corresponding columns of the matrix Z.

2.2.1. The Oracle Estimator

Assuming that we know by the oracle the set A of relevant IVs in stage 1 and the set B of IVs having direct effects on Y in stage 2, we have the (ideal but impractical) two-stage oracle-2SLS estimator defined as

Stage 1 : {\hat{γ}}_{A}^{o r} = \underset{γ_{A}}{argmin} {‖ D_{1} - Z_{1, A} γ_{A} ‖}^{2}, D_{2} = Z_{2, A} {\hat{γ}}_{A}^{or}, Stage 2 : ({\hat{β}}^{o r}, α_{B}^{o r}) = \underset{β, α_{B}}{argmin} {‖ Y_{2} - β \cdot D_{2} - Z_{2, B} α_{B} ‖}^{2} .

(2)

Here we explicitly show the stage 1 oracle-2SLS. In some applications, $D_{1}$ is available for the stage 1 analysis, while in others it is not but some estimate ${\hat{γ}}_{A}$ of $γ_{A}^{0}$ is provided by a third party, e.g. the TWAS Fusion website [17]. In the following Proposition 1, we assume a consistent estimator ${\hat{γ}}_{A}$ of $γ_{A}^{0}$ is obtained such that $\sqrt{n_{1}} ({\hat{γ}}_{A} - γ_{A}^{0}) \to N (0, Θ)$ as n₁ →∞ for some constant matrix Θ, which is either obtained from stage 1 analysis with $D_{1}$ or provided by the third party. In fact, since stage 1 is an OLS, rewrite $D_{1} = Z_{1, A} γ_{A}^{0} + ξ_{1}$ with ξ₁ as the vector of random errors, we have

{\hat{γ}}_{A}^{o r} = {(Z_{1, A}^{T} Z_{1, A})}^{- 1} Z_{1, A}^{T} D_{1} = γ_{A}^{0} + {(Z_{1, A}^{T} Z_{1, A})}^{- 1} Z_{1, A}^{T} ξ_{1}, \sqrt{n_{1}} ({\hat{γ}}_{A}^{o r} - γ_{A}^{0}) = \sqrt{n_{1}} {(Z_{1, A}^{T} Z_{1, A})}^{- 1} Z_{1, A}^{T} ξ_{1} \to N (0, σ_{1}^{2} E {(Z_{A} Z_{A}^{T})}^{- 1}),

(3)

and $Θ = σ_{1}^{2} E {(Z_{A} Z_{A}^{T})}^{- 1}$ . In the following, we exchangeably use ${\hat{γ}}_{A}$ and ${\hat{γ}}_{A}^{o r}$ , and expand ${\hat{γ}}_{A}^{o r}$ to ${\hat{γ}}^{o r}$ and $α_{B}^{o r}$ to α^or by letting ${\hat{γ}}_{A^{c}}^{o r} = 0$ and $α_{B^{c}}^{o r} = 0$ , respectively. Denote $σ_{t}^{2} = Var (β^{0} \cdot ξ + ϵ) = (σ_{2}^{2} + 2 σ_{12} β^{0} + {(β^{0})}^{2} σ_{1}^{2})$ , and for subsets I, J ⊆ S, Σ_IJ is the sub-matrix of Σ corresponding to rows in I and columns in J. We define the following matrices

Ψ = (\begin{matrix} {(γ_{A}^{0})}^{T} Σ_{A A} γ_{A}^{0} & {(γ_{A}^{0})}^{T} Σ_{A B} \\ Σ_{B A} γ_{A}^{0} & Σ_{B B} \end{matrix}), Φ = (\begin{matrix} {(γ_{A}^{0})}^{T} Σ_{A A} Θ Σ_{A A} γ_{A}^{0} & {(γ_{A}^{0})}^{T} Σ_{A A} Θ Σ_{A B} \\ Σ_{B A} Θ Σ_{A A} γ_{A}^{0} & Σ_{B A} Θ Σ_{A B} \end{matrix}) .

Although the two-sample 2SLS estimator has been studied previously [30, 20, 23], often invalid IVs having direct effects on the outcome were not considered. To be complete, we show Proposition 1 to establish some properties of the oracle estimator in the presence of invalid IVs.

Proposition 1. When Σ is invertible and |A ∩ B^c|> 0, assuming $\sqrt{n_{1}} ({\hat{γ}}_{A} - γ_{A}^{0}) \to N (0, Θ)$ as n₁ → ∞ and n₂/n₁ → w for some positive and finite constant w, then the probability of the oracle estimator ${\hat{β}}^{o r}$ defined in (2) being unique converges to 1 as n₁, n₂ → ∞, and ${\hat{β}}^{o r}$ is a consistent estimator of true causal effect β⁰ with ${\hat{β}}^{o r} \overset{p}{\to} β^{0}$ as n₁, n₂ → ∞. Furthermore, we have $\sqrt{n_{2}} ({\hat{β}}^{o r} - β^{0}) \overset{d}{\to} N (0, v)$ with $v = {(σ_{t}^{2} \cdot Ψ^{- 1} + w {(β^{0})}^{2} \cdot Ψ^{- 1} Φ Ψ^{- 1})}_{11}$ .

In practice, we plug the estimates of parameters into v to obtain a variance estimate $\hat{v}$ . With ${\hat{β}}^{o r}$ and $\hat{v}$ , we could make inference on β⁰; this method is denoted as “oracle-2SLS-Ind”.

2.2.2. New Method: Two-stage Constrained Maximum Likelihood

The proposed method consists of two stages as an extension to 2SLS. In stage 1, with $D_{1}$ we derive a constrained maximum likelihood to select relevant IVs to satisfy Assumption (A), similar to [35] for general linear regression:

{\hat{γ}}_{K_{1}} = \underset{γ}{argmin} {‖ D_{1} - Z_{1} γ ‖}^{2} subject to \frac{1}{τ_{1}} \sum_{j = 1}^{p} min (| γ_{j} |, τ_{1}) \leq K_{1},

(4)

where $\frac{1}{τ_{1}} min (| γ_{j} |, τ_{1})$ [34] is the truncated L₁-function for Y_j, which is a continuous surrogate of the L₀ loss I(γ_j ≠ 0) with I(·) the indicator function. Denote ${\hat{A}}_{K_{1}} = {j \in S ∣ {{\hat{γ}}_{j, K_{1}} \neq 0}$ as the estimate of set A. The tuning parameter K₁ is an integer and can be interpreted as the number of non-zero components of γ⁰, and the constrained problem (4) performs a best-subset-like (but computationally much more efficient) search to select K₁ relevant IVs. In practice, as required by some technical conditions shown in Supplementary, we set T₁ to be a small fixed value like 1×10⁻⁵ to ensure an adequate TLC approximation to the L₀-constraint, and use BIC to estimate the optimal K₁. If we assume ξ following normal distribution, after ignoring constant terms, the log-likelihood for stage 1 is $l_{1} ({\hat{γ}}_{K_{1}}) = - [n_{1} \cdot log (σ_{1}^{2}) + {‖ D_{1} - Z_{1} {\hat{γ}}_{K_{1}} ‖}^{2} / σ_{1}^{2}] / 2$ . As $σ_{1}^{2}$ is unknown, we plug in its estimate ${\hat{σ}}_{1}^{2} = {‖ D_{1} - Z_{1} {\hat{γ}}_{K_{1}} ‖}^{2} / n_{1}$ to derive BIC for stage 1:

{BIC}_{1} (K_{1}) = n_{1} \cdot log \frac{{‖ D_{1} - Z_{1} {\hat{γ}}_{K_{1}} ‖}^{2}}{n_{1}} + log (n_{1}) \cdot {‖ {\hat{γ}}_{K_{1}} ‖}_{0} .

(5)

With a candidate set $K_{1}$ for K₁, the optimal K₁ is obtained as ${\hat{K}}_{1} = {argmin}_{K_{1} \in K_{1}} {BIC}_{1} (K_{1})$ , and the estimate of γ⁰ is $\hat{γ} : = {\hat{γ}}_{{\hat{K}}_{1}}$ . Note that the normality assumption on ξ is only for deriving BIC₁ in (5) and is not required to be shown next. As shown by Proposition 2 in the Supplementary, when some mild assumptions are satisfied and $| A | \in K_{1}$ , BIC consistently selects the true tuning parameter with $P ({\hat{K}}_{1} = | A |) \to 1$ and ${\hat{A}}_{{\hat{K}}_{1}}$ is a consistent estimator of the true set A, thus $\hat{γ}$ has the oracle property with $P (\hat{γ} = {\hat{γ}}^{o r}) \to 1$ . As for the oracle-2SLS in Proposition 1, we assume $\sqrt{n_{1}} ({\hat{γ}}_{A} - γ_{A}^{0}) \to N (0, Θ)$ .

Given $\hat{γ}$ , with $D_{2}$ we obtain the predicted exposure as $D_{2} = Z_{2} \hat{γ}$ . Then, in stage 2, we solve constrained minimization:

({\hat{β}}_{K_{2}}, α_{K_{2}}) = \underset{β, α}{argmin} {‖ Y_{2} - β \cdot D_{2} - Z_{2} α ‖}^{2} subject to \frac{1}{τ_{2}} \sum_{j = 1}^{p} min (| α_{j} |, τ_{2}) \leq K_{2} .

(6)

Denote ${\hat{B}}_{K_{2}} = {j \in S ∣ {{\hat{α}}_{j, K_{2}} \neq 0}$ as the estimate of set B. Again, in practice, we set T₂ to be a small fixed value like 1×10⁻⁵ and use BIC to estimate the optimal integer K₂, the number of invalid IVs. Here we model the direct effects of the IVs explicitly and use the non-convex constraint to select and thus account for invalid IVs that violate the IV Assumptions (B) and (C). Similar to (5), the BIC for stage 2 is

{BIC}_{2} (K_{2}) = n_{2} \cdot log \frac{{‖ Y_{2} - {\hat{β}}_{K_{2}} \cdot D_{2} - Z_{2} α_{K_{2}} ‖}^{2}}{n_{2}} + log (n_{2}) \cdot {‖ α_{K_{2}} ‖}_{0} .

(7)

With a candidate set $K_{2}$ for K₂, the optimal K₂ is obtained as ${\hat{K}}_{2} = {argmin}_{K_{2} \in K_{2}} {BIC}_{2} (K_{2})$ , and the final estimate of (β⁰, α⁰) is $(\hat{β}, α) : = ({\hat{β}}_{{\hat{K}}_{2}}, α_{{\hat{K}}_{2}})$ . It is noted that if error terms (ξ, ϵ) in model (1) are from a bivariate normal distribution, in each stage, the objective function is both the squared error loss and minus the log-likelihood as used in 2SLS, though a truncated L₁ constraint (TLC) is imposed to select relevant IVs and invalid IVs respectively in the two stages. We refer to our method as the constrained maximum likelihood in anticipation of its extensions to other parametric models.

Next, we establish that our proposed 2ScML estimator has the oracle property, then we use the asymptotic distribution of the oracle estimator to draw an inference. The following Assumption 2 states T₂ should be sufficiently small to have TLC approximating the L₀-constraint well.

Assumption 2. Assume $0 < τ_{2} \leq 1 / \sqrt{n_{2} \cdot p \cdot c_{max} (Z_{2}^{T} Z_{2})}$ , where c_max(·) denotes the largest eigenvalue of a matrix.

Theorem 1 shows that the 2ScML estimator possesses the oracle property when Assumptions 1 and 2 are satisfied. Note that the error terms ξ and ϵ are not required to be normal.

Theorem 1. Assume that Σ is invertible, $\sqrt{n_{1}} ({\hat{γ}}_{A} - γ_{A}^{0}) \to N (0, Θ)$ as n₁ → ∞, and Assumptions 1 and 2 hold, then as both n₁, n₂ → ∞, BIC consistently selects the tuning parameter with $P ({\hat{K}}_{2} = | B |) \to 1$ and ${\hat{B}}_{{\hat{K}}_{2}}$ is a consistent estimator of the true set B such that $P ({\hat{B}}_{{\hat{K}}_{2}} = B) \to 1$ , and we have $P ((\hat{β}, α) = ({\hat{β}}^{o r}, α^{o r})) \to 1$ .

Plugging the parameter estimates (including the estimates of sets A and B) in the oracle variance v in Proposition 1, we obtain an estimated variance for $\hat{β}$ and thus make inference about β⁰; this method is denoted as “2ScML-Ind” (indicating its dependence on individual-level data).

2.2.3. Computation

To solve nonconvex constrained minimization (4), we use a difference convex (DC) method to approximate the nonconvex constraint with a sequence of convex constraints iteratively. First, we decompose the constraint function into a difference of two convex functions: $\sum_{j = 1}^{p} min (| γ_{j} |, τ_{1}) / τ_{1} = (\sum_{j = 1}^{p} | γ_{j} | - max (| γ_{j} | - τ_{1}, 0)) / τ_{1}$ . Given an estimate ${\hat{γ}}_{j, K_{1}}^{(m)}$ at iteration m, we note $max (| γ_{j} | - τ_{1}, 0) \geq max (| {\hat{γ}}_{j, K_{1}}^{(m)} | - τ_{1}, 0) + (| γ_{j} | - | {\hat{γ}}_{j, K_{1}}^{(m)} |) \cdot I (| {\hat{γ}}_{j, K_{1}}^{(m)} | > τ_{1})$ . Thus we have

\frac{1}{τ_{1}} \sum_{j = 1}^{p} min (| γ_{j} |, τ_{1}) \leq \frac{1}{τ_{1}} (\sum_{j = 1}^{p} | γ_{j} | - max (| {\hat{γ}}_{j, K_{1}}^{(m)} | - τ_{1}, 0) - (| γ_{j} | - | {\hat{γ}}_{j, K_{1}}^{(m)} |) \cdot I (| {\hat{γ}}_{j, K_{1}}^{(m)} | > τ_{1})) = \frac{1}{τ_{1}} (\sum_{j = 1}^{p} | γ_{j} | \cdot I ({\hat{γ}}_{j, K_{1}}^{(m)} \leq τ_{1}) + τ_{1} \cdot I ({\hat{γ}}_{j, K_{1}}^{(m)} > τ_{1})) .

(8)

We then relax (4) as a convex constrained minimization problem:

{\hat{γ}}_{K_{1}}^{(m + 1)} = \underset{γ}{argmin} {‖ D_{1} - Z_{1} γ ‖}^{2} subject to \frac{1}{τ_{1}} \sum_{j = 1}^{p} | γ_{j} | \cdot I ({\hat{γ}}_{j, K_{1}}^{(m)} \leq τ_{1}) \leq K_{1} - \sum_{j = 1}^{p} I ({\hat{γ}}_{j, K_{1}}^{(m)} > τ_{1}) .

(9)

Problem (9) is equivalent to a constrained Lasso problem, which can be solved by the algorithm in [28]. Similarly, we iteratively relax nonconvex minimization (6) as

({\hat{β}}_{K_{2}}^{(m + 1)}, α_{K_{2}}^{(m + 1)}) = \underset{β, α}{argmin} {‖ Y_{2} - β \cdot D_{2} - Z_{2} α ‖}^{2} subject to \frac{1}{τ_{2}} \sum_{j = 1}^{p} | α_{j} | \cdot I ({\hat{α}}_{j, K_{2}}^{(m)} \leq τ_{2}) \leq K_{2} - \sum_{j = 1}^{p} I ({\hat{α}}_{j, K_{2}}^{(m)} > τ_{2}) .

(10)

Again, the algorithm of [28] is applied to solve the constrained Lasso problem (10). This iterative process continues until a termination criterion is met.

We initialize the DC algorithm with the constrained lasso estimates: ${\hat{γ}}_{K_{1}}^{(0)} = {argmin}_{γ} {‖ D_{1} - Z_{1} γ ‖}^{2}$ subject to $\sum_{j = 1}^{p} | γ_{j} | / τ_{1} \leq K_{1}$ , and $({\hat{β}}_{K_{2}}^{(0)}, α_{K_{2}}^{(0)}) = {argmin}_{β, α} {‖ Y_{2} - β \cdot D_{2} - Z_{2} α ‖}^{2}$ subject to $\sum_{j = 1}^{p} | α_{j} | / τ_{2} \leq K_{2}$ . The DC algorithm is not guaranteed to reach a global minimum for the non-convex (4) and (6) (while it is difficult to check whether a solution is global), though it seems to perform well in practice (as shown in our simulations).

2.3. Extension to GWAS Summary Data

For most TWAS applications we have either individual-level data $D_{1}$ or a consistent estimate of $γ_{A}^{0}$ from a third party for stage 1 analysis, thus we assume that an estimate ${\hat{γ}}_{A}$ is available and $\sqrt{n_{1}} ({\hat{γ}}_{A} - γ_{A}^{0}) \to N (0, Θ)$ . For stage 2, based on a GWAS of trait Y, we have an estimated marginal effect size of each Z on Y as ${\hat{β}}_{YZ}$ along with its standard error $se ({\hat{β}}_{YZ})$ . Due to logistic and privacy issues, individual-level genotypes (i.e. Z’s) and phenotypes (i.e. Y) are typically not publicly available, but only summary data in the form of ${\hat{β}}_{YZ}$ ‘s and $se ({\hat{β}}_{YZ})$ ‘s are available for all SNPs/Z’s, only summary data in the form of with which we could calculate the sample correlations between Y and Z’s as r_YZ’s. From a reference panel consisting of a group of n₀ individuals, such as from the 1000 Genomes Project [1] or UK Biobank [39], we obtain individual-level genotype data for the p SNPs as $Z_{0} \in ℝ^{n_{0} \times p}$ with its rows corresponding to individuals. We next extend the oracle-2SLS and the proposed 2ScML in stage 2 to the situation with only GWAS summary-statistics and a reference panel, assuming that the two original samples and the reference panel are independent and from the same population. This extension allows our method to be applied to some published large-scale GWAS summary data with a wide range of traits.

Without loss of generality, we assume D₁, Y₂, and columns of Z₁, Z₂, Z₀ are all standardized to have a sample mean 0 and a sample variance 1, so for example, for the j^th IV Z_j we have $Z_{2, j} Y_{2} / n_{2} = r_{Y Z_{j}}$ ; j = 1, …, p. For a positive integer k, we use I_k to denote the k × k identity matrix.

2.3.1. The Oracle Estimator

If we have individual-level data Y₂ and Z₂ in the second sample, we can get the oracle estimator $({\hat{β}}^{o r}, α_{B}^{o r})$ from stage 2 in equation (2), which is an OLS and has close form solution

(\begin{matrix} {\hat{β}}^{o r} \\ α_{B}^{o r} \end{matrix}) = {[(\begin{matrix} D_{2}^{T} \\ Z_{2, B}^{T} \end{matrix}) (\begin{array}{l} D_{2} & Z_{2, B} \end{array})]}^{- 1} (\begin{matrix} D_{2}^{T} \\ Z_{2, B}^{T} \end{matrix}) Y_{2} = (\begin{matrix} {\hat{γ}}_{A}^{T} (Z_{2, A}^{T} Z_{2, A} / n_{2}) {\hat{γ}}_{A} & {\hat{γ}}_{A}^{T} (Z_{2, A}^{T} Z_{2, B} / n_{2}) \\ (Z_{2, B}^{T} Z_{2, A} / n_{2}) {\hat{γ}}_{A} & Z_{2, B}^{T} Z_{2, B} / n_{2} \end{matrix}) (\begin{matrix} {\hat{γ}}_{A}^{T} (Z_{2, A}^{T} Y_{2} / n_{2}) \\ Z_{2, B}^{T} Y_{2} / n_{2} \end{matrix}) .

(11)

From the GWAS summary statistics we obtain $Z_{2}^{T} Y_{2} / n_{2}$ , but not $Z_{2}^{T} Z_{2} / n_{2}$ . As usual, replacing $Z_{2}^{T} Z_{2} / n_{2}$ with $Z_{0}^{T} Z_{0} / n_{0}$ in (11), we obtain an estimate of $(β^{0}, α_{B}^{0})$ as

(\begin{matrix} {\tilde{β}}^{o r} \\ α_{B}^{o r} \end{matrix}) = {(\begin{matrix} {\hat{γ}}_{A}^{T} (Z_{0, A}^{T} Z_{0, A} / n_{0}) {\hat{γ}}_{A} & {\hat{γ}}_{A}^{T} (Z_{0, A}^{T} Z_{0, B} / n_{0}) \\ (Z_{0, B}^{T} Z_{0, A} / n_{0}) {\hat{γ}}_{A} & Z_{0, B}^{T} Z_{0, B} / n_{0} \end{matrix})}^{- 1} (\begin{matrix} {\hat{γ}}_{A}^{T} (Z_{2, A}^{T} Y_{2} / n_{2}) \\ Z_{2, B} Y_{2} / n_{2} \end{matrix}) .

(12)

We expand $α_{B}^{o r}$ to α^or by adding a component $α_{B^{c}}^{o r} = 0$ . For finite n₀ and n₂, we expect $Z_{2}^{T} Z_{2} / n_{2} \neq Z_{0}^{T} Z_{0} / n_{0}$ , leading to $({\hat{β}}^{o r}, α_{B}^{o r}) \neq ({\tilde{β}}^{o r}, α_{B}^{o r})$ . Intuitively, the difference between $Z_{2}^{T} Z_{2} / n_{2}$ and $Z_{0}^{T} Z_{0} / n_{0}$ introduces some extra variation to the estimate ${\tilde{β}}^{o r}$ as compared to ${\hat{β}}^{o r}$ , and this additional variation is not captured by variance v in Proposition 1 based on individual-level data (without approximation errors). It would result in inflated Type-I errors when ${\tilde{β}}^{o r}$ and variance v in Proposition 1 are used for inference, as supported by our later simulation studies. To account for and quantify the effects of using a reference panel, as stated in Assumption 3, we impose an additional assumption of these two matrices following the Wishart distributions. Although this assumption does not hold exactly for SNP data, a Wishart distribution is widely adopted for a covariance matrix (e.g. as a prior in Bayesian statistics); here we use it as a finite-sample approximation to an asymptotically normal sample covariance matrix [24, 29]: as shown in our simulations, it works well for SNP data. Denote W(Σ, n) as the Wishart distribution with scale matrix Σ and degrees of freedom n.

Assumption 3. Assume that $Z_{0}^{T} Z_{0} ~ W (Σ, n_{0})$ and $Z_{2}^{T} Z_{2} ~ W (Σ, n_{2})$ .

Now we introduce Theorem 2 that gives the asymptotic distribution of ${\tilde{β}}^{o r}$ .

Theorem 2. Assume that Σ is invertible and |A ∩ B^c|> 0; $\sqrt{n_{1}} ({\hat{γ}}_{A} - γ_{A}^{0}) \to N (0, Θ)$ as n₁ → ∞; n₂/n₁ → w, n₂/n₀ → u for some positive and finite constants w and u; and Assumption 3 holds. Then the probability of the oracle estimator ${\tilde{β}}^{o r}$ defined in (12) being unique converges to 1 as n₁, n₂, n₀ → ∞, and ${\tilde{β}}^{o r}$ is a consistent estimator of true causal effect β⁰ with ${\tilde{β}}^{o r} \overset{p}{\to} β^{0}$ as n₁, n₂, n₀ → ∞. Furthermore, we have $\sqrt{n_{2}} ({\tilde{β}}^{o r} - β^{0}) \overset{d}{\to} N (0, v_{c})$ with v_c = v(u + 1)·Tr(ΨB)·Ψ⁻¹, where v and Ψ are given in Proposition 1, and $B = {(β^{0}, {(α_{B}^{0})}^{T})}^{T} (β^{0}, {(α_{B}^{0})}^{T})$ .

Here Tr(·) denotes the trace of a matrix. The variance v_c for GWAS summary data corrects the original individual-level data-based variance v by accounting for its approximation errors with the reference sample. Plugging the parameter estimates into v_c, we obtain a corrected variance estimate ${\tilde{v}}_{c}$ . With ${\tilde{β}}^{o r}$ and ${\tilde{v}}_{c}$ we make inference on β⁰; this method is denoted as “oracle-2SLS-Sum-C” (with “C” for the corrected variance).

2.3.2. New Method: Two-Stage Constrained Maximum Likelihood

In stage 2 for 2ScML in equation (6), the objective function ${‖ Y_{2} - β \cdot D_{2} - Z_{2} α ‖}^{2} = {‖ Y_{2} - Z_{2} (\hat{γ}, I_{p}) {(β, α^{T})}^{T} ‖}^{2}$ . Denote $Λ : = {(\hat{γ}, I_{p})}^{T} (Z_{0}^{T} Z_{0} / n_{0}) (\hat{γ}, I_{p}) \in ℝ^{(p + 1) \times (p + 1)}$ , which is singular even if Z₀ has a full rank p. For computational simplicity, we add a small constant δ, such as 1 × 10⁻⁵, on the diagonal elements of Λ to obtain Λ* = Λ + δ·I_(p+1). Recall $Y_{2}^{T} Y_{2} / n_{2} = 1$ ; after some simplification, we have

({\tilde{β}}_{K_{2}}, α_{K_{2}}) = \underset{β, α}{argmin} {‖ {(Λ^{*})}^{- 1 / 2} {(\hat{γ}, I_{p})}^{T} (Z_{2}^{T} Y_{2} / n_{2}) - {(Λ^{*})}^{1 / 2} {(β, α^{T})}^{T} ‖}^{2} subject to \frac{1}{τ_{2}} \sum_{j = 1}^{p} min (| α_{j} |, τ_{2}) \leq K_{2},

(13)

which could be solved iteratively with a sequence of constrained Lasso problems as in Section 2.2.3. Denote ${\tilde{B}}_{K_{2}} = {j \in S ∣ {{\tilde{α}}_{j, K_{2}} \neq 0}$ as the estimate of set B. Note that $Y_{2}^{T} Z_{2} / n_{2}$ is the vector of the sample correlations between Y and Z’s, available from the GWAS summary data. The BIC is

{BIC}_{2} (K_{2}) = n_{2} \cdot log {1 - 2 (Y_{2}^{T} Z_{2} / n_{2}) (\hat{γ}, I_{p}) {({\tilde{β}}_{K_{2}}, α_{K_{2}}^{T})}^{T} + ({\tilde{β}}_{K_{2}}, α_{K_{2}}^{T}) Λ^{*} {({\tilde{β}}_{K_{2}}, α_{K_{2}}^{T})}^{T}} + log (n_{2}) \cdot {‖ α_{K_{2}} ‖}_{0} .

(14)

With a candidate set $K_{2}$ for K₂, the optimal K₂ is obtained as ${\tilde{K}}_{2} = {argmin}_{K_{2} \in K_{2}} {BIC}_{2} (K_{2})$ , and the estimate of (β⁰, α⁰) is $(\tilde{β}, α) : = ({\tilde{β}}_{{\tilde{K}}_{2}}, α_{{\tilde{K}}_{2}})$ . Theorem 3 states the oracle property of $(\tilde{β}, α)$ .

Theorem 3. Assume that Σ is invertible, $\sqrt{n_{1}} ({\hat{γ}}_{A} - γ_{A}^{0}) \to N (0, Θ)$ as n₁ → ∞, and Assumptions 1 and 2 hold, then as n₁, n₂, n₀ → ∞, BIC consistently selects the tuning parameter with $P ({\tilde{K}}_{2} = | B |) \to 1$ and ${\tilde{B}}_{{\tilde{K}}_{2}}$ is a consistent estimator of the true set B such that $P ({\tilde{B}}_{{\tilde{K}}_{2}} = B) \to 1$ , and we have $P ((\tilde{β}, α) = ({\tilde{β}}^{o r}, α^{o r})) \to 1$ .

Plugging the parameter estimates (including those for sets A and B) in the corrected oracle variance v_c in Theorem 2, we obtain a corrected variance estimate for $\tilde{β}$ and thus make inference about β⁰; this method is denoted as “2ScML-Sum-C”.

2.3.3. A Motivating Example: OLS with Summary Data

This example of OLS regression illustrates some key differences between using individual-level data and using summary data with a reference panel. We have p predictors $X \in ℝ^{p}$ and a response variable Y from a true model: Y = Xβ + ϵ, where ϵ ~ N(0, σ²) is the random error independent of X. We have a sample of size n denoted by $X \in ℝ^{n \times p}$ and $Y \in ℝ^{n}$ from the true model, and an independent reference panel of size n₀ denote by $X_{0} \in ℝ^{n_{0} \times p}$ . We scale the columns of X and X₀, and vector Y to have a sample mean 0 and sample variance 1. With individual-level data we obtain the OLS estimate β = (X^TX)⁻¹X^TY, and its estimated covariance matrix $Cov (β) = {\hat{σ}}^{2} {(X^{T} X)}^{- 1}$ with ${\hat{σ}}^{2} = Y - X β^{2} / n$ . With summary data X^TY/n (as the marginal association estimates), by replacing X^TX/n with $X_{0}^{T} X_{0} / n_{0}$ in β, we obtain

β = {(X_{0}^{T} X_{0} / n_{0})}^{- 1} (X^{T} Y / n), Cov (β ∣ X, X_{0}) = σ^{2} {(X_{0}^{T} X_{0} / n_{0})}^{- 1} (X^{T} X / n^{2}) {(X_{0}^{T} X_{0} / n_{0})}^{- 1} .

Since X^TX/n is unknown, again we approximate it by $X_{0}^{T} X_{0} / n_{0}$ , obtaining the usual uncorrected covariance matrix estimate $cov (β) = {\tilde{σ}}^{2} {(X_{0}^{T} X_{0} / n_{0})}^{- 1} / n$ with ${\tilde{σ}}^{2} = 1 - 2 (Y^{T} X / n) β + β^{T} (X_{0}^{T} X_{0} / n_{0}) β$ (since Y^TY/n =1 after scaling). Assuming X^TX ~ W(Σ, n) and $X_{0}^{T} X_{0} ~ W (Σ, n_{0})$ , with $B = β β^{T} \in ℝ^{p \times p}$ , as shown in the Supplementary we derive the (marginal) covariance matrix as

Cov (β) = \frac{n + n_{0}}{n \cdot n_{0}} Tr (Σ B) Σ^{- 1} + \frac{σ^{2}}{n} Σ^{- 1} .

(15)

Plugging in the estimates ${\tilde{σ}}^{2}, Σ = X_{0}^{T} X_{0} / n_{0}$ and B= ββ^T, we obtain a corrected estimate Cov(β). Note that the second term in the corrected estimate Cov(β) is the uncorrected estimate Cov(β).

We can make inference about β with β and Cov(β), denoted by “OLS-Ind”; or with β and Cov(β), denoted by “OLS-Sum”; or with β and Cov(β), denoted by “OLS-Sum-C”.

We did simulations to compare different methods. We set p = 5, Σ = I₅, β= (1, 1, 1, 1, 1), and σ² = 5. We simulated X ~ N(0, Σ). i.e. the five predictors X₁ to X₅ were iid from a standard normal distribution. We had n = 500 and tried different n₀ = 100, 500, 1000 or 10000, for each setup we did 1000 replications. As all five predictors were symmetric, we show simulation results for β₁ in Table 1, all estimates and standard errors in the table were scaled back to the original scale. We could see that all methods were almost unbiased, but OLS-Sum had inflated Type-I errors and the inflation decreased as n₀ increased. In contrast, OLS-Ind and OLS-Sum-C could control Type-I error well, while the estimate of OLS-Sum-C had a much larger variance than that of OLS-Ind, clearly indicating the cost of using summary data and a reference panel with an inflated variance of the estimate.

Table 1.

Estimating β₁ and testing H⁰: β₁ = 1 versus H₁:β₁ ≠ 1 at the significance level 0.05 based on 1000 simulations for the motivating example. In each column, from top to bottom we show the mean of the estimates, the mean of the standard errors, the standard deviation of the estimates, and the empirical Type-I error.

Method	OLS-Ind	OLS-Sum				OLS-Sum-C
n ₀	—	100	500	1000	10000	100	500	1000	10000
Mean(Est)	1.0003	1.0474	1.0021	0.9981	1.0031	1.0474	1.0021	0.9981	1.0031
Mean(SE)	0.0998	0.0988	0.0993	0.0996	0.0992	0.2749	0.1744	0.1586	0.1432
SD(Est)	0.0993	0.2540	0.1596	0.1442	0.1292	0.2540	0.1596	0.1442	0.1292
Type-I Error	0.048	0.460	0.229	0.171	0.142	0.035	0.032	0.030	0.033

Open in a new tab

3. Simulations

3.1. Simulation 1: TWAS

We compared 2ScML with naive-2SLS/TWAS and oracle-2SLS through simulations; to be realistic, the setup mimicked real TWAS applications with real SNP data, and two independent samples of different sizes as in model (1). We extracted the genotypic data of 408339 individuals in UK Biobank [39] for p = 56 correlated SNPs from gene MAFB on chromosome 20 as the population data for our simulation. The minor allele frequencies (MAFs) of the 56 SNPs ranged from 0.05 to 0.45 and their correlation matrix is shown in the Supplementary. The genotype data in the two samples and the reference panel were independently drawn with replacement from the population. The error terms (ϵ, ξ) were generated from a bivariate normal distribution with both means 0, variances $σ_{1}^{2} = σ_{2}^{2} = σ^{2} = 1$ or 2, and correlation 0.5. For 2 ≤ i ≤ 8, $γ_{i}^{0} = 1$ ; otherwise $γ_{i}^{0} = 0$ ; i.e. the 2^nd to 8^th IVs were relevant with an equal effect size 1. For for i =1, 7, 8, 9, $α_{i}^{0} = 1$ ; otherwise $α_{i}^{0} = 0$ ; i.e. the relevant 7^th and 8^th IVs were invalid, and the irrelevant 1^st and 9^th IVs were also invalid. When σ² was 1 or 2, the true R² in stage 1 was 0.303 or 0.179 respectively. When β⁰ = 0, there was no causal effect from the exposure to the outcome, i.e. it was a null case.

In each simulation we generated two independent samples from model (1) of sizes n₁ = 500, 1000, or 2000 and n₂ =50000 or 100000 for stages 1 and 2 respectively, and generated reference panel of size n₀ = 500, 10000, 50000 or 100000 when n₂ = 50000, n₀ = 500, 10000, 100000 or 200000 when n₂ =100000. Then we applied different methods to the simulated data to test H₀ :β⁰ = 0 versus H₁ :β⁰ ≠ 0. For all methods, in stage 1 we used individual-level data and the 2^nd to 8^th IVs to get $\hat{γ}$ . In stage 2, for 2ScML, we chose the best K₂ from 0 to 10 and set τ₂ = 1 × 10⁻⁵; for naive-2SLS, we fitted a linear regression model of Y on $\hat{D}$ ; for oracle-2SLS, we fitted a linear regression model of Y on $\hat{D}$ and included the 4 truly invalid IVs with direct effects in the stage 2 model. In stage 2 we could use either individual-level data, denoted by “-Ind”, or summary data (and a reference panel) with the uncorrected or corrected variance estimator, denoted by “-Sum” or “-Sum-C” respectively.

We varied β⁰ from −0.1 to 0.1 with a step size of 0.02 and applied all methods with 1000 independent replicates to calculate their empirical Type-I error and power at significance level 0.05. Figure 2 compares the results of oracle-2SLS and 2ScML for σ² = 2, n₁ = 500 and n₂ = 50000 with different n₀’s; the results for n₀ =100000 were similar to those for n₀ = 50000, and thus are not shown here. The complete simulation results for other setups and naive-2SLS are in the Supplementary. From Figure 2 we could see that, with individual-level data, 2ScML-Ind performed almost identically to oracle-2SLS-Ind: they both controlled Type-I error well and had high power. This confirmed the oracle property of 2ScML in Theorem 1. With summary data and a reference panel, both 2ScML-Sum and oracle-2SLS-Sum had inflated Type-I errors: the former had a bit larger inflation; the inflation was significant with small n₀ = 500 and decreased as n₀ increased. When the corrected variance estimator was used, both 2ScML-Sum-C and oracle-2SLS-Sum-C could control Type-I error for all n₀, and their power increased as n₀ increased. Though both 2ScML-Ind and 2ScML-Sum-C could control Type-I error, the former had much higher power than the latter, demonstrating significant loss of information with summary data, which has largely been neglected in the literature; it was the same for oracle-2SLS.

Fig. 2 — Empirical Type-I error rates (for β⁰ = 0 in the x-axis) and power (for β⁰ ≠ 0) in Simulation 1 when σ² = 2, n₁ = 500, and n₂ = 50000.

Table 2 shows more detailed estimation and inference results for true β⁰ = 0. It is clear that for estimation both oracle-2SLS and 2ScML were almost unbiased regardless of using individual-level data or summary data, but the estimates with summary data had larger variations than those with individual-level data. For example, when n₀ = 500, the estimates of oracle-2SLS and 2ScML with summary data had $SD (\tilde{β})$ 0.0504 and 0.0695 respectively, almost five and seven times of their $SD (\tilde{β})$ 0.0101 and 0.102 with individual-level data. This again confirmed the information loss by using summary data. As n₀ increased, $SD (\tilde{β})$ for both oracle-2SLS and 2ScML with summary data decreased. For testing, both oracle-2SLS-Sum-C and 2ScML-Sum-C could control Type-I error well, though the former was a little conservative. As naive-2SLS failed to account for invalid IVs, it always gave largely biased estimates and thus highly inflated Type-I errors.

Table 2.

Detailed results for different methods in Simulation 1 when β⁰ = 0, $σ_{2}^{2} = 2$ , n₁ = 500, and n₂ = 50000. In each cell from top to bottom, we show the means of the causal estimates and their standard errors, the standard deviation of the causal estimates, and the Type-I error.

Method		Ind	Sum			Sum-C
n ₀		—	500	10000	50000	500	10000	50000
oracle-2SLS	Mean(Est)	3e-04	4e-04	−1e-04	3e-04	4e-04	−1e-04	3e-04
	Mean(SE)	0.0101	0.0102	0.0102	0.0101	0.0588	0.0171	0.0129
	SD(Est)	0.0101	0.0504	0.0153	0.0122	0.0504	0.0153	0.0122
	Type-I Error	0.044	0.729	0.187	0.101	0.024	0.025	0.041
2ScML	Mean(Est)	3e-04	0.0040	3e-04	3e-04	0.0040	3e-04	3e-04
	Mean(SE)	0.0101	0.0116	0.0103	0.0102	0.0650	0.0174	0.0129
	SD(Est)	0.0102	0.0695	0.0171	0.0126	0.0695	0.0171	0.0126
	Type-I Error	0.047	0.794	0.239	0.109	0.056	0.039	0.045
naive-2SLS	Mean(Est)	0.1960	0.1959	0.1961	0.1961	0.1959	0.1961	0.1961
	Mean(SE)	0.0204	0.0193	0.0200	0.0200	0.0233	0.0202	0.0201
	SD(Est)	0.0963	0.0971	0.0965	0.0964	0.0971	0.0965	0.0964
	Type-I Error	0.988	0.991	0.988	0.988	0.991	0.988	0.988

Open in a new tab

3.2. Simulation 2: MR

Although not our main purpose, we show the versatile application of our proposed method to typical MR studies and compare with many existing MR methods. We compared 2ScML with many new and popular two-sample MR methods, including MR-ContMix [8], MR-Mix [33], MR-Lasso [6], MR-cML [50], MR-PRESSO [42], MR-IVW (random-effect (RE) meta-analysis) [5], MR-Egger regression [3], weighted median method (MR-W-Median) [4], weighted mode method (MR-W-Mode) [18], MR-RAPS with over-dispersion and using the Tukey loss (MR-RAPS1) and MR-RAPS without over-dispersion and using squared error loss (MR-RAPS2) [51]. In each simulation, we generated two independent samples for the two-stage analysis and calculated the summary data with marginal linear regressions, and used the same summary data for all methods (including the proposed methods).

In summary, 2ScML-Sum-C, MR-ContMix, MR-Lasso, and MR-cML appeared to be the winners with well-controlled Type-I error and high power. More specifically, when IV Assumption (B) and thus the InSIDE assumption was satisfied, MR-PRESSO, MR-RAPS2 and 2ScML-Sum had inflated Type-I errors, MR-IVW and MR-Egger also had slightly inflated Type-I errors, while all other methods could control the Type-I error well at the nominal level of 0.05; 2ScML-Sum-C, MR-ContMix, MR-Lasso and MR-cML performed similarly with Type-I error satisfactorily controlled and higher power than all the other methods. When IV Assumption (B) and thus the InSIDE assumption was violated, MR-IVW, MR-Egger, and MR-RAPS1 had more highly inflated Type-I errors, while all other methods had similar performance to their counterparts when Assumption (B) was satisfied. See Supplementary for detailed results.

3.3. Sensitivity Analysis of IV Strengths

We further studied how our proposed methods perform depending on the strengths of IVs. The concentration parameter μ² defined in [37] quantifies the IV strength; in the Supplementary we discussed the effects of μ² on the asymptotic distribution and efficiency of our proposed estimators. We varied the IV effects γ⁰ and thus μ² in simulations for TWAS and MR. Our results show that when the IVs were moderately strong, the distributions of the proposed estimators were well approximated by normal distributions. The proposed methods could always control Type-I errors, and had increased power as the IV strength increased. Detailed results are in the Supplementary.

4. Application to TWAS for LDL

With only GWAS summary data available in stage 2, we can only consider the methods applicable to GWAS summary data and thus drop “sum” for a simplified notation for each method. For example, 2ScML-C would be 2ScML-Sum-C.

4.1. Main Analysis

We applied 2ScML and the naive-2SLS to identify (putative) causal genes for LDL with GWAS summary statistics. For each gene, we used the TWAS Fusion pre-calculated coefficients $\hat{γ}$ ‘s for our stage 1 analysis [17]; the coefficients were estimated based on microarray expression data of blood from the Young Finns Study (YFS) with sample size n₁ = 1264 [27]. The GWAS summary data of LDL were drawn from [41] with sample sizes up to n₂ = 95454; we removed the SNPs with sample sizes less than 80000. We used software ImpG [31] to impute the LDL GWAS summary statistics with 489 unrelated individuals of European ancestry from the 1000 Genomes Project [1] as the reference panel. As stated in [31], we used the imputation accuracy measure r² to quantify the imputation quality for each SNP and removed imputed SNPs with r² < 0.3. With the availability of the genotype data of 408339 individuals of white British ancestry from UK Biobank, we could take a random sample of n₀ = 500, 10000, 95454, or all 408339 individuals as our reference panel for stage 2 analysis in TWAS. We removed the SNPs with MAFs less than 0.05 or failing the Hardy-Weinberg equilibrium test with p-values less than 0.001.

There were 4700 genes with pre-calculated $\hat{γ}$ in the TWAS Fusion database. We first removed the genes with stage 1 regression p-value greater than 0.05/4700. Then for each of the genes left, we identified the set of its eSNPs with non-zero regression coefficients and also available in both the reference panel and the GWAS summary data, and removed the genes with no more than 1 eSNP. We removed 880 genes in total and analyzed the remaining 3820 genes. We calculated the first stage joint F-statistics for the 3820 genes; the mean and the range of their F-statistics were 34.73 and [3.18, 1144.51] respectively. There were 826 genes with their F-statistics less than 10, while those of the other 2994 genes were greater than 10. We show the distributions of the F-statistics as histograms in the Supplementary. For each gene, we extracted all SNPs near its eSNPs, then pruned out SNPs with pairwise absolute correlations greater than 0.6. If more than 100 SNPs were left, we only kept the top 100 SNPs with the highest absolute correlations with LDL; otherwise, we kept all of them. These SNPs were used in the stage 2 analysis. We set the candidate set of K₂ for 2ScML as $K_{2} = {0, 1, \dots, ⎣ p / 2 ⎦}$ , and set τ₂ = 1 × 10⁻⁵.

For each of the 3820 genes, we obtained its estimated effect sizes and p-values from naive-2SLS and 2ScML for different n₀’s. Since the results for n₀ = 10000, 95454, and 408339 were similar, we present those for n₀ = 500 and 95454 while relegating the others to the Supplementary. Figure 3 shows the quantile-quantile (Q-Q) plots of the p-values for different methods. From panels (A) and (D) we see that for both n₀ = 500 and 95454, the Q-Q plot for 2ScML-C had the left tail in good agreement with the identity line, and its genomic inflation factor λ [13] was close to 1, indicating that the Type-I error was controlled satisfactorily; its heavier right tail could be due to the polygenicity of a complex trait like LDL with many genes having small effects. On the other hand, panels (B) and (C) show that, when n₀ = 500 was small, compared to 2ScML-C, the other three methods could have inflated Type-I errors possibly due to the effects of invalid IVs and/or failing to account for the effects of using a small reference panel. As n₀ increased to 95454, as shown by panels (E) and (F), 2ScML and naive-2SLS had similar performance to 2ScML-C and naive-2SLS-C respectively, while naive-2SLS-C seemed to still have an inflated left tail possibly due to its failure to account for invalid IVs.

Fig. 3 — The Q-Q plots of p-values in the −log₁₀ scale for different methods when n₀ = 500 (top row) and 95454 (bottom row). The left column shows a Q-Q plot of p-values of 2ScML-C versus the expected p-values under the null; the genomic inflation factor λ is shown too. The middle column shows a Q-Q plot of p-values of 2ScML-C versus other methods, and the right column zooms in. The grey solid line in each panel is the identity line.

With different n₀’s, at the Bonferroni corrected significance cutoff 0.05/3820, naive-2SLS, naive-2SLS-C and 2ScML, 2ScML-C identified 27 significant genes in total. We did a literature search on each of the 27 significant genes. We excluded the study generating the LDL GWAS data we used [41]. Based on the literature support from other studies, we assigned a score to each gene: if there were other studies (1) supporting this gene being associated with LDL, we assigned the highest score of 5; (2) supporting this gene associated with a trait related to LDL, we assigned a score of 4; (3) identifying one or more SNPs mapped to or nearby this gene, which were significantly associated with LDL, we assign a score of 3; (4) identifying some SNPs mapped to or nearby this gene, which were significantly associated with other traits related to LDL, we assigned it a score of 2; (5) identifying some SNPs mapped to or nearby this gene, which were significantly associated with any traits, we assigned a score of 1; (6) otherwise, we assigned the lowest score of 0. See Supplementary for a list of all the 27 genes with their supporting references.

When n₀ = 500 and 95454, naive-2SLS and 2ScML-C identified 22 genes in total, Figure 4 shows the numbers of genes identified by each method and their overlaps. Table 3 lists out the 22 genes with their p-values given by naive-2SLS and 2ScML-C; the non-significant ones with p > 0.05/3820 were masked out with an asterisk. Out of the 22 genes, 15 were also analyzed by the joint-tissue imputation (JTI) approach in [52]; their q-values are also shown with non-significant ones at false discovery rate (FDR) > 0.05 marked out. From Figure 4 and Table 3 we can see that, when n₀ increased from 500 to 95454, the number of the significant genes identified by naive-2SLS decreased from 20 to 15, while that by 2ScML-C from 12 to 10, again suggesting possibly more liberal results with false positives when a smaller reference panel was used without suitably accounting for its effects. Two genes HSPA6 and DDAH2 were not significant by JTI; depending on the reference panel size, 2ScML-C also identified one or both as non-significant, while naive-2SLS identified both to be significant. This supports better Type-I error control by 2ScML-C than by naive-2SLS. It is also notable that 2ScML-C did not always give less significant results than naive-2SLS: for both n₀ = 500 and 95454, gene HLA-DQB1 was non-significant by naive-2SLS, while 2ScML-C agreed with JTI and discovered it as significant. Another gene CDKN2D was not significant by naive-2SLS, but in contrast 2ScML-C (when n₀ = 95454) and JTI claimed otherwise.

Fig. 4 — The numbers of significant genes associated with LDL identified by the (naive-)2SLS or 2ScML using either n₀ = 500 or 95454.

Table 3.

The 22 significant genes identified to be associated with LDL by 2ScML-C and/or naive-2SLS with a reference sample size of n₀ = 500 or 95454. Insignificant p-values (or q-values for JTI) are each marked with an asterisk.

Name	Chr	p	n₀ = 500			n₀ = 95454			JTI	Score
			${\hat{K}}_{2}$	naive-2SLS	2ScML-C	${\hat{K}}_{2}$	naive-2SLS	2ScML-C
DOCK7	1	39	0	6.36e-09	6.09e-08	0	6.69e-08	6.80e-08	NA	3
PSRC1	1	29	7	5.31e-57	2.33e-21	2	9.25e-57	2.34e-47	0.00e+00	5
GNAI3	1	34	11	2.42e-06	5.49e-01*	11	3.08e-06	1.05e-02*	NA	4
GSTM4	1	44	9	7.56e-07	8.60e-01*	7	1.75e-06	2.69e-01*	4.44e-02	1
HSPA6	1	45	1	2.05e-11	4.70e-09	1	1.16e-05	1.70e-05*	8.08e-01*	1
MKRN2	3	12	0	2.11e-07	9.62e-07	0	5.59e-07	5.67e-07	5.99e-07	3
MARCH6	5	33	0	1.65e-17	9.68e-14	0	2.18e-02*	2.18e-02*	NA	2
HCG27	6	42	4	1.37e-11	1.04e-04*	3	2.51e-03*	4.61e-03*	7.59e-03	2
MICA	6	34	7	8.67e-07	2.00e-04*	7	7.32e-02*	6.39e-03*	1.12e-07	3
DDAH2	6	12	1	8.66e-06	2.56e-03*	1	9.05e-06	8.38e-04*	9.49e-01*	4
HLA-DQB1	6	55	6	3.40e-04*	1.92e-09	3	4.54e-04*	1.31e-12	3.57e-03	3
TMED4	7	15	5	2.39e-08	1.94e-06	1	3.40e-08	2.84e-05*	9.55e-13	2
CLDN15	7	30	0	4.26e-14	1.30e-11	11	5.86e-02*	4.00e-02*	6.09e-03	1
NSMAF	8	28	1	1.25e-05	2.95e-01*	1	2.14e-05*	4.54e-01*	NA	3
PARP10	8	3	0	4.25e-08	2.68e-07	0	4.12e-08	4.20e-08	1.08e-06	4
GRINA	8	12	0	9.47e-11	2.65e-09	0	7.53e-11	7.83e-11	NA	3
FADS1	11	15	1	1.21e-09	2.14e-07	1	1.14e-09	2.84e-08	4.19e-31	5
OASL	12	38	1	6.63e-15	8.87e-13	1	6.20e-08	8.47e-10	NA	3
TBKBP1	17	5	0	9.96e-06	2.36e-05*	0	9.59e-06	9.67e-06	7.74e-17	3
CDKN2D	19	9	4	6.31e-02*	1.98e-05*	4	6.19e-02*	8.35e-07	1.24e-09	3
SLC44A2	19	11	5	9.27e-06	1.85e-02*	5	8.04e-06	2.44e-04*	2.90e-02	3
PVRL2	19	10	5	2.30e-11	7.23e-05*	5	1.87e-11	1.15e-04*	NA	3

Open in a new tab

Two genes PSRC1 and FADS1 had a score of 5 based on the literature search, both identified by all three methods. Gene PSRC1 modulates cholesterol metabolism and inflammation; over-expression of PSRC1 in mice decreased the LDL level [15]. Mice with gene FADS1 knocked out had decreased cholesterol levels [32], and the gene was in the fatty acid metabolism pathway from the Kyoto Encyclopedia of Genes and Genomes (KEGG) [22]. Gene FADS1 is also in the silver-standard set of LDL-related genes compiled in [52], leading to the validation rates of 1/15 = 6.7% and 1/10 = 10% respectively for naive-2SLS and 2ScML-C with the large reference sample of n₀ = 95454, compared to smaller 17/680=2.5% for JTI and 9/411=2.2% for PrediXcan (even with a larger LDL GWAS dataset) [52].

Finally, we note that it is largely infeasible to apply any robust MR method here because they require the use of multiple SNPs as IVs for each gene: there are no more than one or few independent SNPs near many genes even before a stringent significance cut-off is imposed (to ensure that IV Assumption (A) holds). Specifically, for each of 3820 genes, we pruned its eSNPs with a pairwise absolute correlation threshold of less than or equal to 0.01 to obtain a set of (nearly) independent SNPs. For each of 1328 genes, only one independent eSNP remained; for 2125 genes, we had only 2 eSNPs; for the other 357 (or 10) genes, there were only 3 (or 4) independent eSNPs.

4.2. Secondary Analysis: Comparison with MR-JTI

As mentioned in the previous section, a recent study applied JTI to identify putative causal genes for LDL [52]. A distinct feature of JTI is to build a stage 1 regression model by borrowing information from eQTL data of multiple tissues, which was shown to improve the performance. When applied to the GTEx multi-tissue eQTL data (with the liver data of n₁ = 208 as the primary/target data) and UK Biobank (quantile-transformed) LDL GWAS summary data of n₂ = 343621, JTI identified 680 LDL-associated genes at FDR < 0.05. While the JTI method does not account for invalid IVs, its MR version called MR-JTI does. When applied to the 680 genes with the same data, at the Bonferroni adjusted significance cutoff 0.05/680, MR-JTI identified 138 significant genes, and 6 of them, genes SORT1, TNKS, LPA, FADS3, PLTP, LPIN3, were in the silver-standard set of the LDL-related genes based on the KEGG cholesterol metabolism pathway and literature search.

For comparison, as MR-JTI, we applied 2ScML and naive-2SLS to these 680 genes with the same data and the same significance cut-off, including directly using their fitted models (based on the GTEx multi-tissue data) for stage 1. Since we had access to the UK Biobank genotypic data of 408339 individuals of white British ancestry, after excluding the 333462 individuals identified to be included in the UKB LDL GWAS summary data, we could use a subset of n₀ = 500, 10000, or all of the remaining 74877 individuals as the reference panel. As shown in the Supplementary, with the large reference panel of n₀ = 74877, 2ScML and naive-2SLS (regardless of whether to correct the variance) identified 55 and 73 putative causal genes, 5 and 6 of which were in the silver-standard set respectively: genes SORT1, FADS1, LIPC, TNKS and LPA for both, and gene FADS3 for the latter only. The validation rates by the silver-standard set for 2ScML and naive-2SLS were 5/55=9.1% and 6/73=8.2% respectively, much higher than 6/138 = 4.3% for MR-JTI. We also note the small sample size of the GTEx data being used as the reference panel in stage 2 analysis by JTI and MR-JTI. In addition, we applied MR-JTI to the previous simulations and found its inflated Type-I errors as shown in the Supplementary. Finally, as n₀ increased, for 2ScML-C, both the number of the significant genes and that of validated ones increased, suggesting higher power with a larger reference sample (as shown in simulations).

5. Conclusions and Discussion

We have proposed a Two-Stage Constrained Maximum Likelihood (2ScML) method as an extension to 2SLS to draw inference on causal effects in the presence of invalid instruments. Our modeling assumptions are less stringent than many existing methods, allowing correlated IVs, among which some may not be valid IVs with any or all of the three IV assumptions being violated. This is in contrast to the naive/standard 2SLS/TWAS, and many robust MR methods such as the popular MR-Egger regression. Theoretical and numerical results confirm that 2ScML has superior performance over the standard 2SLS/TWAS, and many new and robust MR methods, including MR-Egger regression. Perhaps most importantly, our method overcomes some practical limitations of many existing robust IV methods, including some recent and strong competitors such as TSHT [16], an adaptive Lasso-based method [45] and a confidence interval-combining method [46], which do not apply to two-sample GWAS summary data that are most widely available as for our motivating TWAS for LDL, though these methods may be extended in the future. Like some other methods based on model selection, including TSHT [16] and the adaptive Lasso-based one [45], 2ScML shares the same limitation of making inference after model selection and its valid inference depends on the selection consistency.

We have pointed out that using individual-level data and using summary data with a reference panel give different estimates of a parameter (e.g. the causal effect of an exposure on an outcome), and there is a loss of information in the latter, especially with a small reference panel as often used in practice; the latter point has been largely unknown or neglected in the literature. Importantly, failing to account for such differences, as in almost all current genetic studies, including TWAS, would often lead to inflated Type-I errors. As shown in Theorems 2 and 3, we have developed a corrected variance estimator for valid inference with GWAS summary data and a reference panel in stage 2. In Section 2.3.3 we have offered a corrected variance estimator for GWAS, thus a better alternative to the current dominant GWAS method failing to account for finite-sample approximation errors of a reference panel to GWAS summary data. Recently Wang et al. [44] proposed several methods using two-sample summary data that are robust to weak IVs. As indicated by Theorems 1 and 2 therein, their methods require either the exact sample correlation matrix or the true population correlation matrix for the IVs, neither of which is typically available in practice. In contrast, our proposed 2ScML uses an estimated correlation matrix from a reference panel and thus is more widely applicable. We have showcased the application of 2ScML to discover putative causal genes for LDL with large-scale GWAS summary data, leading to some encouraging results. More applications to other data with comparisons with other methods warrant future investigation.

An R package implementing 2ScML with some example data, code, and tutorial is publicly available at https://github.com/xue-hr/TScML.

Supplementary Material

Supplement

NIHMS1877198-supplement-Supplement.pdf^{(3MB, pdf)}

Acknowledgments

We thank the reviewers and the editors for many helpful and insightful comments and suggestions. The research was supported by NIH grants R01 AG065636, R01 AG069895, RF1 AG067924, U01 AG073079, R01 AG074858, R01 HL116720 and R01 GM126002, and by the Minnesota Supercomputing Institute at the University of Minnesota.

Footnotes

Conflict of Interest

The authors report there are no competing interests to declare.

Supplementary Materials

In a Supplementary File, we provide the proofs of the theorems and more numerical results.

References

[1].1000 Genomes Project Consortium. (2015). A global reference for human genetic variation. Nature, 526(7571), 68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Barfield R, Feng H, Gusev A, Wu L, Zheng W, Pasaniuc B, Kraft P. (2018). Transcriptome-wide association studies accounting for colocalization using Egger regression. Genetic Epidemiology, 42(5), 418–433. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Bowden J, Davey Smith G, Burgess S. (2015). Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. Int J Epidemiol, 44(2), 512–525. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Bowden J, Davey Smith G, Haycock PC, and Burgess S (2016). Consistent estimation in Mendelian randomization with some invalid instruments using a weighted median estimator. Genetic Epidemiology, 40, 304–314. [DOI] [PMC free article] [PubMed] [Google Scholar]
[5].Burgess S, Butterworth AS, and Thompson SG (2013). Mendelian randomization analysis with multiple genetic variants using summarized data. Genetic Epidemiology, 37, 658–665. [DOI] [PMC free article] [PubMed] [Google Scholar]
[6].Burgess S, Bowden J, Dudbridge F, and Thompson SG (2016). Robust instrumental variable methods using multiple candidate instruments with application to Mendelian randomization. arXiv 1606.03279. [Google Scholar]
[7].Burgess S, et al. (2017). Sensitivity analysis for robust causal inference from Mendelian randomization analysis with multiple genetic variants. Epidemiology, 28, 30–42. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Burgess S, Foley CN, Allara E, Staley JR, and Howson JM (2020). A robust and efficient method for Mendelian randomization with hundreds of genetic variants. Nature Communications, 11(1), 376. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Cai M, Chen L, Liu J, Yang C (2019). Quantifying the impact of genetically regulated expression on complex traits and diseases. bioRxiv. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Davey Smith G, Ebrahim S. (2003). ‘Mendelian randomization’: can genetic epidemiology contribute to understanding environmental determinants of disease? International Journal of Epidemiology, 32, 1–22. [DOI] [PubMed] [Google Scholar]
[11].Davey Smith G, Ebrahim S. (2004). Mendelian randomization: prospects, potentials, and limitations. International Journal of Epidemiology, 33, 30–42. [DOI] [PubMed] [Google Scholar]
[12].Davey Smith G, Hemani G. (2014). Mendelian randomization: genetic anchors for causal inference in epidemiological studies. Human Molecular Genetics, 23, R89–R98. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Devlin B, & Roeder K (1999). Genomic control for association studies. Biometrics, 55(4), 997–1004. [DOI] [PubMed] [Google Scholar]
[14].Gamazon ER et al. (2015) A gene-based association method for mapping traits using reference transcriptome data. Nature Genetics, 47, 1091–1098. [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Guo K, Hu L, Xi D, Zhao J, Liu J, Luo T, … & Guo Z (2018). PSRC1 overexpression attenuates atherosclerosis progression in apoE−/− mice by modulating cholesterol transportation and inflammation. Journal of Molecular and Cellular Cardiology, 116, 69–80. [DOI] [PubMed] [Google Scholar]
[16].Guo Z, Kang H, Tony Cai T, Small DS (2018). Confidence intervals for causal effects with invalid instruments by using two-stage hard thresholding with voting. Journal of the Royal Statistical Society: Series B, 80(4), 793–815. [Google Scholar]
[17].Gusev A, Ko A, Shi H, Bhatia G, Chung W, Penninx BW, …, Pasaniuc B (2016). Integrative approaches for large-scale transcriptome-wide association studies. Nature Genetics, 48(3), 245–252. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Hartwig FP, Davey Smith G, and Bowden J (2017). Robust inference in summary data Mendelian randomization via the zero modal pleiotropy assumption. International Journal of Epidemiology, 46, 1985–1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Hu Y, Li M, Lu Q, Weng H, Wang J, Zekavat SM, Yu Z, Li B, Gu J, Muchnik S et al. (2019). A statistical framework for cross-tissue transcriptome-wide association analysis. Nature Genetics, 51, 568–576. [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Inoue A, & Solon G (2010). Two-sample instrumental variables estimators. The Review of Economics and Statistics, 92(3), 557–561. [Google Scholar]
[21].Kang H, Zhang A, Cai TT, Small DS (2016). Instrumental variables estimation with some invalid instruments and its application to Mendelian randomization. Journal of the American Statistical Association, 111(513), 132–144. [Google Scholar]
[22].Kanehisa M, Furumichi M, Sato Y, Ishiguro-Watanabe M, & Tanabe M (2021). KEGG: integrating viruses and cellular organisms. Nucleic Acids Research, 49(D1), D545–D551. [DOI] [PMC free article] [PubMed] [Google Scholar]
[23].Klevmarken A (1982). Missing variables and two-stage least-squares estimation from more than one data set (No. 62). IUI Working Paper. [Google Scholar]
[24].Kollo T, & von Rosen D (1995). Approximating by the Wishart distribution. Annals of the Institute of Statistical Mathematics, 47(4), 767–783. [Google Scholar]
[25].Lin W, Feng R, Li H (2015). Regularization methods for high-dimensional instrumental variables regression with an application to genetical genomics. Journal of the American Statistical Association, 110(509), 270–288. [DOI] [PMC free article] [PubMed] [Google Scholar]
[26].Mancuso N, Freund MK, Johnson R, Shi H, Kichaev G, Gusev A, and Pasaniuc B (2019). Probabilistic fine-mapping of transcriptome-wide association studies. Nat Genet, 51, 675–682. [DOI] [PMC free article] [PubMed] [Google Scholar]
[27].Nuotio J, Oikonen M, Magnussen CG, Jokinen E, Laitinen T, Hutri-Kahonen N, …, Jula A (2014). Cardiovascular risk factors in 2011 and secular trends since 2007: the Cardiovascular Risk in Young Finns Study. Scandinavian Journal of Public Health, 42(7), 563–571. [DOI] [PubMed] [Google Scholar]
[28].Osborne MR, Presnell B, Turlach BA (2000). On the lasso and its dual. Journal of Computational and Graphical Statistics, 9(2), 319–337. [Google Scholar]
[29].Ouimet F (2022). A symmetric matrix-variate normal local approximation for the Wishart distribution and some applications. Journal of Multivariate Analysis, 189, 104923. [Google Scholar]
[30].Pacini D, & Windmeijer F (2016). Robust inference for the Two-Sample 2SLS estimator. Economics Letters, 146, 50–54. [DOI] [PMC free article] [PubMed] [Google Scholar]
[31].Pasaniuc B, Zaitlen N, Shi H, Bhatia G, Gusev A, Pickrell J, …, Price AL (2014). Fast and accurate imputation of summary statistics enhances evidence of functional enrichment. Bioinformatics, 30(20), 2906–2914. [DOI] [PMC free article] [PubMed] [Google Scholar]
[32].Powell DR, Gay JP, Smith M, Wilganowski N, Harris A, Holland A, … & Desai U (2016). Fatty acid desaturase 1 knockout mice are lean with improved glycemic control and decreased development of atheromatous plaque. Diabetes, Metabolic Syndrome and Obesity: Targets and Therapy, 9, 185. [DOI] [PMC free article] [PubMed] [Google Scholar]
[33].Qi G, and Chatterjee N (2020). Mendelian randomization analysis using mixture models for robust and efficient estimation of causal effects. Nature Communications, 10, 1941. [DOI] [PMC free article] [PubMed] [Google Scholar]
[34].Shen X, Pan W, and Zhu Y (2012). Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association. 107, 223–232. [DOI] [PMC free article] [PubMed] [Google Scholar]
[35].Shen X, Pan W, Zhu Y, Zhou H (2013). On constrained and regularized high-dimensional regression. Annals of the Institute of Statistical Mathematics, 65(5), 807–832. [DOI] [PMC free article] [PubMed] [Google Scholar]
[36].Slob EA, Burgess S (2020). A comparison of robust Mendelian randomization methods using summary data. Genetic Epidemiology, 44(4), 313–329. [DOI] [PMC free article] [PubMed] [Google Scholar]
[37].Stock JH, Wright JH, & Yogo M (2002). A survey of weak instruments and weak identification in generalized method of moments. Journal of Business & Economic Statistics, 20(4), 518–529. [Google Scholar]
[38].Su YR, Di C, Bien S, Huang L, Dong X, Abecasis G, et al. (2018). A Mixed-Effects Model for Powerful Association Tests in Integrative Functional Genomics. Am J Hum Genet, 102(5), 904–919. [DOI] [PMC free article] [PubMed] [Google Scholar]
[39].Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, … & Collins R (2015). UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Medicine, 12(3), e1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]
[40].Tchetgen Eric J. Tchetgen, Sun BaoLuo, and Walter Stefan. (2017). The GENIUS approach to robust Mendelian randomization inference. arXiv:1709.07779. [Google Scholar]
[41].Teslovich TM, Musunuru K, Smith AV, Edmondson AC, Stylianou IM, Koseki M, … & Johansen CT (2010). Biological, clinical and population relevance of 95 loci for blood lipids. Nature, 466(7307), 707–713. [DOI] [PMC free article] [PubMed] [Google Scholar]
[42].Verbanck M, Chen C-Y, Neale B, Do R (2018). Detection of widespread horizontal pleiotropy in causal relationships inferred from Mendelian randomization between complex traits and diseases. Nature Genetics, 50, 693–698. [DOI] [PMC free article] [PubMed] [Google Scholar]
[43].Wainberg M, Sinnott-Armstrong N, Mancuso N, Barbeira AN, Knowles DA, Golan D, …Kundaje A (2019). Opportunities and challenges for transcriptome-wide association studies. Nature Genetics, 51(4), 592–599. [DOI] [PMC free article] [PubMed] [Google Scholar]
[44].Wang S, & Kang H (2021). Weak-instrument robust tests in two-sample summary-data Mendelian randomization. Biometrics. [DOI] [PubMed] [Google Scholar]
[45].Windmeijer F, Farbmacher H, Davies N, Davey Smith G (2019). On the use of the lasso for instrumental variables estimation with some invalid instruments. Journal of the American Statistical Association, 114(527), 1339–1350. [DOI] [PMC free article] [PubMed] [Google Scholar]
[46].Windmeijer F, Liang X, Hartwig FP, Bowden J (2019). The Confidence Interval Method for Selecting Valid Instrumental Variables. Discussion Paper 19/715, Department of Economics, University of Bristol. [Google Scholar]
[47].Wu C, Pan W (2020). A powerful fine-mapping method for transcriptome-wide association studies. Hum Genet, 139, 199–213. [DOI] [PMC free article] [PubMed] [Google Scholar]
[48].Xu Z, Wu C, Wei P, Pan W. (2017a). A Powerful Framework for Integrating eQTL and GWAS Summary Data. Genetics, 207, 893–902. [DOI] [PMC free article] [PubMed] [Google Scholar]
[49].Xu Z, Wu C, Pan W; Alzheimer’s Disease Neuroimaging Initiative. (2017b). Imaging-wide association study: Integrating imaging endophenotypes in GWAS. Neuroimage, 159, 159–169. [DOI] [PMC free article] [PubMed] [Google Scholar]
[50].Xue H, Shen X, & Pan W (2021). Constrained maximum likelihood-based Mendelian randomization robust to both correlated and uncorrelated pleiotropic effects. American Journal of Human Genetics, 108(7), 1251–1269. [DOI] [PMC free article] [PubMed] [Google Scholar]
[51].Zhao Q, Wang J, Hemani G, Bowden J, Small DS. (2020). Statistical inference in two-sample summary-data Mendelian randomization using a robust adjusted profile score. Annals of Statistics, 48, 1742–1769. [Google Scholar]
[52].Zhou D, Jiang Y, Zhong X, Cox NJ, Liu C, & Gamazon ER (2020). A unified framework for joint-tissue transcriptome-wide association and Mendelian randomization analysis. Nature Genetics, 52(11), 1239–1246. [DOI] [PMC free article] [PubMed] [Google Scholar]
[53].Zhu X, & Stephens M (2017). Bayesian large-scale multiple regression with summary statistics from genome-wide association studies. Annals of Applied Statistics, 11(3), 1561. [DOI] [PMC free article] [PubMed] [Google Scholar]
[54].Zhu Z, Zhang F, Hu H, Bakshi A, Robinson MR, Powell JE, et al. (2016). Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nature Genetics, 48(5), 481–7. [DOI] [PubMed] [Google Scholar]
[55].Zhu Z, Zheng Z, Zhang F, Wu Y, Trzaskowski M, Maier R, … & Yang J (2018). Causal associations between risk factors and common diseases inferred from GWAS summary data. Nature Communications, 9, 224. [DOI] [PMC free article] [PubMed] [Google Scholar]
[56].Zhu Z, Zheng Z, Zhang F et al. (2018). Causal associations between risk factors and common diseases inferred from GWAS summary data. Nature Communications, 9, 224. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

NIHMS1877198-supplement-Supplement.pdf^{(3MB, pdf)}

[R1] [1].1000 Genomes Project Consortium. (2015). A global reference for human genetic variation. Nature, 526(7571), 68–74. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] [2].Barfield R, Feng H, Gusev A, Wu L, Zheng W, Pasaniuc B, Kraft P. (2018). Transcriptome-wide association studies accounting for colocalization using Egger regression. Genetic Epidemiology, 42(5), 418–433. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Bowden J, Davey Smith G, Burgess S. (2015). Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. Int J Epidemiol, 44(2), 512–525. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Bowden J, Davey Smith G, Haycock PC, and Burgess S (2016). Consistent estimation in Mendelian randomization with some invalid instruments using a weighted median estimator. Genetic Epidemiology, 40, 304–314. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] [5].Burgess S, Butterworth AS, and Thompson SG (2013). Mendelian randomization analysis with multiple genetic variants using summarized data. Genetic Epidemiology, 37, 658–665. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] [6].Burgess S, Bowden J, Dudbridge F, and Thompson SG (2016). Robust instrumental variable methods using multiple candidate instruments with application to Mendelian randomization. arXiv 1606.03279. [Google Scholar]

[R7] [7].Burgess S, et al. (2017). Sensitivity analysis for robust causal inference from Mendelian randomization analysis with multiple genetic variants. Epidemiology, 28, 30–42. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Burgess S, Foley CN, Allara E, Staley JR, and Howson JM (2020). A robust and efficient method for Mendelian randomization with hundreds of genetic variants. Nature Communications, 11(1), 376. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Cai M, Chen L, Liu J, Yang C (2019). Quantifying the impact of genetically regulated expression on complex traits and diseases. bioRxiv. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Davey Smith G, Ebrahim S. (2003). ‘Mendelian randomization’: can genetic epidemiology contribute to understanding environmental determinants of disease? International Journal of Epidemiology, 32, 1–22. [DOI] [PubMed] [Google Scholar]

[R11] [11].Davey Smith G, Ebrahim S. (2004). Mendelian randomization: prospects, potentials, and limitations. International Journal of Epidemiology, 33, 30–42. [DOI] [PubMed] [Google Scholar]

[R12] [12].Davey Smith G, Hemani G. (2014). Mendelian randomization: genetic anchors for causal inference in epidemiological studies. Human Molecular Genetics, 23, R89–R98. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Devlin B, & Roeder K (1999). Genomic control for association studies. Biometrics, 55(4), 997–1004. [DOI] [PubMed] [Google Scholar]

[R14] [14].Gamazon ER et al. (2015) A gene-based association method for mapping traits using reference transcriptome data. Nature Genetics, 47, 1091–1098. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [15].Guo K, Hu L, Xi D, Zhao J, Liu J, Luo T, … & Guo Z (2018). PSRC1 overexpression attenuates atherosclerosis progression in apoE−/− mice by modulating cholesterol transportation and inflammation. Journal of Molecular and Cellular Cardiology, 116, 69–80. [DOI] [PubMed] [Google Scholar]

[R16] [16].Guo Z, Kang H, Tony Cai T, Small DS (2018). Confidence intervals for causal effects with invalid instruments by using two-stage hard thresholding with voting. Journal of the Royal Statistical Society: Series B, 80(4), 793–815. [Google Scholar]

[R17] [17].Gusev A, Ko A, Shi H, Bhatia G, Chung W, Penninx BW, …, Pasaniuc B (2016). Integrative approaches for large-scale transcriptome-wide association studies. Nature Genetics, 48(3), 245–252. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].Hartwig FP, Davey Smith G, and Bowden J (2017). Robust inference in summary data Mendelian randomization via the zero modal pleiotropy assumption. International Journal of Epidemiology, 46, 1985–1998. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Hu Y, Li M, Lu Q, Weng H, Wang J, Zekavat SM, Yu Z, Li B, Gu J, Muchnik S et al. (2019). A statistical framework for cross-tissue transcriptome-wide association analysis. Nature Genetics, 51, 568–576. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] [20].Inoue A, & Solon G (2010). Two-sample instrumental variables estimators. The Review of Economics and Statistics, 92(3), 557–561. [Google Scholar]

[R21] [21].Kang H, Zhang A, Cai TT, Small DS (2016). Instrumental variables estimation with some invalid instruments and its application to Mendelian randomization. Journal of the American Statistical Association, 111(513), 132–144. [Google Scholar]

[R22] [22].Kanehisa M, Furumichi M, Sato Y, Ishiguro-Watanabe M, & Tanabe M (2021). KEGG: integrating viruses and cellular organisms. Nucleic Acids Research, 49(D1), D545–D551. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] [23].Klevmarken A (1982). Missing variables and two-stage least-squares estimation from more than one data set (No. 62). IUI Working Paper. [Google Scholar]

[R24] [24].Kollo T, & von Rosen D (1995). Approximating by the Wishart distribution. Annals of the Institute of Statistical Mathematics, 47(4), 767–783. [Google Scholar]

[R25] [25].Lin W, Feng R, Li H (2015). Regularization methods for high-dimensional instrumental variables regression with an application to genetical genomics. Journal of the American Statistical Association, 110(509), 270–288. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] [26].Mancuso N, Freund MK, Johnson R, Shi H, Kichaev G, Gusev A, and Pasaniuc B (2019). Probabilistic fine-mapping of transcriptome-wide association studies. Nat Genet, 51, 675–682. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] [27].Nuotio J, Oikonen M, Magnussen CG, Jokinen E, Laitinen T, Hutri-Kahonen N, …, Jula A (2014). Cardiovascular risk factors in 2011 and secular trends since 2007: the Cardiovascular Risk in Young Finns Study. Scandinavian Journal of Public Health, 42(7), 563–571. [DOI] [PubMed] [Google Scholar]

[R28] [28].Osborne MR, Presnell B, Turlach BA (2000). On the lasso and its dual. Journal of Computational and Graphical Statistics, 9(2), 319–337. [Google Scholar]

[R29] [29].Ouimet F (2022). A symmetric matrix-variate normal local approximation for the Wishart distribution and some applications. Journal of Multivariate Analysis, 189, 104923. [Google Scholar]

[R30] [30].Pacini D, & Windmeijer F (2016). Robust inference for the Two-Sample 2SLS estimator. Economics Letters, 146, 50–54. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] [31].Pasaniuc B, Zaitlen N, Shi H, Bhatia G, Gusev A, Pickrell J, …, Price AL (2014). Fast and accurate imputation of summary statistics enhances evidence of functional enrichment. Bioinformatics, 30(20), 2906–2914. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] [32].Powell DR, Gay JP, Smith M, Wilganowski N, Harris A, Holland A, … & Desai U (2016). Fatty acid desaturase 1 knockout mice are lean with improved glycemic control and decreased development of atheromatous plaque. Diabetes, Metabolic Syndrome and Obesity: Targets and Therapy, 9, 185. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] [33].Qi G, and Chatterjee N (2020). Mendelian randomization analysis using mixture models for robust and efficient estimation of causal effects. Nature Communications, 10, 1941. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] [34].Shen X, Pan W, and Zhu Y (2012). Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association. 107, 223–232. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] [35].Shen X, Pan W, Zhu Y, Zhou H (2013). On constrained and regularized high-dimensional regression. Annals of the Institute of Statistical Mathematics, 65(5), 807–832. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] [36].Slob EA, Burgess S (2020). A comparison of robust Mendelian randomization methods using summary data. Genetic Epidemiology, 44(4), 313–329. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] [37].Stock JH, Wright JH, & Yogo M (2002). A survey of weak instruments and weak identification in generalized method of moments. Journal of Business & Economic Statistics, 20(4), 518–529. [Google Scholar]

[R38] [38].Su YR, Di C, Bien S, Huang L, Dong X, Abecasis G, et al. (2018). A Mixed-Effects Model for Powerful Association Tests in Integrative Functional Genomics. Am J Hum Genet, 102(5), 904–919. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] [39].Sudlow C, Gallacher J, Allen N, Beral V, Burton P, Danesh J, … & Collins R (2015). UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Medicine, 12(3), e1001779. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] [40].Tchetgen Eric J. Tchetgen, Sun BaoLuo, and Walter Stefan. (2017). The GENIUS approach to robust Mendelian randomization inference. arXiv:1709.07779. [Google Scholar]

[R41] [41].Teslovich TM, Musunuru K, Smith AV, Edmondson AC, Stylianou IM, Koseki M, … & Johansen CT (2010). Biological, clinical and population relevance of 95 loci for blood lipids. Nature, 466(7307), 707–713. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] [42].Verbanck M, Chen C-Y, Neale B, Do R (2018). Detection of widespread horizontal pleiotropy in causal relationships inferred from Mendelian randomization between complex traits and diseases. Nature Genetics, 50, 693–698. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] [43].Wainberg M, Sinnott-Armstrong N, Mancuso N, Barbeira AN, Knowles DA, Golan D, …Kundaje A (2019). Opportunities and challenges for transcriptome-wide association studies. Nature Genetics, 51(4), 592–599. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] [44].Wang S, & Kang H (2021). Weak-instrument robust tests in two-sample summary-data Mendelian randomization. Biometrics. [DOI] [PubMed] [Google Scholar]

[R45] [45].Windmeijer F, Farbmacher H, Davies N, Davey Smith G (2019). On the use of the lasso for instrumental variables estimation with some invalid instruments. Journal of the American Statistical Association, 114(527), 1339–1350. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] [46].Windmeijer F, Liang X, Hartwig FP, Bowden J (2019). The Confidence Interval Method for Selecting Valid Instrumental Variables. Discussion Paper 19/715, Department of Economics, University of Bristol. [Google Scholar]

[R47] [47].Wu C, Pan W (2020). A powerful fine-mapping method for transcriptome-wide association studies. Hum Genet, 139, 199–213. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] [48].Xu Z, Wu C, Wei P, Pan W. (2017a). A Powerful Framework for Integrating eQTL and GWAS Summary Data. Genetics, 207, 893–902. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] [49].Xu Z, Wu C, Pan W; Alzheimer’s Disease Neuroimaging Initiative. (2017b). Imaging-wide association study: Integrating imaging endophenotypes in GWAS. Neuroimage, 159, 159–169. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] [50].Xue H, Shen X, & Pan W (2021). Constrained maximum likelihood-based Mendelian randomization robust to both correlated and uncorrelated pleiotropic effects. American Journal of Human Genetics, 108(7), 1251–1269. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R51] [51].Zhao Q, Wang J, Hemani G, Bowden J, Small DS. (2020). Statistical inference in two-sample summary-data Mendelian randomization using a robust adjusted profile score. Annals of Statistics, 48, 1742–1769. [Google Scholar]

[R52] [52].Zhou D, Jiang Y, Zhong X, Cox NJ, Liu C, & Gamazon ER (2020). A unified framework for joint-tissue transcriptome-wide association and Mendelian randomization analysis. Nature Genetics, 52(11), 1239–1246. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] [53].Zhu X, & Stephens M (2017). Bayesian large-scale multiple regression with summary statistics from genome-wide association studies. Annals of Applied Statistics, 11(3), 1561. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] [54].Zhu Z, Zhang F, Hu H, Bakshi A, Robinson MR, Powell JE, et al. (2016). Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nature Genetics, 48(5), 481–7. [DOI] [PubMed] [Google Scholar]

[R55] [55].Zhu Z, Zheng Z, Zhang F, Wu Y, Trzaskowski M, Maier R, … & Yang J (2018). Causal associations between risk factors and common diseases inferred from GWAS summary data. Nature Communications, 9, 224. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R56] [56].Zhu Z, Zheng Z, Zhang F et al. (2018). Causal associations between risk factors and common diseases inferred from GWAS summary data. Nature Communications, 9, 224. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Causal Inference in Transcriptome-Wide Association Studies with Invalid Instruments and GWAS Summary Data

Haoran Xue

Xiaotong Shen

Wei Pan

Abstract

1. Introduction

2. Methods

2.1. Model

Fig. 1.

2.2. Estimation and Inference with Individual-Level Data

2.2.1. The Oracle Estimator

2.2.2. New Method: Two-stage Constrained Maximum Likelihood

2.2.3. Computation

2.3. Extension to GWAS Summary Data

2.3.1. The Oracle Estimator

2.3.2. New Method: Two-Stage Constrained Maximum Likelihood

2.3.3. A Motivating Example: OLS with Summary Data

Table 1.

3. Simulations

3.1. Simulation 1: TWAS

Fig. 2.

Table 2.

3.2. Simulation 2: MR

3.3. Sensitivity Analysis of IV Strengths

4. Application to TWAS for LDL

4.1. Main Analysis

Fig. 3.

Fig. 4.

Table 3.

4.2. Secondary Analysis: Comparison with MR-JTI

5. Conclusions and Discussion

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases