Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2022 Jun 29;46(8):572–588. doi: 10.1002/gepi.22491

Statistical power of transcriptome‐wide association studies

Ruoyu He 1,2, Haoran Xue 2, Wei Pan 2,; for the Alzheimer's Disease Neuroimaging Initiative
PMCID: PMC9669108  NIHMSID: NIHMS1817305  PMID: 35766062

Abstract

Transcriptome‐Wide Association Studies (TWASs) have become increasingly popular in identifying genes (or other endophenotypes or exposures) associated with complex traits. In TWAS, one first builds a predictive model for gene expressions using an expression quantitative trait loci (eQTL) data set in stage 1, then tests the association between the predicted gene expression and a trait based on a large, independent genome‐wide association study (GWAS) data set in stage 2. However, since the sample size of the eQTL data set is usually small and the coefficient of multiple determination (i.e., R2) of the model for many genes is also small, a question of interest is to what extent these factors affect the statistical power of TWAS. In addition, in contrast to a standard (univariate) TWAS (UV‐TWAS) considering only a single gene at a time, multivariate TWAS (MV‐TWAS) methods have recently emerged to account for the effects of multiple genes, or a gene's nonlinear effects, simultaneously. With the absence of the power analysis for these MV‐TWAS methods, it would be of interest to investigate whether one can gain or lose power by using the newly proposed MV‐TWAS instead of UV‐TWAS. In this paper, we first outline a general method for sample size/power calculations for two‐sample TWAS, then use real data—the Alzheimer's Disease Neuroimaging Initiative (ADNI) expression quantitative trait loci (eQTL) data and the Genotype‐Tissue Expression (GTEx) eQTL data for stage 1, the International Genomics of Alzheimer's Project Alzheimer's disease (AD) GWAS summary data and UK Biobank (UKB) individual‐level data for stage 2—to empirically address these questions. Our most important conclusions are the following. First, a sample size of a few thousands (~8000) would suffice in stage 1, where the power of TWAS would be more determined by cis‐heritability of gene expression. Second, as in the general case of simple regression versus multiple regression, the power of MV‐TWAS may be higher or lower than that of UV‐TWAS, depending on the specific relationships among the GWAS trait and multiple genes (or linear and nonlinear terms of the same gene's expression levels), such as their correlations and effect sizes. Interestingly, several top genes with large power gains in MV‐TWAS (over that in UV‐TWAS) were known to be (and in our data more significantly) associated with AD. We also reached similar conclusions in an application to the GTEx whole blood gene expression data and UKB GWAS data of high‐density lipoprotein cholesterol. The proposed method and the conclusions are expected to be useful in planning and designing future TWAS and other related studies (e.g., Proteome‐ or Metabolome‐Wide Association Studies) when determining the sample sizes for the two stages.

Keywords: 2SLS, Alzheimer's disease, causal inference, sample size, TWAS

1. INTRODUCTION

Transcriptome‐Wide Association Studies (TWASs) have been increasingly applied to identify genes associated with complex traits (Gusev et al., 2016). The statistical principle underlying TWAS is (two‐sample) two‐stage least squares (2SLS; Gamazon et al., 2015; Gusev et al., 2016). In stage 1, for a candidate gene, a regression model is often trained on a small eQTL data set to predict its expression level using its cis‐nucleotide polymorphism (cis‐SNPs). Then in stage 2, the pretrained model is used to predict the gene expression (of the candidate gene) using the SNPs from a much larger genome‐wide association study (GWAS) data set that is independent of the eQTL data set in stage 1; then the predicted gene expression is tested for association with the trait of the GWAS data to determine whether there is an association between the gene and the trait. If an association is established, under the framework of instrumental variable (IV) regression and its related valid IV assumptions, the gene can be interpreted as (putative) causal, similar to Mendelian randomization (MR; Angrist et al., 1996; Xue & Pan, 2020). For stage 1, various methods, including stepwise variable selection coupled with ordinary least squares (OLS), lasso and elastic net penalized regression, and some Bayesian methods, have been applied to select cis‐SNPs to be IVs while building a predictive model for gene expression (Gamazon et al., 2015; Gusev et al., 2016; Xue & Pan, 2020). In contrast, in stage 2 typically a simple regression or a univariate association test is conducted. It is also notable that GWAS summary data, not necessarily GWAS individual‐level, can be applied in stage 2, largely facilitating the wide applicability of TWAS (Gusev et al., 2016).

Despite the success of TWAS in discovering important gene‐trait associations (and putative causal genes for traits; Gamazon et al., 2015; Gusev et al., 2016), some important questions remain open. First, although the GWAS data set used in stage 2 usually contains from tens to hundreds of thousands of individuals, the eQTL data set is often quite small with a sample size of only a few hundreds or thousands. To what extent does the difference in sample sizes affect the power of TWAS? We therefore study the effects of stages 1 and 2 sample sizes on the power of TWAS. Second, the predictive power as measured by the coefficient of multiple determination, R2, of a pretrained model in stage 1 is usually very low with R20.05. Part of the reason is that the predictive power is upper bounded by the gene's expression heritability (Gamazon et al., 2015). Assuming that a lower R2 is partly due to the small sample size (and thus large estimation errors), if we were able to increase R2, how much would that boost the power of TWAS? Lastly, in contrast to a standard (univariate) TWAS (UV‐TWAS) considering only a single gene at a time, multivariate TWAS (MV‐TWAS) has recently emerged to account for the effects of multiple genes, or of a gene's linear and nonlinear effects (Knutson et al., 2020; Lin et al., 2022). We are not aware of any studies on the power of MV‐TWAS. In particular, it would be of interest to investigate whether one can gain or lose power by using MV‐TWAS as compared with UV‐TWAS: while accounting for multiple related genes' (predicted) expression levels may boost power, the expected presence of their correlations may reduce the precision of the estimates and thus statistical power.

We will use real data, mainly the Alzheimer's Disease Neuroimaging Initiative (ADNI) gene expression data (Shen et al., 2013) and the International Genomics of Alzheimer's Project (IGAP) Alzheimer's disease (AD) GWAS summary data (Li et al., 2002), to empirically answer these questions. More specifically, we will use the fitted gene expression imputation models with the ADNI eQTL data and the fitted TWAS models with the IGAP GWAS data as the true models (with the true parameter values) in power/sample size calculations, which are more realistic than some arbitrarily chosen simulation models or parameter values to mimic unknown truths, leading to more meaningful conclusions to guide the study design for future TWAS. In addition, we will also apply the methods to the GTEx gene expression data and UK Biobank (UKB) GWAS data (GTEx Consortium, 2020; Sudlow et al., 2015). We will propose a general approach to sample size/power calculations for both univariate and multivariate TWAS, which can be applied to plan future TWAS and other related studies (such as Proteome‐ or Metabolome‐Wide Association Studies). In particular, since both TWAS and MR are special cases of IV regression (Burgess et al., 2015; Xue & Pan, 2020), one may wonder whether we could simply follow an approach used for MR to calculate the power for TWAS. The short answer is no. Most power calculation methods for MR either account for only a single IV or focus on one‐sample problems (Brion et al., 2012; Burgess, 2014; Freeman et al., 2013; Pierce et al., 2010). L. Deng et al. (2020) proposed a general procedure allowing for both two‐sample setups and multiple (independent) IVs, but they assumed a single exposure. In contrast, we aim to inspect multivariate TWAS models with multiple (possibly correlated) exposures and multiple correlated IVs/SNPs, comparing their power to that of the standard/univariate TWAS. We note that our proposed sample size/power calculation method is general for two‐sample 2SLS, applicable not only to (two‐sample) TWAS (and related Proteome‐ or Metabolome‐Wide Association Studies), but also to (two‐sample) multivariable MR with possibly correlated SNPs/IVs (Burgess et al., 2015; Burgess & Thompson, 2015; Porcu et al., 2019).

The paper is organized as follows: in Section 2, we first introduce the general TWAS model along with necessary notations, then three specific TWAS models, whose power will be studied. Next, we propose a general sample size/power calculation procedure that can be applied to TWAS analyses, followed by the real data sets to be used. In Section 3, we apply the proposed method to the real data sets, addressing the three questions mentioned above. In Section 4, we summarize the main results, discuss the importance of the findings, and point out some potential limitations of the current study.

2. MATERIALS AND METHODS

2.1. A general TWAS model

Let n1 and n2 be the sample sizes of the eQTL and GWAS data in stages 1 and 2, respectively. We assume that there are no overlaps between the two data sets; that is, the two samples are independent as required by the two‐sample 2SLS. For a given gene of interest, let Z1 be an n1×p matrix coding for its cis‐SNPs as IVs in the eQTL data set, Z2 be an n2×p matrix for the IVs in the GWAS data set, x0 be an n1×1 vector of the observed expression levels of the gene in the eQTL data set, and Y be an n2×1 vector of the GWAS trait; p is the numbers of the IVs/SNPs used to predict the gene's expression. Following these notations and the standard valid IV assumptions, a general TWAS model is

X=Z1β+ν1, (1)
X2=Z2β+ν2, (2)
Y=X2θ+u=Z2βθ+ν2θ+u=Z2βθ+ω, (3)

where X=(x0,x1,,x(q1)) with x1,,x(q1) being either higher‐order terms of x0 or expression levels of other genes related to the gene of interest in the eQTL data; X2 would be the gene's expression levels in the GWAS data, though not actually observed, and Xˆ2=Z2βˆ is the imputed gene expression levels; θ=(θ0,,θ(q1))T is a q×1 vector of the (unknown) causal effects of interest, and β=(β0,,β(q1)) is a p×q matrix for the (unknown) regression coefficients; ν1 and ν2 are independent error terms as n1×q and n2×q matrices, each row of which is independently and normally distributed with mean 0 and covariance matrix Σν. Independent of ν1, the error term u is an n2×1 vector following a normal distribution with mean 0 and covariance matrix σu2I, and the error term ω is an n2×1 vector following a normal distribution with mean 0 and covariance matrix σω2 I.

For simplicity of notation, we assume that X and Y are already suitably adjusted for covariates, and variables Z1, Z2, X, and Y are standardized to have their sample means all equal to 0 and sample variances to 1. Furthermore, here we assume that the GWAS individual‐level data are available for simplicity and clarity of presentation, but in practice only GWAS summary data are required in stage 2 as to be shown in our numerical examples.

This general TWAS model allows us to accommodate more general cases than the standard TWAS where genes are imputed and tested only one by one (i.e., with q=1). For example, in contrast to the standard TWAS of imputing and testing only a linear term x0, one can impute x0 and x02 in stage 1 and test both terms in stage 2 to account for possibly nonlinear effects of gene expression; doing so can gain power and identify additional genes (Lin et al., 2022). When q is 1 and larger than 1, the general TWAS model represents the UV‐TWAS and MV‐TWAS, respectively.

Applying (two‐sample) 2SLS (i.e., OLS in each stage), we obtain the estimators for the unknown parameters in the general TWAS model as

βˆ=Z1TZ11Z1TX,Xˆ2=Z2βˆ,θˆ=Xˆ2TXˆ21Xˆ2TY,Σˆν=(XZ1βˆ)T(XZ1βˆ)(n11),σˆω2=YXˆ2θˆ22n2.

Note that our proposed methods require the use of OLS estimator (OLSE), or more generally, maximum likelihood estimator (MLE), in each stage as used above, because the (corrected) covariance matrix formula for θˆ to be shown later is based on the OLSE (or MLE).

2.2. Some specific TWAS methods

2.2.1. UV‐TWAS

We first consider the standard (univariate) TWAS, testing the linear effects of the genes one by one with X=x0, denoted as UV‐L. For this model, we focused on studying the impact of the sample size and the stage 1 pretrained model's predictive capability as measured by the coefficient of multiple determination, R2, on the power of TWAS. We increased the sample size of stage 1 by a factor of 1, 2, 5, and 10, the sample size of stage 2 by a factor of 1, 1.2, 1.5, and 2, and the R2 of the fitted imputation model by 0.1, 0.2, and 0.3 (but capped the final increased R2 at 0.95) to examine their influences on the power. Note that we increase the R2 by a fixed amount, not relative to the original R2, to see the direct effects of R2. The calculation of R2 can be found in Section 2.2.4.

As noted in Lin et al. (2022), the quadratic effect of gene expression can be regarded as the influence of a gene's expression variability on the trait. We therefore also consider another UV‐TWAS model, denoted as UV‐Q, in which we test the quadratic term of gene expression on a trait with X=x02 (i.e., with only a quadratic term and no linear term). We will compare the power of this model to that of MV‐TWAS to be introduced later.

2.2.2. TWAS‐LQ

As shown in Lin et al. (2022), testing for quadratic effects of gene expression may unveil additional genes whose expression levels are nonlinearly associated with a trait. In this extended MV‐TWAS model, denoted TWAS‐LQ, we consider both linear and quadratic terms of gene expression. In other words, we have X=(x0,x02) and test the corresponding θ in the model as described in (1)–(7). For power calculations, we set θ˜=θˆ as estimated from the IGAP AD GWAS summary data, and used the corrected covariance matrix formula. Depending on which components of θ to be tested, we can have three approaches, called TWAS‐L, TWAS‐Q, and TWAS‐LQ: We test only the linear component, the quadratic component, and both linear and quadratic components, respectively.

2.2.3. Other MV‐TWAS

One possible downside of MV‐TWAS with multiple genes is that their correlated gene expression levels may lead to the loss of power as compared with UV‐TWAS. Here we consider a possibly more extreme situation where we expect a greater loss of power: because many physically neighboring genes have correlated expression levels (and their expression levels are often more highly correlated if they are closer to each other), we include every two neighboring genes in the MV‐TWAS model. That is, if x0 and x1 are the expression levels of two neighboring genes, we have X=(x0,x1) in the MV‐TWAS model. This setup is both more challenging and useful for TWAS because we often would like to identify which gene (among several with correlated expression) in a GWAS trait‐associated locus is indeed causal. Similar to TWAS‐LQ, in MV‐TWAS we consider the power for three tests, called MV‐Target, MV‐Alternative, and MV‐Joint, to test the (linear) effects of the first gene, of its neighbor, and of both the genes, respectively.

2.2.4. Coefficient of multiple determination

For each UV‐TWAS model, the coefficient of multiple determination R2 was calculated as

R2=1RSSTSS=1RSS(n11)=1Σˆν,

where RSS and TSS were the residual sum of squares and the total sum of squares for the model; the second equality in the above equation follows from the fact that each X was standardized to have a sample variance of 1; that is, Var(X)=TSS(n11)=1. To assess the effects of different R2 on the power of TWAS, we simply changed the value of Σˆν: for example, to increase R2 by 0.1, we set the new Σˆν to max(0.05,Σˆν0.1) and then used the new Σˆν in the power calculation. Note that this capped R2 at 0.95.

The R2 can be interpreted as the (estimated) cis‐heritability of the gene's expression levels based on the gene's cis‐SNPs, which is expected to be lower than the gene's heritability based on the genome‐wide SNPs (i.e., both cis‐ and trans‐SNPs).

2.3. Power calculations

The standard TWAS estimate of the variance of θˆ ignores the estimation errors with Xˆ, or equivalently, with βˆ, in stage 1, thereby underestimating the true variance. We adopt the formula in Inoue and Solon (2010) for a corrected covariance matrix estimate of θˆ as

Var^c(θˆ)=Var^S(θˆ)1+n2n1θˆTΣˆνθˆσˆω2, (4)

where Var^S(θˆ)=(Xˆ2TXˆ2)1σˆω2 is the standard/naive variance estimate, which ignores the estimation variability in stage 1 and thus would be obtained under the assumption of n1=. The formula clearly illustrates that if the sample size in stage 1 is much smaller than that in stage 2, then we will have an inflated variance and thus reduced power in stage 2. With the corrected estimate of the covariance matrix, we now outline the hypothesis testing procedure as follows. Let q be the number of exposures (columns of X), r be the number of exposures we want to test, A={A1,,Ar} be a given subset of {1,2,,q} with each Ai{1,,q} and 1rq. For each gene, we aim to perform the following test in stage 2 with

H0:θA=0versusH1:θA=θ˜A0, (5)

where θA=(θA1,,θAr)T is a subvector of θ to be tested, and here we let θ˜A=θˆA, that is, we use the estimated effects as the real effects under the alternative. Then Var^c(θˆA), the estimated covariance matrix of θˆA, is just the corresponding submatrix of Var^c(θˆ). The TWAS test statistic is

T=θˆAT(Var^c(θˆA))1θˆA, (6)

which has a central χr2 distribution under the null H0; however, under the alternative H1, T follows a noncentral χ2 distribution with r degrees of freedom and noncentrality parameter θ˜AT(Var^c(θˆA))1θ˜A. The power is given by

PTχr,1αm2H1, (7)

where α is the nominal significance level, m is the number of genes to be tested, and χr,1αm2 is the (1αm)th quantile of a central χ2 distribution with r degrees of freedom. Here we follow the popular practice of using the Bonferroni correction to adjust for multiple testing in TWAS.

2.4. Data

2.4.1. The ADNI eQTL data

Data used in the preparation of this article were obtained from the ADNI database (http://adni.loni.usc.edu). The ADNI was launched in 2003 by the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration, private pharmaceutical companies, and nonprofit organizations, as a 60‐million, 5‐year public–private partnership. The primary goal of ADNI has been to test whether serial magnetic resonance imaging, positron emission tomography, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early AD. Determination of sensitive and specific markers of very early AD progression is intended to aid researchers and clinicians to develop new treatments and monitor their effectiveness, as well as lessen the time and cost of clinical trials. The Principal Investigator of this initiative is Michael W. Weiner, MD, VA Medical Center and University of California‐San Francisco. ADNI is the result of efforts of many coinvestigators from a broad range of academic institutions and private corporations, and subjects have been recruited from over 50 sites across the United States and Canada. The initial goal of ADNI was to recruit 800 subjects but ADNI has been followed by ADNI‐GO and ADNI‐2. To date these three protocols have recruited over 1500 adults, ages 55–90, to participate in the research, consisting of cognitively normal older individuals, people with early or late MCI, and people with early AD. The follow‐up duration of each group is specified in the protocols for ADNI‐1, ADNI‐2, and ADNI‐GO. Subjects originally recruited for ADNI‐1 and ADNI‐GO had the option to be followed in ADNI‐2. For up‐to‐date information, see www.adni-info.org.

We used the ADNI whole‐genome sequencing (WGS) and gene expression data (Shen et al., 2013) as the eQTL data in stage 1. After removing 37 individuals with any missing data, we had n1=711 individuals and 17,256 genes. Following previous ADNI studies (Lin et al., 2022), we adjusted for five covariates, namely, age, gender, year of education, handedness (left or right handed), and intracranial volume. We regressed the gene expression (X) on these covariates and then used the standardized residuals as the new X to remove any potential effects they may have on AD, as shown by previous studies (Doody et al., 1999; van Loenhoud et al., 2022). For each gene, we extracted the SNPs from its cis‐region by expanding 100 kb upstream and downstream of its coding region. SNPs with minor allele frequency  0.05 or any missing values were also excluded. We further pruned the SNPs so that the absolute value of the pairwise Pearson correlation between any two SNPs was less than 0.8. If the number of SNPs was still larger than 50 after pruning, we would keep the top 50 SNPs with the highest correlations (in their absolute values) with the gene's expression levels X. If q>1 (as in the MV‐TWAS models), for each column i of X, we would take the top 50 SNPs with the highest correlations (in absolute values) with the column as Ii and use I=iIi as the SNPs for stage 1. Next, the data were normalized to have mean 0 and variance 1, and we used backward selection with AIC as the criterion to select SNPs. The final model would be a linear regression model given by the backward selection procedure. For each gene, the F test was performed to assess the fit of the final model; if the p value was 0.001, the gene was discarded (i.e., not used in the subsequent analysis). The choice of the tuning parameters (such as 0.8, 50, and 0.001 above) was somewhat arbitrary and largely followed a previous study (Xue & Pan, 2020) with some adjustments and the following rationale: highly correlated SNPs are not expected to help much to predict gene expression while causing the multicollinearity problem, thus we pruned out highly correlated SNPs; given the relatively small sample size, to avoid large estimation errors (due to large variability), we decided to use a relatively simple linear model for effective prediction; and to satisfy the IV relevance assumption in 2SLS, we only chose cis‐SNPs that were likely to be associated with a gene's expression level as IVs.

2.4.2. The IGAP AD GWAS summary data

For the GWAS data in stage 2, we used the summary data set released by the IGAP in 2013, which contained 54,162 individuals (Li et al., 2002). The IGAP AD GWAS summary data set was used to estimate the parameters in stage 2, namely, θˆ and σˆω2. To estimate θˆ, notice that

θˆ=Xˆ2TXˆ21Xˆ2TY=βˆTZ2TZ2βˆ1βˆTZ2TY.

Since we did not have individual‐level data for stage 2, we used Z1 to estimate Z2TZ2 and the summary statistics to estimate Z2TY. For σˆω2, we have

σˆω2=YXˆ2βˆ22n2=12YTZ2βˆθˆn2+Z2βˆθˆ2n2.

We substituted Z1 for Z2 in the last term, and used the summary statistics to estimate YTZ2 in the second term. To calculate the power, we assume that under the alternative hypothesis, the true θ is equal to the estimated one, that is, θ=θˆ.

2.4.3. The GTEx and UKB data

To check our conclusions further, we used the Genotype‐Tissue Expression (GTEx) data set and the UKB data (GTEx Consortium, 2020; Sudlow et al., 2015). For our purposes, we used the GTEx v8 whole blood data with 19,696 genes and n1=670 in stage 1. We regressed the gene expression on some covariates (including the first five genotype principal components, WGS sequencing platform, WGS library construction protocol, donor sex, and PEER factors) provided in the data set, and used the standardized residuals as the gene expression (i.e., x0) as in a previous TWAS (Lin et al., 2022). The rest of the data preprocessing and quality control procedures are the same as for the ADNI data.

For stage 2, we used the individual‐level UKB data with the high‐density lipoprotein cholesterol as the trait of interest. Only the data from the individuals of white British ancestry were used. Any individual who might be a close relative of another (kinship >0) or had any missing values for the SNPs selected in stage 1 was also removed. The sample sizes for most of the genes were about 200,000—the exact sample size for a gene depended on the number of individuals removed due to missing data—and the total number of SNPs was around 800,000.

For the GTEx data, we obtained the estimated expression heritabilities and their 95% confidence intervals (CIs) from (Wheeler et al., 2016). Therefore, we used the estimated expression heritabilities as the upper bounds of the stage 1 R2 to demonstrate the effects of increasing R2 on the power of TWAS under more realistic conditions. The genes with estimated heritability smaller than the R2 of the stage 1 fitted model were thus removed. In addition, if the lower bound of the 95% CI of the estimated heritability was 0, then the R2 of the fitted model was used as the lower bound because a gene with heritability 0 cannot be genetically associated with any trait.

3. RESULTS

3.1. Effect of the sample size on TWAS power

In Figure 1, we show how power varies with the increasing sample size in stage 1. Except for a few genes for which the power does not increase at all, in general increasing the sample size leads to higher power. When the stage 1 sample size n1 is about 8000, that is, a little over eleven times the original size, the power nearly reaches its maximum for most of the genes as if n1=. Overall, for most genes, there seems to be only diminishing power gains by increasing n1; that is, the power only increased barely or only slightly. This result suggests that the current practice of having a sample size of eQTL data ranging from a few hundreds to tens of thousands (Gamazon et al., 2015; Gusev et al., 2016) is not expected to be a severe limiting factor of the power of TWAS.

Figure 1.

Figure 1

Power of UV‐L with different sample sizes n1 for the ADNI/IGAP data: (a) for all genes with stage 2 p0.01 (in stage 2 for their association with AD); (b) for 15 top genes (marked by the vertical line in panel (a)) with the most significant p5.6e6, which is the significance level after the Bonferroni correction. The genes are ordered by the power for “Inf” (i.e., n1=); the lines correspond to the factors by which we increase n1. AD, Alzheimer's disease; ADNI, Alzheimer's Disease Neuroimaging Initiative; IGAP, International Genomics of Alzheimer's Project; UV‐L, univariate linear.

For comparison, in Figure 2, we show how power varies with the increasing sample size in stage 2. Note that the naive estimator, Var^S(θˆ)=(Xˆ2TXˆ2)1σˆω2, is inversely proportional to n2 since as n2, (Xˆ2TXˆ2)10, while σˆω2σω2. As a result, increasing the stage 2 sample size will lower the naive variance estimate; at the same time, although the inflation factor, 1+n2n1θˆTΣˆνθˆσˆω2, will increase with n2, the final variance Var^c(θˆ) of the estimated causal effect θˆ will decrease since θˆTΣˆνθˆσˆω2 is usually very small (0.03). It is clear that unless the original power is already close to 1, there will be notable power gains for almost every gene by increasing n2, unlike the situation for increasing n1.

Figure 2.

Figure 2

Power of UV‐L with different sample sizes n2 for the ADNI/IGAP data: (a) for all genes with stage 2 p0.01 (in stage 2 for their association with AD); (b) for 15 top genes with the most significant p5.6e6. 5.6e6 is given by the Bonferroni correction. For comparison, the order of the genes remains the same as in Figure 1 (ordered by the power for “Inf” in Figure 1); the different lines correspond to the factors by which we increase n2. AD, Alzheimer's disease; ADNI, Alzheimer's Disease Neuroimaging Initiative; IGAP, International Genomics of Alzheimer's Project; UV‐L, univariate linear.

3.2. Effect of R2 on the power of TWAS

A more important factor determining TWAS power is the predictive power of gene expression by SNPs in stage 1, which can be measured by multiple coefficient of determination, R2. In Figure 3, we show that how TWAS power varies with R2; the genes are ordered by the power after increasing the observed R2 by 0.3. We can see that even 0.1 increase in R2 boosts the power by a large margin; a 0.3 increase of R2 leads to high power for most of the genes. Note that for some genes, there is little power gain because their original R2's are already large (close to 0.95), leaving barely any room to increase their R2's (since R2 cannot be larger than 1 and we capped any increased R2 at 0.95); in addition, the power of TWAS depends on not only the R2 in stage 1, but other parameters in stage 2, such as the sample size, prediction accuracy of the stage 2 model, and the size and variance of the estimated causal effect, θˆ. On the other hand, it is worth pointing out that, by definition, the true R2 for any gene is biologically upper bounded by its heritability. In other words, if a gene's true R2 is small, regardless of the sample size n1, any analysis method will always be relatively low powered to detect its association with a trait in stage 2.

Figure 3.

Figure 3

(a) Power of UV‐L with different R2 for the genes with stage 2 p0.01 for the ADNI/IGAP data; (b) the corresponding baseline R2 for the genes. The genes are ordered by their power for “+0.3.” AD, Alzheimer's disease; ADNI, Alzheimer's Disease Neuroimaging Initiative; IGAP, International Genomics of Alzheimer's Project; UV‐L, univariate linear.

3.3. Power analysis for TWAS‐LQ and MV‐TWAS

Figure 4 compares the power for different tests in MV‐TWAS and TWAS‐LQ with that in UV‐TWAS. It can be seen that, in general, MV‐TWAS/TWAS‐LQ may substantially increase or decrease the power as compared with the standard (linear) UV‐TWAS denoted as UV‐L. For example, when the underlying relationship between the gene expression and the trait can be approximated by a quadratic function (when TWAS‐Q or UV‐Q gives a small p value), including a quadratic term of the gene expression in TWAS‐LQ will greatly increase the power, as indicated by panel C in Figure 4. However, in the case where the quadratic term is not relevant/significant, for example, the underlying relationship is approximately linear, UV‐L outperforms TWAS‐LQ. We observe a similar result for the MV‐TWAS model. When the expression of the neighbor gene nearest to the target gene has a great contribution to the trait, MV‐TWAS outperforms UV‐TWAS; otherwise, we lose power by including it in the model.

Figure 4.

Figure 4

Power comparison for genes with stage 2 p0.01 for all models with the ADNI/IGAP data: (a) UV‐L versus TWAS‐L versus MV‐Target; (b) UV‐Q versus TWAS‐Q versus MV‐Alternative; (c) UV‐L versus TWAS‐LQ versus MV‐Joint; (d) UV‐Q versus TWAS‐LQ versus MV‐Joint. The genes are ordered by the power for the UV‐TWAS models. AD, Alzheimer's disease; ADNI, Alzheimer's Disease Neuroimaging Initiative; IGAP, International Genomics of Alzheimer's Project; LQ, linear and quadratic; MV, multivariate; TWAS‐L, Transcriptome‐Wide Association Study linear; UV‐L, univariate linear; UV‐Q, univariate quadratic.

As an example, consider gene HLA‐DRB5 and its nearest neighbor gene HLA‐DRB1 in Figure 4(c). When considered separately, that is, applying the UV‐L model to each of the two genes, HLA‐DRB5 is significantly associated with AD with a p value of 4.4e6 that can be detected with power 0.52, while HLA‐DRB1 has a p value of 3e5 and power 0.28. Pearson's correlation between the predicted expressions of the two genes is 0.21. Given the small p value of the neighbor gene and the low correlation between their predicted expressions, incorporating the neighbor gene into the MV‐TWAS model boosts the power from 0.52 to 0.83. On the other hand, gene GLT8D2 has a p value of 2.02e10 and power 0.7 under the UV‐L model. However, since its nearest gene, TDG, has a much larger p value of 0.64, including it in the MV‐TWAS model decreases the power to 0.03. Interestingly, there is rarely any notable power decrease when the high correlation between a target gene and its neighbor gene is the only reason. In fact, if both genes have p values less than or equal to 0.01 under the UV‐L model, the correlation between their predicted expressions is less than 0.3 in absolute values except for gene RNASE3, for which the correlation is 0.87. Under the UV‐L model, RNASE3 and its neighboring gene RNASE2 have p values of 5.6e3 and 8.5e4, respectively, and corresponding power of 0.04 and 0.11; in contrast, the power for their MV‐TWAS model is 0.09, an increase for RNASE3 but a decrease for RNASE2. The results suggest that for MV‐TWAS, the relevance of the genes to the trait has a larger impact on the power than their correlation.

Tables 1 and 2 list the genes for which there are relatively larger improvements of power by TWAS‐LQ and MV‐TWAS over that of UV‐TWAS. The p values, the power, and the power gained by using TWAS‐LQ/MV‐TWAS are also listed. As detailed in Section 4, the relevance of these genes to AD has been discussed in the literature.

Table 1.

Some AD‐associated genes with relatively large power gains in TWAS‐LQ over UV‐TWAS (UV‐L or UV‐Q) for the ADNI/IGAP data

Gene Chromosome p value Power Power gain
FAM117B 2 5.2e − 5 0.21 0.06
HLA‐DQA2 6 3.3e − 26 0.99 0.19
HLA‐DQB1 6 2.8e − 5 0.39 0.28
HLA‐DRB1 6 0 1 0.72
HLA‐DRB5 6 8.1e − 8 0.78 0.26
BCL3 19 2.7e − 8 0.41 0.26
CNN2 19 9.2e − 4 0.19 0.04

Note: The p value/power refers to the stage 2 p value/power of TWAS‐L, TWAS‐Q, or TWAS‐LQ, whichever resulted in a large power gain, which was calculated by subtracting the power of UV‐TWAS from that of TWAS‐LQ.

Abbreviations: AD, Alzheimer's disease; ADNI, Alzheimer's Disease Neuroimaging Initiative; IGAP, International Genomics of Alzheimer's Project; LQ, linear and quadratic; TWAS‐L, Transcriptome‐Wide Association Study linear; TWAS‐Q, Transcriptome‐Wide Association Study quadratic; UV‐L, univariate linear; UV‐Q, univariate quadratic.

Table 2.

Some AD‐associated genes with relatively large power gains in MV‐TWAS over UV‐TWAS (UV‐L or UV‐Q) for the ADNI/IGAP data

Gene Chromosome p value Power Power gain
BIN1 2 3.8e − 15 0.75 0.07
HLA‐DQA2 6 7.1e − 30 0.99 0.18
HLA‐DQB1 6 5.3e − 22 0.99 0.89
HLA‐DRB1 6 1.6e − 16 0.99 0.71
HLA‐DRB5 6 1.1e − 8 0.83 0.31
APOC1P1 19 1.01e − 61 1 0.08

Note: The p value/power refers to the stage 2 p value/power of MV‐Target, MV‐Alternative, or MV‐Joint, whichever resulted in a large power gain, which was calculated by subtracting the power of UV‐TWAS from that of MV‐TWAS.

Abbreviations: AD, Alzheimer's disease; ADNI, Alzheimer's Disease Neuroimaging Initiative; IGAP, International Genomics of Alzheimer's Project; LQ, linear and quadratic; MV, multivariate; TWAS‐L, Transcriptome‐Wide Association Study linear; TWAS‐Q, Transcriptome‐Wide Association Study quadratic; UV‐L, univariate linear; UV‐Q, univariate quadratic.

3.4. Results for the GTEx and UKB data

We focused on analyzing the effects of sample sizes and R2 with the GTEx and UKB data, and reached the same conclusions as before. We observed power increases as the stage 1 sample size was multiplied by a factor of 2, 5, 10, and 50 (Figure 5). Note that the power increased much less from “5×” to “10×” than from “1×” to “2×” or from “2×” to “5×,” the same effect on diminishing power gains discussed in Section 3.1. Recall that the stage 1 sample size n1 was 670, and the stage 2 sample size n2 was around 200,000. Due to the large difference in the sample sizes, we needed to increase n1 by a factor of no more than 50 for the power to be almost equal to its maximum as if n1=. The ratio n1n2 at which the power nearly attained the maximum was similar for both data set—roughly 1/6 or 1/7.

Figure 5.

Figure 5

Power of UV‐L with different sample sizes n1 for the GTEx/UKB data. (a) Top 142 genes with the significant p values (4.36e6). (b) A selected set of 24 genes with the significant p values, leading to a relatively smooth baseline power curve for better visualization. 4.3e6 is the significance level after the Bonferroni correction. The genes are ordered by the power for “Inf” (i.e., n1=); the lines correspond to the factors by which we increase n1. GTEx, Genotype‐Tissue Expression; UKB, UK Biobank; UV‐L, univariate linear.

The case for n2 was the same as for the ADNI/IGAP data, see Figure 6. Regardless of the ratio n1n2, increasing n2 always resulted in a notable power gain unless the power was close to 1, in contrast to diminishing power gains of increasing n1.

Figure 6.

Figure 6

Power of UV‐L with different sample sizes n2 for the GTEx/UKB data. (a) Top 142 genes with the significant p4.36e6. (b) A selected set of 24 genes with the significant p values, leading to a relatively smooth baseline power curve for better visualization. 4.3e6 is the significance level after the Bonferroni correction; For comparison with Figure 5, the order of the genes remains the same (ordered by the power for “Inf” in Figure 5); the different lines correspond to the factors by which we increase n2. GTEx, Genotype‐Tissue Expression; UKB, UK Biobank; UV‐L, univariate linear.

The availability of the estimated expression heritability allowed for a more realistic expectation of the upper bound of the R2 in stage 1 (Wheeler et al., 2016), as compared with arbitrarily increasing R2 as done for the ADNI data. In Figure 7 A, we plotted the power calculated with the estimated heritability and the 95% CI bounds against the power calculated with the actual R2 of the fitted model. Panel B compares the estimated heritability to the R2 of the fitted model. For most genes, their fitted R2 values were quite close to their heritability estimates, leading to their similar power curves. Similar to that observed for the ADNI data, a larger R2 would always result in a gain in power; how much a larger R2 would improve the power was also dependent on other factors as discussed in Section 3.2.

Figure 7.

Figure 7

(a) Power of UV‐L with different R2 for the GTEx/UKB data. For “baseline,” the power was calculated with the R2 of the fitted stage 1 model; for “heritability,” the power was calculated using the estimated heritability and the upper/lower bounds of its 95% CI as the R2. (b) R2 of the fitted model (“baseline”), the estimated heritability and its 95% CI (“heritability”). Only the significant genes (p5.63e5) with their estimated heritability larger than the fitted R2 are shown. The genes are ordered by the power for “baseline.” CI, confidence interval; GTEx, Genotype‐Tissue Expression; UKB, UK Biobank; UV‐L, univariate linear.

4. DISCUSSION

In summary, we have studied the effects of the sample size and R2 (or more generally, gene expression heritability) in stage 1 on the power of the standard/univariate TWAS, and the potential power gains/losses of multivariate TWAS. The small sample size of the eQTL data in TWAS may raise concerns on its possibly diminishing the power of TWAS by inflating the variance of the causal estimate θˆ in stage 2. We have shown that, as expected, increasing the stage 1 sample size would increase the power, but, perhaps surprisingly, only to a limited extent: a sample size n1 of eight thousands, only about 1/7 of the stage 2 sample size, seemed to nearly reach the power upper bound (when n1=) for almost all the genes in our ADNI/IGAP real data example. This result is surprising because this sample size n1 in stage 1 is still much smaller than n2 in stage 2. On the other hand, the effect of the sample size n2 in stage 2 appears to be of much higher impact. The same conclusions on the differing effects of the sample sizes in the two stages in TWAS were reached in the application to the GTEx and UKB data. These results are in agreement with a recent empirical study using other omic and GWAS data (Baranger et al., 2022): using smaller GWAS data in stage 2 may dramatically hinder any new discovery at the end for TWAS.

The case for R2, however, is more complicated than that for the sample size. Whether and how much the power would increase as a result of a larger R2 depended on (1) the original (or baseline) R2, (2) the estimated variance of the causal effect, and (3) the original power. Generally, if the original power and original R2 were small, and the estimated variance was large, then one would benefit more from a larger R2 than otherwise. However, it is noted that the R2 for any gene is upper bounded by the heritability of its expression, so it cannot really be manipulated for real data to increase power.

The multivariate TWAS analysis suggested that compared with the standard/univariate TWAS model, including higher‐order terms or information from other genes could greatly increase the power (or decrease the power), at least under some realistic configurations (since the unknown true parameters are estimated from the real data) as we demonstrated with the real data. Interestingly, as shown in Tables 1 and 2, many of the genes with substantial power gains by TWAS‐LQ and MV‐TWAS over that of the standard UV‐TWAS are related to AD as discussed in the literature. For example, the genes in the human leukocyte antigen (HLA) complex have been considered risk factors of late‐onsite AD (Mansouri et al., 2015; Steele et al., 2017; Wang et al., 2020). APOC1 is another widely known gene, related to APOE, that affects the risk of AD (Q. Zhou & Zhao, 2014; X. Zhou et al., 2019). On the other hand, it has been proposed that BIN1 mediates the risk of AD (Chapuis et al., 2013). The results justified the use of MV‐TWAS models to identify additional genes that could be missed by UV‐TWAS, so researchers may consider incorporating them into their studies.

Although power analysis for TWAS methods has gained some attention in the literature, to the best of our knowledge, none of the studies addressed the issues discussed in this paper. For example, a recent study inspected the effects of prediction accuracy of the stage 1 model on the TWAS power (Cao et al., 2021). However, they did not directly examine the effect of R2 on the power, but instead focused on the influence of replacing predicted gene expression with observed gene expression. Other studies either compared TWAS methods with GWAS, or focused on closely related MR analyses (L. Deng et al., 2020; Veturi & Ritchie, 2018). In particular, these studies assumed a univariate, linear stage 2 model and conducted simulations, instead of using real data to generate more realistic parameters for analysis. Importantly, we expect that the current study will be useful in offering a general approach to sample size/power calculations for both univariate and multivariate TWAS.

There are a few limitations of this study. First, we did not account for potential horizontal pleiotropy where genetic variants contribute to both gene expression levels and traits directly, leading to a biased causal estimate if not suitably accounted (Y. Deng & Pan, 2021; Lin et al., 2022). Second, one could consider nonlinear or nonparametric models like random forest (Breiman, 2001), gradient boosting machines (Friedman, 2001), or other machine learning methods (Okoro et al., 2021) to boost the prediction accuracy in stage 1, to which the analysis presented here no longer applies because of our adopted linear models in stage 1. Third, perhaps most importantly, some specific results (e.g., required n1 or the ratio of n1n2) may change if different eQTL and GWAS data are considered, though we do not expect that our general conclusions (as discussed in the first three paragraphs in this section) will change dramatically.

ACKNOWLEDGMENTS

Data collection and sharing for this project were funded by the Alzheimer's Disease Neuroimaging Initiative (ADNI; National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH‐12‐2‐0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer's Association; Alzheimer's Drug Discovery Foundation; Araclon Biotech; BioClinica Inc.; Biogen; Bristol‐Myers Squibb Company; CereSpir Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann‐La Roche Ltd., and its affiliated company Genentech Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC; Johnson & Johnson Pharmaceutical Research & Development LLC; Lumosity; Lundbeck; Merck & Co. Inc.; Meso Scale Diagnostics, LLC; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer's Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California. The Genotype‐Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. The data used for the analyses described in this manuscript were obtained from dbGaP Project #26511. The access to the UKB data was approved through UKB Application #35107. This study was supported by NIH grants R01AG065636, RF1AG067924, U01AG073079, and R01HL116720, and by the MSI. We thank the reviewers for many helpful and constructive comments.

He, R. , Xue, H. , & Pan, W. (2022). Statistical power of transcriptome‐wide association studies. Genetic Epidemiology, 46, 572–588. 10.1002/gepi.22491

Data used in preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at: http://adni.loni.usc.edu/wp-content/uploads/howtoapply/ADNIAcknowledgementListSep23.pdf.

DATA AVAILABILITY STATEMENT

The ADNI data are available to the approved user at the ADNI website http://adni.loni.usc.edu; see the ADNI website for information about the application procedure. The IGAP AD GWAS summary data can be downloaded at https://www.ebi.ac.uk/gwas/studies/GCST002245. The GTEx data are available to the approved user at https://www.ncbi.nlm.nih.gov/gap/, and the UKB data are available to the approved user at https://www.ukbiobank.ac.uk/. The R code can be found at https://github.com/RuoyuHe/TWAS_Power_Analysis. All the analyses were performed on the MSI (https://www.msi.umn.edu/) server with a single AMD ROME compute node. The run time was roughly 4 h for every 2000 genes.

REFERENCES

  1. Angrist, J. D. , Imbens, G. W. , & Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434), 444–455. 10.1080/01621459.1996.10476902 [DOI] [Google Scholar]
  2. Baranger, D. A. , Hatoum, A. S. , Polimanti, R. , Gelernter, J. , Edenberg, H. J. , Bogdan, R. , & Agrawal, A. (2022). Multi‐omics analyses cannot identify true‐positive novel associations from underpowered genome‐wide association studies of four brain‐related traits. bioRxiv. 10.1101/2022.04.13.487655 [DOI]
  3. Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. 10.1023/A:1010950718922 [DOI] [Google Scholar]
  4. Brion, M.‐J. A. , Shakhbazov, K. , & Visscher, P. M. (2012). Calculating statistical power in Mendelian randomization studies. International Journal of Epidemiology, 42(5), 1497–1501. 10.1093/ije/dyt179 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Burgess, S. (2014). Sample size and power calculations in Mendelian randomization with a single instrumental variable and a binary outcome. International Journal of Epidemiology, 43(3), 922–929. 10.1093/ije/dyu005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Burgess, S. , Small, D. S. , & Thompson, S. G. (2015). A review of instrumental variable estimators for Mendelian randomization. Statistical Methods in Medical Research, 26(5), 2333–2355. 10.1177/0962280215597579 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Burgess, S. , & Thompson, S. G. (2015). Multivariable Mendelian randomization: The use of pleiotropic genetic variants to estimate causal effects. American Journal of Epidemiology, 181(4), 251–260. 10.1093/aje/kwu283 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Cao, C. , Ding, B. , Li, Q. , Kwok, D. , Wu, J. , & Long, Q. (2021). Power analysis of transcriptome‐wide association study: Implications for practical protocol choice. PLOS Genetics, 17(2), e1009405. 10.1371/journal.pgen.1009405 [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Chapuis, J. , Hansmannel, F. , Gistelinck, M. , Mounier, A. , Cauwenberghe, C. V. , Kolen, K. V. , Geller, F. , Sottejeau, Y. , Harold, D. , Dourlen, P. , Grenier‐Boley, B. , Kamatani, Y. , Delepine, B. , Demiautte, F. , Zelenika, D. , Zommer, N. , Hamdane, M. , Bellenguez, C. , Dartigues, J.‐F. , … GERAD Consortium (2013). Increased expression of BIN1 mediates Alzheimer genetic risk by modulating tau pathology. Molecular Psychiatry, 18(11), 1225–1234. 10.1038/mp.2013.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Deng, L. , Zhang, H. , & Yu, K. (2020). Power calculation for the general two‐sample Mendelian randomization analysis. Genetic Epidemiology, 44(3), 290–299. 10.1002/gepi.22284 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Deng, Y. , & Pan, W. (2021). Model checking via testing for direct effects in Mendelian randomization and transcriptome‐wide association studies. PLOS Computational Biology, 17(8), e1009266. 10.1371/journal.pcbi.1009266 [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Doody, R. S. , Vacca, J. L. , Massman, P. J. , & Liao, T.‐y. (1999). The influence of handedness on the clinical presentation and neuropsychology of Alzheimer disease. Archives of Neurology, 56(9), 1133. 10.1001/archneur.56.9.1133 [DOI] [PubMed] [Google Scholar]
  13. Freeman, G. , Cowling, B. J. , & Schooling, C. M. (2013). Power and sample size calculations for Mendelian randomization studies using one genetic instrument. International Journal of Epidemiology, 42(4), 1157–1163. 10.1093/ije/dyt110 [DOI] [PubMed] [Google Scholar]
  14. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189–1232. 10.1214/aos/1013203451 [DOI] [Google Scholar]
  15. Gamazon, E. R. , Wheeler, H. E. , Shah, K. P. , Mozaffari, S. V. , Aquino‐Michaels, K. , Carroll, R. J. , Eyler, A. E. , Denny, J. C. , Nicolae, D. L. , Cox, N. J. , & Im, H. K. (2015). A gene‐based association method for mapping traits using reference transcriptome data. Nature Genetics, 47(9), 1091–1098. 10.1038/ng.3367 [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. GTEx Consortium . (2020). The GTEx consortium atlas of genetic regulatory effects across human tissues. Science, 369(6509), 1318–1330. 10.1126/science.aaz1776 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Gusev, A. , Ko, A. , Shi, H. , Bhatia, G. , Chung, W. , Penninx, B. W. J. H. , Jansen, R. , de Geus, E. J. C. , Boomsma, D. I. , Wright, F. A. , Sullivan, P. F. , Nikkola, E. , Alvarez, M. , Civelek, M. , Lusis, A. J. , Lehtimäki, T. , Raitoharju, E. , Kähönen, M. , Seppälä, I. , … Pasaniuc, B. (2016). Integrative approaches for large‐scale transcriptome‐wide association studies. Nature Genetics, 48(3), 245–252. 10.1038/ng.3506 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Inoue, A. , & Solon, G. (2010). Two‐sample instrumental variables estimators. Review of Economics and Statistics, 92(3), 557–561. 10.1162/rest_a_00011 [DOI] [Google Scholar]
  19. Knutson, K. A. , Deng, Y. , & Pan, W. (2020). Implicating causal brain imaging endophenotypes in Alzheimer's disease using multivariable IWAS and GWAS summary data. NeuroImage, 223, 117347. 10.1016/j.neuroimage.2020.117347 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Li, K. K. W. , Ng, I. O. L. , Fan, S. T. , Albrecht, J. H. , Yamashita, K. , & Poon, R. Y. C. (2002). Activation of cyclin‐dependent kinases CDC2 and CDK2 in hepatocellular carcinoma. Liver, 22(3), 259–268. 10.1046/j.0106-9543.2002.01629.x [DOI] [PubMed] [Google Scholar]
  21. Lin, Z. , Xue, H. , Malakhov, M. M. , Knutson, K. A. , & Pan, W. (2022). Accounting for nonlinear effects of gene expression identifies additional associated genes in transcriptome‐wide association studies. Human Molecular Genetics. Advance online publication. 10.1093/hmg/ddac015 [DOI] [PMC free article] [PubMed]
  22. Mansouri, L. , Klai, S. , Gritli, N. , Fekih‐Mrissa, N. , Messalmani, M. , Bedoui, I. , Derbali, H. , & Mrissa, R. (2015). Association of HLA‐DR/DQ polymorphism with Alzheimer's disease. The American Journal of the Medical Sciences, 349(4), 334–337. 10.1097/maj.0000000000000416 [DOI] [PubMed] [Google Scholar]
  23. Okoro, P. C. , Schubert, R. , Guo, X. , Johnson, W. C. , Rotter, J. I. , Hoeschele, I. , Liu, Y. , Im, H. K. , Luke, A. , Dugas, L. R. , & Wheeler, H. E. (2021). Transcriptome prediction performance across machine learning models and diverse ancestries. Human Genetics and Genomics Advances, 2(2), 100019. 10.1016/j.xhgg.2020.100019 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Pierce, B. L. , Ahsan, H. , & VanderWeele, T. J. (2010). Power and instrument strength requirements for Mendelian randomization studies using multiple genetic variants. International Journal of Epidemiology, 40(3), 740–752. 10.1093/ije/dyq151 [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Porcu, E. , Rüeger, S. , Lepik, K. , Santoni, F. A. , Reymond, A. , & Kutalik, Z. (2019). Mendelian randomization integrating GWAS and eQTL data reveals genetic determinants of complex and clinical traits. Nature Communications, 10(1), 3300. 10.1038/s41467-019-10936-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Shen, L. , Thompson, P. M. , Potkin, S. G. , Bertram, L. , Farrer, L. A. , Foroud, T. M. , Green, R. C. , Hu, X. , Huentelman, M. J. , Kim, S. , Kauwe, J. S. K. , Li, Q. , Liu, E. , Macciardi, F. , Moore, J. H. , Munsie, L. , Nho, K. , Ramanan, V. K. , Risacher, S. L. , … Saykin, A. (2013). Genetic analysis of quantitative phenotypes in AD and MCI: Imaging, cognition and biomarkers. Brain Imaging and Behavior, 8(2), 183–207. 10.1007/s11682-013-9262-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Steele, N. Z. R. , Carr, J. S. , Bonham, L. W. , Geier, E. G. , Damotte, V. , Miller, Z. A. , Desikan, R. S. , Boehme, K. L. , Mukherjee, S. , Crane, P. K. , Kauwe, J. S. K. , Kramer, J. H. , Miller, B. L. , Coppola, G. , Hollenbach, J. A. , Huang, Y. , & Yokoyama, J. S. (2017). Fine‐mapping of the human leukocyte antigen locus as a risk factor for Alzheimer disease: A case–control study. PLOS Medicine, 14(3), e1002272. 10.1371/journal.pmed.1002272 [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Sudlow, C. , Gallacher, J. , Allen, N. , Beral, V. , Burton, P. , Danesh, J. , Downey, P. , Elliott, P. , Green, J. , Landray, M. , Liu, B. , Matthews, P. , Ong, G. , Pell, J. , Silman, A. , Young, A. , Sprosen, T. , Peakman, T. , & Collins, R. (2015). UK Biobank: An open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLOS Medicine, 12(3), e1001779. 10.1371/journal.pmed.1001779 [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. van Loenhoud, A. C. , Groot, C. , Bocancea, D. I. , Barkhof, F. , Teunissen, C. , Scheltens, P. , van de Flier, W. M. , & Ossenkoppele, R. (2022). Association of education and intracranial volume with cognitive trajectories and mortality rates across the Alzheimer disease continuum. Neurology, 98(16), e1679–e1691. 10.1212/wnl.0000000000200116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Veturi, Y. , & Ritchie, M. D. (2018). How powerful are summary‐based methods for identifying expression‐trait associations under different genetic architectures? Pacific Symposium on Biocomputing, 23, 228–239. World Scientific. 10.1142/9789813235533_0021 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Wang, Z.‐X. , Wan, Q. , & Xing, A. (2020). HLA in Alzheimer's disease: Genetic association and possible pathogenic roles. NeuroMolecular Medicine, 22(4), 464–473. 10.1007/s12017-020-08612-4 [DOI] [PubMed] [Google Scholar]
  32. Wheeler, H. E. , Shah, K. P. , Brenner, J. , Garcia, T. , Aquino‐Michaels, K. , GTEx Consortium , Cox, N. J. , Nicolae, D. L. , & Im, H. K. (2016). Survey of the heritability and sparse architecture of gene expression traits across human tissues. PLOS Genetics, 12(11), e1006423. 10.1371/journal.pgen.1006423 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Xue, H. , Pan, W. , & Alzheimer's Disease Neuroimaging Initiative . (2020). Some statistical consideration in transcriptome‐wide association studies. Genetic Epidemiology, 44(3), 221–232. 10.1002/gepi.22274 [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Zhou, Q. , Zhao, F. , Lv, Z.‐p. , Zheng, C.‐g. , Zheng, W.‐d. , Sun, L. , Wang, N.‐n. , Pang, S. , de Andrade, F. M. , Fu, M. , He, X.‐h. , Hui, J. , Jiang, W.‐y. , Yang, C.‐y. , Shi, X.‐h. , Zhu, X.‐q. , Pang, G.‐f. , Yang, Y.‐g. , Xie, H.‐q. , … Yang, Z. (2014). Association between APOC1 polymorphism and Alzheimer's disease: A case–control study and meta‐analysis. PLOS One, 9(1), e87017. 10.1371/journal.pone.0087017 [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Zhou, X. , Chen, Y. , Mok, K. Y. , Kwok, T. C. Y. , Mok, V. C. T. , Guo, Q. , Ip, F. C. , Chen, Y. , Mullapudi, N. , Alzheimer's Disease Neuroimaging Initiative , Giusti‐Rodríguez, P. , Sullivan, P. F. , Hardy, J. , Fu, A. K. Y. , Li, Y. , & Ip, N. Y. (2019). noncoding variability at the APOE locus contributes to the Alzheimer's risk. Nature Communications, 10(1), 3310. 10.1038/s41467-019-10945-z [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The ADNI data are available to the approved user at the ADNI website http://adni.loni.usc.edu; see the ADNI website for information about the application procedure. The IGAP AD GWAS summary data can be downloaded at https://www.ebi.ac.uk/gwas/studies/GCST002245. The GTEx data are available to the approved user at https://www.ncbi.nlm.nih.gov/gap/, and the UKB data are available to the approved user at https://www.ukbiobank.ac.uk/. The R code can be found at https://github.com/RuoyuHe/TWAS_Power_Analysis. All the analyses were performed on the MSI (https://www.msi.umn.edu/) server with a single AMD ROME compute node. The run time was roughly 4 h for every 2000 genes.


Articles from Genetic Epidemiology are provided here courtesy of Wiley

RESOURCES