Skip to main content
PLOS Genetics logoLink to PLOS Genetics
. 2021 Feb 26;17(2):e1009405. doi: 10.1371/journal.pgen.1009405

Power analysis of transcriptome-wide association study: Implications for practical protocol choice

Chen Cao 1,, Bowei Ding 2,, Qing Li 1, Devin Kwok 2, Jingjing Wu 2,*, Quan Long 1,2,3,4,*
Editor: Xiaofeng Zhu5
PMCID: PMC7946362  PMID: 33635859

Abstract

The transcriptome-wide association study (TWAS) has emerged as one of several promising techniques for integrating multi-scale ‘omics’ data into traditional genome-wide association studies (GWAS). Unlike GWAS, which associates phenotypic variance directly with genetic variants, TWAS uses a reference dataset to train a predictive model for gene expressions, which allows it to associate phenotype with variants through the mediating effect of expressions. Although effective, this core innovation of TWAS is poorly understood, since the predictive accuracy of the genotype-expression model is generally low and further bounded by expression heritability. This raises the question: to what degree does the accuracy of the expression model affect the power of TWAS? Furthermore, would replacing predictions with actual, experimentally determined expressions improve power? To answer these questions, we compared the power of GWAS, TWAS, and a hypothetical protocol utilizing real expression data. We derived non-centrality parameters (NCPs) for linear mixed models (LMMs) to enable closed-form calculations of statistical power that do not rely on specific protocol implementations. We examined two representative scenarios: causality (genotype contributes to phenotype through expression) and pleiotropy (genotype contributes directly to both phenotype and expression), and also tested the effects of various properties including expression heritability. Our analysis reveals two main outcomes: (1) Under pleiotropy, the use of predicted expressions in TWAS is superior to actual expressions. This explains why TWAS can function with weak expression models, and shows that TWAS remains relevant even when real expressions are available. (2) GWAS outperforms TWAS when expression heritability is below a threshold of 0.04 under causality, or 0.06 under pleiotropy. Analysis of existing publications suggests that TWAS has been misapplied in place of GWAS, in situations where expression heritability is low.

Author summary

We compared the effectiveness of three methods for finding genetic effects on disease in order to quantify their strengths and help researchers choose the best protocol for their data. The genome-wide association study (GWAS) is the standard method for identifying how the genetic differences between individuals relate to disease. Recently, the transcriptome-wide association study (TWAS) has improved GWAS by also estimating the effect of each genetic variant on the activity level (or expression) of genes related to disease. The effectiveness of TWAS is surprising because its estimates of gene expressions are very inaccurate, so we ask if a method using real expression data instead of estimates would perform better. Unlike past studies, which only use simulation to compare these methods, we incorporate novel statistical calculations to make our comparisons more accurate and universally applicable. We discover that depending on the type of relationship between genetics, gene expression, and disease, the estimates used by TWAS could be actually more relevant than real gene expressions. We also find that TWAS is not always better than GWAS when the relationship between genetics and expression is weak and identify specific turning points where past studies have incorrectly used TWAS instead of GWAS.


This is a PLOS Computational Biology Methods paper.

Introduction

High-throughput sequencing instruments have enabled the rapid profiling of transcriptomes (RNA expression of genes) [14], proteomes (proteins) [57] and other ‘omics’ data [810]. These ‘omics’ provide insight into the intermediary effects of genotypes on endophenotypes, and can improve the ability of genome-wide association studies (GWAS) to find associations between genetic variants and disease phenotypes. [1113]. The integration of diverse ‘omics’ data sources remains a challenging and active field of research [1417].

One approach to integrating ‘omics’ and GWAS is the transcriptome-wide association study (TWAS), which quantitatively aggregates multiple genetic variants into a single test using transcriptome data. Pioneered by Gamazon et al [18], the TWAS protocol typically has two steps. First, a model is trained to predict gene expressions from local genetic variants near the focal genes, using a reference dataset containing both genotype and expression data. Second, the pretrained model is used to predict expressions from genotypes in the association mapping dataset under study, which contains genotypes and phenotypes (but not expression). The predicted expressions are then associated to the phenotype of interest. TWAS can also be conducted with summary statistics from GWAS datasets (i.e. meta-analysis) as first demonstrated by Gusev et al. [19,20]. TWAS has since achieved significant popularity and success in identifying the genetic basis of complex traits [2127], inspiring similar protocols for other endophenotypes such as IWAS for images [28] and PWAS for proteins [29].

Despite its demonstrated effectiveness, important questions remain regarding the theoretical conditions under which TWAS is superior to GWAS. First: TWAS mapping relies entirely on predicted expressions, but as shown by many methodological papers, the mean R2 between predicted and actual expressions is very low (around 0.02 ~ 0.05). This is in part due to low expression heritability [18], which bounds the maximum predictive accuracy attainable by the genotype-expression model. Naturally, one can ask: given sufficiently low expression heritability, is there is a point at which TWAS performs worse than GWAS? Indeed in real data, genes discovered with significant TWAS p-values tend to have a higher R2, and thus expression heritability, than on average [18,19,3032]. We therefore investigate the effect of expression heritability on the power of TWAS, as well as its interactions with trait heritability, phenotypic variance from expressions, number of causal genes, and genetic architecture. Second: as described by Gamazon et al. [18], the key insight of TWAS is that it aggregates sensible genetic variants to estimate “genetically regulated gene expression”, or GReX [18], for use in downstream GWAS. Given this hypothesis, one may ask if actual expression data would further improve the power of downstream GWAS over predicted expressions. This is not a trivial question, as although actual expressions do not suffer from prediction errors, they also include experimental or environmental noise which masks the genetic component of expression. To test this problem, we invent a hypothetical protocol associating real expressions to phenotype, which we call “expression mediated GWAS” or emGWAS. While emGWAS is not in practical use due to the difficulties of accessing relevant tissues (e.g., in the studies of brain diseases), it can potentially be applied to future analyses of diseases where tissues are routinely available (e.g., blood or cancerous tissues). More importantly, emGWAS serves as a useful benchmark for evaluating the theoretical properties of TWAS-predicted expressions against ground truth expression data. By analyzing the power of TWAS, GWAS, and emGWAS, we develop practical guidelines for choosing each protocol given different expression heritability and genetic architectures.

While there has been an existing study comparing the power of GWAS, TWAS, and a protocol which integrates eQTLs with GWAS [33], the existing study is purely simulation-based, whereas we determine power directly using traditional closed-form analysis. We derive non-centrality parameters (NCPs) for the relevant statistical tests and the linear mixed model (LMM) in particular (Methods). Our derivation uses a novel method to convert an LMM into a linear regression by decorrelating the covariance structure of the LMM response variable (Methods). To our best knowledge, this is the first closed-form derivation of the NCP for LMMs in current literature, with potential for broad applications as LMMs are the dominant models used in GWAS and portions of the TWAS pipeline.

Unlike pure simulations, which stochastically resample the alternative hypothesis to estimate statistical power, our closed-form derivation directly calculates power from a particular configuration of association mapping data. As a result, our method saves computational resources, yields more accurate power estimations, and adapts easily to similar protocols such as IWAS [28] and PWAS [29,34]. Moreover, as the closed-form derivation avoids conducting the actual regression, our power calculations do not depend on specific implementations of GWAS and TWAS, which could otherwise cause our results to vary due to differences in filtering inputs or parameter optimizations. Our work therefore characterizes the theoretical power of the protocols across all LMM-based implementations and datasets, although we are unable to account for power losses due to practical implementation issues.

In the following section we describe our novel derivation of NCPs for LMMs and our power analyses of GWAS, TWAS, and emGWAS. We present guidelines on the applicability of each protocol under different input conditions and discuss potential limitations of our approach as well as areas for future research.

Methods

Mathematical definitions of GWAS, TWAS, and emGWAS protocols

While there are many variations of GWAS and TWAS [18,19,3539], in this work we assume that multiple genes contribute to phenotypic variation, and for each causal gene, multiple single nucleotide polymorphisms (SNPs) contribute to both gene expression and phenotype. This setting is motivated by the fact that most complex traits are known to have multiple contributing loci, and TWAS fundamentally assumes that genes have multiple local causal variants. To ensure consistency, we apply the same assumptions in the design of the hypothetical protocol emGWAS. Specifically, we define the following models:

GWAS

For GWAS, we adopted a standard LMM similar to EMMAX [35]

Y=βj01+βj1Xj+u+ε,j=1,2,,nx, (1)

where n is the number of individuals, nx is the total number of genetic variants, Y is an n×1 vector of phenotypes, 1 is an n×1 vector of ones, Xj is an n×1 genotype vector with Xij∈{0,1,2} representing the number of minor allele copies for the ith individual and jth genetic variant, βj0 and βj1 are the intercept and effect size of the genetic variant, u is an n×1 vector of random effects following the multivariate normal distribution, i.e. uN(0,σg2Kx), and ε is an n×1 vector of errors with εN(0,σe2I). In the distributions of u and ε, σg2 and σe2 are their respective variance components, I is an n×n identity matrix, and Kx is the genomic relationship matrix (GRM), which is a known n×n real symmetric matrix. Following Patterson et al [40], Kx is calculated by

Kx=1nxX˜X˜T, (2)

where nx is the total number of genetic variants and X˜ is a standardized n×nx matrix. For example, an element X˜ij in the jth genetic variant column is calculated as

X˜ij=XijX¯.jSXj, (3)

where X¯.j=1ni=1nXij and SXj2=1n1i=1n(XijX¯.j)2 are the sample mean and sample variance of the jth variant, respectively.

emGWAS

For emGWAS, we first regress the phenotype on the actual (not predicted) expressions, and then regress the expressions on individual local genetic variants in a similar manner as a cis-eQTL analysis. We chose the LMM to associate phenotype with expression, since under the assumption that multiple genes contribute to phenotype, we expect that the random term of the LMM can capture the effects of non-focal genes. We calculate the GRM from DNA instead of expressions because they provide better estimates of pairwise relationships between study participants than correlations based on predicted expression data. We chose to use linear regression (LM) to model the association between expression and local genetic variants (which correspond to cis-eQTLs), as it is the most common model used in cis-eQTL analyses.

Specifically, the phenotype-expression model is

Y=βl01+βl1Zl+u+ε,l=1,2,,nz, (4)

where n, Y, 1, u and ε have identical interpretations as in the GWAS model from (1), nz is the total number of genes, Zl is an n×1 gene expression vector for the lth gene, and βl0 and βl1 are the intercept and effect size of the gene.

The linear regression associating gene expression with local genetic variants is

Zl=βlk01+βlk1Xlk+εel,l=1,2,,nz,k=1,2,,nel, (5)

where Xlk is an nel×1 vector of the kth local genetic variants for the lth gene, εelN(0,σel2I) is a n×1 vector of errors with variance component σel2, nel is the total number of local genetic variants in the lth gene, and βlk0 and βlk1 are the intercept and effect size of the variant.

TWAS

For TWAS, we apply an analysis similar to emGWAS, except that gene expressions are predicted using a pretrained elastic-net model. Specifically,

Y=βPl01+βPl1Z^l+u+ε,l=1,2,,nz, (6)

where Z^l is the altered notation representing an n×1 vector of predicted gene expressions for the lth gene, and βPl0 and βPl1 are the intercept and effect size of the predicted gene expression.

There are several methods to estimate gene expression including least absolute shrinkage and selection operator (LASSO) and elastic-net. Gamazon et al. has shown that elastic-net has good performance and is more robust to minor changes in the input variants [18]. We therefore use the “glmnet” package in R to train a predictive model using elastic-net. The objective function in “glmnet” is

Lenet(β)=12nZXβ2+λ(1α2β2+αβ1) (7)

where λ and α are tuning parameters. The penalty term is a convex (linear) combination of LASSO and ridge penalties, where α = 1 is equivalent to the LASSO objective function, and α = 0 is equivalent to ridge regression. Optimal values of λ and α were chosen by minimizing the cross-validated squared-error. Readers are referred to Appendix A in S1 Text for details.

In practice, the specific regression model varies depending on the tool in use. For example, the leading TWAS tool PrediXcan [18] does not include the random effects of a mixed model, and many TWAS tools can also analyze summary statistics instead of subject-level genotypes [19]. The motivation of this work is to reveal the key issues of using gene expressions as mediations, therefore has to adapt comparable framework. In other words, we do not intend to compare LMM against linear regression, which will mislead the comparison between GWAS and TWAS. Since LMMs are dominant in GWAS, we chose LMMs as the underlying model for all of the protocols we analyze, which allows us to compare them under an equivalent statistical framework. We believe that LMMs are a sensible approach for TWAS, since the random term can capture the genetic contributions of non-focal genes.

Closed-form derivation of NCP and power calculation

The non-centrality parameter (NCP) measures the distance between a non-central distribution and a central distribution under a specific alternative hypothesis. The NCP enables calculation of the probability of rejecting the null hypothesis, assuming the central distribution, when the alternative hypothesis is correct. As such, the NCP naturally allows the power of a statistical test to be determined in a closed form. We have developed the following method to derive the NCP for LMMs, which we believe is new to the literature.

For a standard simple linear regression, the NCP of a t-test of the coefficient of the predictor variable can be derived similarly to a one-sample t-test statistic as follows: if X1,…,Xn~N(μ,σ) is a simple random sample, then the one-sample t-test statistic for evaluating the null hypothesis H0:μ = μ0 is

T=X¯μ0Sn=n(X¯μ0)σ(n1)S2σ2n1tn1, (8)

where X¯ and S are the sample mean and (unbiased) sample standard deviation respectively. Under H0, n(X¯μ0)/σN(0,1) and (n1)S2/σ2χn12, and thus T~tn−1. Under the alternative hypothesis Ha:μ = μa, the test statistic T=n[(X¯μa)+(μaμ0)]/σ(n1)S2/σ2n1 follows a non-central t distribution with NCP given by

v=μaμ0σ/n (9)

To derive a closed-form NCP for LMMs, we convert the LMM to a linear regression without intercept by decorrelating the response variable and the predictors, a technique that has previously been applied to mixed models [41,42]. The procedure is as follows: we first fit the null model Yc = u+ε with no genetic variants, following an existing innovation for reducing the computational cost of repeatedly factorizing the GRM when analyzing many variants [35,42]. We then estimate σg2 using the Newton-Raphson method detailed in Appendix B in S1 Text. Denoting the eigen decomposition of the GRM as Kx=UxΛxUx1, we construct the de-correlation matrix as

Dx=(σg2Λx+σe2I)12UxT. (10)

By left multiplying both X and Y by Dx, and denoting X*=DxX=(X1*,X2*,,Xn*)T and Y*=DxY=(Y1*,Y2*,,Yn*)T, the covariance structure in Y* is thus removed and a linear regression of Y* on X* is equivalent to the original LMM model. A proof of the validity of this decorrelation structure is presented in Appendix C in S1 Text.

Based on the closed-form NCP for linear regression, we derive the estimated NCP of the LMM from (1), which is given by

υ^Gj=i=1nX^ij*Y^i*i=1nD^xi·2i=1nY^i*D^xi·i=1nX^ij*D^xi·i=1n(X^ij*)2(i=1nD^xi·)2(i=1nD^xi·X^ij*)2i=1nD^xi·2, (11)

where X^j*=D^xXj=(X^1j*,X^2j*,,X^nj*)T,Y^*=D^xY=(Y^1*,Y^2*,,Y^n*)T, and D^xi·=j=1nD^xij. A proof of this expression of the NCP for LMMs is in Appendix D in S1 Text.

The above result allows us to derive the statistical power of the GWAS, emGWAS, and TWAS protocols. For GWAS, we use the Bonferroni-corrected significance level αx=0.05nx to account for multiple testing [43], where nx is the total number of SNPs. Throughout this paper, we use f(t;υ) to denote the probability density function of the non-central t distribution with n-2 degrees of freedom and NCP υ. The statistical power of the jth SNP can then be estimated by PGj=F01(1αx)+f(t;υ^Gj)dt using the estimated NCP υ^Gj, where F0(t) is the cumulative distribution function of the central t distribution with n-2 degrees of freedom, and F01(1αx) gives the critical value for the central distribution. We directly implement this power computation in R via the function “pt”, which takes the critical value, NCP, and degrees of freedom as parameters.

For emGWAS, we assume that the powers of the expression-phenotype and genotype-expression regression models (4) and (5) are independent of each other. For the model Y = βl01+Zlβl1+u+ε from (4), we left multiply the estimated D^x to both sides of the equation so that the estimated NCP for the lth gene expression is given by

υ^eZl=i=1nZ^il*Y^i*i=1nD^xi·2i=1nY^i*D^xi·i=1nZ^il*D^xi·i=1n(Z^il*)2(i=1nD^xi·2)2(i=1nD^xi·Z^il*)2i=1nD^xi·2, (12)

where Z^l*=D^xZl=(Z^1l*,Z^2l*,,Z^nl*)T. We use the significance level αz=0.05nz for each individual test, where nz is the total number of genes. The statistical power of detecting the lth gene expression is then estimated by PeZl=F01(1αz)+f(t;υ^eZl)dt. For the model from (5), we simply calculate the estimated NCP of the standard linear regression, which is

υ^eXlk=i=1n(XilkX¯·lk)Zili=1n(XilkX¯·lk)2σ^el, (13)

where

σ^el=1n2i=1n(ZilZ¯l+β^lk(XilkX¯lk))2. (14)

Again, we use the significance level αel=0.05nel, where nel is the total number of local genetic variants in the lth gene, so that the power of detecting Xlk is estimated by PeXlk=F01(1αel)+f(t;υ^eXlk)dt. Since we assume the power of (4) and (5) are independent, the power of detecting the lth gene and the kth variants in the lth gene simultaneously is give by PeZlPeXlk. If the independence assumption is violated, i.e., the powers of these two steps are positively correlated, then the estimated power for emGWAS will be conservative.

For TWAS, the NCP is estimated in a similar manner as the first step of emGWAS, i.e.

υ^Tl=i=1nZ^^il*Y^i*i=1nD^xi·2i=1nY^i*D^xi·i=1nZ^^il*D^xi·i=1n(Z^^il*)2(i=1nD^xi·2)2(i=1nD^xi·Z^^il*)2i=1nD^xi·2, (15)

where the only difference between (12) and (15) is that Z^il*=D^xZil in (15) is replaced by Z^^il*=D^xZ^il in (15). The significance level is again αz=0.05nz and the power is estimated by PTl=F01(1αz)+f(t;υ^Tl)dt.

Simulation of phenotype and expression

As the statistical power of each protocol depends on the magnitude of the genetic effect, we simulated input data at various effect sizes. While effect size depends on a combination of many factors, we chose to focus on the following three aspects. 1) We considered two genetic architectures: causality and pleiotropy (Fig 1). In the causality scenario, the contribution of genotype to phenotype is mediated through expression (Fig 1A), whereas in the pleiotropy scenario, genotype contributes to both expression and phenotype directly (Fig 1B). We did not consider the scenario where phenotype is causal to expression. 2) We considered the strength of three different variant components: trait heritability (the variance component of phenotype explained by genotype, denoted hx=>y2), expression heritability (the variance component of expression explained by genotype, denoted hx=>z2), and the phenotypic variance component explained by expression, denoted hz=>y2 and abbreviated as PVX. 3) We also considered the number of genes contributing to phenotype and the number of local genetic variants contributing to expression.

Fig 1.

Fig 1

Causality (A) and Pleiotropy (B) scenarios for genotype (X), expression (Z) and phenotype (Y).

In all our simulations, we use real genotypes from the 1000 Genomes Project (N = 2504). Although there are multiple existing datasets containing both expressions and genotype, we chose to use simulated expressions instead as it is difficult to match real data exactly to desired properties such as expression heritability or the number of contributing genetic variants. By simulating expressions, we can perform a consistent power analysis across a comprehensive range of prespecified input conditions.

In the causality scenario, phenotypes were simulated with the following procedure. First, several genes (nzsig = 4, 9, or 13) were selected as causal genes. For each gene (indexed by l = 1,2,…,nzsig), several common and independent genetic variants were selected as causal variants (nz(l)−sig = 4 ~9, MAF>0.05, and R2<0.01). A linear combination of local variants in the lth gene is generated to produce the expression values Z(l), and a linear combination of these gene expressions Z is generated as the genomic contribution to phenotype. Note that at each step, we ensure the simulated linear combinations of variants and expressions match our desired values for expression heritability hx=>z2 and PVX hz=>y2 (Appendix E in S1 Text).

In the pleiotropy scenario, we followed a similar procedure except that the phenotype Y was directly generated from a linear combination of genotypes, instead of expressions (Appendix F in S1 Text). Note that although the expressions Z and phenotype Y are unrelated by genuine biological causality, they are generated from the same genetic variants and are therefore statistically correlated. Therefore, if the trait heritability and expression heritability are sufficiently large, TWAS can still identify causal genes using the statistical correlation between genetic variants and expression.

We simulated both scenarios with expression heritability hx=>z2 from the values (2.5%, 3%, 4%, 6%, 8%, 10%, 30%), and with trait heritability hx=>y2 in the pleiotropy scenario or PVX hz=>y2 in the causality scenario from the values (0.5%, 1%, 2.5%, 5%, 10%). Although we initially tested more extreme values, our Results show that the turning points where TWAS outperforms GWAS are well within the range of values presented here, and the relative performance of the protocols remains consistent under more extreme conditions. We therefore chose to restrict our discussion to the most relevant values for protocol selection, noting that the expression heritability values we examine are at the high-end of real observed values [18], while the trait heritability values are lower than typically found in GWAS.

Finally, as each simulation involves multiple variants and genes, the overall power of each protocol is defined as follows: the power of GWAS is the probability of detecting at least one causal variant in any causal gene, the power of emGWAS is the probability of detecting at least one gene and one local SNP of that gene simultaneously, and the power of TWAS is the probability that at least one predicted gene expression is significant. Specifically,

PGWAS=1j=1nxsig(1PG(j)), (16)
PemGWAS=1l=1nzsig(1PeZ(l)PeX(l)),wherePeX(l)=1k=1nz(l)sig(1PeX(l)(k)), (17)
PTWAS=1l=1nzsig(1PT(l)), (18)

where nxsig, nzsig and nz(l)−sig denote the numbers of significant SNPs, genes, and SNPs in the lth significant gene respectively, G(j) denotes the jth significant SNP identified by GWAS, Z(l) and X(l)(k) denote the lth significant gene and the kth significant SNP of the lth significant gene identified by emGWAS, and T(l) denotes the lth significant gene identified by TWAS.

Results

As a quality control measure, we compared the actual expression heritability and the mean R2 of the predicted expressions (Table 1). As expected, the mean R2 grows closer to the actual heritability value as expression heritability increases.

Table 1. Comparisons of R2 of imputed gene expression under different levels of expression heritability and number of genetic variants.

Mean of R2 Sample Standard Deviation of R2
h12=0.025 0.007847616 0.007415877
h12=0.03 0.01259302 0.008410582
h12=0.04 0.02319834 0.009481371
h12=0.06 0.04415579 0.01083593
h12=0.08 0.06465895 0.01175991
h12=0.1 0.08518152 0.01264175
h12=0.3 0.2886779 0.01514781

Causality scenario

We first analyzed cases where expression heritability is high (hx=>z2 = 0.1 or 0.3) but the PVX is low (Fig 2). Overall, emGWAS clearly outperforms both GWAS and TWAS by a large margin, and TWAS also generally outperforms GWAS. Note that although the PVX is low and favors GWAS, TWAS is still more powerful due to the high expression heritability, which shows that expression heritability affects the performance of TWAS more than the PVX. Consistent with intuition, we observed that GWAS and TWAS have higher power as expression heritability increases, whereas this increase is much smaller for emGWAS. The power of GWAS and emGWAS reduces as the number of causal genes grows, whereas TWAS is largely unaffected by the number of causal genes. This is also consistent with intuition since TWAS uses GReX (Z^) to aggregate genetic effects, avoiding the burden of multiple-testing correction.

Fig 2. Causality scenario when expression heritability is high and PVX is low.

Fig 2

The PVX is 0.005, 0.01, 0.025, and 0.05 in the four columns as indicated by the X-axis labels. The number of genes contributing to phenotype for (A), (B) and (C) are 4, 9, and 13 respectively. The expression heritability for the top and bottom rows of (A), (B) and (C) are 0.1 and 0.3 respectively. The number of causal variants per gene is randomly sampled from the interval [4,9].

We then analyzed cases where the PVX is high, but expression heritability is relatively low (hx=>z2 = 0.025, 0.03, 0.04 or 0.08). Evidently, emGWAS performs best with powers consistently at 1.0. The comparison between TWAS and GWAS is more nuanced, as TWAS is suboptimal to GWAS when the expression heritability is 0.025 or 0.03 (Fig 3A and 3B), begins to outperform GWAS when the expression heritability is 0.04 (Fig 3C), and clearly outperforms GWAS when the expression heritability is 0.08 (Fig 3D). This quantifies an important turning point in that GWAS is superior to TWAS when expression heritability is less than 0.04, even if PVX is high (favoring TWAS).

Fig 3. Causality scenario when expression heritability is low and PVX is high.

Fig 3

The PVX is 0.05 and 0.1 in the two columns as indicated by the X-axis labels. The numbers of genes contributing to phenotype in the left, middle and right panels are 4, 9, and 13 respectively. The expression heritability levels in (A), (B), (C) and (D) are 0.025, 0.03, 0.04, and 0.08 respectively. The number of causal variants per gene is randomly sampled from the interval [4,9].

Pleiotropy scenario

Again, we first analyze cases where expression heritability is high and trait heritability is low (Fig 4). Unlike in the causality scenario, the power of emGWAS is very low compared to TWAS and GWAS. A potential explanation is that when the effect of genetic variants on phenotype is not mediated through expressions, the non-genetic effects within the actual expressions add noise to emGWAS predictions. In contrast, the elastic-net model in TWAS captures only the genetic component of expressions, meaning the predicted expressions are a more accurate model of the direct genetic effect on phenotype. While errors are unavoidable in the elastic-net training process (as revealed in Table 1), our results show that the loss of power due to non-genetic effects is overwhelmingly greater than the loss due to training errors. As in the casualty scenario, TWAS generally outperforms GWAS except in the case where trait heritability is extremely low and the number of contributing genes is large, which is rare in practice. We therefore conclude that in both scenarios, TWAS has better power than GWAS when expression heritability is high.

Fig 4. Pleiotropy scenario when expression heritability is high and trait heritability is low.

Fig 4

The trait heritability is 0.005, 0.01, 0.025, and 0.05 in the four columns as indicated by the X-axis labels. The numbers of genes contributing to phenotype for (A), (B) and (C) are 4, 9, and 13 respectively. The expression heritability for the top and bottom rows of (A), (B) and (C) are 0.1 and 0.3 respectively. The number of causal variants per gene is randomly sampled from the interval [4,9].

We finally analyze cases where expression heritability is low but trait heritability is high. Here, emGWAS continues to be the least powerful of the three protocols. As in the causality scenario, we again observe a turning point where TWAS outperforms GWAS: TWAS has lower power than GWAS when the expression heritability is 0.025 or 0.04 (Fig 5A and 5B), TWAS has comparable power when the expression heritability is 0.06 (Fig 5C), and TWAS outperforms GWAS when the expression heritability is 0.08 (Fig 5D).

Fig 5. Pleiotropy scenario when expression heritability is low and trait heritability is high.

Fig 5

The PVX is 0.05 and 0.1 in the two columns as indicated by the X-axis labels. The numbers of genes contributing to phenotype for the left, middle and right panels are 4, 9, and 13 respectively. The expression heritability levels in (A), (B), (C) and (D) are 0.025, 0.04, 0.06, and 0.08 respectively. The number of causal variants per gene is randomly sampled from the interval [4,9].

Our results can be summarized in two observations. First, emGWAS outperforms TWAS and GWAS in the casualty scenario, but is less powerful in the pleiotropy scenario regardless of the accuracy of the predicted expressions (Table 1). This demonstrates that when non-genetic components in expression do not contribute to phenotype (i.e. pleiotropy scenario), predicted expressions capture genetic contributions better than actual expressions (which include non-genetic components). Second, the turning point at which traditional GWAS outperforms TWAS is an expression heritability of less than 0.04 in the causality scenario, or 0.06 in the pleiotropy scenario.

These turning points are immediately relevant to the practical conduct of association mapping studies, as shown by the following analysis of expression heritability in existing TWAS publications. As few publications disclose their estimated expression heritability, we use published R2 values of the correlation between predicted and actual expressions to approximate the underlying expression heritability. We use the difference between expression heritability and R2 as calculated from our simulations (Table 1) to map these R2 values to an estimated expression heritability (i.e. R2 of 0.023 and 0.044 give expression heritability values 0.04 and 0.06, respectively), although in practice the true difference may vary depending on the predictive model used in each study. Table 1 of the PrediXcan publication lists significant results from their paper, in which 14 out of 41 discovered genes have R2 values less than 0.044, with 2 values less than 0.023. Additionally, our review of recent TWAS publications shows that most of the genes presented have mean R2 values less than 0.044 or 0.023 (Table 2). As our power analysis indicated, GWAS may have better power than TWAS given these low expression heritability conditions. Although we are unable to determine if the genes discovered by these publications follow the causality or pleiotropy scenario, other advanced statistical models [44] may be used to determine appropriate thresholds to distinguish between pleiotropy and causality.

Table 2. Mean R2 in published TWAS projects.

Title of the publication Description of prediction accuracy
Large-scale transcriptome-wide association study identifies new prostate cancer risk regions [22] The mean R2 = 0.07 for measured and predicted gene expression for TCGA normal prostate samples using models fitted in GTEx normal prostate.
A framework for transcriptome-wide association studies in breast cancer in diverse study populations [45] The median CV R2 for the 153 genes is 0.011 in both African American and white women.
Evaluation of PrediXcan for prioritizing GWAS associations and predicting gene Expression [46] The average of prediction accuracy (R2) is 0.023 for the DGN model and 0.02 for the GTEx model, with both using whole blood model.
A gene-based association method for mapping traits using reference transcriptome data [18] The average prediction R2 value is 0.0197 for GEUVADIS LCLs. For GTEx tissues, the prediction R2 values are 0.0367 (adipose), 0.0358 (tibial artery), 0.0356 (left-ventricular heart), 0.0359 (lung), 0.0269 (muscle), 0.0422 (tibial nerve), 0.0374 (sun-exposed skin), 0.0398 (thyroid) and 0.0458 (whole blood).

In summary, we suggest the following modifications to the TWAS protocol. First, one may estimate expression heritability in the reference panel and filter out genes with expression heritability less than 0.04. Second, after conducting TWAS association mapping, determine the underlying causality scenario (causality or pleiotropy) in order to choose an appropriate expression heritability threshold (0.04 or 0.06). Finally, conduct GWAS for each gene with an expression heritability below the given threshold.

Application to the power estimation of EpiXcan

Our NCP-based framework can be applied to estimate the power of other protocols. To demonstrate this point, we estimated the power of EpiXcan [27], a novel TWAS-like protocol integrating epigenetic functional annotations to improve the accuracy of predicted expressions and therefore overall TWAS power. The original EpiXcan paper demonstrated that (1) the predictive accuracy of expressions is significantly increased, and (2) EpiXcan enabled the discovery of novel genes [27]. We present here the first rigorous power analysis of EpiXcan. We first conduct simulations where a subset of SNPs are assigned increased effects, which reflects the main insight of the EpiXcan paper that epigenetic-relevant functional SNPs have higher impact on variation in gene expression. In particular, we assume the real effect size follows a standard normal distribution N(0,1), and sample effect sizes from this distribution. Assuming these functional SNPs are known (based on various techniques of annotating SNP functions), we relieve their penalty in training the predictive model. Using the predicted expressions, we calculate power using our derived NCP, and compare the resulting analysis with the standard TWAS protocol. S1S4 Figs depict this quantitative evaluation of the improvement in power due to the contribution of epigenetic-relevant functional SNPs. Evidently, under the causality model EpiXcan indeed increases power by improving expression predictions, although the improvement is more pronounced in the cases that expression heritability is low (S1 and S2 Figs). However, under the pleiotropy model, EpiXcan only shows almost no increase in power over TWAS (S3 and S4 Figs). This observation suggests that when DNA mutations contribute to phenotype directly, the benefit of more accurate predictions for expressions may not be substantial. The source data for Figs 25 and S1S4 Figs are included in S1 Data.

Discussion

In this work, we produced a novel derivation of the NCP for LMMs based on the decorrelation procedure, allowing us to calculate closed-form estimates of statistical power for three protocols: GWAS, emGWAS, and TWAS. Our power analysis revealed two practical insights. First, in the pleiotropy scenario, the use of predicted expressions in TWAS is overwhelmingly more powerful than the use of actual expressions in emGWAS, regardless of the accuracy of the predicted expressions per se (Table 1). This suggests that even if real expressions can be experimentally determined, TWAS is still superior for the analysis of some genes. While this appears counterintuitive, in statistical terms it is a direct result of the lack of a causal relationship between expression and phenotype under pleiotropy. This result reinforces the key insight, as presented by some publications [18], that TWAS uses expression as an objective function to select a linear combination of genetic variants, rather than attempting to accurately predict expressions. We note that this is equivalent to denoising in the field of machine learning [47]. Second, expression heritability determines the relative power of TWAS and GWAS. When the expression heritability is lower than 0.04 (in the casualty scenario) or 0.06 (in the pleiotropy scenario), GWAS outperforms TWAS despite not utilizing gene expression information. This suggests that in practice, TWAS may often be suboptimal when expression heritability is low (Table 2 and Table 1 in [18]), which can be mitigated by choosing the optimal association mapping protocol according to this work’s quantitative guidelines.

A recent publication has also compared the statistical powers of GWAS and TWAS using pure simulations [33]. However, since we calculate power from a closed-form NCP derivation, our work establishes theoretical benchmarks for the performance of each protocol, independent of their implementations. Our work also has a different focus: rather than comparing techniques for training the genotype-expression predictive model and the impact of the actual number of causal genetic variants, we rank the effectiveness of GWAS, TWAS and emGWAS to better guide the practical application of TWAS. We analyze the theoretical effectiveness of real expressions as utilized by emGWAS, but exclude the protocol eGWAS as analyzed in [33], which uses eQTLs to assist association mapping. Our conclusions also differ slightly, as while the previous publication highlighted the importance of expression heritability, they concluded that expression heritability affects power only under the causality scenario, and not pleiotropy. In contrast, we concluded that expression heritability affects both scenarios.

Finally, our closed-form derivation is readily adaptable to other methods utilizing middle ‘omics’ (endophenotypes) such as IWAS [28] and PWAS [29,34]. In fact, the variable Z in formula (15) can already represent such data as images or proteins, and thus no further modifications of the NCPs are necessary to adapt this work.

The present NCP framework only focuses on statistical power for detecting associations, and is not able to determine causality in the framework of Mendelian randomization such as in SMR and its extensions [48,49]. As a future work, we may attempt to derive closed-form power analyses for the MR framework.

There are several limitations in the present study. Although our closed-form derivation is easily adaptable and works independently of specific implementations, it is unable to capture power loss due to implementation limitations or bias in specific datasets. Additionally, closed-form derivations are more sensitive to model assumptions than simulation-based methods. Our calculation of the NCP also requires the variance component σg2 to be estimated from data, in order to form the decorrelation matrix Dx. Although this approximation introduces extra variability and may therefore cause a decrease in power, we have omitted this variability from our analyses as the estimation of σg2 is generally well-established, and has high accuracy in practice when given thousands of samples. Finally, we only compared linear models for GWAS and TWAS. As a future work, we may explore kernel-based nonparametric and semiparametric methods for conducting both GWAS [50,51] and TWAS [52].

Supporting information

S1 Fig. Causality scenario of EpiXcan and TWAS when expression heritability is high and PVX is low.

The PVX (phenotypic variance explained by expression) is 0.005, 0.01, 0.025, and 0.05 in the four columns as indicated by the X-axis labels. In each of (a), (b), and (c), the expression heritability for the top and bottom rows are 0.1 and 0.3 respectively. The number of genes contributing to phenotype for (a), (b) and (c) are 4, 9, and 13 respectively. The number of causal variants per gene is randomly sampled from the interval [4,9].

(TIFF)

S2 Fig. Causality scenario of EpiXcan and TWAS when expression heritability is low and PVX is high.

In each panel, the PVX is 0.05 and 0.1 in the left and right columns as indicated by the X-axis labels. In each of (a), (b), (c), and (d), the numbers of genes contributing to phenotype for the left, center, and right panels are 4, 9, and 13 respectively. The expression heritability levels in (a), (b), (c), and (d) are 0.025, 0.03, 0.04, and 0.08 respectively. The number of causal variants per gene is randomly sampled from the interval [4,9].

(TIFF)

S3 Fig. Pleiotropy scenario of EpiXcan and TWAS when expression heritability is high and trait heritability is low.

The trait heritability is 0.005, 0.01, 0.025, and 0.05 in the four columns as indicated by the X-axis labels. In each of (a), (b), and (c), the expression heritability for the top and bottom panels are 0.1 and 0.3 respectively. The numbers of genes contributing to phenotype for (a), (b), and (c) are 4, 9, and 13 respectively. The number of causal variants per gene is randomly sampled from the interval [4,9].

(TIFF)

S4 Fig. Pleiotropy scenario of EpiXcan and TWAS when expression heritability is low and trait heritability is high.

The PVX is 0.05 and 0.1 in the two columns as indicated by the X-axis labels. In each of (a), (b), (c), and (d), the number of genes contributing to phenotype for the left, center, and right panels are 4, 9, and 13 respectively. The expression heritability levels in (a), (b), (c), and (d) are 0.025, 0.04, 0.06, and 0.08 respectively. The number of causal variants per gene is randomly sampled from the interval [4,9].

(TIFF)

S1 Data. Source data for Figs 25 and S1S4 Figs.

The six columns are scenarios, protocols, number of genes, expression heritability, trait heritability and power respectively.

(XLSX)

S1 Text. Supplementary information and detailed mathematical derivations.

(DOCX)

Data Availability

All relevant data are within the manuscript and its Supporting Information files.

Funding Statement

JW is supported by an NSERC Discovery Grant (RGPIN-2018-04328). QL is supported by an NSERC Discovery Grant (RGPIN-2017-04860), a Canada Foundation for Innovation JELF grant (36605), a New Frontiers in Research Fund (NFRFE-2018-00748), a Clinical Research Fund (10027289) and a Startup grant (10013532) supported by Alberta Children’s Hospital Research Institute (ACHRI). CC is supported by ACHRI postdoctoral scholarship. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Trapnell C, Williams BA, Pertea G, Mortazavi A, Kwan G, van Baren MJ, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat Biotechnol. 2010;28(5):511–5. Epub 2010/05/04. 10.1038/nbt.1621 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Wang Z, Gerstein M, Snyder M. RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet. 2009;10(1):57–63. Epub 2008/11/19. 10.1038/nrg2484 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.McGettigan PA. Transcriptomics in the RNA-seq era. Curr Opin Chem Biol. 2013;17(1):4–11. Epub 2013/01/08. 10.1016/j.cbpa.2012.12.008 . [DOI] [PubMed] [Google Scholar]
  • 4.Ozsolak F, Milos PM. RNA sequencing: advances, challenges and opportunities. Nat Rev Genet. 2011;12(2):87–98. Epub 2010/12/31. 10.1038/nrg2934 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Selevsek N, Chang CY, Gillet LC, Navarro P, Bernhardt OM, Reiter L, et al. Reproducible and consistent quantification of the Saccharomyces cerevisiae proteome by SWATH-mass spectrometry. Mol Cell Proteomics. 2015;14(3):739–49. Epub 2015/01/07. 10.1074/mcp.M113.035550 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Pible O, Armengaud J. Improving the quality of genome, protein sequence, and taxonomy databases: a prerequisite for microbiome meta-omics 2.0. Proteomics. 2015;15(20):3418–23. Epub 2015/06/04. 10.1002/pmic.201500104 . [DOI] [PubMed] [Google Scholar]
  • 7.Bell AW, Deutsch EW, Au CE, Kearney RE, Beavis R, Sechi S, et al. A HUPO test sample study reveals common problems in mass spectrometry-based proteomics. Nat Methods. 2009;6(6):423–30. Epub 2009/05/19. 10.1038/nmeth.1333 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Zhang A, Sun H, Wang P, Han Y, Wang X. Modern analytical techniques in metabolomics analysis. Analyst. 2012;137(2):293–300. Epub 2011/11/22. 10.1039/c1an15605e . [DOI] [PubMed] [Google Scholar]
  • 9.Coats VC, Rumpho ME. The rhizosphere microbiota of plant invaders: an overview of recent advances in the microbiomics of invasive plants. Front Microbiol. 2014;5:368. Epub 2014/08/08. 10.3389/fmicb.2014.00368 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Teperino R, Lempradl A, Pospisilik JA. Bridging epigenomics and complex disease: the basics. Cell Mol Life Sci. 2013;70(9):1609–21. Epub 2013/03/07. 10.1007/s00018-013-1299-z . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Klein RJ, Zeiss C, Chew EY, Tsai JY, Sackler RS, Haynes C, et al. Complement factor H polymorphism in age-related macular degeneration. Science. 2005;308(5720):385–9. 10.1126/science.1109557 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Ozaki K, Ohnishi Y, Iida A, Sekine A, Yamada R, Tsunoda T, et al. Functional SNPs in the lymphotoxin-alpha gene that are associated with susceptibility to myocardial infarction. Nat Genet. 2002;32(4):650–4. Epub 2002/11/12. 10.1038/ng1047 . [DOI] [PubMed] [Google Scholar]
  • 13.Mills MC, Rahal C. A scientometric review of genome-wide association studies. Commun Biol. 2019;2:9. Epub 2019/01/10. 10.1038/s42003-018-0261-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Eddy S, Mariani LH, Kretzler M. Integrated multi-omics approaches to improve classification of chronic kidney disease. Nat Rev Nephrol. 2020. Epub 2020/05/20. 10.1038/s41581-020-0286-5 . [DOI] [PubMed] [Google Scholar]
  • 15.Hasin Y, Seldin M, Lusis A. Multi-omics approaches to disease. Genome Biol. 2017;18(1):83. Epub 2017/05/10. 10.1186/s13059-017-1215-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Yan J, Risacher SL, Shen L, Saykin AJ. Network approaches to systems biology analysis of complex disease: integrative methods for multi-omics data. Brief Bioinform. 2018;19(6):1370–81. Epub 2017/07/07. 10.1093/bib/bbx066 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Fukushima A, Kusano M, Redestig H, Arita M, Saito K. Integrated omics approaches in plant systems biology. Curr Opin Chem Biol. 2009;13(5–6):532–8. Epub 2009/10/20. 10.1016/j.cbpa.2009.09.022 . [DOI] [PubMed] [Google Scholar]
  • 18.Gamazon ER, Wheeler HE, Shah KP, Mozaffari SV, Aquino-Michaels K, Carroll RJ, et al. A gene-based association method for mapping traits using reference transcriptome data. Nature genetics. 2015;47(9):1091–8. 10.1038/ng.3367 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Gusev A, Ko A, Shi H, Bhatia G, Chung W, Penninx BW, et al. Integrative approaches for large-scale transcriptome-wide association studies. Nature genetics. 2016;48(3):245–52. 10.1038/ng.3506 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Barbeira AN, Dickinson SP, Bonazzola R, Zheng J, Wheeler HE, Torres JM, et al. Exploring the phenotypic consequences of tissue specific gene expression variation inferred from GWAS summary statistics. Nat Commun. 2018;9(1):1825. Epub 2018/05/10. 10.1038/s41467-018-03621-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Gusev A, Mancuso N, Won H, Kousi M, Finucane HK, Reshef Y, et al. Transcriptome-wide association study of schizophrenia and chromatin activity yields mechanistic disease insights. Nat Genet. 2018;50(4):538–48. Epub 2018/04/11. 10.1038/s41588-018-0092-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Mancuso N, Gayther S, Gusev A, Zheng W, Penney KL, Kote-Jarai Z, et al. Large-scale transcriptome-wide association study identifies new prostate cancer risk regions. Nat Commun. 2018;9(1):4079. Epub 2018/10/06. 10.1038/s41467-018-06302-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Theriault S, Gaudreault N, Lamontagne M, Rosa M, Boulanger MC, Messika-Zeitoun D, et al. A transcriptome-wide association study identifies PALMD as a susceptibility gene for calcific aortic valve stenosis. Nat Commun. 2018;9(1):988. Epub 2018/03/08. 10.1038/s41467-018-03260-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Gong L, Zhang D, Lei Y, Qian Y, Tan X, Han S. Transcriptome-wide association study identifies multiple genes and pathways associated with pancreatic cancer. Cancer Med. 2018;7(11):5727–32. Epub 2018/10/20. 10.1002/cam4.1836 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Ratnapriya R, Sosina OA, Starostik MR, Kwicklis M, Kapphahn RJ, Fritsche LG, et al. Retinal transcriptome and eQTL analyses identify genes associated with age-related macular degeneration. Nat Genet. 2019;51(4):606–10. Epub 2019/02/12. 10.1038/s41588-019-0351-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Atkins I, Kinnersley B, Ostrom QT, Labreche K, Il’yasova D, Armstrong GN, et al. Transcriptome-Wide Association Study Identifies New Candidate Susceptibility Genes for Glioma. Cancer Res. 2019;79(8):2065–71. Epub 2019/02/03. 10.1158/0008-5472.CAN-18-2888 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Zhang W, Voloudakis G, Rajagopal VM, Readhead B, Dudley JT, Schadt EE, et al. Integrative transcriptome imputation reveals tissue-specific and shared biological mechanisms mediating susceptibility to complex traits. Nat Commun. 2019;10(1):3834. Epub 2019/08/25. 10.1038/s41467-019-11874-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Xu Z, Wu C, Pan W, Alzheimer’s Disease Neuroimaging I. Imaging-wide association study: Integrating imaging endophenotypes in GWAS. Neuroimage. 2017;159:159–69. Epub 2017/07/25. 10.1016/j.neuroimage.2017.07.036 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Brandes N, Linial N, Linial M, editors. PWAS: Proteome-Wide Association Study 2020; Cham: Springer International Publishing. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Mancuso N, Shi H, Goddard P, Kichaev G, Gusev A, Pasaniuc B. Integrating Gene Expression with Summary Association Statistics to Identify Genes Associated with 30 Complex Traits. Am J Hum Genet. 2017;100(3):473–87. Epub 2017/02/28. 10.1016/j.ajhg.2017.01.031 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Wainberg M, Sinnott-Armstrong N, Mancuso N, Barbeira AN, Knowles DA, Golan D, et al. Opportunities and challenges for transcriptome-wide association studies. Nat Genet. 2019;51(4):592–9. Epub 2019/03/31. 10.1038/s41588-019-0385-z [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Mogil LS, Andaleon A, Badalamenti A, Dickinson SP, Guo X, Rotter JI, et al. Genetic architecture of gene expression traits across diverse populations. PLoS Genet. 2018;14(8):e1007586. Epub 2018/08/11. 10.1371/journal.pgen.1007586 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Veturi Y, Ritchie MD. How powerful are summary-based methods for identifying expression-trait associations under different genetic architectures? Pac Symp Biocomput. 2018;23:228–39. Epub 2017/12/09. [PMC free article] [PubMed] [Google Scholar]
  • 34.Okada H, Ebhardt HA, Vonesch SC, Aebersold R, Hafen E. Proteome-wide association studies identify biochemical modules associated with a wing-size phenotype in Drosophila melanogaster. Nat Commun. 2016;7:12649. Epub 2016/09/02. 10.1038/ncomms12649 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, Freimer NB, et al. Variance component model to account for sample structure in genome-wide association studies. Nat Genet. 2010;42(4):348–54. Epub 2010/03/09. 10.1038/ng.548 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MA, Bender D, et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. American journal of human genetics. 2007;81(3):559–75. 10.1086/519795 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Aulchenko YS, Ripke S, Isaacs A, van Duijn CM. GenABEL: an R library for genome-wide association analysis. Bioinformatics. 2007;23(10):1294–6. Epub 2007/03/27. 10.1093/bioinformatics/btm108 . [DOI] [PubMed] [Google Scholar]
  • 38.Gogarten SM, Bhangale T, Conomos MP, Laurie CA, McHugh CP, Painter I, et al. GWASTools: an R/Bioconductor package for quality control and analysis of genome-wide association studies. Bioinformatics. 2012;28(24):3329–31. Epub 2012/10/12. 10.1093/bioinformatics/bts610 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Zhou X, Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nat Genet. 2012;44(7):821–4. Epub 2012/06/19. 10.1038/ng.2310 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006;2(12):e190. Epub 2006/12/30. 10.1371/journal.pgen.0020190 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Rakitsch B, Lippert C, Stegle O, Borgwardt K. A Lasso multi-marker mixed model for association mapping with population structure correction. Bioinformatics. 2013;29(2):206–14. Epub 2012/11/24. 10.1093/bioinformatics/bts669 . [DOI] [PubMed] [Google Scholar]
  • 42.Long Q, Zhang Q, Vilhjalmsson BJ, Forai P, Seren U, Nordborg M. JAWAMix5: an out-of-core HDF5-based java implementation of whole-genome association studies using mixed models. Bioinformatics. 2013;29(9):1220–2. Epub 2013/03/13. 10.1093/bioinformatics/btt122 . [DOI] [PubMed] [Google Scholar]
  • 43.Shaffer JP. Multiple hypothesis testing. Annual review of psychology. 1995;46(1):561–84. [Google Scholar]
  • 44.Schadt EE, Lamb J, Yang X, Zhu J, Edwards S, Guhathakurta D, et al. An integrative genomics approach to infer causal associations between gene expression and disease. Nature genetics. 2005;37(7):710–7. 10.1038/ng1589 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Bhattacharya A, Garcia-Closas M, Olshan AF, Perou CM, Troester MA, Love MI. A framework for transcriptome-wide association studies in breast cancer in diverse study populations. Genome Biol. 2020;21(1):42. Epub 2020/02/23. 10.1186/s13059-020-1942-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Li B, Verma SS, Veturi YC, Verma A, Bradford Y, Haas DW, et al. Evaluation of PrediXcan for prioritizing GWAS associations and predicting gene expression. Pac Symp Biocomput. 2018;23:448–59. Epub 2017/12/09. [PMC free article] [PubMed] [Google Scholar]
  • 47.Tian C, Fei L, Zheng W, Xu Y, Zuo W, Lin C-W. Deep learning on image denoising: An overview. arXiv preprint arXiv:191213171. 2019. [DOI] [PubMed]
  • 48.Zhu Z, Zhang F, Hu H, Bakshi A, Robinson MR, Powell JE, et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat Genet. 2016;48(5):481–7. Epub 2016/03/29. 10.1038/ng.3538 . [DOI] [PubMed] [Google Scholar]
  • 49.Hauberg ME, Zhang W, Giambartolomei C, Franzen O, Morris DL, Vyse TJ, et al. Large-Scale Identification of Common Trait and Disease Variants Affecting Gene Expression. Am J Hum Genet. 2017;101(1):157. Epub 2017/07/08. 10.1016/j.ajhg.2017.06.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93. Epub 2011/07/09. 10.1016/j.ajhg.2011.05.029 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Wu MC, Kraft P, Epstein MP, Taylor DM, Chanock SJ, Hunter DJ, et al. Powerful SNP-set analysis for case-control genome-wide association studies. American journal of human genetics. 2010;86(6):929–42. 10.1016/j.ajhg.2010.05.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Cao C, Kwok D, Edie S, Li Q, Ding B, Kossinna P, et al. kTWAS: integrating kernel-machine with transcriptome-wide association studies improves statistical power and reveals novel genes. bioRxiv. 2020. 10.1093/bib/bbaa270 [DOI] [PubMed] [Google Scholar]

Decision Letter 0

David Balding, Xiaofeng Zhu

5 Dec 2020

* Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. *

Dear Dr Long,

Thank you very much for submitting your Research Article entitled 'Power analysis of transcriptome-wide association study: implications for practical protocol choice' to PLOS Genetics.

The manuscript was fully evaluated at the editorial level and by independent peer reviewers. The reviewers appreciated the attention to an important topic but identified some concerns that we ask you address in a revised manuscript

We therefore ask you to modify the manuscript according to the review recommendations. Your revisions should address the specific points made by each reviewer.

We would like to see some detailed comparisons with other methods (Epixcan and SMR, reviewer 1) and address the independence of the outcome model and expression model (reviewer 2).

In addition we ask that you:

1) Provide a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript.

2) Upload a Striking Image with a corresponding caption to accompany your manuscript if one is available (either a new image or an existing one from within your manuscript). If this image is judged to be suitable, it may be featured on our website. Images should ideally be high resolution, eye-catching, single panel square images. For examples, please browse our archive. If your image is from someone other than yourself, please ensure that the artist has read and agreed to the terms and conditions of the Creative Commons Attribution License. Note: we cannot publish copyrighted images.

We hope to receive your revised manuscript within the next 30 days. If you anticipate any delay in its return, we would ask you to let us know the expected resubmission date by email to plosgenetics@plos.org.

If present, accompanying reviewer attachments should be included with this email; please notify the journal office if any appear to be missing. They will also be available for download from the link below. You can use this link to log into the system when you are ready to submit a revised version, having first consulted our Submission Checklist.

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Please be aware that our data availability policy requires that all numerical data underlying graphs or summary statistics are included with the submission, and you will need to provide this upon resubmission if not already present. In addition, we do not permit the inclusion of phrases such as "data not shown" or "unpublished results" in manuscripts. All points should be backed up by data provided with the submission.

PLOS has incorporated Similarity Check, powered by iThenticate, into its journal-wide submission system in order to screen submitted content for originality before publication. Each PLOS journal undertakes screening on a proportion of submitted articles. You will be contacted if needed following the screening process.

To resubmit, you will need to go to the link below and 'Revise Submission' in the 'Submissions Needing Revision' folder.

[LINK]

Please let us know if you have any questions while making these revisions.

Yours sincerely,

Xiaofeng Zhu

Associate Editor

PLOS Genetics

David Balding

Section Editor: Methods

PLOS Genetics

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: In this article, authors provided power analysis of TWAS to indicate practical protocol choice. It is a very interesting topic and the results are pretty interesting. PrediXcan and EpiXcan are cited and the methods performed TWAS to identify trait-associated transcriptomes. Just one minor suggestion:

can authors compare Epixcan and the other related methods? Maybe another interesting paper using SMR could provide more insights: DOI: 10.1016/j.ajhg.2017.04.016. PMID: 28552197

Reviewer #2: In this manuscript, Ding and colleagues studies power of transcriptome-wide association studies (TWAS), providing practical guidelines based on results accordingly. Specifically, the authors compared power of TWAS with two alternative strategies: genome-wide association studies (GWAS) and association between measured gene expression levels with phenotype of interest (termed emGWAS by the authors). The comparisons are meaningful and the consideration of the emGWAS is thoughtful as such data may become increasingly available in the near future. The authors considered two main scenarios: causal or mediation scenario where genetic variants exert effect on phenotype via gene expression, and pleiotropic scenario where genetic variants simultaneously influence both phenotype and gene expression. There are many merits of the study, including analytical derivations to lay the theoretical foundation of power calculations, and carefully designed simulation studies to evaluate the power of TWAS and related approaches, which have clear implications in practice for real studies with or without measured expression data.

Generally speaking, the manuscript is well written with clearly presented methods and results.

I have the following specific questions and comments (all relatively minor) that can help further improve the manuscript.

(1) (Probably the only comment that is between major and minor): the authors make an independence assumption between the outcome model (4) and expression model (5), as explicitly stated in lines 243-244, and again in line 256. It is not clear to me what independence of two models mean exactly. It is also not clear what the consequences would be if this assumption is violated. The authors should at least make the former clear and discuss the latter.

(2) Lines 158-160, the authors stated “We calculate the GRM from DNA instead of expressions using the assumption that the ultimate goal is to identify genetic variants underlying expressions.” The logic here is awkward: the decision of using GRM based on genotypes is well warranted but not because the goal is to identify genetic variants underlying expression. To me, using genotypes to derive GRMs is justified because they provide reasonable and probably more accurate estimates of pairwise relationships among study participants than correlations based on predicted expression data.

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Yun Li

Decision Letter 1

David Balding, Xiaofeng Zhu

6 Feb 2021

Dear Dr Long,

We are pleased to inform you that your manuscript entitled "Power analysis of transcriptome-wide association study: implications for practical protocol choice" has been editorially accepted for publication in PLOS Genetics. Congratulations!

Before your submission can be formally accepted and sent to production you will need to complete our formatting changes, which you will receive in a follow up email. Please be aware that it may take several days for you to receive this email; during this time no action is required by you. Please note: the accept date on your published article will reflect the date of this provisional acceptance, but your manuscript will not be scheduled for publication until the required changes have been made.

Once your paper is formally accepted, an uncorrected proof of your manuscript will be published online ahead of the final version, unless you’ve already opted out via the online submission form. If, for any reason, you do not want an earlier version of your manuscript published online or are unsure if you have already indicated as such, please let the journal staff know immediately at plosgenetics@plos.org.

In the meantime, please log into Editorial Manager at https://www.editorialmanager.com/pgenetics/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production and billing process. Note that PLOS requires an ORCID iD for all corresponding authors. Therefore, please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field.  This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager.

If you have a press-related query, or would like to know about making your underlying data available (as you will be aware, this is required for publication), please see the end of this email. If your institution or institutions have a press office, please notify them about your upcoming article at this point, to enable them to help maximise its impact. Inform journal staff as soon as possible if you are preparing a press release for your article and need a publication date.

Thank you again for supporting open-access publishing; we are looking forward to publishing your work in PLOS Genetics!

Yours sincerely,

Xiaofeng Zhu

Associate Editor

PLOS Genetics

David Balding

Section Editor: Methods

PLOS Genetics

www.plosgenetics.org

Twitter: @PLOSGenetics

----------------------------------------------------

Comments from the reviewers (if applicable):

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: In the revisions, authors provided a good amount of analyses to verify the power of the method that proposed. For instance, analysis to compare the performance of other methods, say, EpiXcan. The results are convincing, which further make the paper complete. I think the paper is acceptable now.

Reviewer #2: The authors have carefully addressed all my comments. I have no further comments. The authors should be congratulated on presenting this very useful piece of work!

**********

Have all data underlying the figures and results presented in the manuscript been provided?

Large-scale datasets should be made available via a public repository as described in the PLOS Genetics data availability policy, and numerical data that underlies graphs or summary statistics should be provided in spreadsheet form as supporting information.

Reviewer #1: Yes

Reviewer #2: None

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Yun Li

----------------------------------------------------

Data Deposition

If you have submitted a Research Article or Front Matter that has associated data that are not suitable for deposition in a subject-specific public repository (such as GenBank or ArrayExpress), one way to make that data available is to deposit it in the Dryad Digital Repository. As you may recall, we ask all authors to agree to make data available; this is one way to achieve that. A full list of recommended repositories can be found on our website.

The following link will take you to the Dryad record for your article, so you won't have to re‐enter its bibliographic information, and can upload your files directly: 

http://datadryad.org/submit?journalID=pgenetics&manu=PGENETICS-D-20-01247R1

More information about depositing data in Dryad is available at http://www.datadryad.org/depositing. If you experience any difficulties in submitting your data, please contact help@datadryad.org for support.

Additionally, please be aware that our data availability policy requires that all numerical data underlying display items are included with the submission, and you will need to provide this before we can formally accept your manuscript, if not already present.

----------------------------------------------------

Press Queries

If you or your institution will be preparing press materials for this manuscript, or if you need to know your paper's publication date for media purposes, please inform the journal staff as soon as possible so that your submission can be scheduled accordingly. Your manuscript will remain under a strict press embargo until the publication date and time. This means an early version of your manuscript will not be published ahead of your final version. PLOS Genetics may also choose to issue a press release for your article. If there's anything the journal should know or you'd like more information, please get in touch via plosgenetics@plos.org.

Acceptance letter

David Balding, Xiaofeng Zhu

19 Feb 2021

PGENETICS-D-20-01247R1

Power analysis of transcriptome-wide association study: implications for practical protocol choice

Dear Dr Long,

We are pleased to inform you that your manuscript entitled "Power analysis of transcriptome-wide association study: implications for practical protocol choice" has been formally accepted for publication in PLOS Genetics! Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out or your manuscript is a front-matter piece, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Genetics and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Alice Ellingham

PLOS Genetics

On behalf of:

The PLOS Genetics Team

Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom

plosgenetics@plos.org | +44 (0) 1223-442823

plosgenetics.org | Twitter: @PLOSGenetics

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Causality scenario of EpiXcan and TWAS when expression heritability is high and PVX is low.

    The PVX (phenotypic variance explained by expression) is 0.005, 0.01, 0.025, and 0.05 in the four columns as indicated by the X-axis labels. In each of (a), (b), and (c), the expression heritability for the top and bottom rows are 0.1 and 0.3 respectively. The number of genes contributing to phenotype for (a), (b) and (c) are 4, 9, and 13 respectively. The number of causal variants per gene is randomly sampled from the interval [4,9].

    (TIFF)

    S2 Fig. Causality scenario of EpiXcan and TWAS when expression heritability is low and PVX is high.

    In each panel, the PVX is 0.05 and 0.1 in the left and right columns as indicated by the X-axis labels. In each of (a), (b), (c), and (d), the numbers of genes contributing to phenotype for the left, center, and right panels are 4, 9, and 13 respectively. The expression heritability levels in (a), (b), (c), and (d) are 0.025, 0.03, 0.04, and 0.08 respectively. The number of causal variants per gene is randomly sampled from the interval [4,9].

    (TIFF)

    S3 Fig. Pleiotropy scenario of EpiXcan and TWAS when expression heritability is high and trait heritability is low.

    The trait heritability is 0.005, 0.01, 0.025, and 0.05 in the four columns as indicated by the X-axis labels. In each of (a), (b), and (c), the expression heritability for the top and bottom panels are 0.1 and 0.3 respectively. The numbers of genes contributing to phenotype for (a), (b), and (c) are 4, 9, and 13 respectively. The number of causal variants per gene is randomly sampled from the interval [4,9].

    (TIFF)

    S4 Fig. Pleiotropy scenario of EpiXcan and TWAS when expression heritability is low and trait heritability is high.

    The PVX is 0.05 and 0.1 in the two columns as indicated by the X-axis labels. In each of (a), (b), (c), and (d), the number of genes contributing to phenotype for the left, center, and right panels are 4, 9, and 13 respectively. The expression heritability levels in (a), (b), (c), and (d) are 0.025, 0.04, 0.06, and 0.08 respectively. The number of causal variants per gene is randomly sampled from the interval [4,9].

    (TIFF)

    S1 Data. Source data for Figs 25 and S1S4 Figs.

    The six columns are scenarios, protocols, number of genes, expression heritability, trait heritability and power respectively.

    (XLSX)

    S1 Text. Supplementary information and detailed mathematical derivations.

    (DOCX)

    Attachment

    Submitted filename: ResponseLetter_Jan11.pdf

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting Information files.


    Articles from PLoS Genetics are provided here courtesy of PLOS

    RESOURCES