Skip to main content
American Journal of Human Genetics logoLink to American Journal of Human Genetics
. 2018 May 3;102(5):904–919. doi: 10.1016/j.ajhg.2018.03.019

A Mixed-Effects Model for Powerful Association Tests in Integrative Functional Genomics

Yu-Ru Su 1,, Chongzhi Di 1, Stephanie Bien 1, Licai Huang 1, Xinyuan Dong 2, Goncalo Abecasis 3, Sonja Berndt 4, Stephane Bezieau 5, Hermann Brenner 6, Bette Caan 7, Graham Casey 8, Jenny Chang-Claude 9, Stephen Chanock 4, Sai Chen 10, Charles Connolly 1, Keith Curtis 1, Jane Figueiredo 21, Manish Gala 11, Steven Gallinger 12, Tabitha Harrison 1, Michael Hoffmeister 6, John Hopper 13, Jeroen R Huyghe 1, Mark Jenkins 13, Amit Joshi 14, Loic Le Marchand 15, Polly Newcomb 1,20, Deborah Nickerson 16, John Potter 1,20, Robert Schoen 17, Martha Slattery 18, Emily White 1,20, Brent Zanke 19, Ulrike Peters 1,20, Li Hsu 1,2,∗∗
PMCID: PMC5986723  PMID: 29727690

Abstract

Genome-wide association studies (GWASs) have successfully identified thousands of genetic variants for many complex diseases; however, these variants explain only a small fraction of the heritability. Recently, genetic association studies that leverage external transcriptome data have received much attention and shown promise for discovering novel variants. One such approach, PrediXcan, is to use predicted gene expression through genetic regulation. However, there are limitations in this approach. The predicted gene expression may be biased, resulting from regularized regression applied to moderately sample-sized reference studies. Further, some variants can individually influence disease risk through alternative functional mechanisms besides expression. Thus, testing only the association of predicted gene expression as proposed in PrediXcan will potentially lose power. To tackle these challenges, we consider a unified mixed effects model that formulates the association of intermediate phenotypes such as imputed gene expression through fixed effects, while allowing residual effects of individual variants to be random. We consider a set-based score testing framework, MiST (mixed effects score test), and propose two data-driven combination approaches to jointly test for the fixed and random effects. We establish the asymptotic distributions, which enable rapid calculation of p values for genome-wide analyses, and provide p values for fixed and random effects separately to enhance interpretability over GWASs. Extensive simulations demonstrate that our approaches are more powerful than existing ones. We apply our approach to a large-scale GWAS of colorectal cancer and identify two genes, POU5F1B and ATF1, which would have otherwise been missed by PrediXcan, after adjusting for all known loci.

Keywords: mixed-effects score test, functional annotation, expression quantitative trait locus, data-adaptive weight, variance component test, set-based association, genome-wide association study

Introduction

Colorectal cancer (CRC [MIM: 114500]) is the third most common cancer and the second leading cause of cancer deaths,1 yet it is one of the most preventable and treatable cancers if detected early. Through genome-wide association studies (GWASs), there have been 56 colorectal cancer-associated common variants identified to date. This work has laid groundwork for personalized medicine, in which individually tailored strategies can potentially be devised based on one’s own risk to prevent and treat the disease effectively.2 However, these variants explain only about 10% of the heritability.3 There remains many more genetic loci to be discovered. As commonly done in GWASs, these variants are discovered by examining the association of disease risk with genetic variants one by one. Since CRC is a polygenic trait, genetic variants individually have small effects, which are difficult to detect through marginal association analysis even in large samples. There is increasing evidence to suggest that the loci harboring GWAS-identified risk variants and the variants correlated with the risk variants are enriched with regulatory elements demarked by regions of open chromatin and transcription factor binding sites.4, 5 Further, the expression of implicated genes in these loci are often linked to multiple regulatory elements.6, 7, 8 Therefore, it is natural to hypothesize that set-based association can potentially improve power by aggregating the effects of functionally related variants. Additionally, aggregating variants into sets reduces the burden of multiple testing by decreasing the number of tests performed.

Recently, many set-based tests have been proposed to evaluate the association of a set of variants with disease risk,9, 10, 11, 12 particularly when the tested allele is infrequently observed and therefore typically underpowered for weak marginal effects. Broadly speaking, these set-based tests can be grouped into two categories: (1) burden tests where the association with disease risk is tested for the overall effect of a weighted sum of variant alleles or dosages when working with imputed genotypes; and (2) variance component tests where the association is tested for the nullity of variance under random effects models on genetic variants.12 Variant weights used in burden tests are usually the minor allele frequency13 or estimated marginal effects,14 but could also be reflective of predicted functional scores like Combined Annotation Dependent Depletion.15 Variance component tests have also been extended to allow for correlation among genetic variants,9 yielding a combination of burden and random effects testing. These tests, though mainly developed for rare variants, can be straightforwardly applied to common genetic variants.

Substantial efforts have been devoted to defining regulatory genetic variation.16, 17, 18, 19, 20, 21, 22, 23 Such knowledge has been used extensively in understanding the mechanism of the GWAS loci discovered through marginal association analyses.24, 25, 26, 27, 28 It also has the promise to improve power for discovering novel variants through a targeted approach of combining GWASs with genome-wide transcriptome or other types of functional data such as methylation and chromatin.29, 30, 31, 32, 33, 34, 35, 36, 37, 38 For example, Gamazon et al.31 proposed a gene-based association method called PrediXcan that tests predicted gene expression through which genetic variation affects a phenotype. In this method, an external reference dataset that has jointly measured genotypes and tissue-specific gene expression is used to identify a set of variants that modulate transcript abundance of a gene. Like other transcriptome-wide association studies (TWASs),32 PrediXcan can be considered a weighted burden test, where each variant in a gene set is weighted by its additive allelic effect on expression. However, this approach does not take into account the potential effect of genetic variants beyond their effect through the expression of a specific gene. Complex trait loci typically map to regions of the genome clustered with regulatory elements that in turn have combinatorial effects on the expression of several target genes.39, 40 This has led to the hypothesis that multiple variants in a locus may be functional through their disruption of multiple regulatory elements.39 Furthermore, while variation in chromatin accessibility and transcription factor binding sites are the dominant mechanisms of modulating gene expression, regulatory variants can also impact disease risk through trans-acting transcriptional mechanisms, in which a variant modulates gene expression of one gene through another.41 Consequently, given the potential for variant predictors to be linked to variants with independent functional effects on transcription, splicing, or coding of one or more genes, PrediXcan is likely to capture only part of the total effect.

To overcome these issues, we consider a unified mixed effects model framework11 that explicitly models the effects of a set of genetic variants by two components. The first component is a fixed effect through weighted burden scores to reflect the effects of the set of variants on disease risk through intermediate phenotypes such as gene expression as described above. Other intermediate phenotypes can also be incorporated into the fixed effects, for example, genetically regulated methylation.42 The second component handles the residual effects of individual genetic variants beyond the effects of intermediate phenotypes on disease risk. These residual effects are treated as random effects, which are assumed to follow an arbitrary distribution with mean 0 and variance τ2. Under this model, a set of two asymptotically independent score statistics can be derived: one for the fixed effects and the other for the variance component test. Together they are generically referred to as MiSTs (mixed effects score tests). To test the global association, these two test statistics need to be combined; however, how to combine them efficiently is not well studied. In this paper, we propose two data-driven approaches, optimally weighted and adaptively weighted linear combinations, for combining these independent score statistics to capture the association signals from both sources. Because of the data-adaptive nature, the asymptotic distributions of combined tests do not follow straightforwardly. We establish the asymptotic distributions of these combined test statistics, allowing for rapid p value calculation in genome-wide analyses. Further, as the asymptotic distribution does not have a closed form, we thoroughly investigate various numerical approximations and provide guidance to obtaining accurate p values across a wide spectrum of significance levels. A useful feature of the MiST is that it reports p values for fixed and random effects separately, providing insight into the source of association signals. We have performed extensive simulations under various genetic structures for common variants for which linkage disequilibrium (LD) could play a substantial role. The simulation scenarios also include varying degrees of the association of disease risk attributed to the intermediate phenotypes. The extensive simulation results demonstrate that the combination test statistics have comparable or greater power than both the weighted burden (PrediXcan) and variance component test under all considered scenarios.

We apply the proposed tests to 11,470 case subjects and 11,649 control subjects from the Colon Cancer Family Registry (CCFR) and the Genetics and Epidemiology of Colorectal Cancer Consortium (GECCO), using the expression weights derived from the whole-blood transcriptomes and genomes from 922 subjects from depression genes network.17 We identify two genes associated with CRC risk at the genome-wide significance level even after adjusting for all known CRC loci, providing further understanding of the underlying mechanisms for colorectal cancer.

Material and Methods

A Hierarchical Model to Incorporate Functional Information

Consider outcome D, which can be binary (e.g., disease status) or continuous (e.g., blood pressure). We are interested in the association of a set of P variants G=(G1,,GP)T with outcome. Assuming there are d confounders X, a generalized linear regression model can be used to assess the association,

g{E(D|G,X)}=β0+XβX+p=1PGpβpG, (Equation 1)

where g is a pre-specified link function. For example, g may be a logit function if D is binary and an identity function if D is continuous. The regression coefficients β0, βX, and {βpG,p=1,,P} are intercept, effects of d confounders, and effects of P genetic variants. We test the hypothesis of no association between G and D, i.e.,

H0:β1G==βPG=0. (Equation 2)

A straightforward approach is to treat βpG as fixed effects and perform a likelihood ratio test with P degrees of freedom. However, since the likelihood ratio test is an omnibus test, it may suffer power loss when some knowledge is available about the alternatives. For example, if the association of a set of genetic variants in a gene with outcome is through gene expression, power can be gained by accounting for this knowledge using the predicted gene expression as shown in PrediXcan. As we have learned more about the functional annotations of genetic variants, it is conceivable that incorporating this knowledge will improve power for detecting the association.

To account for known functional annotations, we propose a hierarchical model by additionally modeling regression coefficients βp,p=1,,P as a response variable and the functional annotations as covariates. Specifically, supposing Zp=(Z1,,ZR)T is a R×1 vector of a priori attributes associated with the pth variant, the effect of Zp on βp is postulated by a linear regression model

βpG=ZpTγ+δp, (Equation 3)

where γ is a R×1 vector of coefficients and δp is the residual variant-specific effect that is not explained by the R attributes. We assume δp follows an arbitrary distribution with mean 0 and variance τ2 to leverage information across variants.

Plugging Equation 3 into the generalized linear model (Equation 1), the hierarchical model becomes

g{E(D|G,X)}=β0+XβX+γT(p=1PZpGp)+p=1PδpGp. (Equation 4)

Under this model, γ quantifies the association between the outcome and the weighted burden scores p=1PZpGp with the R attributes as weights. Note that γ is fixed effect and δp is a random effect; hence, the model is sometimes called generalized mixed effects model.

Testing for the hypothesis stated in Equation 2 is equivalent to testing for the nullity of γ and τ2 simultaneously, i.e.,

H0:γ=0andτ2=0. (Equation 5)

This model parametrization is very general and includes the models under which popular PrediXcan31 and the variance component SKAT12 tests are developed as special cases. If we specify τ2=0 and test for the nullity of γ, this is equivalent to PrediXcan test. In contrast, if we have no a priori function information for the variants, we can set Zp as 0 and the model is then reduced to the modeling framework for SKAT.

MiST: Mixed Effects Score Test

We propose a mixed effects score test (MiST) under the hierarchical model (Equation 4) based on score statistics under the null for both parameters γ and τ2, which represent the burden and the variance components, respectively. Consider a sample of N subjects. Let D=(D1,,DN)T, G=[G1,,GN]T, and X=[X1,,XN]T denote the observations of outcomes, genotypes, and confounders from the N subjects. Define B=[B1,,BN]T, where Bi=(B1,,BR)T is a R×1 vector of (weighted) burden scores corresponding to a priori functional information Zp, p=1,,P. In MiST, we adopt the modification for deriving the score statistics11 such that the two score statistics for γ and τ2 are asymptotically independent. This property substantially eases the development of test statistics for combining the two components, as shown later in this section, that are powerful for detecting the associated sets of variants.

Briefly, for the burden component γ, the score statistic is derived under the assumption of H0 stated in Equation 5. However, the score statistics for the variance component τ2 is derived under a modified null H0:τ2=0 and unconstrained γ. Let μ˜ and μˆ be the vectors of fitted values of D under the null models H0:γ=0 and τ2=0, and H0:τ2=0, respectively. In addition, define Δ1 as a diagonal matrix with diagonal as μ˜(1Nμ˜), an element-wise product, where 1N is a N×1 vector of 1. The two score statistics Uγ and Uτ2 for γ and τ2 are

Uγ=(Dμ˜)TBV1BT(Dμ˜), (Equation 6)
Uτ2=(Dμˆ)TGGT(Dμˆ), (Equation 7)

where V is the estimated variance-covariance matrix of BT(Dμ˜). For the ease of presentation, we present the detailed form of V in Appendix A.

Under the null hypothesis (Equation 5), Uγ follows a χR2 distribution, and Uτ2 follows a weighted sum of independent χ12 distributions, p=1Pλpχ1,p2 with weights λp, p=1,,P, as the eigenvalues of a matrix of quadratic form (IN2)TΔ1/2GGTΔ1/2(IN2). The half form (IN2)TΔ1/2G is equivalent to the residuals of regressing the re-scaled genotypes Δ1/2G on the re-scaled intercept, confounders, and the (weighted) burden scores Δ1/2[1NXB]. The detailed formula of 2 is provided in Appendix A. There is no analytical form for the weighted sum of χ12 distributions. As a result, various numerical algorithms have been proposed to calculate the p values. A detailed description of these algorithms and their comparison are presented in the section of Numerical Consideration for p Value Calculation.

Powerful Tests for Combining Fixed and Random Effects Components

The power of the score tests based on individual components alone is highly sensitive to the relative contribution of signals from the two components. The appealing feature of MiST of the independent statistics for the two components has made it easy to combine the score statistics for fixed effects (burden) and random effects components and to form tests that are powerful uniformly across various scenarios, as would be common in a genome-wide association analysis.

The most widely used approach for combining independent test statistics is perhaps Fisher’s combination, which simply takes the sum of 2log(p value) from the score test for each individual component. The combined test statistic follows a χ42 distribution. Although this simple method gains power when both the burden and variance components contribute to the association, it can lose power if only one of the two components shows association. This motivates us to develop alternative weighted combinations methods such that when only one of the two components is associated with disease risk, the weight can better reflect the signal that comes from that particular component. It is worth noting that our combination is different from the usual meta-analysis framework under which the individual components to be combined are from different studies but they all test for the same parameter. In our setting, the two components in MiST are testing for different parameters and estimated from the same study. Thus the usual way of determining optimal weights for combining test statistics using sample sizes or variances of parameter estimates in the meta-analysis does not readily apply in our framework. In the following, we propose two data-driven weighted combination methods: optimally weighted linear combination and adaptively weighted linear combination.

An Optimally Weighted Linear Combination

A weighted linear sum of the two score statistics is perhaps the most straightforward combination approach. Consider Tρ=ρUγ+(1ρ)Uτ2, where ρ[0,1] controls the contribution of the burden component. An intuitive approach to determining the optimal weight ρ is to minimize the p values pρ based on Tρ, i.e., ρ=argminρ[0,1]pρ. Although there is no analytical form of ρ due to the complex expression of pρ, numerical optimization techniques can be applied to find ρ. This combination approach is termed as oMiST with “o” standing for “optimal.”

Denote the observed minimal p value as pρobs with ρ the observed optimal weight. To account for the fact that the minimal p value is used, we calculate the p value of oMiST:

Pr(pρpρobs)=1Pr(Pρpρobs,ρ[0,1])=1E{I(Uγ<QUγ(1pρobs))Pr(Uτ2<minρ111ρ[QTρ(1pρobs)ρUγ]|Uγ)}, (Equation 8)

where QU(p) is the 100 p%th quantile of the random variable U, and the expectation is evaluated with respect to Uγ. For a given ρ, the quantile QTρ(1pρobs) has no analytical form; hence numerical approximations are often applied. Various numerical approximations described in the next section for calculating the p values can also be used for calculating the quantiles for the mixture of χ12 distributions. The expectation in Equation 8 can be obtained by numerical integration, which can be calculated quickly using fast algorithms.43

An Adaptively Weighted Linear Combination

The adaptively weighted linear combination can be seen as a data-adaptive generalization of Fisher’s combination. The test statistic takes the form Ta=ργZγ+ρτZτ2, where Zγ=2log(pγ) and Zτ2=2log(pτ2), the log-transformed p values for the burden and the variance components, respectively. Under this combination, we propose the two weights ργ and ρτ determined in a data-adaptive way by Zγ and Zτ2 via the following formulas:

ργ=ZγZγ2+Zτ22, (Equation 9)

and

ρτ=Zτ2Zγ2+Zτ22. (Equation 10)

Intuitively, the weights are equivalent to the sine and cosine functions of the angle between the direction of the observed (Zγ,Zτ2)R2 and the x axis. The test statistics can be further simplified and expressed as Ta=Zγ2+Zτ22. This simplified form of Ta provides insight on its asymptotic null distribution, which is the sum of squares of two independent χ22 distributions. Numerical integration is employed in the p value calculation at a low computational cost. We call the adaptively weighted linear combination test aMiST with “a” for “adaptive.”

Numerical Consideration for p Value Calculation

Although it has been shown that the variance component test statistic Uτ2 follows a mixture of χ12 distributions under the null, there lacks an analytical form for the mixture of χ12 distribution. In order to calculate the p values fast without resorting to monte carlo simulation, various numerical algorithms have been proposed to calculate the cumulative probability distribution. These include the Davies method,44 the moment matching method (Liu’s method),45 and the saddle point approximation.46 The Davies method involves inverting the characteristic function of the mixture of χ12 distributions to the corresponding cumulative distribution function, and then utilizes numerical integration to calculate the p value. Liu’s method approximates the mixture distribution by a noncentral χ2 distribution that matches on the third moment and minimizes the difference of the fourth moment with the mixture of the χ2 distribution. Liu’s method has also been modified to match on the fourth moment while minimizing the difference of the third moment and it is shown to provide more accurate approximation on extreme tail probabilities.9 The saddle point approach approximates the distribution function of Uτ2 at a given point by the distribution function of a standard normal variable evaluated at that point, which is determined by the cumulant generating function of Uτ2, its first two derivatives, and its saddle point. Both the moment matching and saddle point methods facilitate very fast computation of p values; however, they may be at the cost of potential inaccuracy on (extreme) tail probabilities.

Lumley et al.47 recently proposed a hybrid of the saddle point and moment matching numerical approximations to handle a large number of genetic variants so that there is no need to obtain all the eigenvalues. Briefly, the mixture of χ2 distributions is partitioned into two parts: sum of χ12 for a few leading eigenvalues and the remainder that is approximated by a (non-central) χ2 distribution via moment matching. The saddle point approximation is then employed to obtain the p values for the sum of the two parts together.

Through our extensive simulation, we observe that various approximation methods may have a non-negligible impact on the type I error, especially at the genome-wide level. However, to our knowledge, there has not been a systematic investigation on the performance of different numerical methods at the extreme tails in the context of common genetic variants that are possibly in LD. To gain insight into the performance of these various approximation methods, we conducted a large-scale simulation study to compare their performance on type I error, particularly at the extreme tail.

We selected a gene from the expression quantitative trait loci (eQTL) database from Predictdb Data Repository (see Web Resources), which was obtained based on elastic net regularization regression using the whole-blood transcriptome data from Depression Genes and Networks (DGN).31 The gene has 213 eQTLs. We randomly selected 10,000 subjects with these variants from the CCFR and GECCO GWAS data so that the LD pattern reflects the underlying genetic structure, and generated a we continuous phenotype with mean 0 and variance 1. We repeated the process 1 million times. We compared the Davies method, Liu’s 4th-moment matching method, the saddle point approximation, and Thomas Lumley’s hybrid approach with various fractions of leading eigenvalues (top 1%, top 10%, top 20%, and top 50%).

Simulation

We evaluated the finite-sample performance of MiST in comparison to both PrediXcan, which is the same as our score test for γ, and the variance component test for τ2 after adjusting for the burden component, which we term modified SKAT (mSKAT). We selected three genes from the eQTL database from Predictdb Data Repository. These three genes represent different genetic structures. Both CXCR1 (MIM: 146929) and C18orf32 are of moderate set size (42 and 38 genetic variants, respectively), while ARHGAP11A (MIM: 610589) is a larger set including 92 variants. In terms of the LD structure, both C18orf32 and ARHGAP11A show largely independent or weak correlation among variants, but CXCR1 contains several clusters of variants that are nearly perfectly correlated (Figures S1–S3). We used CCFR and GECCO GWAS data as the template, randomly selected subjects, and generated the continuous and binary outcomes using the generalized linear regression model, as described below. Each simulated dataset consists of 10,000 subjects, and for binary outcome, the dataset consists of 5,000 case subjects and 5,000 control subjects, which is about the same magnitude as our real data application.

Type I Error

We examine the type I error of PrediXcan, modified SKAT, and various combinations of MiST for both continuous and dichotomous outcomes. The phenotypes were generated under the null model with intercept only, where the identity link was used for the continuous outcome and the logit link was used for binary outcome. A total of 1 million simulated datasets were generated. We used the prediction weights obtained a priori from DGN to calculate the predicted genetically regulated gene expression or weighted burden score in MiST.

Power

We simulated the outcome under a wide range of scenarios for each of the three genes CXCR1, C18orf32, and ARHGAP11A. Specifically, we generated gene expression M by

M=c1p=1PGpZp+e,eN(0,σe2),

where Zp,p=1,,P, are the weights obtained from the elastic-net regularized regression using the DGN transcriptome data and they are treated known. Coefficients c1 and σe2 are chosen to achieve a desired R2, i.e., the variation of gene expression M is explained by variant predictors Gp,p=1,,P, while keeping the total variance of M constant. We denote this R2 by v1 and select v1 to be 0.25 to generate more realistic scenarios under which genetic regulatory variants explain only a modest amount of the variation in gene expression.

We then generated the outcome by the generalized linear model

g{E(D|G)}=β0+c3M+c4r,

where r was the residual of p=1PGpδp, with δpi.i.d.N0,σδ2, regressed on p=1PGpZp to mimic the individual variant contributions that are not explained by the effect of genetically predicted expression. We used the identity link for the continuous outcome and logit link for binary outcome. Coefficients c3 and c4 were chosen to achieve the desired proportion of total variability c3M+c4r that is contributed by gene expression M, while keeping the total variability constant. We denote this proportion as v2 and vary the value of v2 from 0 to 1 to evaluate how the signal contribution from the two components affects the power performance of various methods. When v2=1, the explained variation for the outcome comes only from the gene expression. In contrast, when v2=0, the gene expression does not contribute to the overall explained variation. Under this scenario, the variance component τ2 should dominate the signal. We generated 10,000 simulated datasets for each scenario.

Results

Type I Error of the Variance Component Test with Various Numerical Approximations

Table 1 shows empirical type I error rates of variance component test. Among all numerical approximations, the Davies method is the only one that keeps the correct type I error across various levels of significance. The moment matching method is conservative at more relaxed significance levels but becomes anti-conservative when the significance levels are more stringent. In contrast, the saddle point approximation exhibits an opposite pattern, which is conservative at the more stringent significance levels but anti-conservative at level 0.05. The hybrid approach provides a good approximation at more stringent levels, but the type I error at the more relaxed levels, e.g., 0.05, is slightly inflated regardless of which fraction of leading eigenvalues are used. Based on these observations, we proceed with the Davies method for calculating the p values for the variance component test.

Table 1.

The Impact of Numerical Approximations on Weighted Sum of χ2 Distributions on Type I Error Rate of the Variance Component Test

5e−2 1e−2 5e−3 1e−3 5e−4 1e−4 5e−5 1e−5
Davies 4.93e−02 9.82e−03 4.83e−03 9.63e−04 4.77e−04 9.20e−05 4.30e−05 1.00e−05
Liu 4.47e−02 8.63e−03 4.42e−03 1.00e−03 5.52e−04 1.37e−04 7.70e−05 2.10e−05
SaddlePoint 5.27e−02 1.02e−02 4.97e−03 9.55e−04 4.63e−04 8.70e−05 3.90e−05 8.00e−06
Hybrid-1 5.61e−02 1.12e−02 5.48e−03 1.04e−03 5.15e−04 9.70e−05 4.80e−05 1.10e−05
Hybrid-10% 5.28e−02 1.03e−02 4.98e−03 9.58e−04 4.66e−04 9.00e−05 4.20e−05 1.20e−05
Hybrid-20% 5.27e−02 1.02e−02 4.98e−03 9.57e−04 4.65e−04 8.90e−05 4.10e−05 1.10e−05
Hybrid-50% 5.27e−02 1.02e−02 4.98e−03 9.57e−04 4.65e−04 8.90e−05 4.10e−05 1.10e−05

Type I error rates of the variance component test with different numerical algorithms, Davies method (Davies), Liu’s moment matching method (Liu), saddle point approximation (SaddlePoint), and Lumley’s hybrid method with different percentages of leading eigen-values (Hybrid-1 to Hybrid-50%), for calculating p values at various significance levels ranging from 0.05 to 10−5.

Type I Errors of MiST

We evaluated the type I error of PrediXcan, modified SKAT, and various combination approaches in MiST at various significance levels ranging from 0.05 to 10−5 for continuous outcome (Table 2) and binary outcome (Table 3). All combination methods in MiST have correct type I error for all three genetic structures, CXCR1, C18orf32, and ARHGAP11A, at all levels.

Table 2.

Evaluation on Type I Error Rates of PrediXcan, Modified SKAT, and MiST on Continuous Outcomes

Method 5e−2 1e−2 5e−3 1e−3 5e−4 1e−4 5e−5 1e−5
Gene: ARHGAP11A

PrediXcan 4.96e−02 9.91e−03 4.98e−03 9.59e−04 4.89e−04 9.90e−05 4.30e−05 1.10e−05
mSKAT 4.97e−02 9.70e−03 4.84e−03 9.38e−04 4.42e−04 9.10e−05 4.20e−05 1.00e−05
oMiST 4.96e−02 9.76e−03 4.87e−03 9.30e−04 4.67e−04 8.60e−05 4.50e−05 8.00e−06
aMiST 4.96e−02 9.77e−03 4.87e−03 9.32e−04 4.72e−04 8.90e−05 4.40e−05 1.10e−05
fMiST 4.93e−02 9.76e−03 4.88e−03 1.01e−03 4.83e−04 9.30e−05 4.70e−05 7.00e−06

Gene: C18orf32

PrediXcan 5.04e−02 1.00e−02 5.00e−03 9.49e−04 4.82e−04 9.80e−05 4.70e−05 8.00e−06
mSKAT 5.04e−02 9.99e−03 4.98e−03 9.78e−04 4.83e−04 9.70e−05 5.10e−05 1.40e−05
oMiST 5.04e−02 9.90e−03 4.92e−03 9.75e−04 5.21e−04 9.10e−05 4.40e−05 9.00e−06
aMiST 5.04e−02 9.87e−03 4.95e−03 9.79e−04 5.13e−04 9.30e−05 4.50e−05 9.00e−06
fMiST 5.03e−02 9.95e−03 4.90e−03 9.96e−04 5.04e−04 1.02e−04 4.30e−05 7.00e−06

Gene: CXCR1

PrediXcan 5.01e−02 1.01e−02 5.06e−03 1.01e−03 4.88e−04 8.40e−05 3.90e−05 9.00e−06
mSKAT 5.01e−02 9.93e−03 4.92e−03 9.83e−04 4.98e−04 1.03e−04 4.50e−05 5.00e−06
oMiST 5.09e−02 9.49e−03 4.50e−03 9.89e−04 4.83e−04 9.20e−05 3.80e−05 1.10e−05
aMiST 5.00e−02 1.01e−02 5.00e−03 9.87e−04 4.82e−04 8.40e−05 3.80e−05 9.00e−06
fMiST 5.00e−02 1.01e−02 5.00e−03 9.89e−04 5.03e−04 9.60e−05 4.80e−05 1.00e−05

Type I error rates of various tests, PrediXcan, modified SKAT, and the proposed MiST, for continuous outcome and three genetic structures mimicking CXCR1, C18orf32, and ARHGAP11A at significance levels from 0.05 to 10−5 based on 1 million simulation runs.

Table 3.

Evaluation on Type I Error Rates of PrediXcan, Modified SKAT, and MiST on Dichotomous Outcomes

Method 5e−2 1e−2 5e−3 1e−3 5e−4 1e−4 5e−5 1e−5
Gene: ARHGAP11A

PrediXcan 4.98e−02 9.96e−03 4.96e−03 1.01e−03 4.99e−04 1.03e−04 4.85e−05 7.00e−06
mSKAT 4.96e−02 9.95e−03 4.87e−03 9.70e−04 4.73e−04 8.10e−05 4.00e−05 9.00e−06
oMiST 4.97e−02 9.85e−03 4.95e−03 9.80e−04 4.73e−04 8.75e−05 3.90e−05 1.10e−05
aMiST 4.98e−02 9.87e−03 4.94e−03 9.79e−04 4.80e−04 8.85e−05 4.30e−05 9.00e−06
fMiST 4.97e−02 9.85e−03 4.93e−03 1.00e−03 4.89e−04 9.35e−05 4.25e−05 1.15e−05

Gene: C18orf32

PrediXcan 5.03e−02 9.98e−03 5.06e−03 1.06e−03 5.54e−04 1.21e−04 5.90e−05 1.10e−05
mSKAT 5.03e−02 1.00e−02 5.00e−03 9.57e−04 4.86e−04 9.50e−05 4.70e−05 8.00e−06
oMiST 5.03e−02 1.01e−02 5.09e−03 9.99e−04 5.31e−04 1.09e−04 5.70e−05 1.50e−05
aMiST 5.03e−02 1.01e−02 5.07e−03 1.01e−03 5.26e−04 1.07e−04 5.90e−05 1.50e−05
fMiST 5.02e−02 1.01e−02 5.08e−03 1.03e−03 5.15e−04 1.02e−04 6.00e−05 1.30e−05

Gene: CXCR1

PrediXcan 4.95e−02 9.95e−03 5.12e−03 1.05e−03 5.12e−04 1.04e−04 5.40e−05 9.00e−06
mSKAT 4.99e−02 9.85e−03 4.91e−03 1.02e−03 5.10e−04 1.07e−04 5.60e−05 1.50e−05
oMiST 4.98e−02 9.89e−03 5.04e−03 1.01e−03 4.92e−04 1.09e−04 5.30e−05 1.20e−05
aMiST 4.97e−02 9.93e−03 5.00e−03 1.02e−03 5.05e−04 1.12e−04 5.50e−05 1.30e−05
fMiST 4.99e−02 9.97e−03 5.00e−03 9.89e−04 4.86e−04 1.05e−04 5.70e−05 9.00e−06

Type I error rates of various tests, PrediXcan, modified SKAT, and the proposed MiST, for binary outcome and three genetic structures mimicking CXCR1, C18orf32, and ARHGAP11A at significance levels from 0.05 to 10−5 based on 1 million simulation runs.

Importantly, accurate quantile approximation plays a critical role in obtaining accurate p value for the optimal linear combination method oMiST. Although great efforts have been devoted in obtaining precise approximation for the tail probabilities of weighted sums of χ2 distributions, a more accurate quantile approximation, to our knowledge, is still lacking in the literature. Liu’s moment matching method is straightforward and has been commonly used in approximating the quantiles of weighted sum of χ2 distributions. For example, this was implemented in the popular SKAT-O test.9 However, as shown in Table 1, the moment matching method tends to over-estimate tail probabilities when the p values are roughly greater than 0.001 and underestimates the tail probabilities when the p values are less than 0.001. Correspondingly, it also over- and underestimates the quantiles for both moderate and extreme tail probability, respectively. As a result, this can lead to type I error inflation for moderate significance levels while losing power when the significance level is more stringent. This phenomenon can be seen in Table S1 where Liu’s moment matching method is used for quantile approximation in oMiST. To overcome this issue, we propose a quantile approximation via directly locating the root of the difference between a given tail probability and the approximated tail probability by the Davies method. The type I error of oMiST shown in Tables 2 and 3 confirms the validity of this approach.

Power Comparison

We compared the power of PrediXcan, modified SKAT (mSKAT), and MiST for both the continuous outcome and binary outcomes under three genetic structures mimicking CXCR1 (Figure 1), C18orf32 (Figure 2), and ARHGAP11A (Figure 3). To save space, here we show the results only under the scenarios v1 = 0.25 and v2 = 0, 1, 0.6 (for gene CXCR1 and C18orf32 with continuous outcomes), 0.5 (for ARHGAP11A with continuous outcomes), and 0.55 (for the three genes with binary outcomes) at various significance levels. Here we chose slightly different value for v2 for different genes so that both the PrediXcan (or the burden) and the variance components have roughly similar power. The results for other values of v2 = 0, 0.25, 0.5, 0.75, and 1 at genome-wide significance level 10−6 are provided in Tables S2 and S3.

Figure 1.

Figure 1

Power Comparison under Simulation Scenarios with a Set of Moderate Size and Highly Correlated Genetic Structure on Both Continuous and Dichotomous Outcomes

Power curves of PrediXcan (gray dashed curve), modified SKAT (yellow dashed curve), and three combination methods in MiST (red solid curve for fMiST, and dark and light blue solid curves for aMiST and oMiST, respectively) on CXCR1 for continuous (top) and binary (bottom) outcomes against log10(significance level) under various proportion of signal explained by gene expression (v2=1 for left panel, v2=0.6 [continuous outcome] and v2=0.55 [dichotomous outcome] for central panel, and v2=1 for right panel).

Figure 2.

Figure 2

Power Comparison under Simulation Scenarios with a Set of Moderate Size and Relatively Independent Genetic Structure on Both Continuous and Dichotomous Outcomes

Power curves of PrediXcan (gray dashed curve), modified SKAT (yellow dashed curve), and three combination methods in MiST (red solid curve for fMiST, and dark and light blue solid curves for aMiST and oMiST, respectively) on C18orf32 for continuous (top) and binary (bottom) outcomes against log10(significance level) under various proportion of signal explained by gene expression (v2=1 for left panel, v2=0.6 [continuous outcome] and v2=0.55 [dichotomous outcome] for central panel, and v2=1 for right panel).

Figure 3.

Figure 3

Power Comparison under Simulation Scenarios with a Set of Large Size and Relatively Independent Genetic Structure on Both Continuous and Dichotomous Outcomes

Power curves of PrediXcan (gray dashed curve), modified SKAT (yellow dashed curve), and three combination methods in MiST (red solid curve for fMiST, and dark and light blue solid curves for aMiST and oMiST, respectively) on ARHGAP11A for continuous (top) and binary (bottom) outcomes against log10(significance level) under various proportion of signal explained by gene expression (v2=1 for left panel, v2=0.5 [continuous outcome] and v2=0.55 [dichotomous outcome] for central panel, and v2=1 for right panel).

The proposed various MiST combinations generally have power comparable to the most powerful test across different scenarios. When the signal of association comes only from the gene expression (i.e., v2 = 1), as expected PrediXcan is most powerful but both oMiST and aMiST are nearly as optimal as PrediXcan with a minor power loss due to weights being estimated. In contrast, when the signal of association comes only from the variance component (i.e., v2 = 0), mSKAT is most powerful and PrediXcan has no power, while both oMiST and aMiST are nearly as powerful as mSKAT with minor power loss. When the signal of association is contributed by both the gene expression and variance components (i.e., v2 ranges between 0.5 and 0.6), both PrediXcan and mSKAT suffer substantial power loss compared to all the combination methods of MiST. To comprehensively examine the impact of the LD patterns on the performance of the test statistics, we also conducted simulated “genome-wide” analyses using all the gene sets defined in the PrediXcan whole-blood database (>10,000 sets). For each gene set, we generated a sample of 10,000 subjects (5,000 case and 5,000 control subjects) based on the GECCO genetic data. We considered three different v2 levels: 0, 0.5, and 1. The proportion of genes detected at different significance levels ranging from 0.05 to genome-wide significance level are shown in Figure S4. It can be seen that the power patterns are consistent with that observed in CXCR1, C18orf32, and ARHGAP11A.

An issue in PrediXcan that has not been fully investigated in the literature is the impact of the uncertainty of weights estimated from the reference datasets. Neglecting the inaccuracy in weights and prediction of gene expression may lead to an overestimated power of PrediXcan. To address this issue, we examined the impact of estimated weight in the predicted gene expression on power and whether the variance component test has the potential to improve power over PrediXcan by capturing the effect of residual expression due to attenuated predicted gene expression from estimated weight. We considered the scenario that the association signal comes from only gene expression (i.e., v2 = 1). Specifically, we included an additional step of model building by elastic net (EN) regularization on gene expression to obtain estimated eQTL weights in each simulation replication. The power performance were investigated under several signal-to-noise ratios for eQTLs by setting R2, i.e., v1 = 0.05, 0.12, and 0.25, respectively. We used the eQTL weights obtained from elastic net regulation to estimate the predicted gene expression (EN weights). As a comparison to the most ideal situation, we examined the power of using true eQTL weights (true weights). Since EN regularization may miss some causal variants, we also artificially included causal variants that were not selected by elastic net in the random effects test statistics (EN weights + causal variants). Table 4 shows the power comparison.

Table 4.

Power Comparison with Estimated Weights from Elastic Net Regularization

Gene True Weights
EN Weights
EN Weights + Causal Variants
PredXcan mSKAT PrediXcan mSKAT fMiST aMiST oMiST PrediXcan mSKAT fMiST aMiST oMiST
v1 = 0.05

ARHGAP11A 0.929 0.008 0.865 0.011 0.794 0.822 0.826 0.865 0.011 0.801 0.823 0.825
C18orf32 0.942 0.005 0.817 0.022 0.734 0.767 0.768 0.817 0.021 0.745 0.771 0.772
CXCR1 0.944 0.004 0.908 0.013 0.854 0.876 0.875 0.908 0.016 0.852 0.873 0.871

v1 = 0.12

ARHGAP11A 0.655 0.006 0.564 0.007 0.430 0.477 0.480 0.564 0.008 0.435 0.482 0.481
C18orf32 0.675 0.005 0.474 0.014 0.383 0.408 0.409 0.474 0.018 0.392 0.408 0.411
CXCR1 0.672 0.007 0.603 0.004 0.472 0.512 0.513 0.603 0.009 0.488 0.513 0.511

v1 = 0.25

ARHGAP11A 0.691 0.006 0.632 0.006 0.483 0.530 0.528 0.632 0.005 0.489 0.533 0.532
C18orf32 0.702 0.007 0.607 0.009 0.483 0.516 0.516 0.607 0.009 0.476 0.514 0.515
CXCR1 0.694 0.003 0.653 0.008 0.536 0.574 0.573 0.653 0.009 0.534 0.573 0.571

The power performance of PrediXcan and the proposed methods with accurate weights (true weights), estimated weights from the elastic net regularization (EN weights), and estimated weights plus the a priori information of causal variants (EN weights + causal variants). The power presented here are simulated under scenarios of v2=1 (signal comes through gene expression only) and v1=0.05, 0.12, and 0.25.

Compared to true eQTL weights, using estimated eQTL weights in PrediXcan loses power; however, the variance components test statistic is not able to recover the power loss due to estimated weights. Further, including all the causal variants (“EN weights + causal variants”) in the variance component does not improve power for the combined test statistics. This is probably because the estimated weights generally yield attenuated predicted gene expression and meanwhile, the variance of predicted gene expression is also reduced. As a result, the test statistic does not change much. Further, for the variants missed by EN regularization, their effects tend to be modest and the variance component can not distinguish them from nullity. Taken together, if the signal comes entirely from the gene expression, the proposed combined test statistics do not appear to outperform PrediXcan.

The power of the three combination methods, oMiST, aMiST, and fMiST, is generally comparable when the significance levels are less stringent; however, it shows some power differences when the significance level is more stringent. When one of the two components dominates the associations, the two data-adaptive combination methods oMiST and aMiST have similar power and they are more powerful than fMiST. This is because the data-adaptive weights in oMiST and aMiST allows for more flexibility to put more weight on the test statistic that shows more evidence for association. When the signal for the overall association are roughly the same for both components, fMiST has the best power and oMiST and aMiST show a minimal amount of power loss due to weights estimated from the data. Under this situation, both oMiST and aMiST behave similarly when the significance level is relaxed; however, aMiST appears to suffer more power loss when a more stringent significance level is used.

An Application to a Large-Scale GWAS of Colorectal Cancer

We applied the various combinations of MiST, PrediXcan, and modified SKAT to a large-scale genome-wide colorectal cancer study from multi-center collaborations between CCFR and GECCO to identify functional variant sets associated with CRC risk. The study consists of 11,470 CRC-affected case subjects and 11,649 control subjects of European ancestry from 14 studies. Case and control subjects are matched on age and sex. The summary of each study is provided in Table S4. All studies were approved by their respective Institutional Review Boards.

We downloaded all regulatory variants associated with gene expression from PredictDB Data Repository. These variants were obtained using elastic-net regularized regression to the genotype and blood transcriptome data from DGN. Following the recommendation from Gamazon et al.,31 we focused on genes where the predictive R2, measured by the squared correlation between predicted and observed gene expression, was >1%. This results in a total of 8,909 genes. The number of regulatory variants in each gene ranges from 1 to 213 with mean 33.5 and standard deviation 24.4.

For each gene, we fit a logistic regression model adjusting for study, sex, age, and the four leading principal components of ancestry to account for population substructure. To explore whether any population substructure beyond the adjusted covariates was unaccounted for, we performed a permutation as described in Epstein et al.,48 where the outcome was permutated accounting for the influences of the adjusted covariates while breaking down the association between the genotypes and the outcome. The genomic control values for all tests after permutation range from 0.99 to 1.036. This observation along with the Q-Q plot shown in Figure S5 confirm that the population substructure in GECCO and CCFR is well accounted for by the adjusted covariates.

We performed PediXcan, mSKAT, and various combinations of MiST on each of 8,909 genes. Genes of which the p values reached the genome-wide significance by Bonferroni correction were further evaluated to see whether these are independent secondary signals by adjusting for all previously discovered known risk SNPs on the same chromosome of the genes. Since we additionally performed the conditional analyses, to control the overall type I error at 0.05, we allocated 0.04 for the genome-wide association analysis and 0.01 for the conditional analysis. That is, a gene is declared to reach the genome-wide significance if the p value reaches the threshold 0.04/8,909 = 4.49e−06, and for those genes that pass the genome-wide significance, a gene is declared to reach the significance as a secondary independent signal if the p value reaches the threshold 0.01/p, where p is the number of genes tested in conditional analyses.

Figure 4 shows a Q-Q plot of −log10(p value) for PrediXcan, mSKAT, and MiST against the expected −log10(p value) under the null. There is a clear departure of extreme p values for MiST and mSKAT compared to PrediXcan, demonstrating the potential power gain for combining PrediXcan (weighted burden score) and the variance components. The top panel of Table 5 shows the genome-wide significant genes identified by all tests. Specifically, PrediXcan identifies only one gene, LAMC1 (MIM: 150290), and modified variance component test identifies three genes, POU5F1B (MIM: 615739), C11orf92 (MIM: 615693 [COLCA1]), and ATF1 (MIM: 123803). The combined test, optimal oMiST, and adaptive aMiST identified the same three genes, POU5F1B, C11orf92, and ATF1, whereas the Fisher’s combination fMiST identifies POU5F1B and ATF1. It is interesting to note that LAMC1 identified by PrediXcan has only p value 0.767 for the variance component test mSKAT, whereas the p values for the combined tests are all in the range of 10−5–10−6, albeit not genome-wide significant, suggesting that combined tests can capture the signal, even though it comes only from the fixed effect. Similarly, when genes show evidence only in the variance component test, the combined tests also have small p values close to the variance component test. Among the three combination tests, when the signal comes primarily from one source, either PrediXcan or the variance component, oMiST, and aMiST appear to capture the association with smaller p values than the Fisher’s combination fMiST. On the other hand, when the signals come from both sources, fMiST gives smaller p values than oMiST or aMiST (e.g., POU5F1B). In Table S5, We also provide a list of genes of which the p values reach the false discovery rate (FDR) 0.2. MiST discovers more genes (37, 31, and 44 genes by oMiST, aMiST, and fMiST, respectively) with FDR < 0.2 than either PrediXcan (5 genes) or mSKAT (28 genes), because it can capture the signals from both PrediXcan and mSKAT. Note that since we do not have laboratory functional follow-up, the associations identified here may implicate the locus and not the gene itself.

Figure 4.

Figure 4

Comparison on Results from Analyses with Different Testing Methods on Colorectal Cancer Data in GECCO and CCFR

Quantile-quantile plots from genome-wide association analyses on GECCO and CCFR by PrediXcan (light gray dots), modified SKAT (yellow dots), and each combination method in MiST (red dots for fMiST, and dark and light blue dots for aMiST and oMiST, respectively). The light gray solid line is a 45 degree line, and the pink horizontal line stands for the Bonferroni corrected significance level.

Table 5.

Top Results from Association Analyses with Various Testing Methods on GECCO and CCFR

Chr Gene Number of SNPs R2 PrediXcan mSKAT oMiST aMiST fMiST
1 LAMC1 13 0.2311 2.59e−06a 7.67e−01 6.95e−06 5.74e−06 2.81e−05
8 POU5F1B 45 0.0765 2.79e−02 6.08e−12a 5.00e−10a 9.92e−12a 5.15e−12a
11 C11orf92 21 0.4191 1.97e−01 1.71e−06a 4.24e−06a 3.43e−06a 5.36e−06
12 ATF1 16 0.1501 1.80e−01 1.33e−06a 3.13e−06a 2.62e−06a 3.87e−06a
Total number of identified genes by each method 1 3 3 3 2
1 LAMC1 13 0.2311 4.20e−01
8 POU5F1B 45 0.0765 1.19e−04b 3.21e−04b 2.68e−04b 5.42e−04b
11 C11orf92 21 0.4191 7.42e−01 6.63e−01 3.43e−01
12 ATF1 16 0.1501 1.73e−03b 1.00e−04b 2.29e−04b 3.04e−05b
Total number of identified genes by each method 0 2 2 2 2

Summary of significant genes that were identified by PrediXcan, mSKAT, and various combination methods in MiST at significance level 0.04/8,909 after Bonferroni correction (top), and the results from conditional analyses conditional on all the known CRC risk SNPs on the same chromosome as the genes (bottom). Basic information, including gene names, chromosome where the genes locate, number of genetic variants in the defined set, and predictive R2 from PrediXcan databases is presented accordingly.

a

p values reached 0.04/8,909.

b

p values reached 0.01/P, where P is the number of identified genes from the top panel by each method.

We further conducted a conditional association analysis on the four genome-wide significant genes by adjusting for all the known CRC-risk variants, which have been reported in the literature reaching the genome-wide significance 5 × 10−8, and are on the same chromosomes as the genes in consideration. Note that these known CRC-risk SNPs do not necessarily reach genome-wide significance in the GECCO data. After conditional on known CRC risk variants, POU5F1B and ATF1 remain to be significant for mSKAT and all the combination methods in MiST (the bottom panel of Table 5). We also examined the association of individual variants adjusting for known loci by using a forward selection procedure until no variant reaches p value at 0.05. Table S6 shows the joint analysis of these variants conditional on known loci. There are several variants associated with CRC risk in both POU5F1B and ATF1 after adjusting for known CRC loci, suggesting possibility of several secondary signals that would not have been detected otherwise. Interestingly, the only genome-wide significant gene identified by PrediXcan LAMC1 is no longer significant after adjusting for the known CRC risk variants on chromosome 1.

Bioinformatic Results on POU5F1B and ATF1

To further explore POU5F1B and ATF1, we employed an integrative bioinformatic functional follow-up. We first examined the correlation structure of all the variants in the two genes and found that the lead, or most significant, variants in each gene were in moderate to high LD (squared correlation > 0.5). A list of putative functional/causal variants were then defined as all variants in LD with a lead variant (squared correlation > 0.2 in 1000 Genomes EUR population). The lead variant in the POU5F1B locus, rs7013278, was in LD with 65 variants, including a GWAS-identified risk variant rs6983267. In the ATF1 locus the lead variant, rs4388959, was in LD with 355 variants, including the GWAS-identified risk variant rs11169552. We then used regulatory and coding annotations to characterize putative function for variants in each list.

To investigate the possibility that the putative functional variants were linked to damaging protein-coding variants, we first annotated the two lists with PolyPhen-2 and SIFT. For POU5F1B there were three missense variants, only one of which (rs6998061) was predicted to be possibly damaging. For ATF1 there were no missense or nonsense variants. To explore whether variants in this list could impact multiple cis-regulatory elements, we then explored variation in chromatin accessibility (DNaseI hypersensitivity), enhancers and repressors (chromatin immunoprecipitation-sequencing, ChIP-seq, signals for H3K4me1, H3K27ac, and H3K27me3), and transcription factor binding sites (ChIP-seq and motif analysis) in relevant tissues and cell types from Roadmap epigenomics and ENCODE. It is observed that there are overlaps in both loci for many relevant transcription factor binding sites and motifs. We also explored variation in colorectal crypt enhancers gained and lost in CRC cell lines.49 From this we found that in the POU5F1B locus, variants in our list overlapped four enhancers that are gained in CRC cell lines, as well as additional repressors in normal colon and rectal mucosa. In the ATF1 locus we found that variants overlapped nine enhancers gained and four enhancers lost in CRC cell lines.

To investigate the possibility that the regulatory elements harboring variants in our lists of putative functional variants may physically interact with additional genes, we examined ChIP-seq signals for CTCF (chromatin looping factor) in two colorectal cancer cell lines (HCT116 and CACO2). In the POU5F1B locus there was a CTCF peak at both POU5FIB and the known target gene of rs56683267, MYC (MIM: 190080). In the ATF1 locus, peaks overlapped many of the regulatory elements and promoters for LIMA1 (MIM: 608364), LARP4, DIP2B (MIM: 611379), and ATF1.

Discussion

In this paper, we considered a mixed effects model to incorporate the increasingly accumulated knowledge about functional annotations of genetic variants while allowing for alternative pathways and potentially imperfect functional information. We derived two independent score test statistics, MiST, for the fixed effects and random effects, respectively, and proposed two data-adaptive combination methods, oMiST and aMiST, to combine the association signals from both the fixed and random effects. Extensive simulations show that the combined tests are more powerful than or almost as powerful as the existing approaches, PrediXcan, and the variance component tests, under all scenarios considered. We also provide p values for each of the fixed and random effects, leading to further understanding where the association signal may come from, which helps for future follow-up and validation experiments.

Through the extensive numerical studies on type I error, we found that the numerical approximations for the weighted sum of χ2 distributions, the asymptotic distribution of the variance component test, plays a critical role in the validity of the tests. The inaccuracy in commonly used approximations, for example, Liu’s moment matching method or saddle point approximation, can yield incorrect type I error rate and the type I error can be conservative or anti-conservative, depending on the significance level and the type of approximation method. We also confirmed that the Davies method is the only existing approximation that provides valid type I error across a broad range of significance levels for the variance component test and the subsequent combination tests. This finding on numerical approximations is applicable not only to our proposed MiST, but also broadly to all other tests that involve a mixture of χ2 distributions. For example, the popular SKAT-O used the Davies method to calculate the p value but Liu’s approximation for calculating the quantile in the intermediate step; as a result, SKAT-O can yield somewhat anti-conservative p value, particularly when the significance level is stringent as in the genome-wide association analyses. Our approach of using the Davies method for calculating the quantiles described in the Numerical Consideration for p Value Calculation section can be used to improve the accuracy of SKAT-O.

The proposed combination tests are more powerful than either PrediXcan (fixed effect) or the variance component test when the association signals come from both sources. They are almost as powerful as either one even when the association signal comes from only one source, where either PrediXcan or the variance component test can lose substantial power. To understand how various tests perform in the real data analysis, we further investigate the genetic structures of the genes identified in the association analysis of CRC. For the genes identified by PrediXcan, it is common that most of the associated variants are highly correlated, and they regulate the transcriptome similarly. For example, in gene LAMC1, as shown in Figure S6, there are nine highly correlated regulatory variants, all functioning as suppressors of LAMC1 transcriptome and having strong marginal associations with CRC risk. PrediXcan gains power if the association signals with the expression and CRC risks are homogeneous. On the other hand, as shown in Figures S7–S9, PrediXcan may lose power under the following situations: (1) there exists heterogeneity in the direction of marginal signals after adjusting for the prediction weights (C11orf92 and POU5F1B), (2) suppressors and enhancers are highly correlated (C11orf92 and ATF1), (3) the signal is sparse (C11orf92), and (4) there exist weakly or moderately correlated blocks of variants of small to moderate signal (POU5F1B and ATF1). Under these scenarios, the combined MiST are better equipped to detect such associations than PrediXcan by including the additional variance component test. Meanwhile, the combined MiST retains comparable power compared to PrediXcan even when the effects of variants on CRC risk come only through the transcriptome abundance.

We studied three combination tests: oMiST, aMiST, and fMiST. The power performance of all three tests is quite comparable at relaxed significance levels under many scenarios considered; however, we observed power differences at stringent significance levels. fMiST is the most powerful test, followed by oMiST, when there are moderate signals coming from both sources. On the other hand, both oMiST and aMiST show their strength when the signals are from only one component, fixed or random effects. Among these three tests, oMiST shows the most robust power across a wide spectrum of configurations. Since the underlying mechanisms are diverse and generally unknown in genome-wide association analyses, we would suggest to use oMiST in practice.

While our test provides a powerful approach for discovering putative disease-associated loci where there may be additional signal beyond the effects through gene expression, interpretation of results will still depend on laboratory follow-up.50, 51, 52 For example, as shown in the bioinformatics analysis on POU5F1B and ATF1, the proposed approaches provide the potential targets for further laboratory follow-up. High throughput CRISPR-cas9 screens can be used to investigate the role of enhancer and loss-of-function mutations in a gene on growth or expression in relevant cell lines or tissues. However, to identify causal functional variants, more laborious allelic assays will be necessary.

Recent technological advances have expanded the breadth of available -omic data, including transcriptomic, methylomic, and metabolomic data. These data will help elucidate the role of genetic variation in relation to phenotypes and generate important insights into the genetic underpinnings of the heritability of complex traits like colorectal cancer. There is a great need for powerful analysis tools to fully harness the utility of these comprehensive high-throughput data. Our proposed mixed effects score tests incorporate the functional information, while allowing for heterogeneous effects of these variants. Our tests are flexible, robust, and powerful and readily incorporate many different types of functional information as they become available. Our application to the CRC GWAS data demonstrates that further understanding can be gained through incorporating the functional information.

Acknowledgments

This work is supported by NIH grants R01 CA189532, R01 CA195789, and P01 CA53996 (to L.H.). GECCO is supported by National Cancer Institute, National Institutes of Health, and U.S. Department of Health and Human Services (U01 CA164930; U01 CA137088; R01 CA059045). CCFR was supported by National Cancer Institute (UM1 CA167551). Funding for the studies in the GECCO and CCFR is listed in the Supplemental Data.

Published: May 3, 2018

Footnotes

Supplemental Data include description of studies in GECCO and CCFR, eight figures, and five tables and can be found with this article online at https://doi.org/10.1016/j.ajhg.2018.03.019.

Contributor Information

Yu-Ru Su, Email: ysu@fredhutch.org.

Li Hsu, Email: lih@fredhutch.org.

Appendix A

The Variance-Covariance Matrix in the Score Statistics Uγ

Under Equation 4 and the null hypothesis (Equation 5), the estimated variance-covariance matrix of BT(Dμ˜), V, is expressed as

BTΔ1(IN1)Δ(IN1)Δ1B,

where 1 is defined as Δ1/2X1(X1Δ1X1)1X1TΔ1/2, with X1 denoting [1NX], the design matrix under the null model.

The Asymptotic Distribution of the Score Statistics Uτ2 under the Null

The score statistics Uτ2, defined in Equation 6, follows an asymptotic null distribution as a weighted sum of independent χ2 distributions with the weights as the eigenvalues of (IN2)TΔ1/2GGTΔ1/2(IN2). The N×N matrix 2 is equivalent to Δ1/2X2X2TΔ1X2-1X2TΔ1/2, with X2 defined as [1NXB].

Web Resources

Supplemental Data

Document S1. Description of Studies, Figures S1–S8, and Tables S1–S5
mmc1.pdf (2.9MB, pdf)
Document S2. Article plus Supplemental Data
mmc2.pdf (3.6MB, pdf)

References

  • 1.Siegel R., Desantis C., Jemal A. Colorectal cancer statistics, 2014. CA Cancer J. Clin. 2014;64:104–117. doi: 10.3322/caac.21220. [DOI] [PubMed] [Google Scholar]
  • 2.Hsu L., Jeon J., Brenner H., Gruber S.B., Schoen R.E., Berndt S.I., Chan A.T., Chang-Claude J., Du M., Gong J., Colorectal Transdisciplinary (CORECT) Study. Genetics and Epidemiology of Colorectal Cancer Consortium (GECCO) A model to determine colorectal cancer risk using common genetic susceptibility loci. Gastroenterology. 2015;148 doi: 10.1053/j.gastro.2015.02.010. 1330–9.e14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Jiao S., Peters U., Berndt S., Brenner H., Butterbach K., Caan B.J., Carlson C.S., Chan A.T., Chang-Claude J., Chanock S. Estimating the heritability of colorectal cancer. Hum. Mol. Genet. 2014;23:3898–3905. doi: 10.1093/hmg/ddu087. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Markunas C.A., Johnson E.O., Hancock D.B. Comprehensive evaluation of disease- and trait-specific enrichment for eight functional elements among GWAS-identified variants. Hum. Genet. 2017;136:911–919. doi: 10.1007/s00439-017-1815-6. [DOI] [PubMed] [Google Scholar]
  • 5.Maurano M.T., Humbert R., Rynes E., Thurman R.E., Haugen E., Wang H., Reynolds A.P., Sandstrom R., Qu H., Brody J. Systematic localization of common disease-associated variation in regulatory DNA. Science. 2012;337:1190–1195. doi: 10.1126/science.1222794. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Rosse S.A., Auer P.L., Carlson C.S. Functional annotation of putative regulatory elements at cancer susceptibility loci. Cancer Inform. 2014;13(Suppl 2):5–17. doi: 10.4137/CIN.S13789. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Schaub M.A., Boyle A.P., Kundaje A., Batzoglou S., Snyder M. Linking disease associations with regulatory information in the human genome. Genome Res. 2012;22:1748–1759. doi: 10.1101/gr.136127.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Wang J., Greenside P., Kundaje A., Kellis M. De-novo inference of enhancer-gene networks in diverse cellular contexts reveals the long-range regulatory impact of disease-associated variants. Nature. 2017;9:9–999. [Google Scholar]
  • 9.Lee S., Wu M.C., Lin X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics. 2012;13:762–775. doi: 10.1093/biostatistics/kxs014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Li B., Leal S.M. Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am. J. Hum. Genet. 2008;83:311–321. doi: 10.1016/j.ajhg.2008.06.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Sun J., Zheng Y., Hsu L. A unified mixed-effects model for rare-variant association in sequencing studies. Genet. Epidemiol. 2013;37:334–344. doi: 10.1002/gepi.21717. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Wu M.C., Lee S., Cai T., Li Y., Boehnke M., Lin X. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 2011;89:82–93. doi: 10.1016/j.ajhg.2011.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Madsen B.E., Browning S.R. A groupwise association test for rare mutations using a weighted sum statistic. PLoS Genet. 2009;5:e1000384. doi: 10.1371/journal.pgen.1000384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Lin D.-Y., Tang Z.-Z. A general framework for detecting disease associations with rare variants in sequencing studies. Am. J. Hum. Genet. 2011;89:354–367. doi: 10.1016/j.ajhg.2011.07.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Kircher M., Witten D.M., Jain P., O’Roak B.J., Cooper G.M., Shendure J. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 2014;46:310–315. doi: 10.1038/ng.2892. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Andersson R., Gebhard C., Miguel-Escalada I., Hoof I., Bornholdt J., Boyd M., Chen Y., Zhao X., Schmidl C., Suzuki T. An atlas of active enhancers across human cell types and tissues. Nature. 2014;507:455–461. doi: 10.1038/nature12787. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Battle A., Mostafavi S., Zhu X., Potash J.B., Weissman M.M., McCormick C., Haudenschild C.D., Beckman K.B., Shi J., Mei R. Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Res. 2014;24:14–24. doi: 10.1101/gr.155192.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.GTEx Consortium Human genomics. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015;348:648–660. doi: 10.1126/science.1262110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Lizio M., Harshbarger J., Shimoji H., Severin J., Kasukawa T., Sahin S., Abugessaisa I., Fukuda S., Hori F., Ishikawa-Kato S., FANTOM consortium Gateways to the FANTOM5 promoter level mammalian expression atlas. Genome Biol. 2015;16:22. doi: 10.1186/s13059-014-0560-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Raney B.J., Cline M.S., Rosenbloom K.R., Dreszer T.R., Learned K., Barber G.P., Meyer L.R., Sloan C.A., Malladi V.S., Roskin K.M. ENCODE whole-genome data in the UCSC genome browser (2011 update) Nucleic Acids Res. 2011;39:D871–D875. doi: 10.1093/nar/gkq1017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Kundaje A., Meuleman W., Ernst J., Bilenky M., Yen A., Heravi-Moussavi A., Kheradpour P., Zhang Z., Wang J., Ziller M.J., Roadmap Epigenomics Consortium Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–330. doi: 10.1038/nature14248. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Shin S.-Y., Fauman E.B., Petersen A.-K., Krumsiek J., Santos R., Huang J., Arnold M., Erte I., Forgetta V., Yang T.-P., Multiple Tissue Human Expression Resource (MuTHER) Consortium An atlas of genetic influences on human blood metabolites. Nat. Genet. 2014;46:543–550. doi: 10.1038/ng.2982. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Banovich N., Li Y., Raj A., Ward M., Greenside P., Calderon D., Tung P.-Y., Burnett J., Myrthil M., Thomas S. Impact of regulatory variation across human ipscs and differentiated cells. Genome Res. 2018;28:122–131. doi: 10.1101/gr.224436.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Finucane H.K., Bulik-Sullivan B., Gusev A., Trynka G., Reshef Y., Loh P.R., Anttila V., Xu H., Zang C., Farh K., ReproGen Consortium. Schizophrenia Working Group of the Psychiatric Genomics Consortium. RACI Consortium Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 2015;47:1228–1235. doi: 10.1038/ng.3404. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Kichaev G., Yang W.Y., Lindstrom S., Hormozdiari F., Eskin E., Price A.L., Kraft P., Pasaniuc B. Integrating functional data to prioritize causal variants in statistical fine-mapping studies. PLoS Genet. 2014;10:e1004722. doi: 10.1371/journal.pgen.1004722. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Michailidou K., Lindström S., Dennis J., Beesley J., Hui S., Kar S., Lemaçon A., Soucy P., Glubb D., Rostamianfar A., NBCS Collaborators. ABCTB Investigators. ConFab/AOCS Investigators Association analysis identifies 65 new breast cancer risk loci. Nature. 2017;551:92–94. doi: 10.1038/nature24284. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Smemo S., Tena J.J., Kim K.H., Gamazon E.R., Sakabe N.J., Gómez-Marín C., Aneas I., Credidio F.L., Sobreira D.R., Wasserman N.F. Obesity-associated variants within FTO form long-range functional connections with IRX3. Nature. 2014;507:371–375. doi: 10.1038/nature13138. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Bien S.A., Auer P.L., Harrison T.A., Qu C., Connolly C.M., Greenside P.G., Chen S., Berndt S.I., Bézieau S., Kang H.M., GECCO and CCFR Enrichment of colorectal cancer associations in functional regions: Insight for using epigenomics data in the analysis of whole genome sequence-imputed GWAS data. PLoS ONE. 2017;12:e0186518. doi: 10.1371/journal.pone.0186518. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Chung D., Yang C., Li C., Gelernter J., Zhao H. GPA: a statistical approach to prioritizing GWAS results by integrating pleiotropy and annotation. PLoS Genet. 2014;10:e1004787. doi: 10.1371/journal.pgen.1004787. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Gamazon E.R., Wheeler H.E., Shah K.P., Mozaffari S.V., Aquino-Michaels K., Carroll R.J., Eyler A.E., Denny J.C., Nicolae D.L., Cox N.J., Im H.K., GTEx Consortium A gene-based association method for mapping traits using reference transcriptome data. Nat. Genet. 2015;47:1091–1098. doi: 10.1038/ng.3367. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Gusev A., Ko A., Shi H., Bhatia G., Chung W., Penninx B.W.J.H., Jansen R., de Geus E.J.C., Boomsma D.I., Wright F.A. Integrative approaches for large-scale transcriptome-wide association studies. Nat. Genet. 2016;48:245–252. doi: 10.1038/ng.3506. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Li Q., Seo J.-H., Stranger B., McKenna A., Pe’er I., Laframboise T., Brown M., Tyekucheva S., Freedman M.L. Integrative eQTL-based analyses reveal the biology of breast cancer risk loci. Cell. 2013;152:633–641. doi: 10.1016/j.cell.2012.12.034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Liu J., Wan X., Ma S., Yang C. EPS: an empirical Bayes approach to integrating pleiotropy and tissue-specific information for prioritizing risk genes. Bioinformatics. 2016;32:1856–1864. doi: 10.1093/bioinformatics/btw081. [DOI] [PubMed] [Google Scholar]
  • 35.Nicolae D.L., Gamazon E., Zhang W., Duan S., Dolan M.E., Cox N.J. Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genet. 2010;6:e1000888. doi: 10.1371/journal.pgen.1000888. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Schork A.J., Thompson W.K., Pham P., Torkamani A., Roddey J.C., Sullivan P.F., Kelsoe J.R., O’Donovan M.C., Furberg H., Schork N.J., Tobacco and Genetics Consortium. Bipolar Disorder Psychiatric Genomics Consortium. Schizophrenia Psychiatric Genomics Consortium All SNPs are not created equal: genome-wide association studies reveal a consistent pattern of enrichment among functionally annotated SNPs. PLoS Genet. 2013;9:e1003449. doi: 10.1371/journal.pgen.1003449. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Zablocki R.W., Schork A.J., Levine R.A., Andreassen O.A., Dale A.M., Thompson W.K. Covariate-modulated local false discovery rate for genome-wide association studies. Bioinformatics. 2014;30:2098–2104. doi: 10.1093/bioinformatics/btu145. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Zhou J., Troyanskaya O.G. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods. 2015;12:931–934. doi: 10.1038/nmeth.3547. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Corradin O., Saiakhova A., Akhtar-Zaidi B., Myeroff L., Willis J., Cowper-Sal lari R., Lupien M., Markowitz S., Scacheri P.C. Combinatorial effects of multiple enhancer variants in linkage disequilibrium dictate levels of gene expression to confer susceptibility to common traits. Genome Res. 2014;24:1–13. doi: 10.1101/gr.164079.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Ghoussaini M., French J.D., Michailidou K., Nord S., Beesley J., Canisus S., Hillman K.M., Kaufmann S., Sivakumaran H., Moradi Marjaneh M., kConFab/AOCS Investigators. NBCS Collaborators Evidence that the 5p12 variant rs10941679 confers susceptibility to estrogen-receptor-positive breast cancer through fgf10 and mrps30 regulation. Am. J. Hum. Genet. 2016;99:903–911. doi: 10.1016/j.ajhg.2016.07.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Pai A.A., Pritchard J.K., Gilad Y. The genetic and mechanistic basis for variation in gene regulation. PLoS Genet. 2015;11:e1004857. doi: 10.1371/journal.pgen.1004857. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.McRae A., Marioni R.E., Shah S., Yang J., Powell J.E., Harris S.E., Gibson J., Henders A.K., Bowdler L., Painter J.N. Identification of 55,000 replicated dna methylation qtl. bioRxiv. 2017 doi: 10.1038/s41598-018-35871-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Piessens R., deDoncker Kapenga E., Uberhuber C., Kahaner D. Springer-Verlag Berlin Heidelberg; 1983. Quadpack: A Subroutine Package for Automatic Integration. [Google Scholar]
  • 44.Davies R.B. Algorithm as 155: The distribution of a linear combination of chi-square random variables. J. Royal Stat. Soc. C. 1980;29:323–333. [Google Scholar]
  • 45.Liu H., Tang Y., Zhang H.H. A new chi-square approximation to the distribution of non-negative definite quadratic forms in non-central normal variables. Comput. Stat. Data Anal. 2009;53:853–856. [Google Scholar]
  • 46.Kuonen D. Saddlepoint aapproximation for distribution of quadratic forms in normal variables. Biometrika. 1999;86:929–935. [Google Scholar]
  • 47.Lumley T., Brody J.A., Peloso G., Morrison A., Rice K. FastSKAT: Sequence kernel association tests for very large sets of markers. bioRxiv 085639. 2018 doi: 10.1101/085639. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Epstein M.P., Duncan R., Jiang Y., Conneely K.N., Allen A.S., Satten G.A. A permutation procedure to correct for confounders in case-control studies, including tests of rare variation. Am. J. Hum. Genet. 2012;91:215–223. doi: 10.1016/j.ajhg.2012.06.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Akhtar-Zaidi B., Cowper-Sal-lari R., Corradin O., Saiakhova A., Bartels C.F., Balasubramanian D., Myeroff L., Lutterbaugh J., Jarrar A., Kalady M.F. Epigenomic enhancer profiling defines a signature of colon cancer. Science. 2012;336:736–739. doi: 10.1126/science.1217277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Deans R.M., Morgens D.W., Ökesli A., Pillay S., Horlbeck M.A., Kampmann M., Gilbert L.A., Li A., Mateo R., Smith M. Parallel shRNA and CRISPR-Cas9 screens enable antiviral drug target identification. Nat. Chem. Biol. 2016;12:361–366. doi: 10.1038/nchembio.2050. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Gilbert L.A., Horlbeck M.A., Adamson B., Villalta J.E., Chen Y., Whitehead E.H., Guimaraes C., Panning B., Ploegh H.L., Bassik M.C. Genome-scale crispr-mediated control of gene repression and activation. Cell. 2014;159:647–661. doi: 10.1016/j.cell.2014.09.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Kampmann M., Horlbeck M.A., Chen Y., Tsai J.C., Bassik M.C., Gilbert L.A., Villalta J.E., Kwon S.C., Chang H., Kim V.N., Weissman J.S. Next-generation libraries for robust RNA interference-based genome-wide screens. Proc. Natl. Acad. Sci. USA. 2015;112:E3384–E3391. doi: 10.1073/pnas.1508821112. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Description of Studies, Figures S1–S8, and Tables S1–S5
mmc1.pdf (2.9MB, pdf)
Document S2. Article plus Supplemental Data
mmc2.pdf (3.6MB, pdf)

Articles from American Journal of Human Genetics are provided here courtesy of American Society of Human Genetics

RESOURCES