Abstract
Twin data are commonly used for studying complex psychiatric disorders, and mixed effects models are one of the most popular tools for modeling dependence structures between twin pairs. However, for eQTL (expression quantitative trait loci) data where associations between thousands of transcripts and millions of single nucleotide polymorphisms need to be tested, mixed effects models are computationally inefficient and often impractical. In this paper, we propose a fast eQTL analysis approach for twin eQTL data where we randomly split twin pairs into two groups, so that within each group the samples are unrelated, and we then apply a multiple linear regression analysis separately to each group. A score statistic that automatically adjusts the (hidden) correlation between the two groups is constructed for combining the results from the two groups. The proposed method has well-controlled type I error. Compared to mixed effects models, the proposed method has similar power but drastically improved computational efficiency. We demonstrate the computational advantage of the proposed method via extensive simulations. The proposed method is also applied to a large twin eQTL data from the Netherlands Twin Register.
Keywords: Twin eQTL data, Score statistic, Mixed effects model, Matrix representation, High correlation
Introduction
Genome-wide association studies (GWAS) have been widely used over the last decade for identifying genetic variants associated with a diversity of complex human diseases, such as type 2 diabetes, breast cancer and psychiatric disorders [Garcia-Closas et al., 2013; Hanson et al., 2014; Winham et al., 2013]. These studies have identified a large number of disease associated SNPs (single nucleotide polymorphisms). However, the majority of the SNPs detected by GWAS individually explain a very small fraction of the total heritability associated with these traits, with no immediately clear functional or regulatory roles [McCarthy and Hirschhorn, 2008]. Gene expression, as an intermediate molecular phenotype, may provide additional insight into the regulatory roles of SNPs implicated by GWAS. With widely available high throughput technologies, gene expression and genetic variant data can be collected simultaneously on disease-relevant tissues from the same individuals, and expression quantitative trait loci (eQTL) analysis can be performed to assess which genomic regions and genetics variants lead to gene expression [Gilad et al., 2008]. eQTL analyses have been used to identify eQTL hotspots (genomic regions affecting multiple transcripts), construct causal networks, elucidate subclasses of clinical phenotypes, and determine lists of candidate genes for clinical trials [see the reviews of Kendziorski and Wang, 2006 and Wright et al., 2012]. Recent research has also shown that SNPs detected by GWAS are significantly more likely to be eQTL, which can be used to boost the discovery of trait-associated SNPs, and improve understanding of the biology underlying complex traits [Hsu et al., 2010; Nicolae et al., 2010; Schadt et al., 2008; Zhang et al., 2012; Zhong et al., 2010].
Various eQTL analytical approaches have been established where an association test is performed between one transcript and one SNP at a time by linear regression analysis, analysis of variance [Shabalin, 2012], generalized linear regression [Hernandez et al., 2012], or mixed effects models [Kang et al., 2008] depending on the type of eQTL data. More complicated analytical procedures, such as Bayesian regression [Bottolo et al., 2011; Chipman et al., 2011; Stegle et al., 2010] and partial least square regression [Chun et al., 2009] have also been applied to eQTL data. In addition, several methods have been proposed for detecting associations between a group of SNPs and the gene expression of each transcript [Hoggart et al., 2008; Michaelson et al., 2009; Zeng, 1994]. User friendly software such as Genevar [Yang et al., 2010] and eQTL viewer [Zou et al., 2007] have been developed for eQTL data analysis, output visualization and result interpretation. The high dimensionality of eQTL data makes modern eQTL analysis computationally intensive, given the fact that associations between several million SNPs and tens of thousands of transcripts need to be tested. To mitigate this heavy computational burden, analysis may be restricted to a small number of SNP-transcript pairs [Ghazalpour et al., 2008]. Alternatively, Shabalin [2012] has developed a fast eQTL analytical tool which is thousands of times faster than any existing QTL/eQTL software. Matrix eQTL is extremely computationally efficient because it expresses the association test between a SNP and the transcript pair as a function of the correlation between pair, which can be realized by a quick matrix operation.
For complex psychiatric disorders, such as schizophrenia and major depressive disorder, twin studies have received attention for establishing the general extent to which genes and environment are etiologically important [Boomsma et al., 2002; Chou et al., 2009; Neale and Cardon, 1992; Park et al., 2012; Silventoinen et al., 2003; Vaccarino et al., 2008]. Typical twin data include both monozygotic twins (MZ) and dizygotic twins (DZ), plus unpaired individual twins (singletons). MZ twins are assumed to be genetically identical, while DZ twins share 50% of their genes on average. Assuming that MZ and DZ twins share the same environment, a higher phenotypic similarity between MZ twins compared to DZ twins indicates that the phenotype is genetically controlled. Unlike data with independent samples, twin data require more careful statistical modeling since ignoring genetic relatedness and shared environment among twin pairs may lead to high false and/or low true positive findings. Several statistical approaches are available for twin data. One of the most common approaches is structural equation modeling (SEM) Neale et al. [1989]. Several software programs to perform SEM are available, such as Mx [Neale et al., 1999], Mplus [Muthen and Muthen, 1998], LISREL [Jorsekog and Sorborn, 1986] and OpenMx [Boker et al., 2011, 2012]. Another popular alternative for twin and family data is the mixed-effects model where random effects are used to properly account for the correlations among subjects [Carlin et al., 2005; Kuna et al., 2012; Wang et al., 2011]. Mixed model have a well-established theory which is familiar to statisticians. Moreover, it is conveniently implemented in most statistical software and can flexibly adjust other non-genetic and genetic covariates [Feng et al., 2009; Rabe-Hesketh et al., 2008].
Though powerful, the mixed effects models are computationally intensive and impractical for modern twin eQTL analysis. Moreover, the ultra fast tool Matrix eQTL is not readily applicable to twin data since it does not model the dependence structure between twin pairs. To overcome these computational challenges, we propose a novel fast twin eQTL analysis approach. In this approach, we first randomly split the twin pairs into two groups such that within each group, the samples are unrelated. We then run a separate analysis for each group using any statistical procedure valid for independent data, such as multiple linear regression or analysis of variance. When combining the results from the two groups, we find traditional meta-analysis procedures, such as Fisher’s test, is no longer applicable since the two sets of results are not independent due to the correlation between twin pairs. Naively combining the two sets of p-values without consideration of the dependence structure of the data would lead to inflated false positive findings. In this paper, we propose a novel score test which automatically adjusts the (hidden) correlation structures between twin pairs, and therefore controls the type I error accurately. To demonstrate the computational advantages and evaluate the performance of the proposed method, we conduct extensive simulations under various settings to mimic real world twin data.
Our simulation results establish that our proposed approach controls type I error rates well, with negligible power loss compared to the mixed effects model, which is the gold standard for analyzing twin data. Furthermore, the computational efficiency of the proposed method is dramatically improved. The proposed method is more than a thousand times faster than mixed effects models. The fast performance of the proposed method is achieved by computing the most computationally intensive part in the score test by matrix operations, similar to what has been done in matrix eQTL. The utility of the proposed method is further illustrated by analyzing a twin eQTL data where the twin samples (~ 4,000) are from the Netherlands twin registry.
Materials and Methods
The first step of the proposed approach is to randomly split each twin pair into two groups such that all samples within each group are unrelated. Thus a statistical procedure valid for independent data can be directly applied to each group for testing SNP-transcript pair associations. For simplicity and because of its popularity for eQTL data, multiple linear regression is considered. Specifically, for each group, given a transcript and a SNP, we fit the following linear regression model:
where for the ith individual, yi is the gene expression, gi is the SNP genotype which is coded 0, 1 or 2 according to the number of minor alleles in the genotype, β is the fixed effect of the SNP; and γ = (γ0, γ1, …, γq)′ is a vector of parameters corresponding to the vector of non-genetic covariates plus the intercept xi = (1, xi1, …, xiq). Rewriting the above model in matrix form, we get
where Y = (y1, y2, …, yn)′, G = (g1, g2, …, gn)′ and . The log-likelihood function is therefore
where θ = (β, γ, σ2) and . For the eQTL analysis, the null hypothesis is H0 : β = 0 where the vector of the nuisance parameters is η = (γ, σ2). Following Zou et al. [2004], we get the score function of the ith individual as
where Uβ,i(β, η) and Uη,i(β, η) are defined as
and Σβ,η(β, η) is the limit of , and Ση,η(β, η) is the limit of . The two Hessian matrices are
Let the restricted MLE η̂ be the solution of Σi=1 Uη,i(β, η) = 0, where β is set to 0 in this equation. Specifically, we have
Note that η̂ is estimated under H0 and thus does not depend on the genotypes of any given SNP. Therefore it does not change from SNP to SNP, and needs only to be estimated once for each transcript. Also note that all of the off-diagonal elements in equal to zero. Replacing all unknown parameters by their sample estimators in the score function, we get
For computational efficiency, we express the score function Ûi in terms of a matrix operation below. Denote
we have
Let Û be the vector of the score function of all individuals. That is,
Substituting a, A and c into the above equation, we get
| (*) |
When the covariates X and G are independent of each other, the elements inside {} can be simplified further to
Note that the elements inside {} only depend on Y and the covariates X, and thus are the same across all SNPs. This motivates us to derive the above matrix operation to calculate the score vectors across a large number of SNPs for a given transcript simultaneously. Specifically, we replace the vector G in equation (*) by a matrix H = (G1, …, Gm), where Gj is the G vector corresponding to SNP j (j = 1, …, m). The score matrix Û then is n × m, where the jth column represents the score vector at SNP j. Remember that twin pairs are randomly split into two groups. Notationally let’s add superscripts to Û above and denote Û(1) and Û(2) as the score matrices of group 1 and group 2, respectively. Also let nk be the number of individuals in the kth group (k = 1, 2) and the score function of the ith subject for the jth SNP in the kth group be (i = 1, …, nk), or specifically we now have
The linear regression results for the jth SNP from the two groups may be naively combined as
and assumed to follow asymptotically under H0. Here, β̂(k) is the estimate of the SNP effect β with its standard error SE(β̂(k)) from the linear regression analysis of the kth group (k = 1, 2). This test is only valid when the results from the two groups are independent of each other, which is not likely to be true for twin data. To account for the dependence structure of the two groups, we derive a new score statistic to automatically adjust the correlation between the two groups:
where ntwin is the total number of twin pairs, and groups 1 and 2 individuals are arranged in such way that the first ntwin samples in groups 1 and 2 are paired twins, and the remaining samples are singletons. The proposed test statistic Wproposed(j) follows asymptotically under H0.
Results
Simulation studies
The proposed method is applied to simulated twin data to evaluate its performance. Each dataset includes 900 or 1800 individuals consisting of MZ twins, DZ twins and singletons in the ratio 2 : 2 : 1. Two continuous covariates (x1, x2) with effects γ = (0.3, 0.1) are correlated to the response variable y, where x1 and x2 are generated from N(3, 1) and N(5, 2), respectively. The response variable y is generated from the following model,
where yi is the gene expression for the ith individual, μ is the grand mean, gi is the SNP genotype, and xi is a vector of non-genetic covariates. The SNP genotype gi and the vector of covariates xi are generated independently. The random terms ai, di, ci, ei are the additive, dominant, common environment effects and random error, respectively, which are mutually independent and normally distributed with mean zero and variance and , respectively. For subjects i and j who are a twin pair, we have and if they are MZ pair; and if they are DZ pair, while for all twin pairs. According to Neale et al. [1989], the above model is referred to as the ACE or ACDE model depending or not. For each simulation set up, 1000 datasets are generated. We set and , resulting in the additive heritability of 0.462 and the variance explained by the shared environmental c2, of 0.231 under the ACE model. For the ACDE model, we set to , leading to and c2 = 0.158. Since there are 1000 association tests for each simulated dataset, 1000 datasets give a total of 1000 * 1000 = 1 million tests, which are used for the type I error evaluation. The type I error of the proposed method is compared with three other methods: 1) a multiple linear regression model on the full twin data, where the dependence between the twin pairs is ignored; 2) the naive approach; and 3) the mixed effects model: yij = gijβ + xijγ + aij + dij + cij + εij, where i and j are family id and individual index respectively, xij is the vector of non-genetic covariates plus intercept, gij is the SNP genotype, and the definition of aij, cij, dij and their covariance structures are described in the above ACE and ACDE models.
Results from Tables 1 and 2 demonstrate that the type I error rates of both the proposed method and the mixed effects model are well controlled. In contrast, if the linear regression model (lm) is directly applied to the full twin data, the type I error rates have been dramatically inflated. For example, under the settings of n = 1800 and the targeted type I error rate α = 0.05, the type I errors for the data from the ACE model and ACDE model are 0.098 and 0.102, respectively. The type I error inflation in the naive method is also clear.
Table 1.
Type I error comparison for data from the ACE model
| n = 900 | n = 1800 | |||||||
|---|---|---|---|---|---|---|---|---|
| α | 0.05 | 0.01 | 0.001 | 0.0001 | 0.05 | 0.01 | 0.001 | 0.0001 |
| proposed | 0.0522 | 0.0108 | 0.0011 | 0.0001 | 0.0507 | 0.0100 | 0.001 | 0.0001 |
| naive | 0.1000 | 0.0310 | 0.060 | 0.0012 | 0.0989 | 0.0301 | 0.0055 | 0.0010 |
| mixed | 0.0509 | 0.0103 | 0.0011 | 0.0001 | 0.0502 | 0.0100 | 0.0010 | 0.0001 |
| lm | 0.0988 | 0.0301 | 0.0057 | 0.0011 | 0.0984 | 0.0298 | 0.0053 | 0.0010 |
proposed: proposed score test statistic
naive: naive test statistic
mixed: mixed effects model
lm: multiple linear regression
Table 2.
Type I error comparison for data from the ACDE model
| n = 900 | n = 1800 | |||||||
|---|---|---|---|---|---|---|---|---|
| α | 0.05 | 0.01 | 0.001 | 0.0001 | 0.05 | 0.01 | 0.001 | 0.0001 |
| proposed | 0.0524 | 0.0107 | 0.0011 | 0.0001 | 0.0510 | 0.0104 | 0.0010 | 0.0001 |
| naive | 0.1027 | 0.0321 | 0.0064 | 0.0012 | 0.1025 | 0.0320 | 0.0062 | 0.0013 |
| mixed | 0.0509 | 0.0103 | 0.0010 | 0.0001 | 0.0502 | 0.0101 | 0.0010 | 0.0001 |
| lm | 0.1024 | 0.0317 | 0.0061 | 0.0012 | 0.1020 | 0.0317 | 0.0061 | 0.0012 |
proposed: proposed score test statistic
naive: naive test statistic
mixed: mixed effects model
lm: multiple linear regression
For power comparisons, we generate 1000 datasets under the alternative hypothesis, where we set β to 0.32, 0.37, 0.45, and 0.50 in the ACE model and β to 0.40, 0.45, 0.50, and 0.55 in the ACDE model. For both of the models, the non-genetic effects γ and the variance components and are kept the same as in the above type I error investigation. The results in Tables 3 and 4 show that the proposed method has negligible power loss compared to the gold standard mixed effects models regardless of the sample size. All simulations and data analyses are conducted in R (a programming language, R Core Team [2014]).
Table 3.
Power comparison for data from the ACE model
| β = 0.32 | β = 0.37 | β = 0.45 | β = 0.50 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | proposed | naive | mixed | pro | naive | mixed | pro | naive | mixed | pro | naive | mixed | |
| n = 900 | α = 0.05 | 0.733 | 0.817 | 0.761 | 0.834 | 0.886 | 0.865 | 0.933 | 0.966 | 0.951 | 0.970 | 0.985 | 0.986 |
| 0.01 | 0.503 | 0.662 | 0.557 | 0.664 | 0.790 | 0.710 | 0.830 | 0.903 | 0.882 | 0.899 | 0.951 | 0.933 | |
| 0.001 | 0.225 | 0.416 | 0.286 | 0.363 | 0.584 | 0.434 | 0.629 | 0.796 | 0.696 | 0.745 | 0.873 | 0.811 | |
| 0.0001 | 0.103 | 0.240 | 0.130 | 0.166 | 0.372 | 0.224 | 0.368 | 0.633 | 0.466 | 0.530 | 0.753 | 0.630 | |
| n = 1800 | α = 0.05 | 0.945 | 0.971 | 0.966 | 0.984 | 0.991 | 0.996 | 1.000 | 1.000 | 0.999 | 1.000 | 1.000 | 1.000 |
| 0.01 | 0.831 | 0.919 | 0.889 | 0.940 | 0.973 | 0.963 | 0.989 | 0.998 | 0.999 | 0.999 | 1.000 | 0.999 | |
| 0.001 | 0.592 | 0.781 | 0.690 | 0.784 | 0.912 | 0.867 | 0.953 | 0.988 | 0.983 | 0.988 | 0.998 | 0.999 | |
| 0.0001 | 0.344 | 0.600 | 0.459 | 0.567 | 0.796 | 0.693 | 0.864 | 0.953 | 0.928 | 0.951 | 0.989 | 0.983 | |
proposed: proposed score test statistic
naive: naive test statistic
mixed: mixed effects model
lm: multiple linear regression
Table 4.
Power comparison for data from the ACDE model
| β = 0.40 | β = 0.45 | β = 0.50 | β = 0.55 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Method | proposed | naive | mixed | pro | naive | mixed | pro | naive | mixed | pro | naive | mixed | |
| n = 900 | α = 0.05 | 0.740 | 0.849 | 0.786 | 0.842 | 0.906 | 0.873 | 0.904 | 0.950 | 0.932 | 0.949 | 0.974 | 0.963 |
| 0.01 | 0.501 | 0.670 | 0.561 | 0.633 | 0.792 | 0.688 | 0.755 | 0.870 | 0.804 | 0.844 | 0.926 | 0.890 | |
| 0.001 | 0.231 | 0.426 | 0.268 | 0.330 | 0.569 | 0.395 | 0.459 | 0.695 | 0.538 | 0.597 | 0.805 | 0.672 | |
| 0.0001 | 0.094 | 0.242 | 0.122 | 0.151 | 0.342 | 0.201 | 0.237 | 0.489 | 0.299 | 0.334 | 0.627 | 0.426 | |
| n = 1800 | α = 0.05 | 0.964 | 0.982 | 0.980 | 0.986 | 0.993 | 0.995 | 0.995 | 0.999 | 0.999 | 0.999 | 1.000 | 0.999 |
| 0.01 | 0.859 | 0.942 | 0.913 | 0.948 | 0.980 | 0.971 | 0.981 | 0.991 | 0.990 | 0.994 | 0.999 | 0.998 | |
| 0.001 | 0.664 | 0.831 | 0.745 | 0.800 | 0.918 | 0.869 | 0.895 | 0.974 | 0.948 | 0.969 | 0.988 | 0.985 | |
| 0.0001 | 0.431 | 0.676 | 0.511 | 0.605 | 0.813 | 0.703 | 0.756 | 0.907 | 0.844 | 0.865 | 0.972 | 0.927 | |
proposed: proposed score test statistic
naive: naive test statistic
mixed: mixed effects
lm: multiple linear regression
Computational Efficiency
Our proposed method combines the advantages of multiple linear regression and the matrix operation to achieve fast computational performance. The proposed method took 361 seconds with one 2.93GHz Intel processor to perform 1 million association tests whereas the mixed effects model took 21, 000 seconds using a cluster of 100 Intel processors to run the same number of tests. The R function that implements the proposed method can be found at www.bios.unc.edu/~feizou/software.
Netherlands Twin Registry (NTR) eQTL Study
A total of 2, 752 individuals from the Netherlands Twin Registry (NTR) had their SNP genotypes and gene expression data measured [Wright et al., 2014] on Affymetrix 6.0 SNP arrays and U219 expression arrays. The goals of this project were to quantify the heritability of each transcript, to build a comprehensive list of eQTL in peripheral blood, and to assess the biomedical relevance of these eQTLs. After quality control [see Wright et al., 2014, for details], 686, 895 SNPs and 47, 585 transcripts on 2, 494 individuals were used for the eQTL analysis. The 2, 494 individuals include 638 MZ pairs, 529 DZ pairs and 160 singletons. A total of 96 covariates including the plate number, hybridization well position, age at blood sampling, sex, time intervals from sampling to RNA extraction and hybridization, and total white and red cell counts, 5 principal components (PCs) from the gene expression data, and 3 PCs derived from the SNP data are included as covariates in the eQTL analysis [see Wright et al., 2014].
In contrast to controlling the family-wise error, controlling the false discovery rate (FDR) is less stringent and handles the dependence between test statistics reasonably well [Benjamini and Yekutieli, 2001]. Instead of saving the p-values for all the SNP-transcript pairs, Shabalin [2012] only saves the p-values that pass a pre-specified significance threshold, on which the Benjamini and Hochberg procedure is applied. Following the procedure of Shabalin [2012], we identified 184, 474 and 127, 891 SNP-transcript pairs at FDR = 0.01 and FDR = 0.001, respectively, with the proposed method. In contrast, the naive method identified 225, 419 and 136, 509 significant SNP-transcript pairs at FDR = 0.01 and FDR = 0.001, which are significantly more than what have been detected by the proposed method. These additional findings are likely to be false positives, given the fact that the type I error inflation of the naive method. Based on the results from the proposed method, SNP-transcripts pairs with p-values passing certain thresholds are shown in the heat map (upper panel of Figure 1). Significant SNP-transcript pairs along the 45 degree diagonal line in the heat map are cis-eQTLs, whereas those off-diagonal are trans-eQTLs. The lower panel of Figure 1 clearly suggests that there are two master regulator regions on chromosomes 3 and 17 that affect a large number of transcripts, which deserve further investigation.
Figure 1.
eQTL hotspot for NTR twin data
Upper panel: heat map of the SNP-transcript pairs with p-values< 10−15. Green, blue, red and dark dots indicate the pairs whose −log10(p) falls into [15, 25), [25, 50), [50, 100) and [100, ∞), respectively. The colored lines on the 4 edges are used to differentiate the 22 autosomes. X-axis: SNP locations; Y-axis: transcript locations.
Lower panel: the number of transcripts (y-axis) that are significantly associated with each SNP (P < 10−15). X-axis: SNP locations.
The top 20 SNP-transcript pairs detected by the proposed approach are summarized in Table 5. Note that the p-values of these pairs from the naive method and Matrix eQTL method are too small to be reported precisely and thus are reported as 0 in the software R.
Table 5.
Top 20 SNP-transcript pairs identified from the proposed method
| SNP | chr(SNP) | start(SNP) | end(SNP) | GENE | chr(GENE) | start(GENE) | end(GENE) | log10(pval) | q value | |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | rs3909451 | chr5 | 96295120 | 96295121 | 11745601_a_at | chr5 | 96215266 | 96225192 | 228.4077 | 1.278352e-218 |
| 2 | rs27300 | chr5 | 96363406 | 96363407 | 11745601_a_at | chr5 | 96215266 | 96225192 | 227.2055 | 8.544353e-218 |
| 3 | rs12229020 | chr12 | 10117690 | 10117691 | 11729479_a_at | chr12 | 10124007 | 10138056 | 227.1056 | 8.544353e-218 |
| 4 | rs7313235 | chr12 | 10132274 | 10132275 | 11729479_a_at | chr12 | 10124007 | 10138056 | 226.8456 | 1.166032e-217 |
| 5 | rs2235918 | chr1 | 17422605 | 17422606 | 11727597_at | chr1 | 17393255 | 17445948 | 226.6566 | 1.441486e-217 |
| 6 | rs2076607 | chr1 | 17422659 | 17422660 | 11727597_at | chr1 | 17393255 | 17445948 | 225.8381 | 7.908774e-217 |
| 7 | rs9272346 | chr6 | 32604371 | 32604372 | 11750528_x_at | chr6 | 32410960 | 32411702 | 225.5650 | 1.271430e-216 |
| 8 | rs4808485 | chr19 | 16439497 | 16439498 | 11764269_at | chr19 | 16438314 | 16438642 | 224.0849 | 3.360241e-215 |
| 9 | rs27290 | chr5 | 96350093 | 96350094 | 11745601_a_at | chr5 | 96215266 | 96225192 | 221.9954 | 3.670607e-213 |
| 10 | rs4698634 | chr4 | 17630191 | 17630192 | 11736024_at | chr4 | 17616279 | 17627249 | 221.6546 | 7.240198e-213 |
| 11 | rs27300 | chr5 | 96363406 | 96363407 | 11747137_x_at | chr5 | 96215265 | 96253609 | 221.4142 | 1.145032e-212 |
| 12 | rs3909451 | chr5 | 96295120 | 96295121 | 11747137_x_at | chr5 | 96215265 | 96253609 | 221.2025 | 1.708595e-212 |
| 13 | rs2549794 | chr5 | 96244540 | 96244541 | 11747137_x_at | chr5 | 96215265 | 96253609 | 221.0543 | 2.218743e-212 |
| 14 | rs12229020 | chr12 | 10117690 | 10117691 | 11753241_s_at | chr12 | 10124195 | 10137625 | 220.9615 | 2.550835e-212 |
| 15 | rs7673500 | chr4 | 17621378 | 17621379 | 11736024_at | chr4 | 17616279 | 17627249 | 220.8615 | 2.939277e-212 |
| 16 | rs12229020 | chr12 | 10117690 | 10117691 | 11762101_at | chr12 | 10124033 | 10136838 | 220.8420 | 2.939277e-212 |
| 17 | rs2549794 | chr5 | 96244540 | 96244541 | 11745601_a_at | chr5 | 96215266 | 96225192 | 220.7022 | 3.816818e-212 |
| 18 | rs2076608 | chr1 | 17422301 | 17422302 | 11727597_at | chr1 | 17393255 | 17445948 | 220.1213 | 1.373340e-211 |
| 19 | rs2235914 | chr1 | 17424844 | 17424845 | 11727597_at | chr1 | 17393255 | 17445948 | 220.0841 | 1.417414e-211 |
| 20 | rs9902260 | chr17 | 62399869 | 62399870 | 11757545_x_at | chr17 | 62399863 | 62400230 | 219.5656 | 4.443247e-211 |
Discussion
eQTL analysis usually involves thousands of transcripts and millions of SNPs assayed in thousands of individuals. The challenges for eQTL analysis for twin data are the heavy computational burden and the dependency of twins within pairs. The mixed effects model is the gold standard for analyzing twin data. However, it is extremely time-consuming for modern eQTL data. Although matrix eQTL is an ultra fast tool, it is not applicable to twin data for eQTL analysis, since the covariance matrix V is assumed to be known. This assumption is not true for twin data, where the covariance matrix can be decomposed into several parts. For example, where and are unknown. If the heritability is much larger than random error, covariance for the additive genetic effect A can be used to approximate the covariance V, otherwise dropping the second term in the parentheses may result in an invalid inference. Alternatively, the heritability for each transcript needs to be estimated from the mixed effects model before the matrix eQTL is applied, and the computation advantage will be lost dramatically. In contrast to the mixed effects model, we randomly split the twin pairs into two groups and then applied multiple linear regression in each group to calculate the score vector for examining the association of each SNP-transcript pair. The simplicity of linear regression models mitigates the heavy computational burden dramatically. Simulation studies demonstrate the computational advantages of the proposed methods. Specifically, the proposed method is > 6000 times faster than the mixed effect model. The proposed method achieves the fast performance for two reasons: the simplicity of the linear regression model and large matrix operations (especially the matrix multiplication), expressing the most computationally intensive part in an efficient way.
To combine the analytical results from two groups, we propose a novel score statistic which automatically adjusts the correlation between the two groups in a natural way. Results from simulations show that the proposed method controls type I error rates much better at the target levels than do the linear regression model and the naive method. Moreover, this proposed method is competitive and does not lose much power compared to the mixed effects model but is much more computationally efficient. The slight power loss is likely due to the following two reasons: a) the fixed non-genetic covariate effects need to be estimated twice in our analysis while in the mixed effects model, they only need to be estimated once; and b) the shared environment effects are specifically modeled by the mixed effects model but our method lumps that effect into the random error term and inexplicitly models it. Score statistics have been well studied [Cox and Hinkley, 1974, Section 9.3(iii)] and widely used in practice. For example, the similar idea based on score statistic has been applied to meta-analysis of GWAS with overlapping samples, where a robust estimator for variance-covariance matrix was proposed [Lin and Sullivan, 2009]. Although we have focused our attention on twin studies, the proposed method could be applied and extended to multivariate phenotypes, where multiple correlated phenotypes are measured for each subject.
In simulation studies, we have also tried another test statistic,
However, it is less efficient than the proposed one. This is probably due to the fact that our proposed score test follows under H0 while T2 follows under H0. In addition, we examined the robustness of the proposed method under two model misspecification scenarios where (1) if the IBD value for DZ pairs is not fixed at 0.5, but are randomly generated from N(0.50, 0.04) to reflect the IBD variation among DZ samples that are commonly observed; (2) the random error ei is non-normally distributed with a skewed distribution or a distribution with heavier tails (see Table 6). Expect these changes, the simulation set ups are kept the same as those in Table 1 with sample size n = 900. Clearly, both the proposed method and the computationally intensive mixed effects model are very robust to the model misspecifications, further suggesting broad applications of the proposed method.
Table 6.
Type I error for data from misspecified ACE model with n=900
| ρdz ~ N | E ~ Gamma1 | E ~ Gamma2 | E ~ t1 | E ~ t2 | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| pro | mix | pro | mix | pro | mix | pro | mix | pro | mix | |
| α = 0.05 | 0.0522 | 0.0510 | 0.0524 | 0.0502 | 0.0524 | 0.0508 | 0.0525 | 0.0519 | 0.0545 | 0.0491 |
| α = 0.01 | 0.0108 | 0.0103 | 0.0110 | 0.0111 | 0.0108 | 0.0101 | 0.0110 | 0.0113 | 0.0116 | 0.0118 |
| α = 0.001 | 0.0011 | 0.0011 | 0.0011 | 0.0011 | 0.0011 | 0.0010 | 0.0012 | 0.0011 | 0.0013 | 0.0018 |
| α = 0.0001 | 0.0001 | 0.0001 | 0.0001 | 0.0004 | 0.0001 | 0.0001 | 0.0001 | 0.0002 | 0.0001 | 0.0001 |
pro: proposed score test statistic
mix: mixed effects
E: random error
ρdz: IBD value for DZ pairs
N: Normal(0.501,0.0389)
Gamma1: Gamma(2, ) and Gamma2: Gamma(6, )
t1: tdf=4 and t2 tdf=3
Finally, it is worth mentioning that the twin samples are randomly split, therefore the analysis result is not fixed and varies from one split to another split. As an example, we split 10, 000 simulated data from the ACE model twice and the scatter plot of the proposed test statistics from the two splits is given in Figure 2. The two results are not identical but highly consistent with R2 > 0.99. To reduce the splitting variation, we can repeat splitting of the samples multiple times and combine the analysis results from the multiple splits. But how to combine the highly correlated results deserves further investigation.
Figure 2.
Scatter plot of the proposed test statistics from 10, 000 simulated data with each data randomly split twice. The data are generated from the ACE model and sample size is 1800.
Acknowledgments
The authors are grateful for constructive comments and suggestions from the reviewers and the editor. Support was provided in part by National Institute of General Medical Sciences R01GM074175.
References
- Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Annals of Statistics. 2001;29(4):1165–1188. [Google Scholar]
- Boker SM, Neale MC, Maes HH, Wilde MJ, et al. OpenMx: An Open Source Extended Structural Equation Modeling Framework. Psychometrika. 2011;76(2):306–317. doi: 10.1007/s11336-010-9200-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Boker SM, Neale MC, Maes HH, Wilde MJ, et al. OpenMx 1.2 User Guide. 2012 [Google Scholar]
- Boomsma D, Busjahn A, Peltonen L. Classical twin studies and beyond. Nature Reviews Genetics. 2002;3(11):872–882. doi: 10.1038/nrg932. [DOI] [PubMed] [Google Scholar]
- Bottolo L, Petretto E, Blankenberg S, Cambien F, Cook SA, Tiret L, Richardson S. Bayesian detection of expression quantitative trait loci hot spots. Genetics. 2011;189(4):1449–1459. doi: 10.1534/genetics.111.131425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carlin JB, Gurrin LC, Sterne JA, Morley R, Dwyer T. Regression models for twin studies: a critical review. International Journal of Epidemiology. 2005;34(5):1089–1099. doi: 10.1093/ije/dyi153. [DOI] [PubMed] [Google Scholar]
- Chipman KC, Singh AK. Bayesian detection of expression quantitative trait loci hot spots. BMC Bioinformatics. 2011;12(7) 2011. [Google Scholar]
- Chou YY, Lepor N, Chiang MC, et al. Mapping genetic in uences on ventricular structure in twins. Neuroimage. 2009;44(4):1312–1323. doi: 10.1016/j.neuroimage.2008.10.036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chun H, Keles S. Expression quantitative trait loci mapping with multivariate sparse partial least squares regression. Genetics. 2009;182(1):79–909. doi: 10.1534/genetics.109.100362. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cox DR, Hinkley DV. Theoretical Statistics. London: Chapman and Hall; 1974. [Google Scholar]
- Feng R, Zhou G, Zhang M, Zhang H. Analysis of twin data using SAS. Biometrics. 2009;65(2):584–589. doi: 10.1111/j.1541-0420.2008.01098.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Garcia-Closas M, Couch FJ, Lindstrom S, Michailidou K, Schmidt MK, et al. Genome-wide association studies identify four ER negative-specific breast cancer risk loci. Nature Genetics. 2013;45(4):392–398. doi: 10.1038/ng.2561. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ghazalpour A, Doss S, Kang H, et al. High-resolution mapping of gene expression using association in an outbred mouse stock. PLoS Genetics. 2008;4(8):e1000149. doi: 10.1371/journal.pgen.1000149. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gilad Y, Rifkin SA, Pritchard JK. Revealing the architecture of gene regulation: the promise of eQTL studies. Trends in Genetics. 2008;24(8):408–415. doi: 10.1016/j.tig.2008.06.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hanson RL, Muller YL, Kobes S, et al. A genome-wide association study in American Indians implicates DNER as a susceptibility locus for type 2 diabetes. Diabetes. 2014;63(1):369–376. doi: 10.2337/db13-0416. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hernandez DG, Nalls MA, Moore M, et al. Integration of GWAS SNPs and tissue specific expression profiling reveal discrete eQTLs for human traits in blood and brain. Neurobiology of Disease. 2012;47(1):20–28. doi: 10.1016/j.nbd.2012.03.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hoggart CJ, Whittaker JC, DeIorio M, Balding DJ. Simultaneous analysis of all SNPs in genome-wide and re-sequencing association studies. PLoS Genetics. 2008;4(7):e1000130. doi: 10.1371/journal.pgen.1000130. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hsu YH, Zillikens MC, Wilson SG, Farber CR, Demissie S, et al. An integration of genome-wide association study and gene expression profiling to prioritize the discovery of novel susceptibility Loci for osteoporosis-related traits. PLoS Genetics. 2010;6(6):e1000977. doi: 10.1371/journal.pgen.1000977. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jorsekog KG, Sorborn D. Lisrel VI. Mooresville, Indiana: Scientific Software; 1986. [Google Scholar]
- Kang HM, Ye C, Eskin E. Accurate discovery of expression quantitative trait loci under con- founding from spurious and genuine regulatory hotspots. Genetics. 2008;180(4):1909–1925. doi: 10.1534/genetics.108.094201. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kendziorski C, Wang P. A review of statistical methods for expression quantitative trait loci mapping. Mammalian Genome. 2006;17(6):509–517. doi: 10.1007/s00335-005-0189-6. [DOI] [PubMed] [Google Scholar]
- Kuna ST, Maislin G, Pack FM, Staley B, Hachadoorian R, Coccaro EF, Pack AI. Heritability of performance deficit accumulation during acute sleep deprivation in twins. Sleep. 2012;35(9):1223–1233. doi: 10.5665/sleep.2074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin DY, Sullivan PF. Meta-analysis of genome-wide association studies with overlapping subjects. The American Journal of Human Genetic. 2009;85(6):862–872. doi: 10.1016/j.ajhg.2009.11.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McCarthy MI, Hirschhorn JN. Genome-wide association studies: potential next steps on a genetic journey. Human Molecular Genetics. 2008;17:R156–R165. doi: 10.1093/hmg/ddn289. (Review Issue 2) [DOI] [PMC free article] [PubMed] [Google Scholar]
- Michaelson JJ, Loguercio S, Beyer A. Detection and interpretation of expression quantitative trait loci (eQTL) Methods. 2009;48(3):265–276. doi: 10.1016/j.ymeth.2009.03.004. [DOI] [PubMed] [Google Scholar]
- Muthen LK, Muthen BO. Mplus User’s Guide. Los Angeles, California: Muthen & Muthen; 1998. [Google Scholar]
- Neale MC, Cardon LR. Methodology for genetic studies of twins and families. Dordrecht, the Netherlands: Kluwer Academic; 1992. [Google Scholar]
- Neale MC, Heath AC, Hewitt JK, Eaves LJ, Fulker DW. Fitting genetic models with LISREL: Hypothesis testing. Behavior Genetics. 1989;19(1):37–49. doi: 10.1007/BF01065882. [DOI] [PubMed] [Google Scholar]
- Neale MC, Boker SM, Xie G, Maes HH. Mx:Statistical Modeling. Richmond, Virginia: Department of Psychiatry, Medical College of Virginia of Virginia Commonwealth University; 1999. 1999. [Google Scholar]
- Nicolae DL, Gamazon E, Zhang W, Duan S, Dolan ME, Cox NJ. Trait-associated SNPs are more likely to be eQTLs: annotation to enhance discovery from GWAS. PLoS Genetics. 2010;6(4):e1000888. doi: 10.1371/journal.pgen.1000888. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Park JH, Song YM, Sung J, et al. The association between fat and lean mass and bone mineral density: the Healthy Twin Study. Bone. 2012;50(4):1006–1011. doi: 10.1016/j.bone.2012.01.015. [DOI] [PubMed] [Google Scholar]
- R Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2014. URL http://www.R-project.org. [Google Scholar]
- Rabe-Hesketh S, Skrondal A, Gjessing HK. Biometrical modeling of twin and family data using standard mixed model software. Biometrics. 2008;64(1):280–288. doi: 10.1111/j.1541-0420.2007.00803.x. [DOI] [PubMed] [Google Scholar]
- Schadt EE, Molony C, Chudin E, Hao K, Yang X, et al. Mapping the Genetic Architecture of Gene Expression in Human Liver. PLoS Biology. 2008;6(5):e107. doi: 10.1371/journal.pbio.0060107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shabalin AA. Matrix eQTL: ultra fast eQTL analysis via large matrix operations. Bioinformatics. 2012;28(10):1353–1358. doi: 10.1093/bioinformatics/bts163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Silventoinen K, Sammalisto S, Perola M, et al. Heritability of adult body height: a comparative study of twin cohorts in eight countries. Twin Research. 2003;6(5):399–408. doi: 10.1375/136905203770326402. [DOI] [PubMed] [Google Scholar]
- Stegle O, Parts L, Durbin R, Winn J. A bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies. PLoS Computational Biology. 2010;6(5):e1000770. doi: 10.1371/journal.pcbi.1000770. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vaccarino V, Brennan ML, Miller AH, et al. Association of major depressive disorder with serum myeloperoxidase and other markers of in ammation: a twin study. Biological Psychiatry. 2008;64(6):476–483. doi: 10.1016/j.biopsych.2008.04.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang X, Guo X, He M, Zhang H. Statistical inference in mixed models and analysis of twin and family data. Biometrics. 2011;67(3):987–995. doi: 10.1111/j.1541-0420.2010.01548.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Winham SJ, Cuellar-Barboza AB, Oliveros A, McElroy SL, Crow S, Colby C, Choi D-S, Chauhan M, Frye M, Biernacka JM. Genome-wide association study of bipolar disorder accounting for effect of body mass index identifies a new risk allele in TCF7L2. Molecular Psychiatry. 2013 doi: 10.1038/mp.2013.159. [DOI] [PubMed] [Google Scholar]
- Wright FA, Shabalin AA, Rusyn I. Computational tools for discovery and interpretation of expression quantitative trait loci. Pharmacogenomics. 2012;13(3):343–352. doi: 10.2217/pgs.11.185. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright FA, Sullivan PF, Brooks AI, Zou F, et al. Heritability and genomics of gene expression in peripheral blood. Nature Genetics. 2014;46(5):430–437. doi: 10.1038/ng.2951. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yang TP, Beazley C, Montgomery SB, et al. Genevar: a database and Java application for the analysis and visualization of SNP-gene associations in eQTL studies. Bioinformatics. 2010;26(19):2474–2476. doi: 10.1093/bioinformatics/btq452. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zeng ZB. Precision mapping of quantitative trait loci. Genetics. 1994;136(4):1457–1468. doi: 10.1093/genetics/136.4.1457. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang M, Liang L, Morar N, Dixon AL, Lathrop GM, Ding J, Moffatt MF, Cookson WO, Kraft P, Qureshi AA, Han J. Integrating pathway analysis and genetics of gene expression for genome-wide association study of basal cell carcinoma. Human Genetics. 2012;131(4):615–623. doi: 10.1007/s00439-011-1107-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhong H, Beaulaurier J, Lum PY, Molony C, Yang X, Macneil DJ, et al. Liver and adipose expression associated SNPs are enriched for association to type 2 diabetes. PLoS Genetics. 2010;6:e1000932. doi: 10.1371/journal.pgen.1000932. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zou F, Fine JP, Hu J, Lin DY. An efficient resampling method for assessing genome-wide statistical significance in mapping quantitative trait loci. Genetics. 2004;168(4):2307–2316. doi: 10.1534/genetics.104.031427. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zou W, Aylor DL, Zeng Z. eQTL viewer: visualizing how sequence variation affects genome-wide transcription. BMC Bioinformatics. 2007;8(7) doi: 10.1186/1471-2105-8-7. [DOI] [PMC free article] [PubMed] [Google Scholar]


