Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2013 Feb 17;29(8):1026–1034. doi: 10.1093/bioinformatics/btt075

Accounting for non-genetic factors by low-rank representation and sparse regression for eQTL mapping

Can Yang 1, Lin Wang 1,2, Shuqin Zhang 1,3, Hongyu Zhao 1,*
PMCID: PMC3624800  PMID: 23419377

Abstract

Motivation: Expression quantitative trait loci (eQTL) studies investigate how gene expression levels are affected by DNA variants. A major challenge in inferring eQTL is that a number of factors, such as unobserved covariates, experimental artifacts and unknown environmental perturbations, may confound the observed expression levels. This may both mask real associations and lead to spurious association findings.

Results: In this article, we introduce a LOw-Rank representation to account for confounding factors and make use of Sparse regression for eQTL mapping (LORS). We integrate the low-rank representation and sparse regression into a unified framework, in which single-nucleotide polymorphisms and gene probes can be jointly analyzed. Given the two model parameters, our formulation is a convex optimization problem. We have developed an efficient algorithm to solve this problem and its convergence is guaranteed. We demonstrate its ability to account for non-genetic effects using simulation, and then apply it to two independent real datasets. Our results indicate that LORS is an effective tool to account for non-genetic effects. First, our detected associations show higher consistency between studies than recently proposed methods. Second, we have identified some new hotspots that can not be identified without accounting for non-genetic effects.

Availability: The software is available at: http://bioinformatics.med.yale.edu/software.aspx.

Contact: hongyu.zhao@yale.edu

Supplementary information: Supplementary data are available at Bioinformatics online.

1 INTRODUCTION

Nowadays both gene expression levels and hundreds of thousands of single-nucleotide polymorphisms (SNPs) can be measured by high-throughput technologies. This allows us to systematically explore the relationship between gene expression levels and genotypes: whether a gene is differentially expressed with different genotypes (or alleles) at a specific locus. The loci that are associated with gene expression levels are known as ‘expression quantitative trait loci’ (eQTL) (Li et al., 2012). Recently, a large number of eQTLs have been found in eQTL studies (Cookson et al., 2009). These findings provide insights on how gene expression levels are affected by specific genetic variants (Cheung and Spielman, 2009). They may further help to prioritize disease-associated loci and contribute to disease understanding (Nica and Dermitzakis, 2008).

An important issue in eQTL mapping is that a fairly large proportion of the measured gene expression variations may not be caused by genetic variants, but by some other factors, including cellular state (Alter et al., 2000), environmental factors (Gibson, 2008) and experimental conditions (Leek et al., 2010). A typical example is the batch effect, which may arise when sub-groups of samples were processed by different laboratories, different technicians or on different days. Because these factors are unrelated to genetic variants, we call them non-genetic factors in the rest of the article.

Some of the non-genetic effects can be directly measured. For example, when the batch information is available, the batch effects may be adjusted, e.g an empirical Bayes method named ‘Combat’ (Li and Rabinovic, 2007). However, in practice, non-genetic factors may not be directly and completely observable and thus remain hidden. For example, Pastinen et al. (2006) showed that cell culture conditions have an unnegligible influence on a large number of genes. Gagnon-Bartsch and Speed (2012) reported that a substantial within-batch effect exists in the Microarray Quality Control study (Shi et al., 2006). ‘Expression heterogeneity’ (EH) arises when these hidden factors are not taken into account in statistical analysis. Leek and Storey (2007) showed that EH not only leads to the reduction of statistical power but also spurious association signals in eQTL mapping.

Recently, capturing EH in gene expression studies has drawn the attention of researchers. Many methods have been proposed to infer the hidden factors by some forms of factor analysis, and adjust the inferred factors as if they were observed (Alter et al., 2000; Nielsen et al., 2002).

One well-known method that attempts to address these issues is the Surrogate Variable Analysis (SVA; Leek and Storey, 2007). It performs principal component analysis while taking genotypes into consideration and uses permutation to choose the number of principal components. Kang et al. (2008) proposed the intersample correlation emended (ICE) eQTL mapping method, in which a linear mixed model was introduced to model the hidden factors. When modeling EH, Kang et al. (2008) used the covariance matrix of the gene expression data as the EH covariance matrix in their ICE model. However, this estimate is inconsistent and thus reduces the power of eQTL mapping. Listgarten et al. (2010) introduced another linear mixed model, named ‘LMM-EH’, which corrected the inconsistency of the estimated EH covariance matrix. Once the latent covariance matrix has been estimated, LMM-EH can scan every gene-SNP pair. Alternatively, Stegle et al. (2010) jointly modeled SNPs, gene probes and hidden confounders into a Bayesian framework. Despite its greatly increased power in eQTL mapping, its heavy computational burden might limit its usage. Fusi et al. (2012) proposed another model named ‘PANAMA’ and borrowed some computational techniques from Gaussian process (Rasmussen and Williams, 2006) and further improved the performance of eQTL mapping. However, during the model optimization, PANAMA may be trapped in a local optimum because the optimization problem is not convex.

In this article, we introduce an alternative formulation to address this issue. We propose a LOw-Rank representation to account for non-genetic factors and make use of Sparse regression for eQTL mapping (LORS). We integrate the low-rank representation and sparse regression into a unified framework, in which SNPs and gene probes can be jointly analyzed. Given the two regularization parameters, the optimization of the model structure is a convex problem. We have developed an efficient algorithm to solve this convex problem and its convergence is guaranteed. We demonstrate its usefulness through its applications to both synthetic data and real data.

2 MODEL

Before introducing our formulation, we summarize the notations used in this article. We consider the following norms of a vector Inline graphic: the Inline graphic norm defined as Inline graphic; the Inline graphic and the squared Inline graphic norms defined as Inline graphic and Inline graphic, respectively. We use the following three norms of a matrix Inline graphic: the Frobenius norm Inline graphic, the nuclear norm Inline graphic, where Inline graphic are the singular values of Inline graphic and r is the rank of Inline graphic and the ‘elementwise’ Inline graphic norm Inline graphic.

Let Inline graphic be an Inline graphic matrix corresponding to a gene expression dataset, where n is the number of samples and q is the number of genes. Let Inline graphic be an Inline graphic matrix corresponding to a SNP dataset, where p is the number of SNPs. To model the relationship between Inline graphic and Inline graphic, we propose to decompose Inline graphic as:

graphic file with name btt075m1.jpg (1)

where Inline graphic is the coefficient matrix, Inline graphic is a vector whose entries are all 1, Inline graphic is a Inline graphic matrix with Inline graphic being the j-th intercept and Inline graphic is a Gaussian random noise term with zero mean and variance Inline graphic, i.e. Inline graphic. Here we introduce Inline graphic in our model to account for the variations caused by a few hidden factors. This model implies that gene expression levels are influenced by genetic factors, non-genetic factors and random noises.

To make the decomposition (1) possible, we make the following assumptions:

  • There are only a few hidden factors that may influence gene expression levels. Thus, Inline graphic is a low-rank matrix. Here, we also implicitly assume that the hidden factors have global effects rather than local effects.

  • The gene expression level may only be affected by a small fraction of SNPs. This implies that the coefficient matrix Inline graphic should be sparse.

Based on these assumptions, we propose to solve the following optimization problem:

graphic file with name btt075m2.jpg (2)

where Inline graphic is the elementwise Inline graphic norm defined before, r0 and t0 are some fixed constants. To make the minimization problem tractable, we relax the rank operator on Inline graphic with the nuclear norm, which has been proven to be an effective convex surrogate of the rank operator (Recht et al., 2010). Now we rewrite (2) in a Lagrange form

graphic file with name btt075m3.jpg (3)

where Inline graphic is the nuclear norm of Inline graphic, ρ and λ are regularization parameters that control the sparsity of Inline graphic and the rank of Inline graphic, respectively. Now it is a convex optimization problem and can be solved efficiently.

Missing data are commonly encountered when analyzing gene expression data. Here we extend our basic model (3) in the following to handle missing data naturally.

Suppose we only observed a subset of entries in Inline graphic, indexed by Inline graphic. The unobserved entries are indexed by Inline graphic. Mathematically, we can define an orthogonal projection operator Inline graphic that projects a matrix Inline graphic onto the linear space of matrices supported by Inline graphic:

graphic file with name btt075m4.jpg (4)

and Inline graphic is its complementary projection, i.e. Inline graphic (W) = W.

Because we want to find a sparse coefficient matrix Inline graphic and a low-rank matrix Inline graphic based on the observed data, we propose to solve the following optimization problem:

graphic file with name btt075m5.jpg (5)

where the first term Inline graphic is the sum of squared errors on the observed entries indexed by Inline graphic.

3 ALGORITHM

To solve the optimization problem (3) efficiently, we need the following lemma [the proof can be found in (Mazumder et al., 2010)]:

Lemma 1 —

Suppose matrix Inline graphic has rank r. The solution to the optimization problem

graphic file with name btt075m6.jpg (6)

is given by Inline graphic where

graphic file with name btt075m7.jpg (7)

Inline graphic is the Singular Value Decomposition (SVD) of Inline graphic, Inline graphic, and Inline graphic.

We adopt an alternating strategy to solve problem (3). For fixed Inline graphic and Inline graphic, the optimization problem becomes

graphic file with name btt075m8.jpg (8)

By Lemma 1, we have a closed-form solution for Inline graphic:

graphic file with name btt075m9.jpg (9)

For fixed Inline graphic, the optimization problem becomes

graphic file with name btt075m10.jpg (10)

It can be further decomposed into q independent Lasso problems (Tibshirani, 1996):

graphic file with name btt075m11.jpg (11)

where Inline graphic and Inline graphic are the j-th column of Inline graphic and Inline graphic, respectively. The Lasso problem can be solved efficiently by the coordinate descent algorithm (Friedman et al., 2007, 2010). Now we have Algorithm 1:

Algorithm 1 A fast algorithm to solve problem (3).

  • Input: Inline graphic. Initialize Inline graphic, Inline graphic.

  • Iterate until convergence:
    • Inline graphic-step: Inline graphic.
    • Inline graphic-step: Solve q independent Lasso problems (11) by the coordinate descent algorithm.
  • Output: Inline graphic, Inline graphic, Inline graphic.

So far we have developed the algorithm for solving problem (3). To derive an algorithm to solve optimization problem (5), we need the following lemma [its proof was given by (Mazumder et al., 2010)]:

Lemma 2 —

Soft-impute algorithm

graphic file with name btt075m12.jpg (12)

By Lemma 1, the optimal value Inline graphic of the optimization problem (12) can be obtained via updating Inline graphic using

graphic file with name btt075m13.jpg (13)

with an arbitrary initialization.

We also adopt the alternating strategy to solve (5). For fixed Inline graphic and Inline graphic, optimization problem (5) becomes

graphic file with name btt075m14.jpg (14)

By Lemma 2 we have

graphic file with name btt075m15.jpg (15)

For fixed Inline graphic, optimization problem (5) becomes

graphic file with name btt075m16.jpg (16)

Again, this problem can be decomposed into q independent Lasso problems as follows:

graphic file with name btt075m17.jpg (17)

Now we have Algorithm 2:

Algorithm 2 A fast algorithm to solve problem (5).

  • Input: Inline graphic. Initialize Inline graphic, Inline graphic.

  • Iterate until convergence:
    • Inline graphic-step: iteratively update Inline graphic using (15).
    • Inline graphic-step: Solve q independent Lasso problems (17) using the coordinate descent algorithm.
  • Output: Inline graphic, Inline graphic, Inline graphic.

The convergence analysis of our algorithms and the CPU timings are provided in the Supplementary Document.

4 PARAMETER TUNING

We have two parameters that need to be tuned in our models. Here we propose a cross-validation-like strategy to select these two parameters. The idea is as follows: Let Inline graphic be the index of the observed entries of Inline graphic. We randomly divide Inline graphic into training entries Inline graphic and testing entries Inline graphic: Inline graphic and Inline graphic. The sizes of Inline graphic and Inline graphic are roughly the same. We may solve problem (5) on a grid of (Inline graphic) values on the training data:

graphic file with name btt075m18.jpg (18)

Then we evaluate the prediction error (19) on the testing data

graphic file with name btt075m19.jpg (19)

where we write Inline graphic, Inline graphic and Inline graphic as Inline graphic, Inline graphic and Inline graphic to emphasize that Inline graphic, Inline graphic and Inline graphic depend on the parameters ρ and λ. We can then choose the parameter setting Inline graphic, which minimizes the prediction error (19).

However, searching for two parameters on a grid of values may be too computationally expensive when dealing with large datasets. Instead, we search a good λ value with fixing Inline graphic and then perform a one dimensional search on a sequence of ρ values. In our implementation, we first set the maximum rank of Inline graphic, denoted as Inline graphic, equal to Inline graphic. Then, we start from a large Inline graphic, which equals to the second largest singular value of matrix Inline graphic. After solving

graphic file with name btt075m20.jpg (20)

if Inline graphic, we reduce λ by a factor Inline graphic and repeatedly solve (20) until Inline graphic. Using warm-start, this sequential optimization is efficient (Mazumder et al., 2010).

Then we choose a λ value, which minimizes the prediction error

graphic file with name btt075m21.jpg (21)

Let Inline graphic be the value corresponding to the minimal prediction error (21). Now we can perform a one dimensional search for a good value for ρ. We generate a sequence of ρ values with length Inline graphic equally decreasing from Inline graphic to Inline graphic on the log scale, where Inline graphic is the smallest ρ value such that all entries of Inline graphic are zero. Typically, we set Inline graphic and Inline graphic. For each ρ value, we solve

graphic file with name btt075m22.jpg (22)

and evaluate the prediction error:

graphic file with name btt075m23.jpg (23)

Then we choose the ρ value corresponding to the minimal prediction error (23). Now we can solve model (5) using (Inline graphic) as regularization parameters, and obtain a sparse matrix Inline graphic and a low rank matrix Inline graphic.

5 DISCUSSION

5.1 Relationship between our method and other methods

To our knowledge, LMM-EH (Listgarten et al., 2010) proposed the first framework, where multiple gene expression levels and confounder effects can be jointly analyzed in eQTL studies. For the j-th gene expression level in the LMM-EH model, it assumes the following structure:

graphic file with name btt075m24.jpg (24)

where Inline graphic are the expression and SNP data matrices, respectively. Here Inline graphic denotes Gaussian noise, i.e. Inline graphic, and Inline graphic denotes a random effect, i.e. Inline graphic, where τ is a scalar and Inline graphic. Assuming the independence among Inline graphic, and integrate out Inline graphic and Inline graphic, we arrive at the following form:

graphic file with name btt075m25.jpg (25)

LMM-EH adopts the following strategy to estimate the covariance matrix Inline graphic and other model parameters Inline graphic:

  • First, it estimates Inline graphic from the null model, which does not include any SNPs, denoted as Inline graphic (Kang et al., 2008, 2010; Lippert et al., 2011).

  • Second, using Inline graphic in model (24) as a known covariance and estimate Inline graphic for all gene-SNP pairs (one gene versus one SNPs at a time).

PANAMA extends LMM-EH and allows joint analysis of all SNPs (Fusi et al., 2012). Specifically, PANAMA models the relationship between gene expression levels and SNPs as follows:

graphic file with name btt075m26.jpg (26)

here Inline graphic is the intercept, Inline graphic and Inline graphic are the corresponding coefficients representing the effects of SNPs and hidden factors Inline graphic. PANAMA assigns independent Gaussian priors for Inline graphic and Inline graphic:

graphic file with name btt075m27.jpg (27)

where K is the number of hidden factors. Assuming Inline graphic … , q are independent and integrating out Inline graphic and Inline graphic, the model becomes

graphic file with name btt075m28.jpg (28)

where the intercept term Inline graphic is dropped for notation convenience and Inline graphic. In principle, parameter estimation in (28) can be done by borrowing some computational tricks from Gaussian process model optimization (Rasmussen and Williams, 2006). Computation becomes prohibitive when all genome-wide SNPs are included. In this case, PANAMA adopts a heuristic strategy: PANAMA begins with the null model (i.e. the model does not include SNPs). It first uses principal components to initialize Inline graphic and gradually adds significantly associated SNPs into model (28), and re-estimate model parameter Inline graphic and Inline graphic. This process iterates until no significantly associated SNPs are added into the model. In summary, Inline graphic and Inline graphic are jointly optimized during the iterations.

Our model (1) can be regarded as an equivalent form of (26) because a low rank matrix can always be written as Inline graphic with Inline graphic, where r is the rank of Inline graphic. Unlike PANAMA, both our formulations (3) and (5) are joint convex w.r.t (Inline graphic, Inline graphic, Inline graphic). When the tuning parameters (λ and ρ) are given, our algorithms are guaranteed to converge to the optimal solution without any heuristic. Furthermore, we do not assume that Inline graphic are independent. This can be seen from Lemma 1 and Lemma 2: information among multiple gene expression is used jointly by singular value decomposition.

Compared with PANAMA, the proposed method LORS has its disadvantages. Using PANAMA’s formulation, statistical significance of the associations can be evaluated. Currently, we can not provide a rigorous statistical significance test of the estimated coefficient matrix Inline graphic. The difficulty comes from the unknown statistical property of the nuclear norm. How to do statistical tests with the nuclear norm regularization needs to be investigated in the future. In this article, we use permutation to obtain a rough estimate of false discovery rate (FDR) for our method.

5.2 A screening method based on LORS

Although optimization of our LORS model (3) is a convex problem, it is still too computationally intensive to directly use it for analyzing human-size datasets (e.g. the number of genes Inline graphic, the number of SNPs Inline graphic). One can see the computational bottleneck in the (Inline graphic) step of Algorithm 1 and 2. In this step, q Lasso problems need to be solved, each of which involves p variables. To overcome this computational difficulty, we propose to solve the following optimization problem:

graphic file with name btt075m29.jpg (29)

where Inline graphic is the entire data matrix of gene expression, Inline graphic is the j-th SNP, Inline graphic is the coefficient of the j-th SNP corresponding to its effect size on q genes. Here we consider one SNP at a time, and thus, we do not add L1 regularization. Clearly, it can be considered as the single-variable version of LORS. Thus, we call this screening method as ‘LORS-Screening’. The algorithm to solve (29) is given in the Supplementary Documents. The computational time of LORS-Screening is given in the Supplementary Material.

For large datasets (e.g. human datasets), we recommend to use LORS-Screening to reduce the number of SNPs. After the screening process, we may select top d SNPs for each gene (based on the absolute value of the coefficients). Then we can fit the LORS model using the selected SNPs. This strategy is similar to the single-variable screening step followed by joint analysis in linear regression (Fan and Lv, 2008). According to the property of L1 regularization, LORS can identify at most n non-zero coefficients for each gene. Here we may set d = n.

6 RESULTS

6.1 Synthetic data

To avoid the simulation setup favoring our own model, we use LMM-EH model (24). Specifically, we generate genetic effects, non-genetic effects and noises as follows:

  • Genetic effects: each SNP is generated independently and the minor allele frequencies of these SNPs are uniformly distributed in the interval (0.1, 0.4). The coefficient matrix Inline graphic is a sparse matrix with 1% non-zero entries. These non-zero coefficients are generated using standard Gaussian distribution. Let Inline graphic denote the genetic effect Inline graphic.

  • Non-genetic effects: The covariance matrix Inline graphic is generated by Inline graphic, where Inline graphic and Inline graphic. Here K is the number of hidden factors. The random effect Inline graphic is drawn from Inline graphic. Let Inline graphic.

  • Inline graphic.

Now we have

graphic file with name btt075m30.jpg (30)

In the following simulation studies (Sections 6.2 and 6.3), we set n = 100, p = 100 and q = 200. To evaluate the performance under different signal-to-noise ratios, we define Inline graphic and Inline graphic as:

graphic file with name btt075m31.jpg (31)

Parameters τ and Inline graphic can be used to control Inline graphic and Inline graphic. An example of synthesized datasets is given in the Supplementary Document.

6.2 Influence of parameter tuning

Before we compare LORS with some other methods, we would like to empirically evaluate our parameter tuning procedure. Given a dataset, we need to randomly partition the observed entries into two parts: Inline graphic and Inline graphic. Basically, we train our model based on Inline graphic for different parameters and choose a good parameter configuration such that the trained model has an accurate prediction on Inline graphic. There may be two concerns: (i) Because the random partition may introduce randomness in our modeling process, does this strategy provide a stable parameter selection? (ii) Can this strategy adapt to different noise level? To answer these questions, we do 100 random partitions of a synthetic dataset, and run our method based on each partition separately. The distribution of the selected parameters is shown in Figure 1. First, one can see that the selected parameters Inline graphic do not change a lot during 100 random partitions. The stability of our method should be attributed to the continuity property of the Inline graphic norm (Fan and Li, 2001), that is, a small change of dataset will not cause a big change of the optimization solution. Second, when the signal becomes weaker, i.e. Inline graphic (the left panel of Fig. 1) reduces to Inline graphic (the right panel of Fig. 1), a larger ρ will be selected to prevent the noise from entering the model. This shows that our parameter tuning strategy can adapt to different noise levels. In Section 3.2 of the Supplementary Document, we provide more evidence to show that the random partition in our parameter tuning has little effects on eQTL mapping (i.e. the estimation of matrix Inline graphic).

Fig. 1.

Fig. 1.

The distribution of selected parameters (100 random partitions of training and testing data) for synthetic datasets. Left panel: the synthetic dataset is generated with n = 100, p = 100, q = 200, Inline graphic and Inline graphic. For the parameter λ, Inline graphic and Inline graphic are selected in the sequence of λ values in most cases; for parameter ρ, Inline graphic is selected in most cases. Right panel: the synthetic dataset is generated with n = 100, p = 100, q = 200, Inline graphic and Inline graphic. For λ, Inline graphic and Inline graphic are often selected; for ρ, Inline graphic is selected in most cases

6.3 Performance evaluation

We will mainly compare our method LORS with PANAMA. The reasons are: (i) PANAMA can be regarded as an extension of LMM-EH as we discussed above. (ii) Fusi et al. (2012) showed that PANAMA significantly outperforms other related methods, including SVA, PEER and ICE. Here we include the results from standard linear regression as reference, and compare LORS, LORS-Screening and PANAMA with the standard linear regression.

To compare our method with PANAMA under different settings, we vary Inline graphic, Inline graphic and K. For each setting, we report the averaged result from 50 realizations. Figure 2 shows the comparison results for different combinations of Inline graphic, Inline graphic and K (more simulation results can be found in the Supplementary Document). From these simulation results, we can see the following:

  • When the number of hidden factors increases, the performance of both LORS and PANAMA degrades slightly.

  • When the genetic effects and non-genetic effects are small (compared with noises), e.g. Inline graphic, Inline graphic), LORS is only comparable with standard linear regression and PANAMA is even worse. This is because the noise plays a dominant role here, it is difficult to account for non-genetic effects under this situation.

  • As the genetic effects and non-genetic effects become more apparent, both LORS and PANAMA perform better than standard linear regression. As we mentioned in Section 5.1, LORS and PANAMA share the same model structure, and they differ in how the model structure is inferred. We suspect that PANAMA may be trapped at a local optimum during its model optimization. As a result, LORS may have better performance than PANAMA.

  • Regarding to LORS-Screening, it turns out that LORS-Screening is slightly worse than LORS but comparable with PANAMA. Because the computational cost is largely reduced, it is preferred in large data analysis.

Fig. 2.

Fig. 2.

The ROC curves for performance comparison. (A) The number of hidden factors K = 10. (B) The number of hidden factors K = 30. In each panel, we vary Inline graphic to compare the performance of LORS, PANAMA and standard linear regression

6.4 Estimate of false discovery rate

Owing to lack of statistical tests, it is necessary to provide a way to estimate the FDR of our method. Because correlation exists among the rows and columns of Inline graphic, exactly estimating FDR becomes difficult. Here we follow the strategy of (Tibshirani and Wang, 2008; Nowak et al., 2011) to obtain a rough estimator of the true FDR, which may serve as a guideline when applying our method. We use

graphic file with name btt075m32.jpg (32)

as a rough estimator for FDR, where Inline graphic is the number of associations identified at threshold τ under the null distribution, Inline graphic is the number of associations identified at threshold τ in the original dataset. We use permutation to obtain the number of associations identified under the null distribution. Specifically, for a given threshold τ, we do T permutations. At the t-th permutation, we permute the rows of the expression dataset Inline graphic to generate a null dataset, denoted as Inline graphic. Then, we run LORS on the permuted dataset (Inline graphic) and obtain the number of associations by applying the threshold τ to the estimated matrix Inline graphic, denoted as Inline graphic. After T permutations, the final estimation of Inline graphic is given by

graphic file with name btt075m33.jpg (33)

We report estimated FDR based on the simulated data as described in Section 6.1 in Figure 9 of the Supplementary Document. The permutation strategy provides a reasonably good estimate for the true FDR. The estimated FDR often overestimates the true FDR (due to the correlation), and thus, it may serve as a conservative guideline.

6.5 Application to eQTL data from yeast

We applied our method to two yeast datasets for eQTL mapping. The first dataset is from Brem et al. (2002) (GEO accession number GSE 1990), which consists of 7084 probes and 2956 genotyped loci in 112 segregates. The second one comes from Smith and Kruglyak (2008), which includes 5493 probes measured in 109 segregates. Analysis of these two datasets provides us an opportunity to demonstrate the benefit of our method because the two expression data share the same genetic effect but different confounding effects.

The significant linkage peaks given by standard linear regression are shown in Figure 3 (A). We can clearly see the confounding effects that lead to spurious associations. We also show the result given by LORS in Figure 3 (B). In total, LORS has detected about 10 000 associations according to non-zero Inline graphic values. Because LORS does not perform statistical significance tests, we are not able to report our result based on statistical significance. In practice, people may be more interested in the top signals that will be followed up for replications. Thus, we only show the top 1000 associations based on the absolute value of Inline graphic. The plots of all associations are also given in the Supplementary Document for completeness. It can be seen that the confounding effects are successfully accounted by LORS, and thus spurious associations are greatly reduced. To quantitatively evaluate the ability of accounting for confounding effects, we compare the reproducibility of the results given by LORS and PANAMA. We examine the reproducibility based on the following two criteria:

  • The consistency of detected SNP-gene associations. Let Inline graphic and Inline graphic be the sets of SNP-gene associations detected in the two yeast datasets, respectively. The most T significant associations from the two datasets are denoted as Inline graphic and Inline graphic. The consistency is defined as Inline graphic, where Inline graphic denotes the size of Inline graphic. For LORS, the ranking is based on the absolute value of Inline graphic. For PANAMA, the ranking is based on the q-value.

  • The consistency of detected hotspots. For a SNP, we can count the number of associated genes from the detected SNP-gene associations (for LORs, all SNP–Gene pairs with a non-zero B are defined as associations. For PANAMA, SNP–Gene pairs with a q-value <0.001 are defined as associations, we tried different cutoffs from 0.01 to 0.001, the results are similar), i.e. the regulatory degree of the SNP. SNPs with large degrees are often referred to as hotspots. According to SNPs’ regulatory degrees, we sort them in a descending order and denote the sorted SNPs lists as Inline graphic and Inline graphic for the two yeast datasets. Let Inline graphic and Inline graphic be the top T SNPs in the sorted SNP lists, respectively. The consistency of detected hotspots is defined as Inline graphic.

Fig. 3.

Fig. 3.

(A) A plot of significant linkage peaks given by standard linear regression (P < 0.01 after Bonferroni correction) for expression QTL in the study (Smith and Kruglyak, 2008) by marker location (x-axis) and expression trait location (y-axis). (B) The plot of linkage peaks in the study (Smith and Kruglyak, 2008) given by LORS (Top 1000 associations based on Inline graphic are shown here. The plot of All associations are given in the supplementary document)

For Brem’s dataset (Brem et al., 2002), the estimated sparse matrix Inline graphic given by LORS has about 6000 non-zero entries in total. Among them, there are 4500 entries with abs Inline graphic and 2500 entries with abs Inline graphic, respectively. For Simth’s dataset (Smith and Kruglyak, 2008), the estimated Inline graphic has about 10 000 non-zero entries in total. There are about 4500 entries with abs Inline graphic. (To provide a meaningful guideline of the thresholds, we estimate FDR using 50 permutations. The estimated FDR corresponding to different thresholds are provided in the Supplementary Document. It tells us that FDR Inline graphic 0.01 when threshold Inline graphic 0.01). In Figure 4, we show the consistency of the top 4500 associations. The consistencies of hotspots are shown in Figure 5. It seems to be counter-intuitive that the fraction of consistency of PANAMA increases as the number of detected association increases. In fact, the consistency of PANAMA increases to 0.12 and then drops. We provide the detailed information in the Supplementary Document. From Figures 4 and 5, it can be seen that LORS achieves better consistency than PANAMA.

Fig. 4.

Fig. 4.

Consistency of detected associations between two independent yeast eQTL datasets

Fig. 5.

Fig. 5.

Consistency of detected eQTL hotspots between two independent yeast eQTL datasets

So far we have shown that spurious associations can be reduced by successfully accounting for non-genetic effects. Now we are going to show whether it could help to detect more biologically relevant associations. We take a closer inspection of the top 15 hotspots, as listed in Table 1. In most cases (12/15), associated genes are enriched with at least one GO category, which implies that they are biologically relevant findings. In particular, we detect two novel hotspots (NO. 9 and NO. 13), which can not be detected by standard linear regression (adjusted P Inline graphic). For these two hotspots, the associated genes are enriched in GO categories. In detail, for hotspot NO. 9, five of the 18 associated genes are functional in response to toxin, they are AAD4, YDL218W, YLL056C, AAD6 and SPS100. The hotspot eQTL is cis-linked to one of them, AAD4, which apparently explains the detected association. Hotspot NO. 13 locates at transcription factor (TF) CAT8. Based on the transcriptional regulation information in yeast from both direct (Chip-chip) or indirect (Microarrays—wild type versus TF mutant) evidence (Teixeira et al., 2006), 9 (they are ADY2, PUT4, GAP1, ATO3, ALP1, YDR222w, CWP1, ADH2 and LPX1) of the 15 associated genes can be potentially regulated by CAT8 (adjusted P Inline graphic, the details of the p-value calculation is given in the Supplementary Document). Interestingly, five genes (i.e. ADY2, PUT4, GAP1, ATO3, ALP1) are functional in organic acid transport and CAT8 is known to regulate acid transport pathway (Young et al., 2003). Identification of this hotspot provides a positive example and indicates that, when non-genetic effect has been successfully accounted for, we may be able to detect more biologically relevant trans eQTL.

Table 1.

Summary of the detected hotspots

Hotspot index Sizea Locib GO categoryc Hitsd t-test (all)e t-test (hits)f
NO. 1 32 Chr XII:1056103 Telomere maintenance via recombinationInline graphic 5 20 5
NO. 2 27 Chr IV:1525327 Telomere maintenance via recombinationInline graphic 4 5 3
NO. 3 26 Chr XII:662627 Sterol metabolic processInline graphic 7 25 7
NO. 4 24 Chr I:52859 Fatty acid metabolic processInline graphic 10 12 6
NO. 5 24 Chr XV:202370 Response to abiotic stimulusInline graphic 10 11 6
NO. 6 23 Chr III:201166 Response to pheromoneInline graphic 7 16 6
NO. 7 21 Chr VII:402833 Protein foldingInline graphic 8 4 3
NO. 8 19 Chr I:7298 Fatty acid beta-oxidationInline graphic 5 13 4
NO. 9 18 Chr IV:33214 Response to toxinInline graphic 5 0 0
NO. 10 16 Chr II:562415 CytokinesisInline graphic 8 15 8
NO. 11 16 Chr X:698149 3
NO. 12 16 Chr XV:132423 13
NO. 13 15 Chr XIII:843356 Organic acid transportInline graphic 5 0 0
NO. 14 15 Chr V:395442 3
NO. 15 15 Chr XVI:486637 Sexual reproductionInline graphic 6 8 4

aNumber of genes associated with the hotspot. bThe chromosome position of hotspot. cThe most significant GO category enriched in the associated gene set. The enrichment test was performed using DAVID (Huang da and Lempicki, 2008). The gene function is defined by GO Fat category. DAVID outputs the Benjamini–Hochberg adjusted P-value. Adjusted P-values are indicated by Inline graphic, where Inline graphic dNumber of associated genes that are functional in the enriched GO category. eNumber of associated genes that can also be identified using t-test. fNumber of associated genes that are functional in the enriched GO category and can also be identified using t-test. Two novel hotspots (NO. 9 and NO. 13) which cannot be detected by standard linear regression are in bold.

7 CONCLUSIONS

In this article, we have introduced a method named ‘LORS’ to account for non-genetic effects in eQTL mapping. LORS provided a unified framework in which all SNPs and all gene probes can be jointly analyzed. The formulation of LORS is a convex optimization problem and thus its global optimum can be achieved. We also developed an efficient algorithm to solve this problem and guaranteed its convergence. We demonstrated its performance using synthetic datasets and real datasets.

A limitation of LORS is that we do not provide a rigorous statistical significance test of the estimated coefficient matrix Inline graphic. Here we simply rank associations based on the estimated sparse matrix abs(Inline graphic) and estimate FDR.

Funding: This work was supported in part by NSF (DMS 1106738), NIH (R01 GM59507) and NSFC (10901042, 10971075 and 91130032).

Conflict of Interest: none declared.

Supplementary Material

Supplementary Data

REFERENCES

  1. Alter O, et al. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA. 2000;97:10101–10106. doi: 10.1073/pnas.97.18.10101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Brem RB, et al. Genetic dissection of transcriptional regulation in budding yeast. Science. 2002;296:752–755. doi: 10.1126/science.1069516. [DOI] [PubMed] [Google Scholar]
  3. Cheung V, Spielman R. Genetics of human gene expression: mapping dna variants that influence gene expression. Nat. Rev. Genet. 2009;10:595–604. doi: 10.1038/nrg2630. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Cookson W, et al. Mapping complex disease traits with global gene expression. Nat. Rev. Genet. 2009;10:184–194. doi: 10.1038/nrg2537. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Huang da W, et al. Systematic and integrative analysis of large gene lists using david bioinformatics resources. Nat. Protoc. 2008;4:44–57. doi: 10.1038/nprot.2008.211. [DOI] [PubMed] [Google Scholar]
  6. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001;96:1348–1360. [Google Scholar]
  7. Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Series B Stat. Methodol. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Friedman J, et al. Pathwise coordinate optimization. Ann. Appl. Stat. 2007;1:302–332. [Google Scholar]
  9. Friedman J, et al. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010;33:1–22. [PMC free article] [PubMed] [Google Scholar]
  10. Fusi N, et al. Joint modelling of confounding factors and prominent genetic regulators provides increased accuracy in genetical genomics studies. PLoS Comput. Biol. 2012;8:e1002330. doi: 10.1371/journal.pcbi.1002330. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Gagnon-Bartsch J, Speed T. Using control genes to correct for unwanted variation in microarray data. Biostatistics. 2012;13:539–552. doi: 10.1093/biostatistics/kxr034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Gibson G. The environmental contribution to gene expression profiles. Nat. Rev. Genet. 2008;9:575–581. doi: 10.1038/nrg2383. [DOI] [PubMed] [Google Scholar]
  13. Kang H, et al. Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Genetics. 2008;180:1909–1925. doi: 10.1534/genetics.108.094201. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Kang H, et al. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 2010;42:348–354. doi: 10.1038/ng.548. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Leek J, Storey J. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007;3:e161. doi: 10.1371/journal.pgen.0030161. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Leek J, et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 2010;11:733–739. doi: 10.1038/nrg2825. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8:118–127. doi: 10.1093/biostatistics/kxj037. [DOI] [PubMed] [Google Scholar]
  18. Li L, et al. eQTL. Methods Mol. Biol. 2012;871:265–279. doi: 10.1007/978-1-61779-785-9_14. [DOI] [PubMed] [Google Scholar]
  19. Lippert C, et al. Fast linear mixed models for genome-wide association studies. Nat. Methods. 2011;8:833–835. doi: 10.1038/nmeth.1681. [DOI] [PubMed] [Google Scholar]
  20. Listgarten J, et al. Correction for hidden confounders in the genetic analysis of gene expression. Proc. Natl Acad. Sci. USA. 2010;107:16465–16470. doi: 10.1073/pnas.1002425107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Mazumder R, et al. Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 2010;11:2287–2322. [PMC free article] [PubMed] [Google Scholar]
  22. Nica A, Dermitzakis E. Using gene expression to investigate the genetic basis of complex disorders. Hum. Mol. Genet. 2008;17(R2):R129–R134. doi: 10.1093/hmg/ddn285. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Nielsen T, et al. Molecular characterisation of soft tissue tumours: a gene expression study. Lancet. 2002;359:1301–1307. doi: 10.1016/S0140-6736(02)08270-3. [DOI] [PubMed] [Google Scholar]
  24. Nowak G, et al. A fused lasso latent feature model for analyzing multi-sample aCGH data. Biostatistics. 2011;12:776–791. doi: 10.1093/biostatistics/kxr012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Pastinen T, et al. Influence of human genome polymorphism on gene expression. Hum. Mol. Genet. 2006;15(Suppl. 1):R9. doi: 10.1093/hmg/ddl044. [DOI] [PubMed] [Google Scholar]
  26. Rasmussen C, Williams C. Gaussian Processes in Machine Learning. Cambridge, MA, USA: The MIT Press; 2006. [Google Scholar]
  27. Recht B, et al. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Rev. 2010;52:471–501. [Google Scholar]
  28. Shi L, et al. The microarray quality control (MAQC) project shows inter-and intraplatform reproducibility of gene expression measurements. Nat. Biotechnol. 2006;24:1151–1161. doi: 10.1038/nbt1239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Smith EN, Kruglyak L. Gene-environment interaction in yeast gene expression. PLoS Biol. 2008;6:e83. doi: 10.1371/journal.pbio.0060083. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Stegle O, et al. A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies. PLoS Comput. Biol. 2010;6:e1000770. doi: 10.1371/journal.pcbi.1000770. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Teixeira M, et al. The yeastract database: a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae. Nucleic Acids Res. 2006;34(Suppl. 1):D446–D451. doi: 10.1093/nar/gkj013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Series B. 1996;58:267–288. [Google Scholar]
  33. Tibshirani R, Wang P. Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics. 2008;9:18–29. doi: 10.1093/biostatistics/kxm013. [DOI] [PubMed] [Google Scholar]
  34. Young E, et al. Multiple pathways are co-regulated by the protein kinase Snf1 and the TFs Adr1 and Cat8. J. Biol. Chem. 2003;278:26146–26158. doi: 10.1074/jbc.M301981200. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES