Abstract
The study of gene expression quantitative trait loci (eQTL) is an effective approach to illuminate the functional roles of genetic variants. Computational methods have been developed for eQTL mapping using gene expression data from microarray or RNA-seq technology. Application of these methods for eQTL mapping in tumor tissues is problematic because tumor tissues are composed of both tumor and infiltrating normal cells (e.g. immune cells) and eQTL effects may vary between tumor and infiltrating normal cells. To address this challenge, we have developed a new method for eQTL mapping using RNA-seq data from tumor samples. Our method separately estimates the eQTL effects in tumor and infiltrating normal cells using both total expression and allele-specific expression (ASE). We demonstrate that our method controls type I error rate and has higher power than some alternative approaches. We applied our method to study RNA-seq data from The Cancer Genome Atlas and illustrated the similarities and differences of eQTL effects in tumor and normal cells.
Keywords: eQTL, Tumor Purity, RNA-Seq, Allele Specific Expression
1. Introduction
Genetic variants (e.g. Single Nucleotide Polymorphisms (SNPs)) that are associated with the expression of one or more genes are referred to as gene expression quantitative trait loci (eQTLs). Genome-wide eQTL study is a powerful tool for understanding the functional roles of genetic variants. For example, eQTL analyses can help interpret the results of genome-wide association studies (GWASs) [1].
There are two types of eQTL, cis-eQTL and trans-eQTL [2, 3], which are distinguished by the pattern of expression change they induce. To precisely define these eQTL types, we first define the term “allele”. Consider a diploid genome, which has two homologous copies of each chromosome: a maternal copy and a paternal copy. As such, each genetic locus (e.g., a SNP or a gene) has two copies within a cell, which are referred to as the two alleles of this locus. For a gene affected by a cis-eQTL, the expression of each allele is moderated by the genetic content of the corresponding homologous chromosome, which leads to allelic imbalance of gene expression. In contrast, for a gene affected by a trans-eQTL, the expression of both alleles are modified to the same extent.
The concepts of cis- and trans-eQTLs are crucial to our method development, and thus we further illustrate them by two examples. Consider a cis-eQTL, which is a SNP with A and T alleles. The A allele inhibits the binding of a transcription factor, which up-regulates the expression of a nearby gene. In contrast, the T allele does not affect transcription factor binding. If we refer to the two alleles of this gene by A or T allele (based on known phase between this cis-eQTL and the nearby gene of interest), this cis-eQTL leads to lower expression in the A allele than the T allele. An example of a trans-eQTL could be a SNP that affects the activity of a transcription factor, which in turn regulates the expression of a gene and it has the same influence on the gene expression from both alleles.
Cis-eQTLs are often falsely conflated with local eQTLs since cis-eQTLs are often located nearby the genes they affect. Trans-eQTLs, on the other hand, can be located anywhere in the genome in relation to the genes that they regulate [2]. It is important to reinforce that the defining characteristics of cis-eQTL and trans- eQTL are not based on their proximity to their target genes, as local eQTLs can induce cis- or trans- patterns of expression change.
Traditional eQTL mapping methods implicitly assume an eQTL has the same effect on all cells within a sample. This is a reasonable assumption for samples with a relatively homogeneous cell population. However, tumor samples invariably contain both tumor cells and infiltrating normal cells (e.g., immune cells) and eQTL effects could differ between these two types of cells. To quantitatively capture this concept of inhomogeneity within a tumor cell population, we consider its tumor purity, defined as the proportion of tumor cells within the tumor sample. Previous eQTL studies in tumor samples often ignore tumor purity information and directly apply eQTL mapping methods that assume the tumor samples are composed of homogenous cells [4, 5, 6, 7, 8]. When tumor and normal eQTL are discordant, our results show that ignoring tumor purity may lead to severely inflated type I error in the identification of tumor-specific eQTLs (Figure 1).
Fig. 1.
Examining Type I error (A) and Power (B) from simulation studies, and the details of simulations are described in Section 2. Here we define true discovery as tumor-specific eQTLs. In panel (A), the tumor-specific eQTLs effect is 1 (corresponding to no eQTL effect), while the normal-specific eQTL effect increases from 1.0 to 3.0. In panel (B), the normal-specific eQTLs effect is 1 and tumor-specific eQTL effect varies from 1 to 1.8. The methods LR, TReC and TReCASE ignore tumor purity information while the other three methods account for tumor purity. The details of these methods will be explained in Method Section.
In this paper, we focus on eQTL mapping using germline genetic variants. The proposed methods may be extended to study eQTL mapping using somatic variants, but such extensions must address the challenge of intra-tumor heterogeneity with respect to somatic mutations. To the best of our knowledge, only one previous work has considered a similar problem of cell-type-specific eQTL mapping given cell type proportion estimates [9]. Specifically, Westra et al [9] identify neutrophil-specific eQTLs using a linear model: y = β0 +β1G + β2 P + β3GP where y is gene expression, G is genotype, and P is an estimate or proxy of neutrophil proportion. Loci where eQTL effects are different between neutrophil and other cell types were identified by testing the hypothesis β3 = 0. This approach does not directly estimate or assess cell-type-specific eQTL effects. We show in our analysis that a variant of this method that explicitly models a tumor-specific eQTL effect has lower power than our proposed method. The proposed methods are applied to the genetic and gene expression data of 547 women with breast cancer provided by The Cancer Genome Atlas (TCGA). We examine the agreement and disagreement between each posited model with respect to eQTL identification and discuss some interesting eQTLs identified by our method.
2. Method
Our model is an extension of the TReCASE method, which performs eQTL mapping using RNA-seq data [10]. The TReCASE method models RNA-seq data along two dimensions, Total Read Count (TReC) and Allele-Specific Expression (ASE), and simultaneously uses these two types of data for eQTL mapping [3, 10]. The TReC for a gene of interest is the total number of RNA-seq reads mapped to this gene. Under the TReCASE framework, TReCs across samples are modeled by a negative binomial distribution. The ASE of a gene is quantified by the number of allele-specific reads that match the genotype of one haplotype, but not the other haplotype of this gene. Thus, an RNA-seq read is allele-specific if it overlaps with at least one SNP that is heterozygous across the two haplotypes. The number of allele-specific reads from one allele given the total number of allele-specific reads follows a beta-binomial distribution in the TReCASE framework.
The TReCASE method jointly analyzes the TReC and ASE data for cis-eQTL s as these two types of data provide consistent information regarding the effect sizes of cis-eQTL s. In contrast, for a trans-eQTL the eQTL effect is non-zero for TReC but zero for ASE, and thus only TReC data are used for mapping transeQTL s. The TReCASE model implicitly assumes eQTL effects are the same across all the cells within a sample, which may not be correct for tumor samples. In this paper, we extend the TReCASE model for tumor eQTL studies through the incorporation of tumor purity and separate tumor- and normal-specific eQTL effects into our likelihood model. We refer to this new model as pTReCASE.
2.1. The Data
We assume that phased germline genotype data and RNA-seq data from tumor samples are available for n independent subjects. Since germline genotype data have been phased, we have genotypes for each of a subjects’ two haplotypes. We also assume that an estimate of tumor purity is available for each tumor sample. For example, one could estimate tumor purity using somatic copy number aberration data [11].
Since pTReCASE is designed to analyze each gene-SNP pair separately, in the following discussion, we consider the model for a specific gene of interest and a single potential eQTL of this gene. For clarity and simplicity in the following notation, we suppress subscripts related to gene and eQTL and note that the given structure applies across any gene-SNP pair. Let G(i) be the genotype of subject i at the potential eQTL. G(i) can take values in {AA, AB, BB} where A and B denote two alleles of the potential eQTL. Let ρi, di, and xi = (xi1, …, xip)T be the tumor purity estimate, read depth measurement, and a vector of p covariates for the i-th sample respectively. We set di as the 75-th percentile of the TReCs across all genes in the i-th sample, which is a more robust way to measure read-depth than the summation of the TReCs across all genes.
2.2. Purity Corrected Total Read Count (pTReC) Model
The total read count Yi is defined as the number of RNA-seq reads that are mapped to a given gene. We assume that Yi follows a negative binomial distribution with over-dispersion ϕ and subject-specific mean μi, the likelihood for which is given by:
with E(Yi) = μi and . Summarizing across all n subjects, the log-likelihood for the pTReC model is:
| (1) |
In impure tumor samples, pTReC captures the genetic effects of a potential eQTL on Yi through its specification of μi (equation 5). In order to illuminate the structure of μi, we must first quantitatively define these genetic effects for both tumor and normal cells. Let μiA and μiB be the mean expression of alleles A and B for the i-th subject, and use superscripts (T) and (N) to denote measurements from tumor and normal cells, respectively. The values of , , , and are allowed to vary across subjects, but we assume that the ratios of these quantities are population level parameters that are invariant across subjects. This can be justified by the fact that the subject-specific factors, e.g., read depth of the i-th subject, age, gender etc., have the same effect on the allele-specific expression of two alleles and thus they cancel out when we examine their ratios. Symbolically:
| (2) |
Thus, γ represents a population-level eQTL effect in tumor cells that is common to all subjects and η is its counterpart for normal cells. The remaining parameter, κ, is a nuisance parameter that models the baseline gene expression difference between tumor and normal tissues. When γ = 1, no eQTL effect exists within the tumor as the mean expression of alleles A and B are identical. When γ < 1 (or γ > 1), B allele is under-expressed (or over-expressed) relative to A allele. The meaning of η, eQTL effect in normal tissues, can be interpreted similarly.
Now let ξi = μiB/μiA. Assuming that the mean expression of an allele is the weighted summation of its expression in tumor and normal cells, we have:
| (3) |
where ci = (ρiκ) / (1−ρi + ρiκ). The third equality is obtained by dividing both the numerator and denominator by Therefore, the overall genetic effect in a tumor sample is a mixture of the genetic effects within tumor cells and normal cells.
Next we consider modeling the μi across different genotypes. First, if the i-th subject has genotype AA at the candidate eQTL,
| (4) |
We model using a linear function of log read-depth and p covariates: . Applying similar derivations for the subjects with genotypes AB and BB, we have:
| (5) |
Note that in the above equations, the genetic effects η and γ affect gene expression through ξi, and ξi is used to model the difference of μi’s across the three genotype groups: AA, AB, and BB. These equations also demonstrate that we cannot treat the observed genotype as an extra covariate in the negative binomial regression because its relation with log (μi) is not on linear scale.
Let β = (β0, βd, β1, …, βp)T. We estimate all the parameters, including, β,ϕ,κ, η and γ, by maximizing the likelihood function in equation (1), using a block coordinate ascent algorithm. Given the terms involved with κ, η, and γ in equation (5) as offsets, the problem becomes a standard negative binomial regression with regression coefficients β and over-dispersion parameter ϕ. Therefore, wet set β and ϕ as one block of the parameters and κ, η, and γ as the other block. Within the block coordinate ascent, we estimate one block of parameters by maximizing the likelihood while holding the parameters of the other block at fixed values. Each block of the parameters is then estimated iteratively until convergence of parameter estimates. Specifically, given β and ϕ, κ, η, and γ are estimated by a quasi-Newton method (LBFGS). Then, given κ, η, and γ, β and ϕ are estimated via a negative binomial regression.
2.3. Purity Corrected Allele Specific Expression (pASE) Model
We first briefly describe the measurement of ASE and refer the readers to Sun and Hu (2013) [3] for more details. For each subject, we assume phased genotype data are available for two arbitrarily labeled haplotypes, haplotype 1 and haplotype 2. We extract all the RNA-seq reads that overlap with at least one heterozygous SNP within the body of the gene and assign each of these reads to the haplotype that matches its nucleotide sequence. As haplotypes 1 and 2 are arbitrarily labeled for each subject, we ensure comparability across subjects by relabeling these haplotypes with respect to the genotype of the candidate eQTL. For subjects who are heterozygous at the candidate eQTL, haplotype A contains the A allele of the candidate eQTL and haplotype B contains the B allele. For subjects who are homozygous at the candidate eQTL, haplotypes A and B may be defined arbitrarily without affecting the likelihood function or statistical inference.
Let RiA and RiB be the number of allele specific RNA-seq reads assigned to haplotypes A and B, respectively. Let Ri = RiA + RiB be the total number of allele-specific RNA-seq reads. In human population, usually around 5–10% of RNA-seq reads overlap with at least one heterozygous SNP, hence are allele-specific reads. In other words, Ri is about 5–10% of Yi. We model RiB given Ri using a beta-binomial distribution with probability of success πi and over-dispersion ψ, the likelihood of which is given by:
Incorporating all individuals, we may express the log-likelihood for the ASE model as:
Following the definition of ξi for pTReC model in equation (3), we define an analogous quantify for ASE data: ξi,ASE = μiB / μiA = (1 − ci) ηASE + ciγ ASE, where ci = (ρiκ) / (1−ρi + ρiκ). Then
We add subscript ASE in notations for ξi, δ, and γ for the pASE model in order to distinguish cis-acting and trans-acting eQTL. For cis-eQTL, ξi,ASE as defined in equation (3). For trans-eQTL, however, ξi,ASE = 1 since expression of the A and B alleles are impacted to the same extent. A consequence of the above modeling strategy is that ASE is uninformative regarding κ, ηASE, or γASE when an eQTL is trans-acting. In addition, for cis-eQTL, subjects who are homozygous at the potential eQTL do not contribute to the estimation of the eQTL parameters κ, ηASE, or γASE. However, such subjects are informative regarding the over-dispersion parameter ψ.
As for pTReC, model fitting in pASE is also achieved via a block coordinate ascent algorithm using two blocks of parameters: one block consists of one parameter ψ, and the other block consists of κ, ηASE and γASE. We estimate the parameters of these two blocks iteratively until convergence. Updates for each block are accomplished via LBFGS.
2.4. pTReCASE: Unifying pTReC and pASE Models
Restricting to cis-eQTLs, the pTReC and pASE models share the κ, η, and γ parameters allowing for unification into a single likelihood model:
where Θ = (κ, η, γ, βT, ϕ, ηASE, γASE, ψ)T, which includes all the parameters found in the pTReC or pASE models. Recall that β = (β0, βd, β1, …, βp)T.
Note that the likelihood above relates Yi and Ri, which are the total number of RNA-seq reads and the number of allele-specific RNA-seq reads, respectively. A read is allele-specific if it overlaps with at least one heterozygous SNP. Thus given Yi, the distribution of Ri is a function of RNA-seq read length (the longer the read, the more likely it overlaps with a heterozygous SNP) and the number of heterozygous SNPs within the gene. It is reasonable to assume both factors are independent of eQTL effects. Therefore, we may remove P(Ri | Yi ) from the likelihood function. The log-likelihood of all n subjects is then given by:
Model fitting is achieved via block coordinate ascent using three blocks: block 1 consists of κ, η and γ; block 2 consists of ϕ,βd and βj for j = 0,1, …, p; and block 3 consists of ψ alone. A single update is defined by the following steps. First, given the parameters of blocks 2 and 3, the parameters of block 1 are updated using LBFGS. Then, given the parameters of blocks 1 and 3, the parameters of block 2 are updated via negative binomial regression. Finally, given the other parameters, the parameter of block 3 is updated using LBFGS. These cyclical updates are repeated until convergence.
2.5. Hypothesis Testing
Under the proposed models of sections 2.2 through 2.4, there are three critical questions of interest. Should we use the pTReC or pTReCASE model to assess an eQTL? Does an eQTL exist within normal tissue? Does an eQTL exist within tumor tissue?
Addressing the first question requires consideration of the biological mechanisms driving cis- and trans-eQTLs. For a cis-eQTL, the pTReC and pASE components share the same parameters for eQTL effect sizes, and thus jointly modeling TReC and ASE (i.e., pTReCASE) increases power. For a trans-eQTL, expression of both alleles of the affected gene are altered to the same extent, and thus ASE is not informative in the detection of eQTL or estimating eQTL effect size. Therefore, only TReC data should be used for eQTL mapping of trans-eQTL. We develop a “Cis-Trans” score test to assess whether the eQTL effects estimated using the TReC and ASE data are the same.
Recall that η and γ denote eQTL effects in tumor and normal tissues, respectively for TReC data; and ηASE and γASE denote these eQTL effects for ASE data. Let ηASE = η + αη and γASE = γ + αγ where αη and αγ reflect the discrepancy between ASE and TReC eQTL effects for normal and tumor tissues, respectively. The null hypothesis of equivalent eQTL effects in TReC and ASE components of the model is defined using the notation above by αη = αγ = 0. See the Supplementary Materials Section A.4 for a detailed description and derivation of this “Cis-Trans” score test. The test statistic and its asymptotic distribution are provided below:
where Θ = (κ, η, γ, βT, ϕ, αη, αγ, ψ)T is a re-parameterization of Θ = (κ, η, γ, βT, ϕ, ηASE, γASE, ψ)T, and to replace ηASE and γASE with αη and αγ respectively. is the MLE of our parameters under the null hypothesis where αη = αγ = 0.i is the gradient of the TReCASE likelihood with respect to the parameters, and is the Fisher’s Information Matrix.
After the Cis-Trans score test, the parameters of our model reduce to Θ = (κ, η, γ, βT, ϕ,)T for TReC model or Θ = (κ, η, γ, βT, ϕ, ψ)T for TReCASE model. Here we use a unified notation ΘT for simplicity. The presence of eQTL in normal tissue (i.e., η ≠ 1) or tumor tissues (i.e., γ ≠ 1) can be assessed using likelihood ratio tests (LRT). These test statistics and their asymptotic distributions take the form:
where represents estimates under the alternative.null and represents estimates under the alternative. To test for the presence of an eQTL in normal or tumor tissue, is obtained by fitting the model under a null hypothesis of η = 1 or γ = 1, respectively.
To identify eQTL for a single gene-SNP pair, we propose the following procedure.
Conduct the “Cis-Trans” score test to determine use of pTReC or pTReCASE model.
Under the prescription of the “Cis-Trans” test, conduct separate LRT of γ = 1 and η = 1 to determine the presence of eQTL effects.
The above algorithm is designed to ensure that inconsistent effects in the pTReC and pASE models do not limit the power to detect trans-eQTL. For trans-eQTL, the eQTL effect modeled by pASE should be 1 whereas that modeled by pTReC should be non-unity (≠1). Thus, joint estimation using pTReCASE will dilute effect strength resulting in a loss of power. Applying the Cis-Trans score test before testing eQTL effect will not bias the eQTL test because the Cis-Trans score test assess whether the eQTL effects are consistent between pTReC and pASE model, regardless the size of the eQTL effect.
3. Results
3.1. Simulation Study
We conducted a simulation study to compare the statistical power and type I error rate of pTReCASE and several other methods. Simulations were conducted across a range of eQTL effect sizes in normal and tumor cells. For the purpose of detecting tumor-specific eQTLs, we assessed Type I error by setting γ = 1 while varying η, and assessed power by setting η = 1 while varying γ. For each pair of η and γ, we simulated 400 replicates of gene expression and genotype data for 500 subjects. Genotypes were simulated assuming a minor allele frequency of 0.2. Read counts were simulated according to the pTReCASE model using the following algorithm:
Randomly generate tumor purities from a uniform distribution on (0.5,1) for each of the 500 subjects.
- Simulate TReC via a negative binomial model with:
- Mean of 100 reads for subjects with genotype AA and tumor purity of 0%.
- κ = 1.5 and ϕ = 0.2
Assume that 5% of the simulated TReC reads are allele-specific reads, rounded to the nearest integer. Partition allele specific reads to haplotypes according to the established beta-binomial model using an overdispersion of ψ = 0.2.
Then we applied each considered eQTL mapping method to the simulated data. The type I error was estimated by the proportion of simulations in which a tumor eQTL was incorrectly identified when none was present. Power was estimated by the proportion of simulations in which the simulated tumor eQTL was recovered.
The competing eQTL models that we considered include the TReC/TReCASE method without correction for tumor purity, and the TReC model with tumor purity correction (pTReC). In addition, we also considered a naïve approach of linear regression ignoring tumor purity, labeled LR (linear regression), and a modification of the approach adopted by Westra et al [9] denoted by pLR (purity corrected linear regression). To fit a linear model, we first applied a normal quantile transformation to (read-depth corrected) TReC values of each gene across n samples, and then used the transformed TReC as a response variable for linear regression. Specifically, we first replaced TReC values by their ranks across n samples, and then replaced the ranks by their corresponding normal quantiles. For example, rank r was replaced by the r / (n + 1)-th normal quantile. Letting be the transformed TReC data, the linear model is given by , where G is the genotype of the candidate eQTL.
To test genotype and tumor purity interaction using pLR, we fit a linear model where ρ denotes tumor purity estimate. The interaction test employed by Westra et al [9] (i.e., β3 = 0) does not assess the strength of a tumor eQTL. Rather, it tests whether eQTL effects differ between tumor and normal tissues. Under pLR, we assessed tumor eQTL effects by testing β1 + β3 = 0 since β1 + β3 is the genetic effect of the candidate eQTL tumor purity is 1.
All three methods that control for tumor purity (pTReCASE, pTReC, pLR) control Type I error at the desired level (Figure 1A). In contrast, as eQTL strength in the normal tissue increases, the methods that do not account for tumor purity see a rapid increase in Type I error (Figure 1A). In terms of power, the methods that do not account for tumor purity have higher power (Figure 1B), and of course such high power is not meaningful because they do not control Type I error. These naive methods that ignore tumor purity have higher power because the majority of gene expression are contributed by tumor cells (50–100% of cells are tumor cells and tumor cells have, on average, 1.5 times of the expression of normal cells). If we simulate tumor purity to be uniform from 0 to 100% and set the gene expression of tumor and normal cells to be similar, then the methods that account for tumor purity have higher power. Among those methods that control Type I error (i.e., pLR, pTReC, and pTReCASE), pTReCASE has the highest power since it combines the data of TReC and ASE.
3.2. The Cancer Genome Atlas (TCGA) Data
3.2.1. Data and Model Fitting
We applied the pTReCASE model to analyze gene expression and germline SNP genotype data from 550 breast cancer patients of The Cancer Genome Atlas (TCGA) project. All the data were downloaded from TCGA data portal (https://tcga-data.nci.nih.gov/docs/publications/tcga/), which has now been replaced by NCI Genomic Data Commons (https://portal.gdc.cancer.gov/). We started with 728 patients with RNA-seq data from tumor samples. In order to assess allele-specific gene expression, we downloaded raw RNA-seq data in bam file format. For genotype data, we downloaded the Affymetrix CEL files. We restricted our analysis 550 of 728 patients who had available genotype data, passed quality controls for both genotype and RNA-seq data, and were Caucasian females (See Supplementary Materials Section B for details). Males were excluded as breast cancer in men is rare and may have a different disease etiology. The restriction to Caucasian samples is not necessary, but it helps to eliminate possible confounders [12].
For the remaining 550 patients, genotype imputation and haplotype phasing was performed by MACH [13] using reference haplotypes from the 1000 Genomes Project. Starting with ~800,000 SNPs genotyped by Affymetrix 6.0 array, we imputed the genotypes for ~36 million SNPs. For each sample, we used all the SNPs with heterozygous genotypes to estimate allele-specific expression (See Supplementary Materials Section B for details). For the purposes of eQTL mapping, we restricted our analysis to those SNPs with variant allele frequency (VAF) ≥ 0.02 (6,825,065 SNPs after imputation) because there is limited power to detect eQTL at lower values of VAF. Tumor purities were estimated using ABSOLUTE [11], which led to the exclusion of three additional subjects lacking valid purity estimates. Estimated haplotypes and tumor purities were treated as truth in the subsequent pTReCASE and linear regression models.
Linear models for eQTL analysis and the revised Westra approach (i.e., purity corrected liner regression or pLR) were fit using matrixEQTL [14] and customized R code on normal quantile transformed RNA-Seq count data, respectively. TReC, TReCASE, pTReC and pTReCASE models were fit using our own R packages. The median analysis time for a single gene-SNP pair using pTReCASE was 2.71 seconds (IQR = 2.93 seconds). The covariates used for eQTL mapping include read-depth of RNA-seq experiments (Supplementary Figure S5), RNA sample plates, age, and the top two principal components derived from the genotype data of the 550 Caucasian samples. Since our method is designed to identify cis-eQTL s and most cis-eQTL s are local to the target gene, we restricted our analysis to SNPs located within 100Kb of the gene of interest.
3.2.2. eQTL Identification
Figures 2A–B illustrate a tumor-specific eQTL identified by the pTReCASE model. The estimates of effect sizes (ratio of gene expression of the B allele versus the A allele) for normal and tumor-specific eQTLs are 0.96 (η) and 3.51 (γ), respectively. The fold change of gene expression in tumor versus normal cells (for genotype AA) is 0.19 (κ) (Figure 2D).
Fig. 2.
(A) Covariate-corrected total expression estimated via pTReCASE plotted against genotype and tumor purity. Outliers were suppressed for clarity. Dot plot instead of boxplot was used when sample size of a category is too small. (B) Examination of the allele specific expression corresponding to case shown (A). (C) Covariate-corrected total expression estimated via pTReC plotted against genotype and tumor purity. (D) Table providing Gene, SNP, and estimated parameters for the displayed assessments. pCT references the value of the Cis-Trans score test.
In other words, gene expression in tumor cells is lower than that in normal cells, and the eQTL effect is only present in tumor cells. These numerical estimates were well demonstrated by Figures 2A–B. As tumor purity increases, gene expression measured by TReC decreases (Figure 2A), and the strength of the eQTL increases. Both TReC and ASE show consistent signals that the B allele has higher expression, with a Cis-Trans test p-value of 0.95.
Figure 2C illustrates a tumor-specific eQTL identified by the pTReC model. In this example, gene expression from the ASE model was not used for eQTL mapping due to a significant Cis-Trans test. The gene expression is higher in tumor cells compared to normal cells and the B allele has lower expression than the A allele in tumor cells, but not in normal cells. Note that we can still see some signals of an eQTL in the category with the lowest tumor purity. This results from TCGA samples being selected to have relatively higher tumor purity, thereby creating a categorization schema wherein even the lowest tumor purity category has a nonnegligible amount of tumor cells.
We use another example to demonstrate the utility of the Cis-Trans score test (Figure 3). Considering only the TReC data, the B allele has slightly higher expression than the A allele when tumor purity is high (Figure 3A).
Fig. 3.
Demonstrating the utility of the Cis-Trans score test. (A) Covariate-corrected total expression plotted as a function of genotype and tumor purity. (B) Allele Specific Expression with respect to genotype and tumor purity. (C) Table containing relevant modeling information for A and B. pCT provides the p-value of the Cis-Trans score test.
In contrast, considering only the ASE data, the B allele has much lower expression than A across all tumor purity levels. This inconsistency between TReC and ASE data led to a highly significant Cis-Trans p-value (Figure 3C). In such cases, only the TReC data is trusted and used to estimate eQTL effects. ASE tends to be noisier in real data as mapping biases, incorrect genotype data, and/or other biological and technical factors can lead to the observed ASE imbalance as opposed to eQTL effects. Failure to consider the Cis-Trans test could lead to the acceptance of spurious eQTL results.
Next, we systematically compared the results for all eQTL findings using the pTReCASE, TReCASE, and pLR approaches at various p-value cutoffs. One way to compare the results is to check the overlap of all eQTL findings (i.e., all the gene-SNP pairs) (Supplementary Table S3). However, due to LD, the expression of one gene may be associated with multiple SNPs that are in close proximity to one another and often represent redundant eQTL signals. Therefore, we focused on the eQTL results summarized at the gene level. In other words, for a given p-value cutoff, we counted the number of genes with at least one eQTL with a p-value falling below the cutoff (Supplementary Table S4, Figure 4).
Fig. 4.
The number of genes with at least one significant eQTL at p-value cutoffs 5e-6 (A) or 5e-8 (B) using three methods: pLR, TReC(ASE), and pTReC(ASE). The parenthesis around ASE in TReC(ASE) indicates that we use ASE information if the Cis-Trans test does not reject the null hypothesis that eQTL effects are consistent between TReC and ASE data.
Across p-value thresholds, the pLR model identifies much fewer eQTLs than TReCASE or pTReCASE. Among those eQTLs identified by the pLR model, 70–90% are also identified by pTReCASE. The pLR model also misses at least 70% eQTLs identified by pTReCASE. There are at least two possible reasons for the relatively low power of the pLR model. First, pLR does not use ASE information. Second, pLR assumes normal quantile transformed gene expression is a linear function of tumor purity, which may not be a good approximation.
Compared to pTReCASE, the TReCASE model identifies eQTLs in a larger number of genes. For those genes where TReCASE identifies a significant eQTL and pTReCASE does not, the significant findings of the TReCASE model are most likely driven by an eQTL in normal cells. TReCASE recaptures around two-thirds of eQTL findings identified by pTReCASE. The one-third missed by TReCASE are likely to be tumor-specific eQTLs or eQTLs with much weaker effect in normal cells than tumor cells.
The eQTLs captured by both TReCASE and pTReCASE likely affect gene expression in both tumor and normal cells. As expected, such eQTLs tend to have smaller p-values than those identified by only one method (Supplementary Figure S6). We grouped the eQTLs identified by TReCASE and/or pTReCASE into three groups: those identified by both methods and those identified by one of these two methods and checked whether there are any systematic difference among the three groups. We examined whether the eQTL SNPs in these three groups have equal probability to be located within certain distance d to any breast cancer GWAS (Genome-wide association study) hits. We used 469 GWAS hits (p-value <5 × 1 0 −7) from a recent GWAS of breast cancer with 122,977 cases and 105,974 controls of European ancestry and 14,068 cases and 13,104 controls of East Asian ancestry [15]. By varying the distance of eQTL SNPs and GWAS hits from 100kb to 10Mb, the most significant association between eQTL SNP groups and GWAS hits was achieved when d is around 4Mb (Supplementary Figure S7). For those 345 eQTL SNPs identified by both methods, we expect 156 are within 4Mb of any GWAS hits and observe 185 (Chi-squared test p-value 0.002). This is not a very strong enrichment, but it does suggest that those eQTLs shared by tumor and normal cells are more likely associated with GWAS hits. In addition, we also examined the intersection of eGenes (genes with eQTLs) and 719 cancer related genes as defined by Cancer Gene Census (https://cancer.sanger.ac.uk/census, Supplementary Table S7–S9). Interestingly, we observed slight enrichment of cancer related genes among those eGenes with eQTL by pTReCASE but not by TReCASE: observed 7 genes while expect 3 (p-value 0.03). These results suggest that those eQTLs exist in tumor but not normal cells may be associated tumor progression. Though more data/analysis are needed to confirm this, for example, by a pan-cancer study or using a longer list of cancer related genes.
We have also compared our results with the eQTLs of breast cancer samples reported by an earlier study [7]. Li et al. (2013) [7] used a smaller sample size (n = 219 vs. n = 547 in our study), and measured gene expression by microarray rather than RNA-seq. They limited their analysis on ~800,000 SNPs genotyped by Affymetrix 6.0 array and search for eQTLs around 1Mb of each gene. In contrast, we considered more than 6 million SNPs after imputation and filtering by VAF ≥0.02, and searched for eQTLs around 100kb of each gene. Despite these differences, we found significant overlap of the eQTLs found by our TReCASE or pTReCASE model versus the results reported by Li et al. (2013) [7] (Supplementary Table S5–S6).
Overall both TReCASE and pTReCASE identified a large number of eGenes (i.e., the genes with at least one significant eQTL). For example, at p-value cutoff of 5e-6, TReCASE and pTReCASE identified around 3,000 and 1,200 eGenes respectively out of 18,000 genes tested. These numbers of eGenes are actually not large compared with some recent studies. For example, in the Genotype-Tissue Expression (GTEx) project, with sample size around 300, around 40% of tested genes were identified as eGenes (Figure 1C of a GTEx publication [16]). The relative low power of our study may be due to the higher level of heterogeneity of tumor samples than the normal tissue samples used by the GTEx project.
3.2.3. Assessing Copy Number Effects
Within tumor samples, somatic copy number alterations (SCNAs) are pervasive and they are often strongly associated with gene expression variation [17]. Currently, the pTReC and pTReCASE methods do not account for the effects of SCNAs. Future extension to account for SCNAs is warranted to improve the power of eQTL mapping. However, we do not expect dramatic power improvement because gene expression profiles have high similarity before and after accounting for SCNAs (Supplementary Figure S8). In addition, ignoring SCNAs in eQTL analysis will not lead to many false positive eQTLs because, as shown in the following paragraphs, the genotypes of the identified eQTLs have no or very weak correlations with SCNAs.
Starting with Affymetrix 6.0 array data of both tumor and paired normal samples, we ran the ASCAT pipeline to call copy numbers, as well as tumor purity and ploidy (genome-wide average of copy number) for each tumor sample (https://github.com/Crick-CancerGenomics/ascat/blob/master/ASCAT/R/ascat.R) [18]. Consistent with the findings from previous studies [19], many tumor samples have genome-wide duplication (i.e., with ploidy around 4). Genome-wide duplication should not affect gene expression because gene expression is measured as a relative quantity: the number of RNA-seq reads mapped to a gene versus the total number of RNA-seq reads per sample. In other words, if the expression of all the genes are doubled due to genome-wide duplication, the gene expression measurements by RNA-seq remain the same. Therefore we examined the extent of copy number alteration by Dij = Cij − Ni, where Cij is the total copy number of gene j in sample i, and Ni is the ploidy of the i-th sample.
Because of the limited accuracy to estimate exact copy number, and to improve the robustness of the association analysis, we simply quantified copy number changes as deletion, copy number neutral, and amplification. Specifically, we quantified copy number changes by a variable Gij, which equals to −1, 0, or 1 if Dij < −0.5, | Dij |≤ 0.5, or Dij > 0.5, respectively. Our results are not sensitive to the cutoff choice of 0.5. This cutoff is intuitively easy to interpret. For example, if ploidy is 2. Then this cutoff means that we call a copy number gain/loss if the estimate copy number is > 2.5 or < 1.5. Almost all the genes are affected by SCNA events in 30% or more samples (Supplementary Figure S9A). As expected, there are strong positive correlations between copy number measurement Gij and gene expression across all genes (Supplementary Figure S9B).
Finally, for each of 1,245 genes with at least one eQTL by pTReCASE at the 5 × 1 0 −6 level, we selected the most significant eQTL SNP per gene and assessed its correlation with the copy number of the corresponding gene. The distribution of these correlations across the 1,245 genes matches very well with the expected null distribution when these correlations are all 0 (Figure 5). Therefore, there is little correlation between eQTL genotype and copy number changes, and thus the results of pTReC and pTReCASE are unlikely to suffer from false positives due to SCNAs.
Fig. 5.
The distribution of the correlation between gene expression (measured by Yij / di) and copy number (measured by Gij). Yij denotes the number of RNA-seq reads mapped to the j-th gene in the i-th sample, di is the read-depth measurement for the i-th sample. The red-line indicates the expected null distribution when correlation is 0: Normal .
For a specific gene and SNP, if correcting SCNA is desired, one can use a linear regression approach to remove SCNA effect from total expression before eQTL mapping. However, this approach may create discrepancy between total expression and allele-specific expression and in such cases, our method will only use total expression data.
4. Discussion
Due to the contamination of tumor samples with infiltrating normal cells, the identification of eQTLs within tumor tissues poses several challenges. First and foremost, one needs to separately estimate the eQTL effects in tumor and normal cells. Second, while total gene expression has been widely used for transcriptome studies, it is important to leverage the additional information provided by allele-specific expression that can be effectively derived using RNA-seq data. We have developed a statistical model and software package, pTReCASE, to address these issues. The desirable performance of pTReCASE has been validated using simulations and real data analysis. In contrast, a naïve approach for eQTL mapping that ignores tumor purity may lead to a large fraction of false positives.
The statistical model utilized by pTReCASE involves two assumptions: (1) expression in the tumor sample can be decomposed into two components: expression from tumor cells or normal cells; and (2) the eQTLs effect is additive, instead of dominant or recessive. In fact, the tumor cells are not homogenous and they can be divided into different subclones, the so-called intra-tumor heterogeneity. However, assumption (1) allows pTReCASE to identify average eQTL effects in the tumor cells. Further refinement of subclone-specific eQTL effects is very challenging if not infeasible because subclones are rarely shared across cancer patients. With regard to assumption (2), the additive structure used by pTReCASE is a natural consequence of cis-acting regulation. Should dominant and recessive relationships exist, it is unlikely to result from cis-acting regulation and thus one should not incorporate ASE information in the model. The pTReC model can be modified to capture dominant and recessive relationships.
Within the current established framework for pTReC(ASE), there are three additional avenues for further development and research. The first is to improve the computational efficiency of our software package. Using the current implementation, it takes about thousands of CPU hours for genome-wide local eQTL mapping. This can be easily done using a moderately sized computing cluster, but is not computationally feasible for a single computer. High computational costs also prevent us from using permutations to assess the significance of eQTL results for each gene. Thus, we recommend multiple testing correction by Bonferroni correction based on the total number of SNPs or the number of independent SNPs based on the correlation structure of the genotype data [20], or Benjamini-Hochberg FDR control procedure [21].
We have assumed that the haplotypes connecting candidate eQTLs and the gene of interest are known. In practice, such haplotypes are imputed/phased using statistical methods. Phasing is usually accurate within short genetic distances around the gene of interest. However, if we would like to consider potential eQTLs further from the gene, there is a possibility of phasing error. The second avenue for improving the posited model is to allow for uncertainty in the haplotype phasing by following the approach of Hu et al [22].
In this paper, we focus on germline genetic variants as potential eQTLs. However, somatic mutations/alterations such as SCNAs, DNA methylation, or somatic point mutations (single nucleotide variants or indels) may also affect gene expression. Among all of these factors, SCNAs are likely have the largest impact on gene expression variation. We have discussed the potential consequence of ignoring SCNAs in section 3.2.3. A recent paper [17] has shown that the associations between gene expression and DNA methylation in tumor samples are often due to the confounding of tumor purity, and a new method have been proposed to correct for such confounding. After such correction, the association between DNA methylation and gene expression are not strong for most genes. In addition, given copy number, DNA methylation is often conditionally independent with gene expression. Therefore we expect that the impact of DNA methylation on gene expression is relatively minor. We have also illustrated that for the examples shown in Figure 2, the associations between germline SNPs and gene expression are similar before and after conditioning SCNAs or DNA methylations (Supplementary Figures S10–S11).
Association studies of somatic point mutations merit new method development because most somatic point mutations are rare or even private in cancer patient population, and thus a simple mutation-by-mutation or gene-by-gene association analysis may have limited power [23, 24, 25]. For example, in a pan-cancer studies of 14 cancer types, Fredriksson et al. [23] only identified a few somatic point mutations as eQTLs. Another study that uses a more sophisticated model to borrow information across genes identified somatic mutations as local eQTLs in 65 genes across 12 cancer types [24]. An additional challenge to study somatic mutation associations is intra-tumor heterogeneity. Even estimation of intra-tumor heterogeneity is very challenging task with one tumor sample per subject [26, 27].
It is desirable to systematically study the genetic basis of gene expression in tumor samples using multiple genetic factors including germline SNPs, SCNAs, somatic DNA methylation variation, and even somatic point mutations while accounting for intra-tumor heterogeneity. Such explorations warrant a series of future studies, for example, to study the uncertainty of somatic mutation calling or intra-tumor heterogeneity inference, and assess how such estimation uncertainty affects association analysis.
Supplementary Material
Acknowledgments
This work was partially supported by NIH grants R01 GM105785, R01 GM07335, and Cancer Genomics Training Grant. The authors wish to thank the Associate Editor and two reviewers for their constructive comments and suggestions.
Footnotes
SUPPLEMENTARY MATERIAL
Supplement to “Mapping Tumor-Specific eQTLs in Impure Tumor Samples”: Supplementary document containing RNA-seq and genotype array processing information, mathematical details for the optimization of the pTReC and pTReCASE models, and the derivation of the Cis-Trans score test. (PDF) pTReCASE: Open source R-package pTReCASE containing code to perform the pTReCASE analysis presented in the simulation studies and TCGA Data examination. (GNU zipped tar file). This R package will also be posted at GitHub: https://github.com/Sun-lab/.
Contributor Information
Douglas R. Wilson, Doug R. Wilson is a graduate student, Department of Biostatistics, UNC Chapel Hill, NC 27599.
Joseph G. Ibrahim, Joseph G. Ibrahim is Alumni Distinguished Professor of Biostatistics, Department of Biostatistics, UNC Chapel Hill, NC 27599.
Wei Sun, Wei Sun is an Associate Member in Biostatistics Program at Fred Hutchinson Cancer Research Center..
References
- [1].Cookson W, Liang L, Abecasis G, Moffatt M, and Lathrop M, “Mapping complex disease traits with global gene expression,” Nature Reviews Genetics, vol. 10, no. 3, pp. 184–194, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Rockman M and Kruglyak L, “Genetics of global gene expression,” Nature Reviews Genetics, vol. 7, no. 11, pp. 862–872, 2006. [DOI] [PubMed] [Google Scholar]
- [3].Sun W and Hu Y, “eQTL mapping using RNA-seq data,” Statistics in biosciences, vol. 5, no. 1, pp. 198–219, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Loo LWM, Cheng I, Tiirikainen M, Lum-Jones A, Seifried A, Dunklee LM, Church JM, Gryfe R, Weisenberger DJ, Haile RW, and et al. , “ cis-expression qtl analysis of established colorectal cancer risk variants in colon tumors and adjacent normal tissue,” PLoS ONE, vol. 7, no. 2, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Grisanzio C, Werner L, Takeda D, Awoyemi BC, Pomerantz MM, Yamada H, Sooriakumaran P, Robinson BD, Leung R, Schinzel AC, and et al. , “Genetic and functional analyses implicate the nudt11, hnf1b, and slc22a3 genes in prostate cancer pathogenesis,” Proceedings of the National Academy of Sciences, vol. 109, no. 28, pp. 11252–11257, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Chen Q-R, Hu Y, Yan C, Buetow K, and Meerzaman D, “Systematic genetic analysis identifies cis-eQTL target genes associated with glioblastoma patient survival,” PLoS ONE, vol. 9, no. 8, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Li Q, Seo J-H, Stranger B, McKenna A, Pe’Er I, LaFramboise T, Brown M, Tyekucheva S, and Freedman ML, “Integrative eQTL-based analyses reveal candidate causal genes and loci across five tumor types,” Cell, vol. 152, no. 3, pp. 633–641, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Li Q, Stram A, Chen C, Kar S, Gayther S, Pharoah P, Haiman C, Stranger B, Kraft P, Freedman ML, and et al. , “Expression qtl-based analyses reveal candidate causal genes and loci across five tumor types,” Human Molecular Genetics, vol. 23, pp. 5294–5302, 6 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Westra H-J, Arends D, Esko T, Peters MJ, Schurmann C, Schramm K, Kettunen J, Yaghootkar H, Fairfax BP, Andiappan AK, and et al. , “Cell specific eQTL analysis without sorting cells,” PLoS Genetics, vol. 11, 5 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Sun W, “A statistical framework for eQTL mapping using RNA-seq data,” Biometrics, vol. 68, pp. 1–11, 12 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Carter SL, Cibulskis K, Helman E, McKenna A, Shen H, Zack T, Laird PW, Onofrio RC, Winckler W, Weir BA, et al. , “Absolute quantification of somatic DNA alterations in human cancer,” Nature biotechnology, vol. 30, no. 5, pp. 413–421, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Baquet CR, Mishra SI, Commiskey P, Ellison GL, and DeShields M, “ Breast cancer epidemiology in blacks and whites: disparities in incidence, mortality, survival rates and histology,” Journal of the National Medical Association, vol. 100, no. 5, pp. 480–489, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Li Y, Willer CJ, Ding J, Scheet P, and Abecasis GR, “MaCH: using sequence and genotype data to estimate haplotypes and unobserved genotypes, ” Genetic epidemiology, vol. 34, no. 8, pp. 816–834, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Shabalin AA, “Matrix eQTL: ultra fast eQTL analysis via large matrix operations,” Bioinformatics, vol. 28, no. 10, pp. 1353–1358, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Michailidou K, Lindström S, Dennis J, Beesley J, Hui S, Kar S, Lemaçon A, Soucy P, Glubb D, Rostamianfar A, et al. , “Association analysis identifies 65 new breast cancer risk loci,” Nature, vol. 551, no. 7678, p. 92, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Consortium G et al. , “Genetic effects on gene expression across human tissues,” Nature, vol. 550, no. 7675, p. 204, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Sun W, Bunn P, Jin C, Little P, Zhabotynsky V, Perou CM, Hayes DN, Chen M, and Lin D-Y, “The association between copy number aberration, DNA methylation and gene expression in tumor samples,” Nucleic acids research, vol. 46, no. 6, pp. 3009–3018, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Van Loo P, Nordgard SH, Lingjærde OC, Russnes HG, Rye IH, Sun W, Weigman VJ, Marynen P, Zetterberg A, Naume B, et al. , “Allele-specific copy number analysis of tumors,” Proceedings of the National Academy of Sciences, vol. 107, no. 39, pp. 16910–16915, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Zack TI, Schumacher SE, Carter SL, Cherniack AD, Saksena G, Tabak B, Lawrence MS, Zhang C-Z, Wala J, Mermel CH, et al. , “Pan-cancer patterns of somatic copy number alteration,” Nature genetics, vol. 45, no. 10, p. 1134, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Gao X, Becker LC, Becker DM, Starmer JD, and Province MA, “ Avoiding the high bonferroni penalty in genome-wide association studies,” Genetic Epidemiology, 1 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Benjamini Y and Hochberg Y, “Controlling the false discovery rate: A practical and powerful approach to multiple testing,” Journal of the Royal Statistical Society. Series B Methodological, vol. 57, no. 1, pp. 289–300, 1995. [Google Scholar]
- [22].Hu Y-J, Sun W, Tzeng J-Y, and Perou CM, “Proper use of allele-specific expression improves statistical power for cis-eQTL mapping with RNA-seq data,” Journal of the American Statistical Association, vol. 110, pp. 962–974, 3 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Fredriksson NJ, Ny L, Nilsson JA, and Larsson E, “Systematic analysis of noncoding somatic mutations and gene expression alterations across 14 tumor types,” Nature genetics, vol. 46, no. 12, pp. 1258–1263, 2014. [DOI] [PubMed] [Google Scholar]
- [24].Ding J, McConechy MK, Horlings HM, Ha G, Chan FC, Funnell T, Mullaly SC, Reimand J, Bashashati A, Bader GD, et al. , “Systematic analysis of somatic mutations impacting gene expression in 12 tumour types,” Nature communications, vol. 6, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Liu Y, He Q, and Sun W, “Association analysis using somatic mutations,” PLoS genetics, vol. 14, no. 11, p. e1007746, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Loo PV, Nordgard SH, Lingjaerde OC, Russnes HG, Rye IH,Sun W, Weigman VJ, Marynen P, Zetterberg A, Naume B, and et al. , “Allele-specific copy number analysis of tumors,” Proceedings of the National Academy of Sciences, vol. 107, no. 39, pp. 16910–16915, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Shen R and Seshan VE, “FACETS: allele-specific copy number and clonal heterogeneity analysis tool for high-throughput DNA sequencing,” Nucleic acids research, vol. 44, no. 16, pp. e131–e131, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.





