Abstract
Motivation: RNA sequencing enables allele-specific expression (ASE) studies that complement standard genotype expression studies for common variants and, importantly, also allow measuring the regulatory impact of rare variants. The Genotype-Tissue Expression (GTEx) project is collecting RNA-seq data on multiple tissues of a same set of individuals and novel methods are required for the analysis of these data.
Results: We present a statistical method to compare different patterns of ASE across tissues and to classify genetic variants according to their impact on the tissue-wide expression profile. We focus on strong ASE effects that we are expecting to see for protein-truncating variants, but our method can also be adjusted for other types of ASE effects. We illustrate the method with a real data example on a tissue-wide expression profile of a variant causal for lipoid proteinosis, and with a simulation study to assess our method more generally.
Availability and implementation: http://www.well.ox.ac.uk/~rivas/mamba/. R-sources and data examples http://www.iki.fi/mpirinen/
Contact: matti.pirinen@helsinki.fi or rivas@well.ox.ac.uk
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction
Advancements in sequencing technologies are enabling unprecedented possibilities to study the transcriptome (Montgomery et al., 2010; Pickrell et al., 2010). In RNA-sequencing studies, it is possible to distinguish between transcripts from the two haplotypes of an individual using heterozygous sites. This approach, called allele-specific expression (ASE) analysis, allows an alternative way to quantify cis-regulatory variation, complementary to eQTL analysis. In addition, ASE has been utilized to analyze transcriptome effects of nonsense-mediated decay (NMD) triggered by predicted loss-of-function variants (Lappalainen et al., 2013; MacArthur et al., 2012; Montgomery et al., 2011).
The Genotype-Tissue Expression (GTEx) project is establishing a resource database and tissue bank for the scientific community to study the relationship between genetic variation and gene expression in human tissues (GTEx-Consortium, 2013), with an aim to interpret GWAS findings for translational research. The project is analyzing gene expression from various perspectives, including transcript structure, expression quantity and diversity, eQTLs, and ASE differences. Furthermore, as medical genetics pursues exploration of rare variants, insights gained from the study of DNA and RNA sequencing data in the GTEx project will become important for functional interpretation of rare variants (Rivas et al., 2011, 2013; Zuk et al., 2014).
To date, some methods have been proposed for the analysis of ASE data, but these methods largely focus on a single tissue (Degner et al., 2009; Ronald et al., 2005; Sun, 2012; Zhang et al., 2009) although some could be applied also to multiple tissues (Skelly et al., 2011). The multi-tissue aspect is important for interpreting disease association findings (Grundberg et al., 2012; McCarroll et al., 2008) since eQTL and early ASE studies suggest widespread tissue-specific effects of cis-regulatory variants (Dimas et al., 2009; Gutierrez-Arcelus et al., 2013). Currently, more sophisticated methods for cross-tissue eQTL analysis are emerging (Flutre et al., 2013). However, eQTL analysis requires large sample sizes, while ASE analyses can be conducted with significantly smaller datasets.
In this article, we present novel statistical methods for analyzing ASE patterns from RNA-seq data across multiple tissues. Our main motivation is phenomena such as NMD that can lead to a substantial imbalance of transcripts from the two alleles (Fig. S37 in Lappalainen et al., 2013) and can vary considerably between cell types (Linde et al., 2007). From a statistical point of view, an advantage of prominent ASE is that it can be detected even with a single individual and with as small a read count as 10. (See Middle panel of Fig. 4 for an example dataset.)
We address three main questions:
In which tissues does a heterozygous site show ASE?
Which tissues show similar ASE effects at the site studied?
What proportion of a certain class of variants, such as protein truncating variants (PTV), show ASE in all tissues, only in some tissues, or in no tissues?
For the first question, a standard frequentist version of the binomial test is commonly used. However, an interpretation of such a test depends on factors like the read count and simultaneous analysis of multiple tissues is very challenging. Hence, we require that our statistical framework allow a simultaneous comparison between several cross-tissue models for observed data. For example, we want to weigh relative support of the model where all tissues show ASE to the models where only a single tissue shows ASE and to the null model where none of the tissues show ASE. For this purpose, we adopt a Bayesian model comparison framework. Among its favorable properties are natural ways to compare models with differing number of parameters and to fully account for the amount of available data when evaluating relative support of the models.
2 Methods
2.1 Grouped tissue model
We consider RNA-seq read counts overlapping a particular genomic position from multiple tissues of one individual who is heterozygous at that position. (See Section 4 and Supplementary Information for extensions to multiple sites and individuals.) For tissue , let and be the number of reads supporting the reference and non-reference allele, respectively, and let . We classify the tissues into three groups: no ASE (), moderate ASE () and strong ASE () and denote the group of tissue s by . Motivation for introducing a separate group for strong ASE is to identify the variants with the most extreme ASE effects. Such effects require careful checking for possible biases in the data (Degner et al., 2009; Panousis et al., 2014) and have been seen, for example, among PTVs that exhibit a whole spectrum from none to complete ASE (Fig. S37 in Lappalainen et al., 2013; Kukurba et al., 2014). For each group , we denote the proportion of the transcripts with the reference allele by . We use a binomial sampling model for the data conditional on the group indicators:
(1) |
Possible expression states of the tissues differ in the prior assumptions about parameters . We use the following priors (Fig. 1) to describe different groups:
Under no ASE model both alleles are expressed (almost) equally and hence . As in Skelly et al. (2011), our model allows small deviations from the exact point value of 0.5 to be robust against some technical measurement bias (Fig. S28 in Lappalainen et al., 2013) as well as very small ASE effects that are not a main focus in this study. The parameters for Beta distributions have been chosen in a way that clearly separates the three groups from each other (Fig. 1) and thus gives an informative framework to classify the tissues into three groups In supplementary information, we further discuss the prior specification, extensions to truncated priors and independence across tissues and a restriction to only one ASE group. Our implementation allows user to easily modify the prior parameters as well as use one-sided versions of the prior distributions instead of the two-sided versions used in our simulation experiments. For example, when studying NMD, we may require that the reference allele is expressed more strongly, and hence consider only the one-sided ASE states. Conversely, the two-sided ASE states are tailored for the situations where we do not want to make an assumption about which allele might be dominating. This is useful, for example, when studying imprinting.
Configurations
For fixed prior distributions, the model space consists of configurations represented by vector where each tissue-specific indicator
We partition the space of configurations into five ASE states:
where the states represent configurations for no ASE (NOASE), moderate ASE (MODASE), strong ASE (SNGASE), heterogeneity with at least one tissue showing no ASE (HET0) and heterogeneity with all tissues showing some ASE (HET1). We also consider a tissue-specific sub state of the heterogeneity states:
To do probabilistic comparison between the states we need to define prior probabilities for each state. We do this by defining prior for each configuration in a way which depends on its distance from homogeneity. We define a distance for each configuration as the smallest number of tissues whose group indicators need to be changed in order to turn into a homogeneous configuration (either NOASE, MODASE or SNGASE). Formally, , where and are the number of tissues that assigns to and , respectively. In particular, the configurations with d = 0 are the three homogeneous configurations and the configurations with d = 1 form the set TIS_SPE.
We specify the total prior probability for each possible value of and then distribute it equally among the configurations with the same distance. This prior allows us to easily implement the idea that among the vast space of configurations we favor a priori the parsimonious ones, i.e. those where many tissues are similar. Our prior distribution extends the one recently used for multi-tissue eQTL setting (Flutre et al., 2013) to the case of more than two expression states. In the results reported in this work, we have set a prior mass of 0.75 to d = 0 (i.e. 0.25 for each of NOASE, MODASE and SNGASE) and the remaining 0.25 has been divided equally among all possible values of . This choice was made because it gives an equal prior weight for the four main patterns of ASE: three homogeneous states and general heterogeneity. If a previous estimate of configuration probabilities were available, it could be utilized instead.
2.2 Multi-locus grouped tissue model
For settings where many variants are available for a joint analysis, we extend grouped tissue model (GTM) to a hierarchical multi-locus grouped tissue model (GTM*) that learns from the data the proportion of variants belonging to each of the five states, , and thus avoids the need to fix prior probabilities of the states. Our default prior for the proportions is
We describe the technical details of GTM* and compare it to GTM in Supplementary Information.
2.3 Computation
We use a standard Gibbs sampler to explore the posterior distribution of the configuration indicators under GTM (see Supplementary Information for details).
The basic building block for model comparison is the Bayes factor between configuration and the NOASE state
(2) |
where . Evaluation of can be done analytically by using Beta-binomial likelihood separately for the numerator and the denominator. Thus we can quickly evaluate the Bayes factor for any particular configuration and compare even hundreds of configurations. However, when the number of tissues is large (say T > 10), the number of possible configurations grows too large to be exhaustively evaluated (analogous to a problem with eQTLs in Flutre et al., 2013). This becomes problematic in particular when assessing heterogeneity (either HET0 or HET1), which in principle would require a consideration of all those groupings that assume differences between some tissues. Therefore, we introduce an approximation that avoids enumerating all groupings when assessing heterogeneity, by focusing on only configurations that are strongly supported by the data. Thus we assume that the configurations with d > 1 that have not been visited by the Gibbs sampler and are not among a few dozen top heterogeneity models defined by tissue-specific group membership probabilities, have negligible marginal likelihood and can be ignored. This leads to a lower bound for the marginal likelihoods of the heterogeneity states. In practice, a comparison to the exact values for cases where exact values can be calculated () shows that the lower bound is actually a good estimate in most cases (see Section 3).
By combining the Bayes factors with the prior probabilities of the states we have the posterior probabilities of the states.
2.4 Q-statistic for heterogeneity
We compare GTM to a standard heterogeneity measure (Cochran, 1950)
where and are the empirical proportion of the reference allele at tissue s and across all tissues, respectively. (If , we set and to avoid numerical problems.) The idea is that this measure increases with the heterogeneity of the tissue-specific θs parameters, and can thus be used as a measure of heterogeneity. The Q-statistic is typically used for statistical testing for heterogeneity rather than for quantifying the amount of heterogeneity, and is known to have low power when sample size is small (Higgins et al., 2003). An empirical P-value of the Q-statistic is estimated by simulating datasets where the number of tissues and reads match with the observed data and where all tissues have the same value for .
3 Results
We first assess performance of GTM on simulated data and compare it with a standard heterogeneity measure. Second, we illustrate GTM on a real data example taken from the GTEx project. Results of GTM* are presented in Supplementary Information.
Data simulation
For three values for the number of tissues (T = 5, 10, 30) and four values for the number of reads per each tissue (, for all s), we simulated 1000 datasets for nine scenarios given in Table 1. The reference allele read count for each tissue s was sampled from , where θs is 0.5 for group , 0.75 for group and 0.99 for group .
Table 1.
STATE | ||||
---|---|---|---|---|
1 | T | 0 | 0 | NOASE |
2 | 0 | T | 0 | MODASE |
3 | 0 | 0 | T | SNGASE |
4 | T – 1 | 0 | 1 | HET0 |
5 | 1 | 0 | T – 1 | HET0 |
6 | 0 | T – 1 | 1 | HET1 |
7 | 0 | 1 | T – 1 | HET1 |
8 | 0 | HET0 | ||
9 | 0 | HET1 |
For each of the nine scenarios, the table shows how many tissues (out of total T) belong to each of the three possible groups, and , whose θ parameters are 0.5, 0.75 and 0.99, respectively. The STATE column shows to which of the five combined states each scenario belongs
GTM results
We applied GTM on the simulated datasets with 2000 Gibbs sampler iterations. The results for read counts 10 and 50 are shown in Figure 2 and for the remaining read counts in Supplementary Figure S4. The run time was 8, 16 and 53 s per dataset, for T = 5, 10 and 30, respectively (Intel i7/3.40 GHz).
For the homogeneous scenarios 1, 2 and 3, the detectability of the true state increases with the number of tissues, and in general is already high with only 5 tissues and 10 reads per each tissue. In particular, when the true state is SNGASE (scenario 3), there is no noticeable uncertainty about the correct state in any of the datasets, while for NOASE (scenario 1) and MODASE (scenario 2) the data are not equally decisive. This is related to the fact that the variance in the read counts is larger for (NOASE) and (MODASE) than it is for the extreme value of (SNGASE).
Our scenarios 4, 5, 6 and 7 consider the smallest possible amount of heterogeneity: a single tissue is different from the others. In these scenarios, we see an opposite trend from the homogeneous ones: the true heterogeneous state becomes harder to distinguish from the closest homogeneous one as the number of tissues increases. This is because, according to our prior, a configuration where a single tissue differs from the others becomes less probable as the number of other tissues increases. Information about the true state increases with the number of reads per tissue, and the overall heterogeneity probability (HET0 + HET1) is high for all heterogeneous scenarios for read count 50.
The scenarios 8 and 9 represent stronger heterogeneity where about half of the tissues belong to one group and the remaining half to another. When the true underlying groups are and (scenario 8), then 10 reads is not yet enough to clearly separate the true pattern from the homogeneous states (NOASE and MODASE), while 50 reads is enough for this purpose. In the scenario 9, where tissues are divided between and , the overall heterogeneity probability is always fairly large, but it is difficult to exclude the possibility that at least one tissue belongs to , especially when many tissues are available. As a consequence, the HET0 state gets a considerable probability even though the true underlying state is HET1. To distinguish between the two heterogeneity states in this scenario requires read counts larger than 50 (Supplementary Fig. S4).
Taken together, the results in Figure 2 and Supplementary Figure S4 show that the model is correctly able to distinguish between all five combined states but an amount of information required for accurate classification varies between scenarios.
Comparisons with Q-statistic
To show differences between our GTM heterogeneity probability and Q-statistic in detecting heterogeneity we show ROC curves for two settings (Fig. 3). In the first one, we use the 1000 datasets from simulation scenario 1 to represent a homogeneous state and the 1000 datasets from scenario 4 to represent a heterogeneous state, with T = 30 tissues and n = 10 reads per tissue. The ROC curve on the left panel of Figure 3 shows how those 2000 datasets are ranked by the posterior probability of HET0 + HET1 state from GTM and by the empirical P-value of Q. The right panel in Figure 3 shows similar results when comparing the homogeneous scenario 3 to the heterogeneous scenario 7. In both settings GTM is slightly better in detecting heterogeneity than the Q-statistic: ROC curve of GTM is consistently above that of Q. We discuss the difference between the two approaches in Section 4.
Accuracy of approximation
To assess how accurate our approximation for marginal likelihoods of the heterogeneity states is, we compared the posterior probabilities of the five states from GTM with the exact values on all 9000 datasets simulated with T = 10 tissues and n = 10 reads per tissue. As a measure of accuracy we use the total variation (TV) distance, which for discrete distributions and is defined as . TV describes how much of the probability mass needs to be relocated in order to turn the first distribution into the other, and also gives an upper bound for the maximal difference in probability that the two distributions assign to any one state.
We observed (Table 2) that 95% of the datasets had , and 0.5% had , with maximal TV being 0.168. As TV less than 0.10 is unlikely to change our inference on the underlying state, our approximation works well for a great majority of the datasets tested. However, there are large differences in the accuracy between the scenarios with scenario 8 (5 tissues in and other 5 in ) showing the strongest discrepancy. An explanation for this is that with only 10 reads per tissue the scenario 8 assigns non-negligible posterior probability to so many of the configurations showing heterogeneity between and groups that some of them are missed during our default number of 2000 Gibbs sampler iterations.
Table 2.
SCENAR | TV <0.01 | TV <0.05 | TV <0.10 | |
---|---|---|---|---|
1 | 0.462 | 0.972 | 1 | 0.092 |
2 | 0.452 | 0.959 | 0.995 | 0.129 |
3 | 1 | 1 | 1 | 0.006 |
4 | 0.160 | 0.986 | 0.999 | 0.105 |
5 | 0.997 | 1 | 1 | 0.011 |
6 | 0.327 | 0.973 | 1 | 0.070 |
7 | 0.994 | 1 | 1 | 0.012 |
8 | 0.081 | 0.663 | 0.962 | 0.168 |
9 | 0.565 | 1 | 1 | 0.025 |
For each of the nine scenarios, Table shows the proportion of simulated datasets (out of total 1000 with T = 10 tissues and n = 10 reads per tissue) for which TV distance between GTM estimates and the exact posterior distribution of the five states is less than 0.01, 0.05 or 0.10. The column shows the maximum TV distance observed
In Section 4, we propose an additional heterogeneity measure to complement our approximation for the posterior probability of heterogeneity states, especially for datasets with large number of tissues.
3.1 A PTV in ECM1
We consider read count data on SNV rs121909115 in chromosome 1, whose non-reference allele (T) introduces a stop codon in some transcripts of the gene ECM1. This is an example of a protein-truncating mutation that we expect to experience NMD leading to a reduction in the transcripts from the non-reference allele. However, the strength of NMD and its consistency across different tissue types is unknown. In the currently available GTEx data, we have one individual who is heterozygous at this SNV and Figure 4 shows the data and GTM results across seven tissues.
The results show that, as expected, most tissue types show ASE where the non-reference allele has lower read counts than the reference allele. In addition, there is evidence of heterogeneity between the tissue () and that heterogeneity could result from a tissue-specific effect where the skin tissue escapes ASE (). None of the tissues shows strong evidence for complete ASE (group ), but rather the other six tissue types (apart from skin) are likely to belong to either the moderate ASE group , or to no ASE group. In particular, the nerve and adipose samples cannot be assigned to any one of our groups with high confidence, which can be seen from their membership probabilities being evenly split between the groups and .
The non-reference allele (T) at the SNV rs121909115 is one of the known protein truncating mutations in ECM1 that in homozygous carriers lead to lipoid proteinosis, also known as Urbach-Wiethe disease (OMIM 247100; Hamada et al., 2003). The symptoms of this disease include scarring and infiltration of skin and mucosae (Hamada, 2002). Therefore, it is an interesting observation that in our data the observed allelic ratio of the nonsense mutation is higher in the skin tissue than in several other tissue types.
The isoform view of the GTEx portal (2014-01-17, dbGaP phs000424.v4.p1) shows four isoforms for ECM1 of which only one (ENST00000470432) does not annotate rs121909115 as stop-gain variant. However, the relative abundance of this transcript does not seem particularly high in skin tissue compared with the other tissues. Another observation is that one of the remaining three transcripts (ENST00000346569) is expressed in only a few tissues. These tissues include mucous membrane and skin tissue where symptoms of lipoid proteinosis are seen.
4 Discussion
We have introduced a statistical framework to assess similarities and differences in ASE between tissue types. A motivation for our work comes from ongoing RNA-sequencing projects such as the GTEx project (GTEx-Consortium, 2013) that generates data on up to 30 tissue types per individual with read counts per tissue and per site starting from around 10.
We have chosen a Bayesian approach because it leads to a consistent probabilistic quantification of the support that the data provide for each of the competing models. We see this as an advantage over a series of separate analyses, such as, for example, would be needed by an approach that first assessed heterogeneity using the Q-statistic and if no (statistically significant) heterogeneity was observed, would further classify the dataset into one of the homogeneous states. Previously, two excellent studies on Bayesian models for expression data have been published by Skelly et al. (2011) and Flutre et al. (2013).
Skelly et al. (2011) consider a three-stage hierarchical model for allele read counts from genome-wide RNA-seq data of one individual. They observe allele counts at each heterozygous SNV (1st level) which are assigned to genes (2nd level) whose common properties are controlled by genome-wide parameters (3rd level). Their model would be directly applicable also to multi-tissue RNA-seq data where tissue-specific allele counts replace SNV-specific allele counts and sites replace genes at the second level of the model. Suppose, for example, that we had multi-tissue RNA-seq data on a set of individuals who are heterozygous for at least one PTV. The model of Skelly et al. (2011) would produce posterior distribution on the global proportion of the PTVs that show ASE in at least one tissue type, as well as variant-specific posterior probability of ASE. However, it would not give as refined characterization of ASE at each PTV as our hierarchical model GTM* (Supplementary Information) and it would not do inference on tissue-specific group indicator parameters (γs) that allow probabilistic model comparison between different patterns of ASE.
Flutre et al. (2013) developed a method for a joint eQTL analysis across multiple tissues. They work with micro-array expression data using a linear model that is not directly applicable to the datasets we have in mind: read count data from several tissues of a single individual. They use tissue-specific binary indicator parameters that tell whether a variant is an eQTL in each tissue and introduce three ways to assign prior probabilities to different configurations of the indicators. Their ‘lite’ model gives positive prior on only those configurations whose distance from homogeneity is at most 1. Their ‘BMA’ model is similar to what we have used in that the prior of a configuration depends only on its distance from homogeneity and that the total prior probability corresponding to each value of distance is the same. Finally, their most complex ‘BMA-HM’ model treats the weights of the configurations as random variables and estimates them across genes using a hierarchical model. Similar hierarchical model, that would learn joint ASE patterns between tissues by a simultaneous analysis of a set of genetic variants (e.g. PTVs) is also an important topic for further development of our hierarchical model GTM*.
Even though hierarchical models, such as our GTM* and those of Skelly et al. (2011) and Flutre et al. (2013), are conceptually attractive, we believe that GTM’s ability to analyze one PTV at a time has its merits from a practical point of view: it is quick to run, easy to understand and requires read count data on only one variant.
Heterogeneity measures. Our model assumes that all tissues in a same group have the same reference allele read frequency θ. In practice, we expect that our model is robust to some heterogeneity within each group, since the priors for different groups are so clearly separable from each other (Fig. 1 and Supplementary Fig. S1). Our main interest is to assign tissue types into (three) broad categories of ASE, and consequently we call a dataset heterogeneous only if some pair of tissues show fairly strong difference in the non-reference allele frequency. Standard heterogeneity measures, such as Q, ask a slightly different question of whether parameters θs have exactly the same value across the tissues. A frequentist answer to this latter question is given by an empirical P-value of heterogeneity measures Q or estimated under the null hypothesis that all tissues have the same value for θ parameter which is estimated from the data. By this approach, heterogeneity P-values in the ECM1 example of Figure 4 are 0.011 and 0.006 by using Q and , respectively, as a test statistic. While these P-values point to general heterogeneity between the tissue types, our GTM analysis leads to more detailed information considering the type of heterogeneity: we see heterogeneity where at least one tissue type escapes ASE (our HET0 state).
Our heterogeneity probabilities are based on a lower bound of the true marginal likelihoods of the heterogeneity states. We showed that in a large majority of our datasets (with 10 tissues) the approximation is accurate, but the approximation may not always work as well with larger number of tissues. Therefore, in addition to the heterogeneity probabilities, we also compute, for each pair of tissues, a posterior expectation of the distance between them. Here we define the distance between two tissues to be 0 if they belong to a same group and 1 otherwise. Maximum of all pairwise distances gives an indication whether there is heterogeneity between tissues. By further defining the distance between the ASE groups and to be 0, we have a measure for particular kind of heterogeneity (HET0) where at least one tissue belongs to the no ASE group .
4.1 Application of the method
Nonsense-mediated decay
We envisage that a primary application of our method will be in analyzing NMD. PTVs are usually subject to NMD, a cellular mechanism that detects premature termination codons and prevents expression of truncated proteins. Integrated genome and transcriptome sequencing studies in lymphoblastoid cell lines have demonstrated that ASE can be used for testing variants predicted to trigger NMD (Lappalainen et al., 2013; MacArthur et al., 2012; Montgomery et al., 2010, 2011). To test whether the predicted NMD truly happens, we can use the one-sided ASE models rather than the two-sided ones to explicitly require that an ASE signal is present only if the minor allele, and not the major allele, is silenced. We applied the one-sided approach in our example analysis of a PTV in ECM1.
Imprinting
Genomic imprinting is a phenomenon where only an allele inherited from the parent of a particular sex is expressed (Babak et al., 2008). To assess imprinting for one tissue type one could consider a common coding variant of the gene of interest in that tissue across multiple heterozygous individuals. If the gene is imprinted, then in about half of the individuals the allele 1 should be silenced, because we expect that in about half of the individuals the allele 1 at this locus is inherited from the mother and in the other half it originates from the father. We could extend our framework to such a case by allowing each ASE group to be divided into two subgroups representing parental origin of the allele 1. If one of the subgroups had proportion θ of the reads from allele 1 then the corresponding proportion for the other subgroup should be . We could model θ with a Beta-prior as in our GTM model.
Modest ASE effects
An interesting application of ASE is in the study of cis-regulatory variants in LD with coding variants. When regulatory variants have only modest effects (Dimas et al., 2009), we could modify our prior on ASE states to reflect this (see Supplementary Information section S8 and Fig. S6). With this approach we envision that researchers are able to study the effect of cis-regulatory variants on transcription across a broad range of tissues where the number of samples per tissue may be limited. However, compared to strong ASE effects, more modest effects require much larger read counts per tissue and decrease the ratio between biological signal and possible technical noise (Degner et al., 2009).
Multiple individuals
Suppose we have RNA-seq data on same tissue types from several individuals who are heterozygous at a particular variant. We could first assess, for each tissue type, whether the individuals are heterogeneous in their ASE status. If there is no evident heterogeneity, a simple approach is to combine the reads from the same tissue type across the individuals before analyzing the data across the tissues. A more refined model that accounts for possible individual-specific effects that are shared across the tissues requires further work.
5 Conclusion
We have presented GTM and its multi-site extension (GTM*), to (i) classify tissues into three groups at each site according to their ASE patterns and (ii) classify the sites into five combined ASE states according to their tissue-wide ASE profiles. We see major applications of our approach in assessing homogeneity, heterogeneity and tissue specificity across a group of genetic variants such as variants predicted to trigger NMD, variants in genes with evidence of imprinting, or variants in LD with GWAS loci for a particular disease.
As an example, we presented an application of the method to read count data from the GTEx project for one heterozygote carrier of a premature truncating mutation (p.R53X, rs121909115) in the ECM1 gene. For this variant, which in homozygous form is known to cause lipoid proteinosis, we identified heterogeneous gene expression effects across tissues and evidence for a complete escape from NMD in skin tissue.
The identification and characterization of ASE, in particular for PTVs with a putative complete loss of function effect, will provide a better understanding of the biological mechanisms that are involved in transcriptional regulation and will improve our computational models for annotating variants identified in case–control or clinical genome sequencing studies.
Supplementary Material
Acknowledgements
The GTEx RNA-Seq data used for the analyses described in this manuscript were from dbGaP accession number phs000424.v3.p1 and the Whole Exome Sequencing data will be part of release phs00424.v5.p1. The GTEx Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health (commonfund.nih.gov/GTEx). Additional funds were provided by the NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. Donors were enrolled at Biospecimen Source Sites funded by NCI-SAIC-Frederick, Inc. (SAIC-F) subcontracts to the National Disease Research Interchange (10XS170), Roswell Park Cancer Institute (10XS171), and Science Care, Inc. (X10S172). The Laboratory, Data Analysis, and Coordinating Center (LDACC) was funded through a contract (HHSN268201000029C) to The Broad Institute, Inc. Biorepository operations were funded through an SAIC-F subcontract to Van Andel Institute (10ST1035). Additional data repository and project management were provided by SAIC-F (HHSN261200800001E). The Brain Bank was supported by a supplement to University of Miami grant DA006227. Statistical Methods development grants were made to the University of Geneva (MH090941), the University of Chicago (MH090951 & MH090937), the University of North Carolina - Chapel Hill (MH090936) and to Harvard University (MH090948).
Funding
This work was supported by the Academy of Finland [257654 to M.P.]; Wellcome Trust [095552/Z/11/Z to P.D., 098381 and 090532 to M.I.M.]; National Institutes of Health [R01-MH101814 and MH-090941 to M.I.M.]; Royal Society Wolfson Merit Award to P.D.; and a Clarendon Scholarship, NDM Studentship, and Green Templeton College Award from the University of Oxford to M.R.
Conflict of interest: none declared.
References
- Babak T., et al. (2008) Global survey of genomic imprinting by transcriptome sequencing. Curr. Biol. , 18, 1735–1741. [DOI] [PubMed] [Google Scholar]
- Cochran W.G. (1950) The comparison of percentages in matched samples. Biometrika , 37, 256–266. [PubMed] [Google Scholar]
- Degner J.F., et al. (2009) Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics , 25, 3207–3212. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dimas A.S., et al. (2009) Common regulatory variation impacts gene expression in a cell type–dependent manner. Science , 325, 1246–1250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Flutre T., et al. (2013) A statistical framework for joint eQTL analysis in multiple tissues. PLoS Genet. , 9, e1003486. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grundberg E., et al. (2012) Mapping cis-and trans-regulatory effects across multiple tissues in twins. Nat. Genet. , 44, 1084–1089. [DOI] [PMC free article] [PubMed] [Google Scholar]
- GTEx-Consortium (2013) The genotype-tissue expression (GTEx) project. Nat. Genet. , 45, 580–585. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gutierrez-Arcelus M., et al. (2013) Passive and active DNA methylation and the interplay with genetic variation in gene regulation. eLife, 2:e00523. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hamada T. (2002) Lipoid proteinosis. Clin. Exp. Dermatol. , 27, 624–629. [DOI] [PubMed] [Google Scholar]
- Hamada T., et al. (2003) Extracellular matrix protein 1 gene (ECM1) mutations in lipoid proteinosis and genotype-phenotype correlation. J. Invest. Dermatol. , 120, 34–350. [DOI] [PubMed] [Google Scholar]
- Higgins J.P., et al. (2003) Measuring inconsistency in meta-analyses. BMJ , 327, 557–560. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kukurba K.R., et al. (2014) Allelic expression of deleterious protein-coding variants across human tissues. PLoS Genet. , 10, e1004304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lappalainen T., et al. (2013) Transcriptome and genome sequencing uncovers functional variation in humans. Nature , 501, 506–511. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Linde L., et al. (2007) The efficiency of nonsense-mediated mRNA decay is an inherent character and varies among different cells. Eur. J. Hum. Genet. , 15, 1156–1167. [DOI] [PubMed] [Google Scholar]
- MacArthur D.G., et al. (2012) A systematic survey of loss-of-function variants in human protein-coding genes. Science , 335, 823–828. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McCarroll S.A., et al. (2008) Deletion polymorphism upstream of IRGM associated with altered IRGM expression and Crohn’s disease. Nat. Genet. , 40, 1107–1112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Montgomery S.B., et al. (2010) Transcriptome genetics using second generation sequencing in a Caucasian population. Nature , 464, 773–777. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Montgomery S.B., et al. (2011) Rare and common regulatory variation in population-scale sequenced human genomes. PLoS Genet. , 7, e1002144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Panousis N.I., et al. (2014) Allelic mapping bias in RNA-sequencing is not a major confounder in eQTL studies. Genome Biol. , 15, 467. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pickrell J.K., et al. (2010) Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature , 464, 768–772. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rivas M.A., et al. (2011) Deep resequencing of GWAS loci identifies independent rare variants associated with inflammatory bowel disease. Nat. Genet. , 43, 1066–1073. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rivas M.A., et al. (2013) Assessing association between protein truncating variants and quantitative traits. Bioinformatics , 29, 2419–2426. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ronald J., et al. (2005) Simultaneous genotyping, gene-expression measurement, and detection of allele-specific expression with oligonucleotide arrays. Genome Res. , 15, 284–291. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Skelly D.A., et al. (2011) A powerful and flexible statistical framework for testing hypotheses of allele-specific gene expression from RNA-seq data. Genome Res. , 21, 1728–1737. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun W. (2012) A statistical framework for eQTL mapping using RNA-seq data. Biometrics , 68, 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang K., et al. (2009) Digital RNA allelotyping reveals tissue-specific and allele-specific gene expression in human. Nat. Methods , 6, 613–618. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zuk O., et al. (2014) Searching for missing heritability: Designing rare variant association studies. PNAS , 111, E455–E464. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.