Abstract
Identifying differentially expressed (DE) genes from RNA sequencing (RNAseq) studies is among the most common analyses in genomics. However, RNAseq DE analysis presents several statistical and computational challenges, including over-dispersed read counts and, in some settings, sample non-independence. Previous count-based methods rely on simple hierarchical Poisson models (e.g. negative binomial) to model independent over-dispersion, but do not account for sample non-independence due to relatedness, population structure and/or hidden confounders. Here, we present a Poisson mixed model with two random effects terms that account for both independent over-dispersion and sample non-independence. We also develop a scalable sampling-based inference algorithm using a latent variable representation of the Poisson distribution. With simulations, we show that our method properly controls for type I error and is generally more powerful than other widely used approaches, except in small samples (n <15) with other unfavorable properties (e.g. small effect sizes). We also apply our method to three real datasets that contain related individuals, population stratification or hidden confounders. Our results show that our method increases power in all three data compared to other approaches, though the power gain is smallest in the smallest sample (n = 6). Our method is implemented in MACAU, freely available at www.xzlab.org/software.html.
INTRODUCTION
RNA sequencing (RNAseq) has emerged as a powerful tool for transcriptome analysis, thanks to its many advantages over previous microarray techniques (1–3). Compared with microarrays, RNAseq has increased dynamic range, does not rely on a priori-chosen probes, and can thus identify previously unknown transcripts and isoforms. It also yields allelic-specific expression estimates and genotype information inside expressed transcripts as a useful by-product (4–7). Because of these desirable features, RNAseq has been widely applied in many areas of genomics and is currently the gold standard method for genome-wide gene expression profiling.
One of the most common analyses of RNAseq data involves identification of differentially expressed (DE) genes. Identifying DE genes that are influenced by predictors of interest—such as disease status, risk factors, environmental covariates or genotype—is an important first step toward understanding the molecular basis of disease susceptibility as well as the genetic and environmental basis of gene expression variation. Progress toward this goal requires statistical methods that can handle the complexities of the increasingly large and structurally complex RNAseq datasets that are now being collected from population and family studies (8,9). Indeed, even in classical treatment-control comparisons, the importance of larger sample sizes for maximizing power and reproducibility is increasingly well appreciated (10,11). However, identifying DE genes from such studies presents several key statistical and computational challenges, including accounting for ambiguously mapped reads (12), modeling uneven distribution of reads inside a transcript (13) and inferring transcript isoforms (14).
A fundamental challenge shared by all DE analyses in RNAseq, though, is accounting for the count nature of the data (3,15,16). In most RNAseq studies, the number of reads mapped to a given gene or isoform (following appropriate data processing and normalization) is often used as a simple and intuitive estimate of its expression level (13,14,17). As a result, RNAseq data display an appreciable dependence between the mean and variance of estimated gene expression levels: highly expressed genes tend to have high read counts and subsequently high between-sample variance, and vice versa (15,18). To account for the count nature of the data and the resulting mean-variance dependence, most statistical methods for DE analysis model RNAseq data using discrete distributions. For example, early studies showed that gene expression variation across technical replicates can be accurately described by a Poisson distribution (19–21). More recent methods also take into account over-dispersion across biological replicates (22,23) by replacing Poisson models with negative binomial models (15,16,24–28) or other related approaches (18,29–32). While non-count based methods are also commonly used (primarily relying on transformation of the count data to more flexible, continuous distributions (33,34)), recent comparisons have highlighted the benefits of modeling RNAseq data using the original counts and accounting for the resulting mean-variance dependence (11,35–37), consistent with observations from many count data analyses in other statistical settings (38). Indeed, accurate modeling of mean-variance dependence is one of the keys to enable powerful DE analysis with RNAseq, especially in the presence of large sequencing depth variation across samples (25,33,39).
A second important feature of many RNAseq datasets, which has been largely overlooked in DE analysis thus far, is that samples often are not independent. Sample non-independence can result from individual relatedness, population stratification or hidden confounding factors. For example, it is well known that gene expression levels are heritable. In humans, the narrow-sense heritability of gene expression levels averages from 15–34% in peripheral blood (40–44) and is about 23% in adipose tissue (40), with a maximum heritability in both tissues as high as 90% (40,41). Similarly, in baboons, gene expression levels are about 28% heritable in the peripheral blood (7). Some of these effects are attributable to nearby, putatively cis-acting genetic variants: indeed, recent studies have shown that the expression levels of almost all genes are influenced by cis-eQTLs and/or display allelic specific expression (3,7,45–47). However, the majority of heritability is often explained by distal genetic variants (i.e. trans-QTLs, which account for 63–84% of heritability in humans (40) and baboons (7)). Because gene expression levels are heritable, they will covary with kinship or population structure. Besides kinship or population structure, hidden confounding factors, commonly encountered in sequencing studies (48–51), can also induce similarity in gene expression levels across many genes even when individuals are unrelated (52–56). Failure to account for this gene expression covariance due to sample non-independence could lead to spurious associations or reduced power to detect true DE effects. This phenomenon has been extensively documented in genome-wide association studies (9,57–58) and more recently, in bisulfite sequencing studies (59), but is less explored in RNAseq studies. In particular, none of the currently available count-based methods for identifying DE genes in RNAseq can appropriately control for sample non-independence. Consequently, even though count-based methods have been shown to be more powerful, recent RNAseq studies have turned to linear mixed models (LMMs), which are specifically designed for quantitative traits, to deal with the confounding effects of kinship, population structure or hidden confounders (7,41,60).
Here, we present a Poisson mixed model (PMM) that can explicitly model both over-dispersed count data and sample non-independence in RNAseq data for effective DE analysis. To make our model scalable to large datasets, we also develop an accompanying efficient inference algorithm based on an auxiliary variable representation of the Poisson model (61–63) and recent advances in mixed model methods (9,58,64). We refer to the combination of the statistical method and the computational algorithm developed here as MACAU (Mixed model Association for Count data via data AUgmentation), which effectively extends our previous method of the same name on the simpler binomial model (59) to the more difficult Poisson model. MACAU works directly on RNAseq count data and introduces two random effects terms to both control for sample non-independence and account for additional independent over-dispersion. As a result, MACAU properly controls for type I error in the presence of sample non-independence and, in a variety of settings, is more powerful for identifying DE genes than other commonly used methods. We illustrate the benefits of MACAU with extensive simulations and real data applications to three RNAseq studies.
MATERIALS AND METHODS
Methods for comparison
We compared the performance of seven different methods in the main text: (i) our PMM implemented in the MACAU software package (59); (ii) the linear model implemented in the lm function in R; (iii) the LMM implemented in the GEMMA software package (9,58,65); (iv) the Poisson model implemented in the glm function in R (66); (v) the negative binomial model implemented in the glm.nb function in R; (vi) edgeR implemented in the edgeR package in R (25); (vii) DESeq2 implemented in the DESeq2 package in R (24). All methods were used with default settings. The performance of each method in simulations was evaluated using the area under the curve (AUC) function implemented in the pROC package in R (67), a widely used benchmark for RNAseq method comparisons (68).
Both the linear model and the LMM require quantitative phenotypes. Here, we considered six different transformations of count data to quantitative values, taking advantage of several methods proposed to normalize RNAseq data (e.g. (12–14,17,22,33,69)): (i) quantile normalization (TRCQ), where we first divided the number of reads mapped to a given gene by the total number of read counts for each individual, and then for each gene, quantile normalized the resulting proportions across individuals to a standard normal distribution (7); (ii) total read count (TRC) normalization, where we divided the number of reads mapped to a given gene by the total number of read counts for each individual (i.e. CPM, counts per million; without further transformation to a standard normal within genes: (25)); (iii) upper quantile (UQ) normalization, where we divided the number of reads mapped to a given gene by the UQ (75th percentile) of all genes for each individual (70); (iv) relative log expression normalization (15); (v) the trimmed mean of M-values (TMM) method (39) where we divided the number of reads mapped to a given gene by the normalization factor output from TMM; and (vi) VOOM normalization (33). Simulation results presented in a supplementary figure (see ‘Results' section) showed that TRCQ, VOOM and TRC worked better than the other three methods, with TRCQ showing a small advantage. Therefore, we report results using TRCQ throughout the text.
Simulations
To make our simulations as realistic as possible, we simulated the gene expression count data based on parameters inferred from a real baboon dataset that contains 63 samples (see the next section for a detailed description of the data). We varied the sample size (n) in the simulations (n = 6, 10, 14, 63, 100, 200, 500, 800 or 1000). For n = 63, we used the baboon relatedness matrix (7). For sample simulations with n > 63, we constructed a new relatedness matrix by filling in its off-diagonal elements with randomly drawn off-diagonal elements from the baboon relatedness matrix following (59). For sample simulations with n < 63, we constructed a new relatedness matrix by randomly sub-sampling individuals from the baboon relatedness matrix. In cases where the resulting was not positive definite, we used the nearPD function in R to find the closest positive definite matrix as the final . In most cases, we simulated the TRC for each individual from a discrete uniform distribution with a minimum ( = 1 770 083) and a maximum ( = 9 675 989) TRC (i.e. summation of read counts across all genes) equal to the minimum and maximum TRCs from the baboon data. We scaled the TRCs to ensure that the coefficient of variation was small (CV = 0.3), moderate (CV = 0.6) or high (CV = 0.9) across individuals (i.e. ) and then discretized them. In the special case where CV = 0.3 and n = 63, we directly used the observed TRCs per individual from the baboon data (which has a CV = 0.33).
We then repeatedly simulated a continuous predictor variable from a standard normal distribution (without regard to the pedigree structure). We estimated the heritability of the continuous predictor using GEMMA, and retained if the heritability () estimate (with tolerance) was 0, 0.4 or 0.8, representing no, moderate and highly heritable predictors. Using this procedure, ∼30 percent of values generated were retained, with different retention percentages for different heritability values.
Based on the simulated sample size, TRCs and continuous predictor variable, we simulated gene expression values using the following procedure. For the expression of each gene in turn, we simulated the genetic random effects from a multivariate normal distribution with covariance . We simulated the environmental random effects based on independent normal distributions. We scaled the two sets of random effects to ensure a fixed value of heritability ( or 0.3 or 0.6) and a fixed value of over-dispersion variance ( or 0.4, close to the lower, median and UQs of the over-dispersion variance inferred from the baboon data, respectively), where the function V(•) denotes the sample variance. We then generated the effect size of the predictor variable on gene expression. The effect size was either 0 (for non-DE genes) or generated to explain a certain percentage of variance in (i.e. ; for DE genes). Proportion of variance explained (PVE) values were 15, 20, 25, 30 or 35% to represent different effect sizes. The predictor effects , genetic effects , environmental effects , and an intercept ( to ensure that the expected simulated count is 100) were then summed together to yield the latent variable . Note that does not include the contribution of , which in many cases represent non-genetic effects. Finally, the read counts were simulated based on a Poisson distribution with rate determined by the TRCs and the latent variable , or for the th individual.
With the above procedure, we first simulated data for n = 63, CV = 0.3, = 0, PVE = 0.25, 0.3 and 0.25. We then varied one parameter at a time to generate different scenarios for comparison. In each scenario, conditional on the sample size, TRCs and continuous predictor variable, we performed 10 simulation replicates, where ‘replication’ is at the level described in the paragraph above. Each replicate consisted of 10 000 genes. For examining type I error control, all 10 000 genes were non-DE. For the power comparison, 1000 genes were DE while 9000 were non-DE.
RNAseq datasets
We considered three published RNAseq datasets in this study, which include small (n < 15), medium (15 ≤ n ≤ 100) and large (n > 100) sample sizes (based on current RNAseq sample sizes in the literature).
The first RNAseq dataset was collected from blood samples of yellow baboons (7) from the Amboseli ecosystem of southern Kenya as part of the Amboseli Baboon Research Project (ABRP) (71). The data are publicly available on GEO with accession number GSE63788. Read counts were measured on 63 baboons and 12 018 genes after stringent quality control as in (7). As in (7), we computed pairwise relatedness values from previously collected microsatellite data (72,73) using the software COANCESTRY (74). The data contains related individuals: 16 pairs of individuals have a kinship coefficient exceeding 1/8 and 48 pairs exceed 1/16. We obtained sex information for each individual from GEO. Sex differences in health and survival are major topics of interest in medicine, epidemiology and evolutionary biology (72,75). Therefore, we used this dataset to identify sex-related gene expression variation. In the analysis, we included the top five expression principal components (PCs) as covariates to control for potential batch effects following the original study (7).
The second RNAseq dataset was collected from skeletal muscle samples of Finnish individuals (60) as part of the Finland-United States Investigation of NIDDM Genetics (FUSION) project (76,77). The data are publicly available in dbGaP with accession code phs001068.v1.p1. Among the 271 individuals in the original study, we selected 267 individuals who have both genotypes and gene expression measurements. Read counts were obtained on these 267 individuals and 21 753 genes following the same stringent quality control as in the original FUSION RNAseq study. For genotypes, we excluded SNPs with minor allele frequency < 0.05 and Hardy-Weinberg equilibrium P-value < 10−6. We used the remaining 5 696 681 SNPs to compute the relatedness matrix using GEMMA. The data contains remotely related individuals (three pairs of individuals have a kinship coefficient exceeding 1/32 and 6 pairs exceed 1/64) and is stratified by the municipality from which samples were collected (see ‘Results' section). Two predictors from the data were available to us: the oral glucose tolerance test (OGTT) which classifies n = 162 individuals as either type II diabetes (T2D) patient (n = 66) or normal glucose tolerance (NGT; i.e. control, n = 96); and a T2D-related quantitative trait—fasting glucose levels (GL)—measured on all n = 267 individuals. We used these data to identify genes whose expression level is associated with either T2D or GL. In the analysis, we included age, sex and batch labels as covariates following the original study (60).
The third RNAseq dataset was collected from lymphoblastoid cell lines (LCLs) derived from 69 unrelated Nigerian individuals (YRI) (3). The data are publicly available on GEO with accession number GSE19480. Following the original study (3), we aligned reads to the human reference genome (version hg19) using Burrows-Wheeler Aligner (BWA) (78). We counted the number of reads mapped to each gene on either autosomes or the X chromosome using Ensembl gene annotation information obtained from the UCSC genome browser. We then filtered out lowly expressed genes with zero counts in over 90% of individuals. In total, we obtained gene expression measurements on 13 319 genes. Sex is the only phenotype available in the data and we used sex as the predictor variable to identify sex-associated genes. To demonstrate the efficacy of MACAU in small samples, we randomly subsampled individuals from the data to create small datasets with either n = 6 (3 males and 3 females) or n = 10 (5 males and 5 females) or n = 14 individuals (7 males and 7 females). For each sample size n, we performed 20 replicates of subsampling and we evaluated method performance by averaging across these replicates. In each replicate, following previous studies (52–56), we used the gene expression covariance matrix as (i.e., where is the normalized gene expression matrix and p is the number of genes) and applied MACAU to identify sex-associated genes. Note that the gene expression covariance matrix contains information on sample non-independence caused by hidden confounding factors (52–56). By incorporating , MACAU can be used to control for hidden confounding factors that are commonly observed in sequencing datasets (48–51).
For each of these RNAseq datasets and each trait, we used a constrained permutation procedure to estimate the empirical false discovery rate (FDR) of a given analytical method. In the constrained permutation procedure, we permuted the predictor across individuals, estimated the heritability of the permuted predictor and retained the permutation only if the permuted predictor had a heritability estimate () similar to the original predictor with ±0.01 tolerance (for the original predictors: 0.0002 for sex in the baboon data; 0.0121 for T2D and 0.4023 for GL in the FUSION data; are all close to zero with small variations depending on the sub-sample size in the YRI data). We then analyzed all genes using the permuted predictor. We repeated the constrained permutation procedure and analysis 10 times, and combined the P-values from these 10 constrained permutations. We used this set of P-values as a null distribution from which to estimate the empirical FDR for any given P-value threshold (59). This constrained procedure thus differs from the usual unconstrained permutation procedure (every permutation retained) (79) in that it constrains the permuted predictor to have the same as the original predictor. We chose to use the constrained permutation procedure here because the unconstrained procedure is invalid under the mixed model assumption: the subjects are not exchangeable in the presence of sample non-independence (individual relatedness, population structure or hidden confounders) (79,80). To validate our constrained permutation procedure and test its effectiveness in estimating FDR, we performed a simulation with 1000 DE genes and 9000 non-DE genes as described above. We considered three predictor variables with different heritability: , and . For each predictor variable and each P-value threshold, we computed the true FDR and then estimated the FDR based on either the constrained or unconstrained permutation procedures. The simulation results presented in a supplementary figure (see ‘Results' section) demonstrate that the constrained permutation procedure provides a much more accurate estimate of the true FDR while the unconstrained permutation procedure often under-estimates the true FDR. Therefore, we applied the constrained permutation procedure for all real data analysis.
Finally, we investigated whether the methods we compared were sensitive to outliers (31,81,82) in the first two datasets. To examine outlier sensitivity, we first identified genes with potential outliers using BBSeq (18). In total, we identified 8 genes with potential outliers in the baboon data, 130 genes with potential outliers in the FUSION data (n = 267) and 43 genes with potential outliers in the subset of the FUSION data for which we had T2D diagnoses (n = 162). We counted the number of genes with potential outliers in the top 1000 genes with strong DE association evidence. In the baboon data, 4 genes with potential outliers are in the top 1000 genes with the strongest sex association determined by various methods: two of them by the negative binomial model, three of them by the Poisson model, but zero of them by MACAU, linear model or GEMMA. In the FUSION data, for T2D analysis, 9 genes with potential outliers are in the top 1000 genes with the strongest T2D association determined by various methods: one by MACAU, three by negative binomial, six by Poisson, one by linear and one by GEMMA. For GL analysis, 15 genes with potential outliers are in the top 1000 genes with the strongest GL association determined by various methods: two by MACAU, seven by negative binomial, nine by Poisson, three by linear and three by GEMMA. All outliers are presented in supplementary figures (see ‘Results' section). Therefore, the influence of outliers on DE analysis is small in the real data.
RESULTS
MACAU overview
Here, we provide a brief overview of the PMM; more details are available in the Supplementary Data. To identify DE genes with RNAseq data, we examine one gene at a time. For each gene, we model the read counts with a Poisson distribution
where for the th individual, is the number of reads mapped to the gene (or isoform); is the TRCs for that individual summing read counts across all genes; and is an unknown Poisson rate parameter. We model the log-transformed rate as a linear combination of several parameters
where is a c-vector of covariates (including the intercept); is a c-vector of corresponding coefficients; represents the predictor variable of interest (e.g. experimental perturbation, sex, disease status or genotype); is its coefficient; is an n-vector of genetic effects; is an n-vector of environmental effects; is an n by n positive semi-definite matrix that models the covariance among individuals due to individual relatedness, population structure or hidden confounders; is an n by n identity matrix that models independent environmental variation; is the genetic variance component; is the environmental variance component; and denotes the multivariate normal distribution. In the above model, we assume that is known and can be computed based on either pedigree, genotype or the gene expression matrix (9). For pedigree/genotype data, when is standardized to have , has the usual interpretation of heritability (9), where the tr(⋅) denotes the trace of a matrix. Importantly, unlike several other DE methods (15,25), our model can deal with both continuous and discrete predictor variables.
Both of the random effects terms and model over-dispersion, the extra variance not explained by a Poisson model. However, the two terms and model two different aspects of over-dispersion. Specifically, models the fraction of the extra variance that is explained by sample non-independence while models the fraction of the extra variance that is independent across samples. For example, let us consider a simple case in which all samples have the same sequencing depth (i.e. ) and there is only one intercept term included as the covariate. In this case, the random effects term models the independent over-dispersion: without , is still larger than the mean , with the difference between the two increasing with increasing . In a similar fashion, the random effects term models the non-independent over-dispersion by accounting for the sample covariance matrix . By modeling both aspects of over-dispersion, our PMM effectively generalizes the commonly used negative binomial model—which only models independent extra variance—to account for sample non-independence. In addition, our PMM naturally extends the commonly used LMM (9,64,83,84) to modeling count data.
Our goal here is to test the null hypothesis that gene expression levels are not associated with the predictor variable of interest, or . Testing this hypothesis requires estimating parameters in the PMM (as has previously been done in other settings (85,86), including for modeling uneven RNAseq read distribution inside transcripts (13); details in Supplementary Data). The PMM belongs to the generalized LMM family, where parameter estimation is notoriously difficult because of the random effects and the resulting intractable n-dimensional integral in the likelihood. Standard estimation methods rely on numerical integration (87) or Laplace approximation (88,89), but neither strategy scales well with the increasing dimension of the integral, which in our case equals the sample size. As a consequence, standard approaches often produce biased estimates and overly narrow (i.e. anti-conservative) confidence intervals (90–96). To overcome the high-dimensionality of the integral, we instead develop a novel Markov Chain Monte Carlo (MCMC) algorithm, which, with enough iterations, can achieve high inference accuracy (97,98). We use MCMC to draw posterior samples but rely on the asymptotic normality of both the likelihood and the posterior distributions (99) to obtain the approximate maximum likelihood estimate and its standard error se(). With and se(), we can construct approximate Wald test statistics and P-values for hypothesis testing (Supplementary Material). Although we use MCMC, our procedure is frequentist in nature.
At the technical level, our MCMC algorithm is also novel, taking advantage of an auxiliary variable representation of the Poisson likelihood (61–63) and recent linear algebra innovations for fitting LMMs (9,58,64). Our MCMC algorithm introduces two continuous latent variables for each individual to replace the count observation, effectively extending our previous approach of using one latent variable for the simpler binomial distribution (59). Compared with a standard MCMC, our new MCMC algorithm reduces the computational complexity of each MCMC iteration from cubic to quadratic with respect to the sample size. Therefore, our method is orders of magnitude faster than the popular Bayesian software MCMCglmm (100) and can be used to analyze hundreds of samples and tens of thousands of genes with a single desktop PC (Supplementary Figure S1). Although our procedure is stochastic in nature, we find the MCMC errors are often small enough to ensure stable P-values across independent MCMC runs (Supplementary Figure S2). We summarize the key features of our method along with other commonly used approaches in Table 1.
Table 1. Current approaches for identifying differentially expressed genes in RNAseq.
Statistical method | Directly models counts? | Controls for biological covariates? | Controls for sample non-independence? | Example software that implements the method |
---|---|---|---|---|
Linear regression | No | Yes | No | R and many others |
Linear mixed model | No | Yes | Yes | GEMMA (9) and EMMA (84) |
Poisson model | Yes | Some methods do | No | GLMP (66) and DEGseq (20) |
Negative binomial model | Yes | Some methods do | No | edgeR (25), DESeq (15) and GLMNB (66) |
Poisson mixed model | Yes | Yes | Yes | MACAU |
Simulations: control for sample non-independence
We performed a series of simulations to compare the performance of the PMM implemented in MACAU with four other commonly used methods: (i) a linear model; (ii) the LMM implemented in GEMMA (9,58); (iii) a Poisson model; and (iv) a negative binomial model. We used quantile-transformed data for linear model and GEMMA (see ‘Materials and Methods’ section for normalization details and a comparison between various transformations; Supplementary Figure S3) and used raw count data for the other three methods. To make our simulations realistic, we use parameters inferred from a published RNAseq dataset on a population of wild baboons (7,71) to perform simulations (‘Materials and Methods’ section); this baboon dataset contains known related individuals and hence invokes the problem of sample non-independence outlined above.
Our first set of simulations was performed to evaluate the effectiveness of MACAU and the other four methods in controlling for sample non-independence. To do so, we simulated expression levels for 10 000 genes in 63 individuals (the sample size from the baboon dataset). Simulated gene expression levels are influenced by both independent environmental effects and correlated genetic effects, where genetic effects are simulated based on the baboon kinship matrix (estimated from microsatellite data (7)) with either zero ( ), moderate ( ), or high () heritability values. We also simulated a continuous predictor variable x that is itself moderately heritable ( ). Because we were interested in the behavior of the null in this set of simulations, gene expression levels were not affected by the predictor variable (i.e. no genes were truly DE).
Figure 1, Supplementary Figures S4 and 5 show quantile–quantile plots for analyses using MACAU and the other four methods against the null (uniform) expectation, for , and respectively. When genes are heritable and the predictor variable is also correlated with individual relatedness, then the resulting P-values from the DE analysis are expected to be uniform only for a method that properly controls for sample non-independence. If a method fails to control for sample non-independence, then the P-values would be inflated, resulting in false positives.
Our results show that, because MACAU controls for sample non-independence, the P-values from MACAU follow the expected uniform distribution closely (and are slightly conservative) regardless of whether gene expression is moderately or highly heritable. The genomic control factors from MACAU are close to 1 (Figure 1 and Supplementary Figure S4). Even if we use a relatively relaxed q-value cutoff of 0.2 to identify DE genes, we do not incorrectly identify any genes as DE with MACAU. In contrast, the P-values from negative binomial are inflated and skewed toward low (significant) values, especially for gene expression levels with high heritability. With negative binomial, 27 DE genes (when h2 = 0.3) or 21 DE genes (when h2 = 0.6) are erroneously detected at the q-value cutoff of 0.2. The inflation of P-values is even more acute in Poisson, presumably because the Poisson model accounts for neither individual relatedness nor over-dispersion. For non-count-based models, the P-values from a linear model are slightly skewed towards significant values, with three DE genes (when h2 = 0.3) and one DE gene (when h2 = 0.6) erroneously detected at q < 0.2. In contrast, because the LMM in GEMMA also accounts for individual relatedness, it controls for sample non-independence well. Finally, when genes are not heritable, all methods except Poisson correctly control type I error (Supplementary Figure S5).
Two important factors influence the severity of sample non-independence in RNAseq data (Figure 2). First, the inflation of P-values in the negative binomial, Poisson and linear models becomes more acute with increasing sample size. In particular, when , with a sample size of , from the negative binomial, Poisson and linear models reaches 1.71, 82.28 and 1.41, respectively. In contrast, even when , from both MACAU and GEMMA remain close to 1 (0.97 and 1.01, respectively). Second, the inflation of P-values in the three models also becomes more acute when the predictor variable is more correlated with population structure. Thus, for a highly heritable predictor variable (), (when ) from the negative binomial, Poisson and linear models increases to 2.13, 101.43 and 1.81, respectively, whereas from MACAU and GEMMA remains close to 1 (1.02 and 1.05).
We also compared MACAU with edgeR (25) and DESeq2 (15), two commonly used methods for DE analysis (11,101). Because edgeR and DESeq2 were designed for discrete predictor valuables, we discretized the continuous predictor into 0/1 based on the median predictor value across individuals. We then applied all methods to the same binarized predictor values for comparison. Results are shown in Supplementary Figure S6. For the five methods compared above, the results on binarized values are comparable with those for continuous variables (i.e. Supplementary Figure S6 versus Figure 1). Both edgeR and DESeq2 produce anticonservative P-values and perform similarly to the negative binomial model in terms of type I error control.
Finally, we explored the use of PCs from the gene expression matrix or the genotype matrix to control for sample non-independence. Genotype PCs have been used as covariates to control for population stratification in association studies (102). However, recent comparative studies have shown that using PCs is less effective than using LMMs (83,103). Consistent with the poorer performance of PCs in association studies (83,103), using the top PCs from either the gene expression matrix or the genotype matrix does not improve type I error control for negative binomial, Poisson, linear, edgeR or DESeq2 approaches (Supplementary Figures S7 and 8).
Simulations: power to identify DE genes
Our second set of simulations was designed to compare the power of different methods for identifying DE genes, again based on parameters inferred from real data. This time, we simulated a total of 10 000 genes, among which 1000 genes were truly DE and 9000 were non-DE. For the DE genes, simulated effect sizes corresponded to a fixed PVE in gene expression levels that ranged from 15 to 35%. For each set of parameters, we performed 10 replicate simulations and measured model performance based on the AUC (as in (35,68,104)). We also examined several key factors that could influence the relative performance of the alternative methods: (i) gene expression heritability (); (ii) correlation between the predictor variable and genetic relatedness (measured by the heritability of , or ); (iii) variation of the TRCs across samples (measured by the CV); (iv) the over-dispersion parameter (); (v) the effect size (PVE); and (vi) sample size (n). To do so, we first performed simulations using a default set of values ( 0.3, = 0, CV = 0.3, 0.25, PVE = 0.25 and n = 63) and then varied them one at a time to examine the influence of each factor on the relative performance of each method.
Our results show that MACAU works either as well as or better than other methods in almost all settings (Figure 3 and Supplementary Figure S9–14), probably because it both models count data directly and controls for sample non-independence. In contrast, the Poisson approach consistently fared the worst across all simulation scenarios, presumably because it fails to account for any sources of over-dispersion (Figure 3 and Supplementary Figures S9–14).
Among the factors that influence the relative rank of various methods, the most important factor was heritability (Figure 3A). While all methods perform worse with increasing gene expression heritability, heritability disproportionately affects the performance of models that do not account for relatedness (i.e. negative binomial, Poisson and Linear), whereas when heritability is zero ( ), these approaches tend to perform slightly better. Therefore, for non-heritable genes, linear models perform slightly better than GEMMA, and negative binomial models work similarly or slightly better than MACAU. This observation most likely arises because linear and negative binomial models require fewer parameters and thus have a greater number of degrees of freedom. However, even in this setting, the difference between MACAU and negative binomial is small, suggesting that MACAU is robust to model misspecification and works reasonably well even for non-heritable genes. On the other hand, when heritability is moderate ( ) or high ( ), the methods that account for sample non-independence are much more powerful than the methods that do not. Because almost all genes are influenced by cis-eQTLs (46,47) and are thus likely heritable to some extent, MACAU's robustness for non-heritable genes and its high performance gain for heritable genes make it appealing.
The second most important factor in relative model performance was the variation of TRCs across individuals (CV; Figure 3B). While all methods perform worse with increasing CV, CV particularly affects the performance of GEMMA. Specifically, when CV is small (0.3; as the baboon data), GEMMA works well and is the second best method behind MACAU. However, when CV is moderate (0.6) or high (0.9), the performance of GEMMA quickly decays: it becomes only the fourth best method when CV = 0.9. GEMMA performs poorly in high CV settings presumably because the LMM fails to account for the mean-variance dependence observed in count data, which is in agreement with previous findings (59,105).
The other four factors we explored had small impacts on the relative performance of the alternative methods, although they did affect their absolute performance. For example, as one would expect, power increases with large effect sizes (PVE) (Supplementary Figure S9) or large sample sizes (Supplementary Figure S10), and decreases with large over-dispersion (Supplementary Figure S11) or large (Supplementary Figure S12).
Finally, we included comparisons with edgeR (25) and DESeq2 (15). In the basic parameter simulation setting (n = 63, CV = 0.3, = 0, PVE = 0.25, 0.3 and 0.25), we again discretized the continuous predictor into a binary 0/1 variable based on the median predictor value across individuals. Results for all methods are shown in Supplementary Figure S13A. For the five methods also tested on a continuous predictor variable, the power on binarized values is much reduced compared with the power when the predictor variable is modeled without binarization (e.g. Supplementary Figure S13A versus Figure 3). Further, neither edgeR nor DESeq2 perform well, consistent with the recent move from these methods towards linear models in differential expression analysis (3,7,45–47,106). This result is not contingent on having large sample sizes. In small sample size settings (n = 6, n = 10 and n = 14, with samples balanced between the two classes, 0 or 1), MACAU again outperforms the other methods, though the power difference is much smaller (n = 10 and n = 14; Supplementary Figures S13C and 13D) and sometimes negligible (n = 6, Supplementary Figure S13B).
In summary, the power of MACAU and other methods, as well as the power difference between methods, is influenced in a continuous fashion by multiple factors. Larger sample sizes, larger effect sizes, lower read depth variation, lower gene expression heritability, lower predictor variable heritability and lower over-dispersion all increase power. However, MACAU's power is less diminished by high gene expression heritability and high read depth variability than the non-mixed model methods, while retaining the advantage of modeling the count data directly. In challenging data analysis settings (e.g. when sample size is low and effect size is low: Supplementary Figure S13B for n = 6), no method stands out and using MACAU results in no or negligible gains in power relative to other methods. When the sample size is low (n = 6) and effect sizes are large, however, MACAU consistently outperforms the other methods (n = 6, Supplementary Figure S14).
Real data applications
To gain insight beyond simulation, we applied MACAU and the other six methods to three recently published RNAseq datasets.
The first dataset we considered is the baboon RNAseq study (7) used to parameterize the simulations above. Expression measurements on 12 018 blood-expressed genes were collected by the (ABRP) (71) for 63 adult baboons (26 females and 37 males), among which some were relatives. Here, we applied MACAU and the six other methods to identify genes with sex-biased expression patterns. Sex-associated genes are known to be enriched on sex chromosomes (107,108), and we use this enrichment as one of the criteria to compare method performance, as in (18). Because the same nominal P-value from different methods may correspond to different type I errors, we compared methods based on empirical FDR. In particular, we permuted the data to construct an empirical null, estimated the FDR at any given P-value threshold, and counted the number of discoveries at a given FDR cutoff (see ‘Materials and Methods’ section for permutation details and a comparison between two different permutation procedures; Supplementary Figure S15).
In agreement with our simulations, MACAU was the most powerful method of those we considered. Specifically, at an empirical FDR of 5%, MACAU identified 105 genes with sex-biased expression patterns, 40% more than that identified by the linear model, the second best method at this FDR cutoff (Figure 4A). At a more relaxed FDR of 10%, MACAU identified 234 sex-associated genes, 47% more than that identified by the negative binomial model, the second best method at this FDR cutoff (Figure 4A). Further, as expected, the sex-associated genes detected by MACAU are enriched on the X chromosome (the Y chromosome is not assembled in baboons and is thus ignored), and this enrichment is stronger for the genes identified by MACAU than by the other methods (Figure 4B). Of the remaining approaches, the negative binomial, linear model and GEMMA all performed similarly and are ranked right after MACAU. The Poisson model performs the worst, and edgeR and DESeq2 fall between the Poisson model and the other methods (Figure 4A and B).
The second dataset we considered is an RNAseq study on T2D collected as part of the FUSION study (60). Here, the data were collected from skeletal muscle samples from 267 individuals with expression measurements on 21 753 genes. Individuals are from three municipalities (Helsinki, Savitaipale and Kuopio) in Finland. Individuals within each municipality are more closely related than individuals between municipalities (e.g. the top genotype PCs generally correspond to the three municipalities; Supplementary Figure S16). Two related phenotypes were available to us: 162 individuals with T2D or NGT status (i.e. case/control) based on the OGTT and 267 individuals with the quantitative trait fasting GL, a biologically relevant trait of T2D.
We performed analyses to identify genes associated with T2D status as well as genes associated with GL. To accommodate edgeR and DESeq2, we also discretized the continuous GL values into binary 0/1 categories based on the median GL value across individuals. We refer to the resulting values as GL01. Therefore, we performed two sets of analyses for GL: one on the continuous GL values and the other on the discretized GL01 values. Consistent with simulations and the baboon data analysis, MACAU identified more T2D-associated genes and GL-associated genes than other methods across a range of empirical FDR values. For the T2D analysis, MACAU identified 23 T2D-associated genes at an FDR of 5%, while GEMMA and the linear model, the second best methods at this FDR cutoff, identified only 1 T2D-associated gene (Figure 4C). Similarly, at an FDR of 10%, MACAU identified 123 T2D-associated genes, 51% more than that identified by the linear model, the second best method at this FDR cutoff (Figure 4C). For GL analysis, based on an FDR of 5%, MACAU detected 12 DE genes, while the other methods did not identify any DE genes at this FDR cutoff. At an FDR of 10%, MACAU identified 100 GL associated genes, while the second best methods—the linear model and GEMMA—identified 12 DE genes (Figure 4E). For the dichotomized GL01, none of the methods detected any DE genes even at a relaxed FDR cutoff of 20%, highlighting the importance of modeling the original continuous predictor variable in DE analysis.
Several lines of evidence support the biological validity of the genes detected by MACAU. First, we performed gene ontology (GO) analysis using LRpath (109) on T2D and GL associated genes identified by MACAU, as in the FUSION study (60) (Supplementary Figure S17). The GO analysis results for T2D and GL are consistent with previous studies (60,110) and are also similar to each other, as expected given the biological relationship between the two traits. In particular, T2D status and high GL are associated with decreased expression of cellular respiratory pathway genes, consistent with previous observations (60,110). T2D status and GL are also associated with several pathways that are related to mTOR, including generation of precursor metabolites, poly-ubiquitination and vesicle trafficking, in agreement with a prominent role of mTOR pathway in T2D etiology (111–114).
Second, we performed overlap analyses between T2D and GL associated genes. We reasoned that T2D-associated genes are likely associated with GL because T2D shares a common genetic basis with GL (115–117) and T2D status is determined in part by fasting GL. Therefore, we used the overlap between genes associated with T2D and genes associated with GL as a measure of method performance. In the overlap analysis, genes with the strongest T2D association identified by MACAU show a larger overlap with the top 1000 genes that have the strongest GL association than did genes identified by other methods (Figure 4D). For instance, among the top 100 genes with the strongest T2D-association evidence from MACAU, 63 of them also show strong association evidence with GL. In contrast, only 55 of the top 100 genes with the strongest T2D-association identified by GEMMA, the second best method, show strong association evidence with GL. We observed similar results, with MACAU performing the best, when performing the reciprocal analysis (overlap between genes with the strongest GL-association and the top 1000 genes that have the strongest T2D-association: Figure 4F). To include the comparison with edgeR and DESeq2, we further examined the overlap between T2D associated genes and GL01 associated genes for all methods (Supplementary Figure S18). Again, MACAU performs the best, followed by GEMMA and the linear model, and neither edgeR nor DESeq2 perform well in this context (Supplementary Figure S18). Therefore, MACAU appears to both confer more power to identify biologically relevant DE genes and be more consistent across analyses of related phenotypes.
To assess the type I error rate of various methods, we permuted the trait data from the baboon and the FUSION studies. Consistent with our simulation results, the P-values from MACAU and GEMMA under the permuted null were close to uniformly distributed (slightly conservative) in both datasets, whereas the other methods were not (Supplementary Figures S19 and 20). In addition, none of the methods compared here are sensitive to outliers in the two datasets (Supplementary Figures S21–23).
Finally, although large, population-based RNAseq datasets are becoming more common, MACAU's flexible PMM modeling framework allows it to be applied to DE analysis in small datasets with unrelated individuals as well. In this setting, MACAU can use the gene expression covariance matrix as the matrix to control for hidden confounding effects that are commonly observed in sequencing studies (48–51). Hidden confounders can induce similarity in gene expression levels across many genes even though individuals are unrelated (52–56), similar to the effects of kinship or population structure. Therefore, by defining using a gene expression (instead of genetic) covariance matrix, MACAU can effectively control for sample non-independence induced by hidden confounders, thus extending the LMM widely used to control for hidden confounders in array based studies (52–56) to sequencing count data.
To illustrate this application, we analyzed a third dataset on LCLs derived from 69 unrelated Nigerian individuals (YRI) (3) from the HapMap project (118), with expression measurements on 13 319 genes. We also aimed to identify sex-associated genes in this dataset. To demonstrate the effectiveness of MACAU in small samples, we randomly subsampled individuals from the data to create small datasets with either n = 6 (3 males and 3 females), n = 10 (5 males and 5 females) or n = 14 individuals (7 males and 7 females). For each sample size n, we performed 20 replicates of random subsampling and then evaluated method performance by averaging across replicates. In each replicate, we used the gene expression covariance matrix as and compared MACAU's performance against other methods. Because of the small sample size, none of the methods were able to identify DE genes at an FDR cutoff of 10%, consistent with recent arguments that at least 6–12 biological replicates are needed to ensure sufficient power and replicability in DE analysis (11). We therefore used enrichment of genes on the sex chromosomes to compare the performance of different methods (Supplementary Figure S24). The enrichment of top ranked sex-associated genes on sex chromosomes has previously been used for method comparison and is especially suitable for comparing methods in the presence of batch effects and other hidden confounding factors (119).
In this comparison, MACAU performs the best of all methods when the sample size is either n = 10 or n = 14, and is ranked among the best (together with the negative binomial model) when n = 6. For instance, when n = 6, among the top 50 genes identified by each method, the number of genes on the sex chromosomes for MACAU, negative binomial, Poisson, edgeR, DESeq2, GEMMA and Linear are 3.3, 2.7, 3.1, 1.8, 3.0, 2.0 and 2.4, respectively. The advantage of MACAU becomes larger when the sample size increases: for example, when n = 14, an average of 10.6 genes in the top 50 genes from MACAU are on the sex chromosomes, which is again larger than that from the negative binomial (8.3), Poisson (6.0), edgeR (6.65), DESeq2 (8.8), GEMMA (9.8) or Linear (8.05). These results suggest that MACAU can also perform better than existing methods in relatively small sample study designs with unrelated individuals by controlling for hidden confounders. However, MACAU's power gain is much smaller in this setting than in the first two datasets we considered (the baboon and Fusion data). In addition, MACAU's power gain is negligible in the case of n = 6 when compared with the second best method, though its power gain over the commonly used edgeR and DESeq2 is still substantial. MACAU's small power gain in this data presumably stems from both the small sample size and the small effect size of sex in the data, consistent with previous reports for blood cell-derived gene expression (3,7,120).
DISCUSSION
Here, we present an effective Poisson mixed effects model, together with a computationally efficient inference method and software implementation in MACAU, for identifying DE genes in RNAseq studies. MACAU directly models count data and, using two random effects terms, controls for both independent over-dispersion and sample non-independence. Because of its flexible modeling framework, MACAU controls for type I error in the presence of individual relatedness, population structure and hidden confounders, and MACAU achieves higher power than several other methods for DE analysis across a range of settings. In addition, MACAU can easily accommodate continuous predictor variables and biological or technical covariates. We have demonstrated the benefits of MACAU using both simulations and applications to three recently published RNAseq datasets.
MACAU is particularly well-suited to datasets that contain related individuals or population structure. Several major population genomic resources contain structure of these kinds. For example, the HapMap population (118), the Human Genome Diversity Panel (121), the 1000 Genomes Project in humans (122) as well as the 1001 Genomes Project in Arabidopsis (123) all contain data from multiple populations or related individuals. Several recent large-scale RNAseq projects also collected individuals from genetically differentiated populations (45). MACAU is also well-suited to analyzing genes with moderate to high heritability. Previous studies in humans have shown that, while heritability varies across genes, many genes are moderately or highly heritable, and almost all genes have detectable eQTL (46,124). Analyzing these data with MACAU can reduce false positives and increase power. Notably, even when genes exhibit zero heritability, our results show that MACAU incurs minimal loss of power compared with other approaches.
While we have mainly focused on illustrating the benefits of MACAU for controlling for individual relatedness and population stratification, we note that MACAU can be used to control for sample non-independence occurred in other settings as we have demonstrated with the third real data application. For example, cell type heterogeneity (54) or other hidden confounding factors (52) are commonly observed in sequencing studies and can induce gene expression similarity even when individuals are unrelated (48–51). Because the gene expression covariance matrix contains information on sample non-independence caused by hidden confounding factors (52–56), MACAU could be applied to control for hidden confounding effects by using the gene expression covariance as the matrix. Therefore, MACAU provides a natural avenue for extending the commonly used mixed effects model for controlling for hidden confounding effects (52–55) in array-based studies to sequencing studies. In addition, although we have designed MACAU for differential expression analysis, we note that MACAU may also be effective in other common settings. For example, MACAU could be readily applied in QTL mapping studies to identify genetic variants that are associated with gene expression levels estimated using RNAseq or related high-throughput sequencing methods.
In the present study, we have focused on demonstrating the performance of MACAU in three published RNAseq datasets with sample sizes ranging from small (n = 6) to medium (n = 63) to large (n = 267), relative to the size of most current RNAseq studies. Compared with small sample studies, RNAseq studies with medium or large sample sizes are better powered and more reproducible and are thus becoming increasingly common in genomics (10,11). For example, a recent comparative study makes explicit calls for medium to large sample RNAseq studies performed with at least 12 replicates per condition (i.e. n ≥ 24) (11). However, we recognize that many RNAseq studies are still carried out with a small number of samples (e.g. 3 replicates per condition). As our simulations make clear, the power of all analysis methods is dramatically reduced with decreasing sample size, conditional on fixed values of other factors that influence power (e.g., effect size, gene expression heritability). Thus, MACAU's advantage is no longer obvious in simulated data with only three replicates per condition when the effect size is also small (although its advantage becomes apparent when the simulated effect size increases: Supplementary Figures S13B and 14). In addition, MACAU's advantage is much smaller and sometimes negligible in the small real dataset when compared with the medium and large datasets analyzed here. Furthermore, because MACAU requires estimating one more parameter than other existing methods, MACAU requires at least five samples to run while existing DE methods require at least four. Therefore, MACAU may not confer benefits to power in some settings, and is especially likely (like all methods) to be underpowered in very small sample sizes with small effect sizes. Future extensions of MACAU are likely needed to ensure its robust performance in small as well as moderate to large samples. For example, further power improvements could be achieved by borrowing information across genes to estimate the over-dispersion parameter (15,22,25) or building in a hierarchical structure to model many genes at once.
Like other DE methods (24,25), MACAU requires data pre-processing to obtain gene expression measurements from raw sequencing read files. This data pre-processing step may include read alignment, transcript assembly, alternative transcript quantification, transcript measurement and normalization. Many methods are available to perform these tasks (12,14,68,125–130) and different methods can be differentially advantageous across settings (68,125,131). Importantly, MACAU can be paired with any pre-processing method that retains the count nature of the data. While we provide a preliminary comparison of several methods here (see ‘Materials and Methods’ section; Supplementary Figure S3), a full analysis of how different data pre-processing choices affect MACAU's performance in alternative study designs is beyond the scope of this paper. Notably, recent results suggest that a recommended approach is to incorporate data pre-processing and DE analysis into the same, joint statistical framework (132), which represents an important next step for the MACAU software package.
We note that, like many other DE methods (15,25), we did not model gene length in MACAU. Because gene length does not change from sample to sample, it does not affect differential expression analysis on any particular gene (15,25). However, gene length will affect the power of DE analysis across different genes: genes with longer length tend to have a larger number of mapped reads and more accurate expression measurements, and as a consequence, DE analysis on these genes tends to have higher statistical power (2,70,133). Gene length may also introduce sample-specific effects in certain datasets (134). Therefore, understanding the impact of, and taking into account gene length effects, in MACAU DE analysis represents another possible future extension.
Currently, despite the newly developed computationally efficient algorithm, applications of MACAU can still be limited by its relatively heavy computational cost. The MCMC algorithm in MACAU scales quadratically with the number of individuals/samples and linearly with the number of genes. Although MACAU is two orders of magnitude faster than the standard software MCMCglmm for fitting Poisson mixed effects models (Supplementary Table S1), it can still take close to 20 h to analyze a dataset of the size of the FUSION data we considered here (267 individuals and 21 753 genes). Therefore, new algorithms will be needed to use MACAU for datasets that are orders of magnitude larger.
URLs
The software implementation of MACAU is freely available at: www.xzlab.org/software.html.
Supplementary Material
ACKNOWLEDGEMENTS
We thank Matthew Stephens for insight and support on previous versions of MACAU. We thank Baylor College of Medicine Human Genome Sequencing Center for access to the current version of the baboon genome assembly (Panu 2.0). We thank FUSION investigators for access to the FUSION expression data.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online.
FUNDING
National Institutes of Health [R01HG009124 to X.Z.; R01GM102562 to J.T.; R21AG049936 to J.T.; U01DK062370 to L.S.]; China Scholarship Council (to S.S.). Funding for open access charge: University of Michigan.
Conflict of interest statement. None declared.
REFERENCES
- 1. Nagalakshmi U., Wang Z., Waern K., Shou C., Raha D., Gerstein M., Snyder M.. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008; 320:1344–1349. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Mortazavi A., Williams B.A., Mccue K., Schaeffer L., Wold B.. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods. 2008; 5:621–628. [DOI] [PubMed] [Google Scholar]
- 3. Pickrell J.K., Marioni J.C., Pai A.A., Degner J.F., Engelhardt B.E., Nkadori E., Veyrieras J.B., Stephens M., Gilad Y., Pritchard J.K.. Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010; 464:768–772. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Wang Z., Gerstein M., Snyder M.. RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 2009; 10:57–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Oshlack A., Robinson M.D., Young M.D.. From RNA-seq reads to differential expression results. Genome Biol. 2010; 11:220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Ozsolak F., Milos P.M.. RNA sequencing: advances, challenges and opportunities. Nat. Rev. Genet. 2011; 12:87–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Tung J., Zhou X., Alberts S.C., Stephens M., Gilad Y.. The genetic architecture of gene expression levels in wild baboons. Elife. 2015; 4:e04729. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Bennett B.J., Farber C.R., Orozco L., Kang H.M., Ghazalpour A., Siemers N., Neubauer M., Neuhaus I., Yordanova R., Guan B. et al. . A high-resolution association mapping panel for the dissection of complex traits in mice. Genome Res. 2010; 20:281–290. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Zhou X., Stephens M.. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 2012; 44:821–824. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Wood A.R., Esko T., Yang J., Vedantam S., Pers T.H., Gustafsson S., Chu A.Y., Estrada K., Luan J., Kutalik Z. et al. . Defining the role of common variation in the genomic and biological architecture of adult human height. Nat. Genet. 2014; 46:1173–1186. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Schurch N.J., Schofield P., Gierlinski M., Cole C., Sherstnev A., Singh V., Wrobel N., Gharbi K., Simpson G.G., Owen-Hughes T. et al. . How many biological replicates are needed in an RNA-seq experiment and which differential expression tool should you use?. RNA. 2016; 22:839–851. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Li B., Ruotti V., Stewart R.M., Thomson J.A., Dewey C.N.. RNA-Seq gene expression estimation with read mapping uncertainty. Bioinformatics. 2010; 26:493–500. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Hu M., Zhu Y., Taylor J.M.G., Liu J.S., Qin Z.H.S.. Using Poisson mixed-effects model to quantify transcript-level gene expression in RNA-Seq. Bioinformatics. 2012; 28:63–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Li B., Dewey C.N.. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC bioinformatics. 2011; 12:323. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Anders S., Huber W.. Differential expression analysis for sequence count data. Genome Biol. 2010; 11:R106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Trapnell C., Roberts A., Goff L., Pertea G., Kim D., Kelley D.R., Pimentel H., Salzberg S.L., Rinn J.L., Pachter L.. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat. Protoc. 2012; 7:562–578. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Li J., Jiang H., Wong W.H.. Modeling non-uniformity in short-read rates in RNA-Seq data. Genome Biol. 2010; 11:R50. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Zhou Y.H., Xia K., Wright F.A.. A powerful and flexible approach to the analysis of RNA sequence count data. Bioinformatics. 2011; 27:2672–2678. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Marioni J.C., Mason C.E., Mane S.M., Stephens M., Gilad Y.. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 2008; 18:1509–1517. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Wang L.K., Feng Z.X., Wang X., Wang X.W., Zhang X.G.. DEGseq: an R package for identifying differentially expressed genes from RNA-seq data. Bioinformatics. 2010; 26:136–138. [DOI] [PubMed] [Google Scholar]
- 21. Langmead B., Hansen K.D., Leek J.T.. Cloud-scale RNA-sequencing differential expression analysis with Myrna. Genome Biol. 2010; 11:R83. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Li J., Witten D.M., Johnstone I.M., Tibshirani R.. Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Biostatistics. 2012; 13:523–538. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Auer P.L., Doerge R.W.. A two-stage poisson model for testing RNA-Seq data. Stat. Appl. Genet. Mol. 2011; 10:1–26. [Google Scholar]
- 24. Love M.I., Huber W., Anders S.. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014; 15:550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Robinson M.D., McCarthy D.J., Smyth G.K.. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010; 26:139–140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. McCarthy D.J., Chen Y.S., Smyth G.K.. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 2012; 40:4288–4297. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Di Y.M., Schafer D.W., Cumbie J.S., Chang J.H.. The NBP negative binomial model for assessing differential gene expression from RNA-Seq. Stat. Appl. Genet. Mol. 2011; 10:1–28. [Google Scholar]
- 28. Wu H., Wang C., Wu Z.. A new shrinkage estimator for dispersion improves differential expression detection in RNA-seq data. Biostatistics. 2013; 14:232–243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Van De Wiel M.A., Leday G.G.R., Pardo L., Rue H., Van Der Vaart A.W., Van Wieringen W.N.. Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. Biostatistics. 2013; 14:113–128. [DOI] [PubMed] [Google Scholar]
- 30. Hardcastle T.J., Kelly K.A.. baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data. BMC Bioinformatics. 2010; 11:422. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Li J., Tibshirani R.. Finding consistent patterns: a nonparametric approach for identifying differential expression in RNA-Seq data. Stat. Methods Med. Res. 2013; 22:519–536. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Tarazona S., Garcia-Alcalde F., Dopazo J., Ferrer A., Conesa A.. Differential expression in RNA-seq: A matter of depth. Genome Res. 2011; 21:2213–2223. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Law C.W., Chen Y.S., Shi W., Smyth G.K.. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 2014; 15:R29. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Zwiener I., Frisch B., Binder H.. Transforming RNA-Seq data to improve the performance of prognostic gene signatures. PLoS One. 2014; 9:e85150. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Soneson C., Delorenzi M.. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinformatics. 2013; 14:91. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Kvam V.M., Lu P., Si Y.Q.. A comparison of statistical methods for detecting differentially expressed genes from Rna-Seq data. Am. J. Bot. 2012; 99:248–256. [DOI] [PubMed] [Google Scholar]
- 37. Zhang Z.H., Jhaveri D.J., Marshall V.M., Bauer D.C., Edson J., Narayanan R.K., Robinson G.J., Lundberg A.E., Bartlett P.F., Wray N.R. et al. . A comparative study of techniques for differential expression analysis on RNA-Seq data. PLoS One. 2014; 9:e103207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. McCullagh P., Nelder J.A.. Generalized Linear Models. 1989; London: Chapman and Hall/CRC. [Google Scholar]
- 39. Robinson M.D., Oshlack A.. A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biol. 2010; 11:R25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Price A.L., Helgason A., Thorleifsson G., McCarroll S.A., Kong A., Stefansson K.. Single-tissue and cross-tissue heritability of gene expression via identity-by-descent in related or unrelated individuals. PLoS Genet. 2011; 7:e1001317. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Wright F.A., Sullivan P.F., Brooks A.I., Zou F., Sun W., Xia K., Madar V., Jansen R., Chung W.I., Zhou Y.H. et al. . Heritability and genomics of gene expression in peripheral blood. Nat. Genet. 2014; 46:430–437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Monks S.A., Leonardson A., Zhu H., Cundiff P., Pietrusiak P., Edwards S., Phillips J.W., Sachs A., Schadt E.E.. Genetic inheritance of gene expression in human cell lines. Am. J. Hum. Genet. 2004; 75:1094–1105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Emilsson V., Thorleifsson G., Zhang B., Leonardson A.S., Zink F., Zhu J., Carlson S., Helgason A., Walters G.B., Gunnarsdottir S. et al. . Genetics of gene expression and its effect on disease. Nature. 2008; 452:423–428. [DOI] [PubMed] [Google Scholar]
- 44. Yang S.J., Liu Y.Y., Jiang N., Chen J., Leach L., Luo Z.W., Wang M.H.. Genome-wide eQTLs and heritability for gene expression traits in unrelated individuals. BMC Genomics. 2014; 15:13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Lappalainen T., Sammeth M., Friedlander M.R., 't Hoen P.A.C., Monlong J., Rivas M.A., Gonzalez-Porta M., Kurbatova N., Griebel T., Ferreira P.G. et al. . Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013; 501:506–511. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Ardlie K.G., DeLuca D.S., Segre A.V., Sullivan T.J., Young T.R., Gelfand E.T., Trowbridge C.A., Maller J.B., Tukiainen T., Lek M. et al. . The genotype-tissue expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science. 2015; 348:648–660. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Battle A., Mostafavi S., Zhu X.W., Potash J.B., Weissman M.M., McCormick C., Haudenschild C.D., Beckman K.B., Shi J.X., Mei R. et al. . Characterizing the genetic basis of transcriptome diversity through RNA-sequencing of 922 individuals. Genome Res. 2014; 24:14–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Stegle O., Parts L., Piipari M., Winn J., Durbin R.. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nat. Protoc. 2012; 7:500–507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Leek J.T. Svaseq: removing batch effects and other unwanted noise from sequencing data. Nucleic Acids Res. 2014; 42:e161. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Leek J.T., Storey J.D.. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007; 3:1724–1735. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51. Risso D., Ngai J., Speed T.P., Dudoit S.. Normalization of RNA-seq data using factor analysis of control genes or samples. Nat. Biotechnol. 2014; 32:896–902. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Kang H.M., Ye C., Eskin E.. Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Genetics. 2008; 180:1909–1925. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Listgarten J., Kadie C., Schadt E.E., Heckerman D.. Correction for hidden confounders in the genetic analysis of gene expression. Proc. Natl. Acad. Sci. U.S.A. 2010; 107:16465–16470. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Zou J., Lippert C., Heckerman D., Aryee M., Listgarten J.. Epigenome-wide association studies without the need for cell-type composition. Nat. Methods. 2014; 11:309–311. [DOI] [PubMed] [Google Scholar]
- 55. Rahmani E., Zaitlen N., Baran Y., Eng C., Hu D.L., Galanter J., Oh S., Burchard E.G., Eskin E., Zou J. et al. . Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nat. Methods. 2016; 13:443–445. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. McGregor K., Bernatsky S., Colmegna I., Hudson M., Pastinen T., Labbe A., Greenwood C.M.T.. An evaluation of methods correcting for cell-type heterogeneity in DNA methylation studies. Genome Biol. 2016; 17:84. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57. Price A.L., Patterson N.J., Plenge R.M., Weinblatt M.E., Shadick N.A., Reich D.. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 2006; 38:904–909. [DOI] [PubMed] [Google Scholar]
- 58. Zhou X., Stephens M.. Efficient multivariate linear mixed model algorithms for genome-wide association studies. Nat. Methods. 2014; 11:407–409. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Lea A.J., Alberts S.C., Tung J., Zhou X.. A flexible, efficient binomial mixed model for identifying differential DNA methylation in bisulfite sequencing data. PLoS Genet. 2015; 11:e1005650. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60. Scott L.J., Erdos M.R., Huyghe J.R., Welch R.P., Beck A.T., Boehnke M., Collins F.S., Parker S.C.J.. The genetic regulatory sigature of type 2 diabetes in human skeletal muscle. Nat. Commun. 2016; 7:11764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61. Fruhwirth-Schnatter S., Wagner H.. Auxiliary mixture sampling for parameter-driven models of time series of counts with applications to state space modelling. Biometrika. 2006; 93:827–841. [Google Scholar]
- 62. Scott S.L. Data augmentation, frequentist estimation, and the Bayesian analysis of multinomial logit models. Stat. Pap. 2011; 52:87–109. [Google Scholar]
- 63. Fruhwirth-Schnatter S., Fruhwirth R.. Data Augmentation and MCMC for Binary and Multinomial Logit Models. Statistical Modelling and Regression Structures. 2010; NY: Springer. [Google Scholar]
- 64. Lippert C., Listgarten J., Liu Y., Kadie C.M., Davidson R.I., Heckerman D.. FaST linear mixed models for genome-wide association studies. Nat. Methods. 2011; 8:833–835. [DOI] [PubMed] [Google Scholar]
- 65. Zhou X., Carbonetto P., Stephens M.. Polygenic modeling with bayesian sparse linear mixed models. PLoS Genet. 2013; 9:e1003264. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66. Venables W.N., Ripley B.D.. Modern Applied Statistics with S. 2002; NY: Springer. [Google Scholar]
- 67. Robin X., Turck N., Hainard A., Tiberti N., Lisacek F., Sanchez J.C., Muller M.. pROC: an open-source package for R and S plus to analyze and compare ROC curves. BMC Bioinformatics. 2011; 12:77. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68. Teng M., Love M.I., Davis C.A., Djebali S., Dobin A., Graveley B.R., Li S., Mason C.E., Olson S., Pervouchine D. et al. . A benchmark for RNA-seq quantification pipelines. Genome Biol. 2016; 17:74. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69. Ritchie M.E., Phipson B., Wu D., Hu Y.F., Law C.W., Shi W., Smyth G.K.. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015; 43:e47. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70. Bullard J.H., Purdom E., Hansen K.D., Dudoit S.. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics. 2010; 11:94. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71. Alberts S., Altmann J.. Kappeler PM, Watts DP. Long-Term Field Studies of Primates. 2012; Berlin Heidelberg: Springer; 261–287. [Google Scholar]
- 72. Alberts S.C., Buchan J.C., Altmann J.. Sexual selection in wild baboons: from mating opportunities to paternity success. Anim. Behav. 2006; 72:1177–1196. [Google Scholar]
- 73. Buchan J.C., Alberts S.C., Silk J.B., Altmann J.. True paternal care in a multi-male primate society. Nature. 2003; 425:179–181. [DOI] [PubMed] [Google Scholar]
- 74. Altmann J., Altmann S., Hausfater G.. Physical maturation and age estimates of yellow baboons, Papio-Cynocephalus, in Amboseli National-Park, Kenya. Am. J. Primatol. 1981; 1:389–399. [DOI] [PubMed] [Google Scholar]
- 75. Archie E.A., Tung J., Clark M., Altmann J., Alberts S.C.. Social affiliation matters: both same-sex and opposite-sex relationships predict survival in wild female baboons. Proc. R. Soc. B. 2014; 281:20141261. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 76. Valle T., Ehnholm C., Tuomilehto J., Blaschak J., Bergman R.N., Langefeld C.D., Ghosh S., Watanabe R.M., Hauser E.R., Magnuson V. et al. . Mapping genes for NIDDM—design of the finland united states investigation of NIDDM Genetics (FUSION) study. Diabetes Care. 1998; 21:949–958. [DOI] [PubMed] [Google Scholar]
- 77. Vaatainen S., Keinanen-Kiukaanniemi S., Saramies J., Uusitalo H., Tuomilehto J., Martikainen J.. Quality of life along the diabetes continuum: a cross-sectional view of health-related quality of life and general health status in middle-aged and older Finns. Qual. Life Res. 2014; 23:1935–1944. [DOI] [PubMed] [Google Scholar]
- 78. Li H., Durbin R.. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009; 25:1754–1760. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79. Churchill G.A., Doerge R.W.. Naive application of permutation testing leads to inflated type I error rates. Genetics. 2008; 178:609–610. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80. Abney M. Permutation testing in the presence of polygenic variation. Genet. Epidemiol. 2015; 39:249–258. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81. Zhou X., Lindsay H., Robinson M.D.. Robustly detecting differential expression in RNA sequencing data using observation weights. Nucleic Acids Res. 2014; 42:e91. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82. George N.I., Bowyer J.F., Crabtree N.M., Chang C.W.. An iterative leave-one-out approach to outlier detection in RNA-seq data. PLoS One. 2015; 10:e0125224. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83. Kang H.M., Sul J.H., Service S.K., Zaitlen N.A., Kong S.Y., Freimer N.B., Sabatti C., Eskin E.. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 2010; 42:348–354. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84. Kang H.M., Zaitlen N.A., Wade C.M., Kirby A., Heckerman D., Daly M.J., Eskin E.. Efficient control of population structure in model organism association mapping. Genetics. 2008; 178:1709–1723. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 85. Tempelman R.J., Gianola D.. A mixed effects model for overdispersed count data in animal breeding. Biometrics. 1996; 52:265–279. [Google Scholar]
- 86. Tempelman R.J. Generalized linear mixed models in dairy cattle breeding. J. Dairy Sci. 1998; 81:1428–1444. [DOI] [PubMed] [Google Scholar]
- 87. Pinheiro J.C., Chao E.C.. Efficient Laplacian and adaptive Gaussian quadrature algorithms for multilevel generalized linear mixed models. J. Comput. Graph. Stat. 2006; 15:58–81. [Google Scholar]
- 88. Goldstein H. Nonlinear multilevel models, with an application to discrete response data. Biometrika. 1991; 78:45–51. [Google Scholar]
- 89. Breslow N.E., Clayton D.G.. Approximate Inference in Generalized Linear Mixed Models. J. Am. Stat. Assoc. 1993; 88:9–25. [Google Scholar]
- 90. Breslow N.E., Lin X.H.. Bias correction in generalized linear mixed models with a single-component of dispersion. Biometrika. 1995; 82:81–91. [Google Scholar]
- 91. Browne W.J., Draper D.. A comparison of Bayesian and likelihood-based methods for fitting multilevel models. Bayesian Anal. 2006; 1:473–513. [Google Scholar]
- 92. Lin X.H., Breslow N.E.. Bias correction in generalized linear mixed models with multiple components of dispersion. J. Am. Stat. Assoc. 1996; 91:1007–1016. [Google Scholar]
- 93. Goldstein H., Rasbash J.. Improved approximations for multilevel models with binary responses. J. R. Stat. Soc. A. 1996; 159:505–513. [Google Scholar]
- 94. Rodriguez G., Goldman N.. Improved estimation procedures for multilevel models with binary response: a case-study. J. R. Stat. Soc. A. 2001; 164:339–355. [Google Scholar]
- 95. Jang W., Lim J.. A numerical study of PQL estimation biases in generalized linear mixed models under heterogeneity of random effects. Commun. Stat. 2009; 38:692–702. [Google Scholar]
- 96. Fong Y.Y., Rue H., Wakefield J.. Bayesian inference for generalized linear mixed models. Biostatistics. 2010; 11:397–412. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97. Smith A.F.M., Roberts G.O.. Bayesian computation via the gibbs sampler and related markov-chain monte-carlo methods. J. R. Stat. Soc. B. 1993; 55:3–23. [Google Scholar]
- 98. Gelman A., Shirley K.. Inference from simulations and monitoring convergence. Handbook of Markov Chain Monte Carlo. 2011; 163–174. [Google Scholar]
- 99. Schwartz L. On Bayes procedures. Zeitschrift fűr Wahrscheinlichkeitstheorie und Verwandte Gebiete. 1965; 4:10–26. [Google Scholar]
- 100. Hadfield J.D. MCMC methods for multi-response generalized linear mixed models: The MCMCglmm R package. J. Stat. Softw. 2010; 33:1–22.20808728 [Google Scholar]
- 101. Seyednasrollah F., Laiho A., Elo L.L.. Comparison of software packages for detecting differential expression in RNA-seq studies. Brief Bioinform. 2015; 16:59–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102. Patterson N., Price A.L., Reich D.. Population structure and eigenanalysis. PLoS Genet. 2006; 2:2074–2093. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 103. Yang J., Zaitlen N.A., Goddard M.E., Visscher P.M., Price A.L.. Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 2014; 46:100–106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104. Rapaport F., Khanin R., Liang Y.P., Pirun M., Krek A., Zumbo P., Mason C.E., Socci N.D., Betel D.. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq datas. Genome Biol. 2013; 14:R95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 105. Chen H., Wang C., Conomos M.P., Stilp A.M., Li Z., Sofer T., Szpiro A.A., Chen W., Brehm J.M., Celedón J.C. et al. . Control for population structure and relatedness for binary traits in genetic association studies via logistic mixed models. Am. J. Hum. Genet. 2016; 98:653–666. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 106. Zhou X., Cain C.E., Myrthil M., Lewellen N., Michelini K., Davenport E.R., Stephens M., Pritchard J.K., Gilad Y.. Epigenetic modifications are associated with inter-species gene expression variation in primates. Genome Biol. 2014; 15:547. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 107. Vawter M.P., Evans S., Choudary P., Tomita H., Meador-Woodruff J., Molnar M., Li J., Lopez J.F., Myers R., Cox D. et al. . Gender-specific gene expression in post-mortem human brain: Localization to sex chromosomes. Neuropsychopharmacol. 2004; 29:373–384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 108. Lemos B., Branco A.T., Jiang P.P., Hartl D.L., Meiklejohn C.D.. Genome-wide gene expression effects of sex chromosome imprinting in Drosophila. G3. 2014; 4:1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 109. Kim J.H., Karnovsky A., Mahavisno V., Weymouth T., Pande M., Dolinoy D.C., Rozek L.S., Sartor M.A.. LRpath analysis reveals common pathways dysregulated via DNA methylation across cancer types. BMC Genomics. 2012; 13:526. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 110. Mootha V.K., Lindgren C.M., Eriksson K.-F., Subramanian A., Sihag S., Lehar J., Puigserver P., Carlsson E., Ridderstråle M., Laurila E. et al. . PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat. Genet. 2003; 34:267–273. [DOI] [PubMed] [Google Scholar]
- 111. Leibowitz G., Cerasi E., Ketzinel-Gilad A.. The role of mTOR in the adaptation and failure of beta-cells in type 2 diabetes. Diabetes Obes. Metab. 2008; 10:157–169. [DOI] [PubMed] [Google Scholar]
- 112. Ost A., Svensson K., Ruishalme I., Brannmark C., Franck N., Krook H., Sandstrom P., Kjolhede P., Stralfors P.. Attenuated mTOR signaling and enhanced autophagy in adipocytes from obese patients with type 2 diabetes. Mol. Med. 2010; 16:235–246. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 113. Laplante M., Sabatini D.M.. mTOR signaling in growth control and disease. Cell. 2012; 149:274–293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 114. Zoncu R., Efeyan A., Sabatini D.M.. mTOR: from growth signal integration to cancer, diabetes and ageing. Nat. Rev. Mol. Cell Biol. 2011; 12:21–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 115. Matthews D.R., Hosker J.P., Rudenski A.S., Naylor B.A., Treacher D.F., Turner R.C.. Homeostasis model assessment—insulin resistance and beta-cell function from fasting plasma-glucose and insulin concentrations in man. Diabetologia. 1985; 28:412–419. [DOI] [PubMed] [Google Scholar]
- 116. Lyssenko V., Nagorny C.L.F., Erdos M.R., Wierup N., Jonsson A., Spegel P., Bugliani M., Saxena R., Fex M., Pulizzi N. et al. . Common variant in MTNR1B associated with increased risk of type 2 diabetes and impaired early insulin secretion. Nat. Genet. 2009; 41:82–88. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 117. Dupuis J., Langenberg C., Prokopenko I., Saxena R., Soranzo N., Jackson A.U., Wheeler E., Glazer N.L., Bouatia-Naji N., Gloyn A.L. et al. . New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk. Nat. Genet. 2010; 42:105–116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 118. Gibbs R.A., Belmont J.W., Hardenbol P., Willis T.D., Yu F.L., Yang H.M., Ch’ang L.Y., Huang W., Liu B., Shen Y. et al. . The International HapMap Project. Nature. 2003; 426:789–796. [DOI] [PubMed] [Google Scholar]
- 119. Gagnon-Bartsch J.A., Speed T.P.. Using control genes to correct for unwanted variation in microarray data. Biostatistics. 2012; 13:539–552. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 120. Powell J.E., Henders A.K., McRae A.F., Wright M.J., Martin N.G., Dermitzakis E.T., Montgomery G.W., Visscher P.M.. Genetic control of gene expression in whole blood and lymphoblastoid cell lines is largely independent. Genome Res. 2012; 22:456–466. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 121. Cann H.M., de Toma C., Cazes L., Legrand M.F., Morel V., Piouffre L., Bodmer J., Bodmer W.F., Bonne-Tamir B., Cambon-Thomsen A. et al. . A human genome diversity cell line panel. Science. 2002; 296:261–262. [DOI] [PubMed] [Google Scholar]
- 122. Landi M.T., Wang Y.F., Mckay J.D., Rafnar T., Wang Z.M., Timofeeva M., Broderick P., Stefansson K., Risch A., Chanock S.J. et al. . Imputation from the 1000 Genomes Project identifies rare large effect variants of BRCA2-K3326X and CHEK2-I157T as risk factors for lung cancer; a study from the TRICL consortium. Cancer Res. 2014; 74:942–942. [Google Scholar]
- 123. Weigel D., Mott R.. The 1001 genomes project for arabidopsis thaliana. Genome Biol. 2009; 10:107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 124. Lonsdale J., Thomas J., Salvatore M., Phillips R., Lo E., Shad S., Hasz R., Walters G., Garcia F., Young N. et al. . The genotype-tissue expression (GTEx) project. Nat. Genet. 2013; 45:580–585. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 125. Kanitz A., Gypas F., Gruber A.J., Gruber A.R., Martin G., Zavolan M.. Comparative assessment of methods for the computational inference of transcript isoform abundance from RNA-seq data. Genome Biol. 2015; 16:150. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 126. Langmead B., Salzberg S.L.. Fast gapped-read alignment with Bowtie 2. Nat. Methods. 2012; 9:357–359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 127. Trapnell C., Williams B.A., Pertea G., Mortazavi A., Kwan G., van Baren M.J., Salzberg S.L., Wold B.J., Pachter L.. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 2010; 28:511–515. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 128. Dobin A., Davis C.A., Schlesinger F., Drenkow J., Zaleski C., Jha S., Batut P., Chaisson M., Gingeras T.R.. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013; 29:15–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 129. Kim D., Pertea G., Trapnell C., Pimentel H., Kelley R., Salzberg S.L.. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013; 14:R36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 130. Bray N.L., Pimentel H., Melsted P., Pachter L.. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 2016; 34:525–527. [DOI] [PubMed] [Google Scholar]
- 131. Conesa A., Madrigal P., Tarazona S., Gomez-Cabrero D., Cervera A., McPherson A., Szczesniak M.W., Gaffney D.J., Elo L.L., Zhang X. et al. . A survey of best practices for RNA-seq data analysis. Genome Biol. 2016; 17:13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 132. Trapnell C., Hendrickson D.G., Sauvageau M., Goff L., Rinn J.L., Pachter L.. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nat. Biotechnol. 2013; 31:46–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 133. Oshlack A., Wakefield M.J.. Transcript length bias in RNA-seq data confounds systems biology. Biol. Direct. 2009; 4:14. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 134. Hansen K.D., Irizarry R.A., Wu Z.J.. Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics. 2012; 13:204–216. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.