Abstract
Heterosis, also known as the hybrid vigor, occurs when the mean phenotype of hybrid off-spring is superior to that of its two inbred parents. The heterosis phenomenon is extensively utilized in agriculture though the molecular basis is still unknown. In an effort to understand phenotypic heterosis at the molecular level, researchers have begun to compare expression levels of thousands of genes between parental inbred lines and their hybrid offspring to search for evidence of gene expression heterosis. Standard statistical approaches for separately analyzing expression data for each gene can produce biased and highly variable estimates and unreliable tests of heterosis. To address these shortcomings, we develop a hierarchical model to borrow information across genes. Using our modeling framework, we derive empirical Bayes estimators and an inference strategy to identify gene expression heterosis. Simulation results show that our proposed method outperforms the more traditional strategy used to detect gene expression heterosis. This article has supplementary material online.
Keywords: Empirical Bayes, Gene expression, Heterosis, Hierarchical model, Microarray, Mixture model
1 INTRODUCTION
Heterosis, or hybrid vigor, refers to the enhanced phenotype of hybrid progeny relative to their inbred parents. Taking maize as an example, the offspring from crossing the inbred lines B73 and Mo17 are taller, mature faster, and produce greater yields than their parental lines (Hallauer and Miranda, 1981). Since heterosis was scientifically documented by Darwin (1876), it has been successfully manipulated to improve many species for food, feed, and fuel industries, such as rice (Yu et al., 1997), alfalfa (Riday and Brummer, 2002), tomatoes (Krieger et al., 2010), and fish (Wohlfarth, 1993). Despite the intensive study and successful utilization of heterosis, the basic genomic mechanisms remain unclear (Coors and Pandey, 1999; Lippman and Zamir, 2007). Researchers speculate that gene expression heterosis could be among the mechanisms responsible for the phenotypic heterosis (Swanson-Wagner et al., 2006; Springer and Stupar, 2007).
Due to advancements in high-throughput genomics technology (such as microarray and next-generation sequencing of RNA), it is now possible to simultaneously measure and compare expression levels of thousands of genes in parental lines and their hybrid offspring to search for evidence of gene expression heterosis. It is of particular interest to test if a gene exhibits any of the following three forms of gene expression heterosis: high-parent heterosis (HPH), low-parent heterosis (LPH), or mid-parent heterosis (MPH). A gene is said to exhibit HPH if the mean expression level of the offspring is greater than the maximum of the two parental means, LPH if the mean expression level of the offspring is smaller than the minimum of the two parental means, and MPH if the mean expression level of the off-spring is not equal to the average of parental means. Let i index the genotypes of the two parents (i = 1, 2) and the offspring (i = 3). Let j (j = 1, …, J) index the genes, where J denotes the total number of genes under study. We use μij to denote the mean expression level of gene j of genotype i. Let hj = μ3j − max{μ1j, μ2j}, lj = min{μ1j, μ2j}− μ3j, and mj = μ3j − (μ1j + μ2j)/2. With these notations, gene j exhibits HPH, LPH, or MPH if and only if hj > 0, lj > 0, or mj ≠ 0, respectively.
Past work on estimating gene expression heterosis using microarray data (Swanson-Wagner et al., 2006; Wang et al., 2006; Bassene et al., 2010) has used separate estimates for each gene obtained by replacing population means (μij, i = 1, 2, 3, j = 1, ···, J) with corresponding sample averages. These sample average estimators of hj and lj are problematic because they are biased and tend to underestimate hj and lj (see Appendix A). Though the sample average estimator of mj is unbiased, with only a few observations for each gene in a typical microarray experiment, the sample average estimators of mj, hj, and lj may each be highly variable.
Because high-throughput technologies measure expression of hundreds of thousands of genes simultaneously, we can utilize information across genes to improve estimation and testing of gene expression heterosis for each individual gene. For gene j, we define two latent variables αj = (μ1j − μ2j)/2 and δj = μ3j − (μ1j + μ2j)/2. Notice that all three types of gene expression heterosis can be written as functions of |αj| and δj; that is, hj = δj − |αj|, lj = − |αj|− δj, and mj = δj. Thus, modeling of |αj| and δj helps to develop statistical inferences for all three types of gene expression heterosis. We model αj, the half parental difference, as a draw from a mixture of a point-mass-at-0 distribution and a normal distribution. This implies that |αj| is equal to 0 with some probability πα and equal to the absolute value of a draw from a normal distribution with probability 1 − πα. The point-mass distribution in the mixture model represents the case where the parental gene expression levels are equal, while the normal component corresponds to genes whose expression levels differ between the two parental lines. Similarly, we model δj, the difference between the offspring mean and the average of the parental means, with another mixture model that has normal and point-mass-at-0 component distributions. We estimate the parameters for these mixture distributions based on observed data from all genes. Under an empirical Bayes framework, we derive posterior distributions of αj and δj and draw inferences about gene expression heterosis from estimates of these posteriors.
We compare the empirical Bayes method with the sample average method through simulation studies where datasets were generated based on real heterosis microarray experiments or hypothetical probability models. Simulation studies show that the empirical Bayes estimators of hj, lj, and mj have smaller mean square errors (MSEs) than the sample average estimators that have been used previously. Furthermore, the empirical Bayes estimators of hj and lj are less biased than the sample average estimators, and the inferences we draw using our empirical Bayes approach are superior to traditional approaches for detecting all forms of heterosis.
The remainder of the paper proceeds as follows. Section 2 presents the proposed hierarchical model in full detail. Section 3 derives the empirical Bayes estimators and inference strategy based on the framework constructed in section 2. Section 4 summarizes analysis results of two real experiments. Section 5 presents results of several simulation studies. Section 6 summarizes our work. R code and C code for the analysis of real experiments in section 4, the simulation studies in section 5, and the implementation of all our algorithms is available upon request.
2 HIERARCHICAL GENE EXPRESSION HETEROSIS MODEL
Let yijk denote the normalized log-scale gene expression measurement for genotype i, gene j, and biological replicate k, where k = 1, ···, ni, and ni is the total number of replicates for genotype i. As is common in microarray data analysis, we assume the dataset for gene j (yijk, i = 1, 2, 3, k = 1, ···, ni) consists of independent observations, and that . The sample average method estimates hj, lj, and mj by ĥj = ȳ3j· − max{ȳ1j·, ȳ2j·}, l̂j = min{ȳ1j·, ȳ2j·} − ȳ3j·, and m̂j = ȳ3j· − (ȳ1j· + ȳ2j·)/2, where . Furthermore, is estimated by .
In the previous section, we defined αj = (μ1j − μ2j)/2 and δj = μ3j − (μ1j + μ2j)/2. In order to share information across genes to improve estimation of gene expression heterosis, we propose the following models (1) – (3) for αj, δj, and the error variance . Suppose
(1) |
(2) |
(3) |
and all αj’s, δj’s, and ’s, are mutually independent.
The scaled inverse χ2 model for the error variances given in (3) follows Smyth (2004). The mixture model for αj in (1) models the cases where parental means are equal and where parental means differ, respectively. The hyperparameter πα specifies the proportion of genes that are equally expressed between two parents. Similarly, the mixture model for δj in (2) describes the cases where mean gene expression in the offspring is equal or not to the average of two parental means. When necessary, the model (1)–(3) may be modified as needed to better capture the features of a given dataset. For example, the mixture model could include more than one normal distribution component for αj or δj. Although all subsequent derivations are for the model specified in (1)–(3), it is straightforward to modify our proposed approach to handle more complex models.
With no loss of information about expression heterosis, the data can be summarized by the sufficient statistics α̂j ≡ (ȳ1j· − ȳ2j·)/2, δ̂j ≡ ȳ3j· − (ȳ1j· + ȳ2j·)/2, and . Clearly, α̂j and δ̂j are the natural sample average estimators of αj and δj, respectively. Based on the normality assumption for yijk, the conditional distributions of α̂j, δ̂j, and – given αj, δj and – are
(4) |
(5) |
(6) |
By combining (1), (3), and (4), it follows that the marginal distribution of α̂j is a two-component mixture distribution, where each component density is itself an infinite mixture of normal distributions with common mean but varying variance. This marginal distribution is determined by the hyperparameters πα, μα, , d0, and . Similarly, the marginal of distribution of δ̂j has an analogous form and is determined by the hyperparameters πδ, μδ, , d0, and .
Figures 1(a) and 1(b) present histograms of empirical marginal distributions and scatter-plots for α̂j and δ̂j from an alfalfa experiment and a maize experiment, respectively. Each of these datasets is discussed in more detail in section 4, but we introduce the plots here to provide some empirical support for the model described in this section. Using methods discussed in Appendix C, we obtain estimates of our model hyperparameters, and the hyperparameter estimates determine fitted marginal densities which are plotted on top of the histograms as red lines. The fitted marginal distributions adequately capture the shape of the empirical distributions. Furthermore, the lack of correlation between α̂j and δ̂j in the scatterplots supports our model assumption of independence between αj and δj. Thus, for both datasets, the model presented in section 2 appears to be consistent with the main features of the data illustrated in these plots.
Figure 1.
Scatterplots of α̂j vs. δ̂j and histograms of empirical marginal distributions of α̂j and δ̂j (j = 1, ···, J) based on two real heterosis experiments. The relative sizes of αj’s and δj’s partition the two-dimensional space virtually into subsets based on the mean expression levels of two inbred parents and their hybrid offspring as shown by dashed lines. Fitted curves represent estimated marginal densities based on the assumed model described in section 2. (a) Alfalfa dataset. B2, B5 and F1 denote the genotypes of the two parental inbred lines and the hybrid offspring, respectively. (b) Maize dataset. B73, Mo17 and F1 denote the genotypes of the two parental inbred lines and the hybrid offspring, respectively.
3 EMPIRICAL BAYES ESTIMATION AND TESTING OF GENE EXPRESSION HETEROSIS
Obtaining estimates of our model hyperparameters is the first step in our empirical Bayes approach. We use the method of Smyth (2004) to estimate d0 and . We estimate other hyperparameters by a combined approach of the moment method and the marginal maximum likelihood method using data from all genes. The details of our proposed approach are provided in Appendix C. Because thousands of genes in one experiment are used to obtain the estimates of the hyperparameters, we claim that adopting the usual empirical Bayes strategy (i.e., treating these unknown hyperparameters as known and equal to their estimates) does not seriously affect the performance of the inferential procedures we describe in this section. This claim is supported by simulation studies presented in sections 4 and 5.
Once estimates of the hyperparameters have been obtained, our goal is to draw inferences regarding expression heterosis for individual genes. Based on (1) – (6), an expression for the joint posterior distribution of (αj, δj) given α̂j, δ̂j, and is derived and illustrated in Appendix B. Sampling from the joint posterior distribution of (αj, δj) allows us to approximate the posterior distributions of hj, lj, and mj via the relationships hj = δj − |αj|, lj = −|αj|− δj, and mj = δj. Based on the form of the posterior of (αj, δj), one common method for sampling αj and δj is through a Markov chain Monte Carlo (MCMC) method, such as using the Metropolis-Hastings algorithm. We have developed and implemented such a Metropolis-Hastings algorithm as illustrated in the online supplement. A good approximation of the posterior distributions of hj, lj, and mj requires a large number of draws from the joint posterior distribution of (αj, δj) for each gene j. By using the Metropolis-Hastings algorithm, an analysis of simulated data for only 1,000 genes took around 5 hours to complete (see more details in the online supplement). Although parallelism and/or more sophisticated sampling algorithms could help to reduce the computing time, the large number of genes in a typical transcript profiling experiment motivates us to find a faster alternative.
To substantially reduce the computing requirement and maintain good approximations of the posterior distributions of hj, lj, and mj, we derive in Appendix B an approximation to the joint posterior distribution of (αj, δj) given by
(7a) |
(7b) |
(7c) |
(7d) |
where ϕ(x|μ, σ2) denotes the normal density with mean μ and variance σ2 evaluated at x,
(8a) |
(8b) |
(8c) |
(8d) |
(8e) |
and the probabilities P1j, P2j, P3j, and P4j sum to 1 and are defined in Appendix B. The approximation to the joint posterior distribution of αj and δj in (7) is a mixture of four joint distributions, where both αj and δj are from point-mass-at-0 as in (7a); δj is from point-mass-at-0 and αj is from a normal distribution as in (7b); αj is from point-mass-at-0 and δj is from a normal distribution as in (7c); and both αj and δj are from normal distributions as in (7d). The approximate posterior mixture distribution combines information from prior models and empirical observations. For example, μ̃αj can be expressed as a weighted average of μα (the prior mean of αj given αj ≠ 0) and α̂j (an estimate of αj based on sample means), where the weight on μα is proportional to the prior precision of αj given , and the weight on α̂j is proportional to an estimate of the conditional precision of α̂j given . Similarly, is the inverse of the average of the precisions and .
The approximation of the joint posterior distribution in (7) allows us to substantially reduce the computing requirement because we no longer need to go through a large number of MCMC iterations, but can instead directly sample from either a point-mass-at-0 distribution or a normal distribution. In addition, this leads to accurate approximations of the posterior distributions of hj, lj, and mj, as demonstrated by simulation studies in section 5 and in the online supplement.
Given the fully specified approximate posteriors of αj and δj and plugging in estimated hyperparameters, it is straightforward to approximate posterior distributions of hj, lj, and mj by simulation. We propose to use the estimated posterior expectations , and as point estimators for hj, lj, and mj, respectively. Tests of HPH, LPH, and MPH, respectively, for each gene j are based on the estimated posterior probabilities , and . For any cutoff c ∈ (0, 1), we declare that gene j exhibits HPH, LPH, or MPH if and only if p̃hj ≥c, p̃lj ≥ c, or p̃mj ≥ c, respectively.
We also use the estimated posterior probabilities to estimate false discovery rates (FDRs) for any family of tests that involves one test per gene. The number of positives, R(c), is the number of genes declared to exhibit a type of gene expression heterosis given the cutoff c. Taking HPH as an example, . The number of false positives, V (c), is estimated as , and the estimated FDR for HPH based on estimated posterior probabilities is given cutoff c. Calculations of estimated FDRs for testing LPH and MPH are similar.
4 EXAMPLE DATA ANALYSIS
4.1 Analysis of an Alfalfa Dataset
We used our method to analyze an alfalfa dataset on gene expression in parental lines B2 and B5 and the hybrid genotype (B2×B5). The data are available in the Gene Expression Omnibus (GEO) database (Barrett et al., 2011) with series number GSE25034. Each genotype had 3 biological replicates measured with Affymetrix Medicago Genome Array (Platform GPL4652). The robust multi-array average (RMA) method (Irizarry et al., 2003) was used to obtain normalized expression measures for each probeset on the array. Non-alfalfa probesets associated with the bacterial genome Sinorhizobium meliloti, along with all other probesets called absent by Affymetrix microarray suite version 5 software in all samples were filtered from the dataset (McClintick and Edenberg, 2006) to leave 31,865 probesets for analysis. The hyperparameters estimated from our proposed method are summarized in row 1 of Table 1.
Table 1.
Estimated hyperparameters (obtained by using the methods described in Appendix C) and empirical estimates of bias and MSE of our hyperparameter estimators based on analysis of 1,000 datasets simulated with hyperparameters estimated from the alfalfa and maize datasets as the true hyperparameter values.
Parameters | πα | μα |
|
πδ | μδ |
|
d0 |
|
|||
---|---|---|---|---|---|---|---|---|---|---|---|
Alfalfa Exp | 0.870 | 0.011 | 0.087 | 0.405 | −0.020 | 0.232 | 2.52 | 0.035 | |||
Bias | −5.33e-2 | −2.92e-3 | −1.12e-2 | −2.72e-2 | 6.77e-4 | −3.53e-3 | 1.60e-3 | 9.36e-6 | |||
MSE | 2.85e-3 | 2.54e-5 | 1.31e-4 | 7.64e-4 | 1.34e-5 | 2.38e-5 | 6.11e-4 | 6.34e-8 | |||
| |||||||||||
Maize Exp | 0.331 | 0.002 | 0.022 | 0.647 | −0.008 | 0.046 | 2.34 | 0.030 | |||
Bias | 2.85e-3 | −1.31e-5 | 1.19e-4 | 1.48e-3 | 6.10e-5 | 4.50e-4 | -1.20e-3 | 4.14e-7 | |||
MSE | 6.45e-5 | 2.90e-6 | 1.99e-7 | 5.73e-5 | 1.31e-5 | 2.12e-6 | 9.20e-4 | 7.67e-8 |
A simulation study was conducted to assess the estimation of hyperparameters. We used the estimated hyperparameter values in Table 1 as the true parameter values to simulate data for 31,865 genes based on the hierarchical model described in sections 2 and 3. Then, we re-estimated the hyperparameters using the simulated data. We repeated this procedure 1,000 times. The estimated bias and MSE in Table 1 for each hyperparameter estimator based on these 1,000 replications show that our hyperparameter estimators are reasonably accurate and precise.
For any gene j, we sample hj, lj, and mj by simulating αj and δj from the approximate joint posterior distribution (7). As an example, the contour plot of 10,000 random draws of α20 and δ20 from the approximate joint posterior distribution of gene “AFFX-Msa-ubq11-3_at” (gene number 20) is plotted in Figure 2. This gene has been reported to be one of the polyubiquitin genes involved in directing protein recycling and related functions (Geer et al., 2010). Based on these draws, , which gives strong evidence of HPH for this gene. As described in section 3, we can also use the estimated posterior distributions of αj and δj to test for any given type of heterosis while controlling FDR at a specified level. For example, we color-coded points in Figure 2(a) of the online supplement to highlight genes significant at approximate FDR level 0.05 when testing for HPH (red), LPH (blue), or MPH (red, blue, or green), respectively.
Figure 2.
Example estimated posterior distribution for a gene exhibiting significant evidence of HPH (gene “AFFX-Msa-ubq11-3_at” in the alfalfa dataset).
We also used a traditional approach based on a separate analysis for each gene to analyze the alfalfa dataset. Sample average estimates and ordinary t-tests were used to identify significant evidence of heterosis. Taking HPH as an example, if ȳ1j· ≥ ȳ2j·, then ĥj = ȳ3j· − ȳ1j·, and the t statistic for the one-sided ordinary t-test is . Similarly, we tested for LPH using a one-sided ordinary t-test, and we tested for MPH using a two-sided ordinary t-test of mj = 0. Given the p-values from the ordinary t-tests, we controlled FDR for the sample average method using the q-value method described by Storey and Tibshirani (2003).
The numbers of genes exhibiting significant evidence of the three types of gene expression heterosis when controlling FDR at approximately 0.05 by the sample average method and the empirical Bayes method, respectively, are in Table 2. Our empirical Bayes method identifies far more significant genes than the sample average approach.
Table 2.
Number of genes declared to exhibit gene expression heterosis by the sample average method and the empirical Bayes method.
Datasets | Heterosis | Sample Average | Empirical Bayes |
---|---|---|---|
Alfalfa Dataset | HPH | 2475 | 3529 |
LPH | 2121 | 4077 | |
MPH | 4813 | 8046 | |
| |||
Maize Dataset | HPH | 55 | 390 |
LPH | 197 | 595 | |
MPH | 1181 | 1447 |
4.2 Analysis of a Maize Dataset
Swanson-Wagner et al. (2009) compared gene expression of maize inbred lines B73 and Mo17 and their hybrid offspring. They studied a total of 13,999 genes in their microarray experiment with 10 biological replicates for each of the three genotypes. The dataset is downloadable in GEO with series number GSE16136.
Log-scale expression measurements were lowess normalized within each slide and median centered. The normalized data were analyzed with our empirical Bayes method, and the estimated hyperparameters are summarized in Table 1 row 4. The simulation described in section 4.1 was repeated for the maize results to estimate the bias and MSE of the hyperparameter estimators. The results are summarized in the last two rows of Table 1.
Based on posterior distributions of αj and δj, we color-coded points in Figure 2(b) of the online supplement to highlight genes significant at approximate FDR level 0.05 when testing for HPH (red), LPH (blue), or MPH (red, blue, or green), respectively. The reported numbers of genes exhibiting each of the three types of gene expression heterosis identified by the sample average method and the empirical Bayes method, respectively, are listed in Table 2 where FDR was controlled at the 0.05 level. Once again, the empirical Bayes method reported more significant genes for all three types of gene expression heterosis than the sample average method.
5 ADDITIONAL SIMULATION STUDIES
5.1 Simulation Study Based on the Alfalfa Experiment
We simulated 100 datasets based on the hierarchical model defined by (1) – (6) using hyper-parameters equal to the estimated values from the alfalfa experiment in Table 1. For each dataset, we simulated 31,865 genes (the same number of genes in the alfalfa experiment) and 3 biological replicates for each genotype.
We used the empirical Bayes method to estimate hj, lj, and mj for all j. For each dataset and each type of heterosis, we ranked the estimation errors from most negative to most positive, then we averaged the estimation errors of the same rank across the 100 datasets. We used the same approach for the sample average method. The box plots of averages of ranked estimation errors are plotted in Figure 3(a) for hj’s, Figure 3(b) for lj’s, and Figure 3(c) for mj’s. These box plots suggest that the empirical Bayes method on average has smaller ranked estimation errors than the sample average method. The box plots also show that the averages of ranked estimation errors by the empirical Bayes method have narrower interquartile ranges than the sample average method for estimating each type of heterosis. Table 3 summarizes the averaged estimation biases and MSEs across all genes in all datasets. The empirical Bayes estimators have smaller biases and MSEs than the sample average estimators for all types of heterosis. Both the plots and statistics show substantial improvement of the empirical Bayes method over the sample average method.
Figure 3.
Plots for the simulation study 5.1 based on the alfalfa data. Top row: box plots of ranked estimation errors averaged over 100 simulated datasets. Middle row: ROC curves averaged over 100 simulated datasets. Bottom row: estimated FDRs based on posterior probabilities versus true FDRs. Left column: HPH. Middle column: LPH. Right column: MPH.
Table 3.
Comparison of the average bias and MSE of the sample average estimators and the empirical Bayes estimators.
Simulations | Variables | Bias ×104 | MSE ×103 | ||
---|---|---|---|---|---|
| |||||
Sample Average | Empirical Bayes | Sample Average | Empirical Bayes | ||
| |||||
Alfalfa Dataset | hj | −830 | −2.76 | 111 | 31.6 |
lj | −827 | 1.18 | 109 | 31.7 | |
mj | −2.02 | −1.97 | 83.1 | 28.1 | |
| |||||
Maize Dataset | hj | −252 | 1.44 | 39.5 | 7.10 |
lj | −254 | 0.212 | 38.8 | 7.10 | |
mj | 0.697 | 0.616 | 30.9 | 4.89 | |
| |||||
Probability Models | hj | −596 | 47.2 | 55.0 | 20.8 |
lj | −598 | 44.5 | 55.6 | 20.8 | |
mj | 0.945 | 1.36 | 41.5 | 15.8 |
For each dataset, we computed the true positive rate (TPr) given a set of fixed levels of false positive rate (FPr) for testing each type of gene expression heterosis by the sample average method and the empirical Bayes method, respectively. Then, we averaged the TPrs across 100 datasets for each given level of FPr for each of the two methods. The resulting average receiver operating characteristic (ROC) curves are plotted in Figures 3(d)–3(f) for testing HPH, LPH, and MPH, respectively. We only plotted over the range of FPr between 0 and 0.05 because FPr>0.05 is rarely of interest in practice. The ROC curves demonstrate that our proposed tests identify more true positives than the sample average method given any fixed level of FPr for testing each type of gene expression heterosis.
By the empirical Bayes method, we estimated the FDRs for testing each type of gene expression heterosis as described in section 3. Then, for each level of estimated FDR, the true FDRs were calculated by averaging the proportions of false positives among the declared heterosis genes across 100 datasets for each type of gene expression heterosis. We plotted the estimated FDRs against the true FDRs in Figures 3(g)–3(i) for testing HPH, LPH, and MPH, respectively. The plots show results for the range of estimated FDR from 0 to 0.25 because only the region of small FDRs is relevant in practice. All three curves show that the estimated FDRs based on posterior probabilities are very close to the true levels, which demonstrates that the proposed method controls FDR as desired.
All results presented above and throughout the paper are based on the approximate joint posterior density in (7). We compared this proposed fast and approximate method with sampling from posterior distribution via the Metropolis-Hastings algorithm. Comparison results are discussed in the online supplement. In summary, we found that while the estimated posterior probabilities of exhibiting HPH, LPH, and MPH are very similar for both methods, our approximate method is more than 1,000 times faster than the Metropolis-Hastings approach.
5.2 Simulation Study Based on the Maize Experiment
The estimated hyperparameters of the maize experiment were used as the true parameter values to simulate 100 microarray datasets, each with 13,999 genes (the number of genes in the maize experiment) and 10 biological replicates for each gene of each genotype.
We analyzed these 100 datasets by the empirical Bayes method and the sample average method. The estimated bias and MSE of hj, lj, and mj estimators averaged across all genes in all datasets are summarized in Table 3. Table 3 shows that the empirical Bayes estimators are more accurate and more precise than the sample average method in estimating all types of heterosis. Figure 3 of the online supplement provides box plots, ROC curves, and FDR plots for the maize simulation results that are very similar to those displayed in Figure 3 for the alfalfa simulation in section 5.1.
5.3 Simulation Study Based on Probability Models
To further assess the performance of the proposed empirical Bayes method, we simulated data using distributions different from those proposed in (1) and (2). Specifically, we simulated αj’s from a mixture distribution with a point-mass-at-0 and a t distribution with a small number of degrees of freedom (2) and a non-centrality parameter (ncp) 0.01. Independently from αj’s, we simulated δj’s from a mixture model with a point-mass-at-0 and two normal distributions N(−0.05, 0.2) and N(0, 0.2). We simulated data for 100 microarray datasets, where each dataset contains 5,000 genes with 3 biological replicates for each of three genotypes. Based on the estimated hyperparameters for the alfalfa experiment and the maize experiment, we set πα=0.8, πδ=0.6, and simulated from a scaled inverse χ2 distribution with parameters d0=2.8 and .
Though the data were not simulated from the proposed model, our empirical Bayes estimators, compared to the sample average estimators, have substantially smaller average bias and MSE for hj and lj as shown in Table 3. Although the averaged estimated bias for mj is slightly bigger than that of the sample average method, the averaged estimated MSE is reduced by the empirical Bayes method. Figure 4 of the online supplement provides box plots, ROC curves, and FDR plots (analogous to those in Figure 3 of section 5.1) that show the empirical Bayes method improves upon the sample average method even though the data-generating model differs from the assumptions in (1) and (2).
6 DISCUSSION
Gene expression heterosis is speculated to be one possible explanation for phenotypic heterosis of traits like plant height or grain yield. One natural strategy for estimation (called the sample average method in this paper) is to simply use the sample means to replace the population means when estimating the three types of gene expression heterosis. Because there are often few observations for each gene in a microarray experiment, such estimates have high standard errors. In addition, the sample average estimators for high-parent heterosis and low-parent heterosis are also biased estimators. Furthermore, the natural t-based testing strategies that accompany the sample average method yield low detection power for all forms of gene expression heterosis.
A shrinkage method based on the sample average estimators can improve inferences on gene expression heterosis by sharing information across genes. We developed hierarchical models by placing a mixture prior model on each of two latent variables. Using an empirical Bayes method, the sample average estimates of gene expression heterosis were adjusted and shrunk towards prior means estimated from the data. The extent of shrinkage was also estimated empirically based on data. Through simulation studies based on real datasets and different probability models, we demonstrated that our empirical Bayes estimators have substantially smaller bias and MSE than the sample average estimators, and the inferences for all three types of gene expression heterosis based on the posterior probabilities also yield higher TPrs given any level of FPr than the ordinary t-tests based on the sample average estimates. We also showed that using posterior probabilities of exhibiting any type of gene expression heterosis to estimate FDR yields accurate estimates of the actual FDR. Thus, the methods we have developed provide researchers with substantially improved statistical tools for studying gene expression heterosis.
The results presented in section 4 focus on identifying individual genes that show significant evidence of expression heterosis of various types. Rather than attempting to identify individual genes, our approach can also be used to estimate global values like the proportion of all genes that exhibit a given type of heterosis. For example, the proportion of maize genes exhibiting HPH is estimated by the average posterior probability of HPH, . This estimated proportion includes genes where expression in the hybrid is only slightly higher than the maximum parental expression. In some cases, scientists prefer to concentrate on large changes in expression. With our empirical Bayes approach, it is straightforward to estimate the posterior probability of hj > k for any constant k. For example, with k = log(1.5), the average posterior probability of hj > k in the maize data is 0.0006. This indicates that genes with hybrid expression (on the original scale) more than 1.5 times that of the high parent are relatively rare.
Our work has focused on the use of gene expression measurements that can be modeled, at least approximately, by linear models with normally distributed errors. This is a standard modeling approach for microarray data. While there are thousands of existing microarray datasets and more generated nearly every day, next-generation sequencing of RNA (RNA-Seq) is an increasingly popular technology for obtaining gene expression measurements. At the present state of the technology, RNA-Seq data are perhaps best treated as counts and modeled with generalized linear models involving overdispersed Poisson or negative binomial distributions (see, for example, Anders and Huber, 2010; Robinson et al., 2010; Lund et al., 2012; McCarthy et al., 2012). We believe the hierarchical modeling ideas we have proposed in the linear model framework are also likely to be very useful in a generalized linear model framework for the study of gene expression heterosis using RNA-Seq data. Developing the details of such an extension is the subject of some of our ongoing and future research.
Supplementary Material
Acknowledgments
Research reported in this publication was supported by National Institute of General Medical Sciences of the National Institutes of Health under award number R01GM109458. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
APPENDIX A: BIAS OF THE SAMPLE AVERAGE ESTIMATORS OF HIGH AND LOW PARENT HETEROSIS
Based on the definitions in section 2, we can rewrite the sample average estimator of hj = δj − |αj| as δ̂j − |α̂j|. Although α̂j and δ̂j are both unbiased estimators of αj and δj, respectively,
Thus, E(ĥj) = E(δ̂j −|α̂j|) < δj −|αj| = hj. Likewise, E(l̂j) = E(−|α̂j| − δ̂j) < −|αj| −δj = lj. Thus, the sample average estimators of hj and lj are both biased estimators that, on average, underestimate high-parent and low-parent heterosis, respectively.
APPENDIX B: DERIVATION AND APPROXIMATION OF THE JOINT POSTERIOR DISTRIBUTION OF αj AND δj
Let p(·) denote a generic probability density function. We have
(9) |
by the conditional independence of α̂j, δ̂j, and given αj, δj, ; the independence of αj, δj, and ; the independence of α̂j and δj; the independence of δ̂j and αj; and the independence of from αj and δj.
It can be shown that
(10) |
where
Similarly,
(11) |
where
Substituting (10) and (11) into (9) and noting that yields
(12a) |
(12b) |
(12c) |
(12d) |
To obtain reliable statistical inferences of αj and δj, as well as inferences of hj, lj, and mj, we need to draw a sufficiently large sample from the posterior distribution proportional to (12) for each gene j. One approach is to use the Metropolis-Hastings algorithm (see the online supplement). However, due to the inefficiency of the Metropolis-Hastings algorithm and the complex structure in (12), obtaining a sufficiently large sample for each of the tens of thousands of genes in a typical microarray experiment requires extensive computing power. Methods, such as parallel computing, could reduce the computing time, but the total amount of required computing power remains substantial.
Here, we propose a novel method to approximate the joint posterior density, which dramatically decreases the required computing power and, at the same time, maintains accurate estimation of the posterior distribution. Specifically, we define as the inverse of the posterior mean of given as in (8a). We use in place of in the conditional distributions of αj and δj; that is, we replace with in , and to obtain μ̃αj, , μ̃δj, and given in (8b)–(8e). This simple replacement of by in the above four terms leads to the form of (7). We then approximate the posterior of αj and δj by (7) where Pkj = Ckj/(C1j + C2j + C3j + C4j) (k = 1, ···, 4) with
(13a) |
(13b) |
(13c) |
and
(13d) |
After this simplification, we no longer need to draw samples from the joint posterior distribution using an iterative algorithm, such as the Metropolis-Hastings method. Instead, we can sample directly from a point-mass-at-0 distribution or a normal distribution as shown in (7). Although we still need to estimate constants C1j, C2j, C3j, and C4j by simulation, the required computations are straightforward and efficient. Thus, the required computing power is dramatically reduced. The online supplement contains a comparison of results for sampling via Metropolis-Hastings and the approximation (7).
APPENDIX C: ESTIMATION OF HYPERPARAMETERS
The hyperparameters to be estimated are πα, μα, , πδ, μδ, , d0, and . As noted in section 3, we use the method of Smyth (2004) to estimate d0 and . In all subsequent calculations, we replace the unknown values of d0 and with their estimates. To estimate the remaining hyperparameters, we initially suppose that are fixed, known constants. Then, based on the proposed model in section 2, we have
(14) |
By equating the first and the second distribution moments with the corresponding sample moments of (α̂j |πα, μα, ), we have
(15) |
Based on (15), μα and can be written as functions of πα as follows.
(16) |
Plugging (16) in (14) and replacing with , we can approximate the distribution of (α̂j |πα, μα, ) as a function with only one unknown parameter πα. We then estimate πα by maximizing the resulting approximate joint likelihood of the α̂j’s for all genes with constraint πα ∈ (0, 1). The estimates of μα and are computed by replacing πα with its estimate and replacing with in (16). A completely analogous procedure is used to estimate μδ, , and πδ.
Footnotes
Evaluation of the approximation of the joint posterior distribution and additional figures.
Contributor Information
Tieming Ji, Email: jit@missouri.edu, Department of Statistics, University of Missouri at Columbia, Columbia, MO 65211, USA.
Peng Liu, Email: pliu@iastate.edu, Department of Statistics, Iowa State University, Ames, IA 50011, USA.
Dan Nettleton, Email: dnett@iastate.edu, Department of Statistics, Iowa State University, Ames, IA 50011, USA.
References
- Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biology. 2010;11:R106. doi: 10.1186/gb-2010-11-10-r106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Muertter RN, Holko M, Ayanbule O, Yefanov A, Soboleva A. NCBI GEO: archive for functional genomics data sets – 10 years on. Nucleic Acids Research. 2011;39:D1005–D1010. doi: 10.1093/nar/gkq1184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bassene JB, Froelicher Y, Dubois C, Ferrer RM, Navarro L, Ollitrault P, Ancillo G. Non-additive gene regulation in a citrus allotetraploid somatic hybrid between C. reticulata Blanco and C. limon (L.) Burm. Heredity. 2010;105:299–308. doi: 10.1038/hdy.2009.162. [DOI] [PubMed] [Google Scholar]
- Coors JG, Pandey S. The Genetics and Exploitation of Heterosis in Crops. Crop Science Society of America; Madison, WI: 1999. [Google Scholar]
- Darwin CR. The Effects of Cross and Self Fertilization in the Vegetable Kingdom. Murray; London: 1876. [Google Scholar]
- Geer LY, Marchler-Bauer A, Geer RC, Han L, He J, HS, Liu C, Shi W, Bryant SH. The NCBI BioSystems database. Nucleic Acids Research. 2010;38:D492–6. doi: 10.1093/nar/gkp858. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hallauer AR, Miranda JB. Quantitative genetics in maize breeding. Iowa State University Press; Ames, IA: 1981. [Google Scholar]
- Irizarray RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics. 2003;4:249–264. doi: 10.1093/biostatistics/4.2.249. [DOI] [PubMed] [Google Scholar]
- Krieger U, Lippman ZB, Zamir D. The flowering gene SINGLE FLOWER TRUSS drives heterosis for yield in tomato. Nature Genetics. 2010;42:459–463. doi: 10.1038/ng.550. [DOI] [PubMed] [Google Scholar]
- Lippman ZB, Zamir D. Heterosis: revisiting the magic. Trends in Genetics. 2007;23:60–66. doi: 10.1016/j.tig.2006.12.006. [DOI] [PubMed] [Google Scholar]
- Lund SP, Nettleton D, McCarthy DJ, Smyth GK. Detecting differential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates. Statistical Applications in Genetics and Molecular Biology. 2012;11:Article 8. doi: 10.1515/1544-6115.1826. [DOI] [PubMed] [Google Scholar]
- McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNA-seq experiments with respect to biological variation. Nucleic Acids Research. 2012;40:4288–4297. doi: 10.1093/nar/gks042. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McClintick JN, Edenberg HJ. Effects of filtering by Present call on analysis of microarray experiments. BMC Bioinformatics. 2006;7:49. doi: 10.1186/1471-2105-7-49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Riday H, Brummer EC. Forage Yield Heterosis in Alfalfa. Crop Science. 2002;42:716–723. [Google Scholar]
- Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–140. doi: 10.1093/bioinformatics/btp616. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smyth G. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology. 2004;3:Article 3. doi: 10.2202/1544-6115.1027. [DOI] [PubMed] [Google Scholar]
- Springer NM, Stupar RM. Allelic variation and heterosis in maize: how do two halves make more than a whole? Genome Research. 2007;17:264–275. doi: 10.1101/gr.5347007. [DOI] [PubMed] [Google Scholar]
- Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Swanson-Wagner R, DeCook R, Jia Y, Bancroft T, Ji T, Zhao X, Nettleton D, Schnable PS. Paternal dominance of trans-eQTL influences gene expression patterns in maize hybrids. Science. 2009;5956:1118–1120. doi: 10.1126/science.1178294. [DOI] [PubMed] [Google Scholar]
- Swanson-Wagner R, Jia Y, DeCook R, Borsuk LA, Nettleton D, Schnable PS. All possible modes of gene action are observed in a global comparison of gene expression in a maize F1 hybrid and its inbred parents. Proceedings of the National Academy of Sciences. 2006;103:6805–6810. doi: 10.1073/pnas.0510430103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang J, Tian L, Lee H, Wei N, Jiang H, Watson B, Madlung A, Osborn TC, Doerge RW, Comai L, Chen ZJ. Genomewide Nonadditive Gene Regulation in Arabidopsis Allotetraploids. Genetics. 2006;172:507–517. doi: 10.1534/genetics.105.047894. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wohlfarth GW. Heterosis for growth rate in common carp. Aquaculture. 1993;113:31–46. [Google Scholar]
- Yu SB, Li JX, Xu CG, Tan YF, Gao YJ, Li XH, Zhang Q, Saghai Maroof MA. Importance of epistasis as the genetic basis of heterosis in an elite rice hybrid. Proceedings of the National Academy of Sciences. 1997;94:9226–9231. doi: 10.1073/pnas.94.17.9226. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.