Abstract
Heterosis is defined as the superiority of a hybrid cross over its two parents. Plant and animals breeders have long been exploiting heterosis, but the causes of this phenomenon are as yet only partly understood. Recently, chip technology has opened up the opportunity to study heterosis at the gene expression level. This article considers the cDNA chip technology, which allows assaying two genotypes simultaneously on the same chip. Heterosis involves the response of at least three genotypes (two parents and their hybrid), so a chip or microarray constitutes an incomplete block, which raises a design problem specific to heterosis studies. The question to be answered is how genotype pairs should be allocated to chips. We address this design problem for two types of heterosis: midparent heterosis and better-parent heterosis. The general picture emerging from our results is that most of the resources should be allocated to parent-hybrid pairs, while chips with parent-parent pairs or hybrid-reciprocal pairs should be used sparingly or not at all.
PROGRESS in plant and animal breeding is often made by exploiting nonadditive gene action. For example, when two maize inbred lines are crossed, the resulting hybrid is frequently found to be superior to the midparent value, i.e., the average of the two parent means (Falconer and Mackay 1996; Lynch and Walsh 1998). This phenomenon is commonly denoted as midparent heterosis of hybrid vigor. Historically, heterosis was first studied at the phenotypic level of agronomically relevant traits such as yield. Several theories have been put forward to explain heterosis (e.g., Stuber et al. 1992), but a consensus has not yet emerged. The advent of chip technologies has now opened up the scope to study heterosis at the gene expression level (Ni et al. 2000; Kollipara et al. 2002; Guo et al. 2003), thus increasing our understanding of the underlying molecular basis of heterosis (Birchler et al. 2003). This article is concerned with the optimal design of gene expression studies aiming at heterosis.
The notion of heterosis may be associated with a linear model as follows. The expected phenotypic values of two parent genotypes A and B and their hybrid AB can be expressed as
![]() |
(1) |
![]() |
(2) |
and
![]() |
(3) |
where φ is a general effect and τ is the genotypic effect. Midparent heterosis may be defined by the linear contrast
![]() |
(4) |
Midparent heterosis occurs whenever δAB ≠ 0. Often, it matters which inbred line is the male parent. It is then important to also study the reciprocal cross, which we denote as BA. The linear model for this genotype is
![]() |
(5) |
The reciprocal's midparent heterosis is
![]() |
(6) |
Heterosis of an agronomic trait is economically useful, when the hybrid outperforms both parents. This type of heterosis is also known as better-parent heterosis, and it will occur for hybrid AB, when τAB > τA and τAB > τB, assuming nonnegative coefficients and that an increase in average phenotype is considered advantageous.
Heterosis is thought to be associated with nonadditive gene action or dominance. In fact, dominance may be regarded as midparent heterosis at the gene level. Similarly, overdominance occurs when there is better-parent heterosis at the gene level. If an expression product for a specific gene can be measured for the inbred parents and their hybrids, dominance can be estimated on the basis of (4) and (6). Similarly, overdominance can be assessed on the basis of the contrasts τAB − τA and τAB − τB at the expression level.
This article is concerned with cDNA chip technology, where each of a large number of genes is represented by a cDNA spot on a glass slide. Expression profiles of two mRNA samples representing two different genotypes are assayed on a slide in parallel. Genotypes are labeled by fluorescent dyes, resulting in a green signal for the one genotype and a red signal for the other genotype. To account for dye effects, it is customary to swap dyes on about half of the chips assigned to the same genotype pair. Statistically, a microarray may be considered as an incomplete block accommodating only two treatments (genotypes) (Kerr and Churchill 2001; Kerr 2003). The design problem is how to allocate different genotype pairs to chips.
Most of the current literature on experimental designs for identifying differentially expressed genes deals with the case where two or more treatments of equal interest are to be compared. Efficient designs in this context are the reference design, the loop design, and balanced block designs (Dobbin and Simon 2002; Kerr 2003; Dobbin et al. 2003a,b; Simon et al. 2003). The objective of heterosis studies differs from those commonly considered in that the treatment contrast of interest involves three treatments, so efficiency regarding all pairwise comparisons is irrelevant. Also, most of the theory of optimal designs revolves around criteria such as A-optimality or E-optimality (John and Williams 1995; Yang et al. 2002), which strive for optimality relative to a broad class of contrasts. In the case of heterosis, such approaches are not optimal, because the class of contrasts of interest is much more limited. Clearly there is only one type of contrast. While other designs such as the loop design may provide good heterosis estimates (Gibson et al. 2004), they are not usually optimal (Keller et al. 2005). By analogy, a balanced block design optimal with respect to all pairwise comparisons is not optimal regarding multiple comparison with a control. Generally, it will be more efficient to directly optimize the design with respect to the particular contrast(s) of interest (John and Williams, 1995).
This article is concerned with the problem of finding a design by which heterosis or dominance can be estimated with minimal standard error. Specifically, we search for the optimal allocation of a fixed number of chips among all possible genotype pairs. We first consider midparent heterosis and then turn to better-parent heterosis. With both types of heterosis, we study the case of two hybrids as well as that of a single hybrid (no reciprocal tested). The derivations for different cases are organized as follows: first an appropriate linear model is formulated and contrasts of interest are defined in terms of the parameters of that model. Optimality is then defined in terms of the variance of a contrast of interest. Minimization of this criterion leads to the optimal allocation.
MIDPARENT HETEROSIS
Hybrids and reciprocal:
We assume that analysis of normalized gene expression data is done in standard fashion on the basis of a linear model for log measurements. The model accounts for all relevant effects, including dye, chip, and genotype (treatment). For details the reader is referred to Kerr and Churchill (2001), Wolfinger et al. (2001), and Keller et al. (2005).
It is assumed throughout that chip effects are taken as fixed, implying that interchip information is not recovered. This approach corresponds to the usual assumption made when deriving optimal incomplete block designs (John and Williams 1995). Since there are only two genotypes per chip, all information on genotype contrasts is contained in pairwise differences of genotypic expression levels per chip. Clearly, the analysis of differences of log measurements is equivalent to analysis of actual log measurements, when chip effects are fixed. We express the model in terms of genotype differences, because this greatly simplifies our study of optimal allocation. In applications one will not usually analyze actual log intensities instead of differences.
Let yji denote the ith observed genotype difference for the jth genotype pair. Specifically, let
y1i = ith observation (chip) on difference A − B (i = 1,…, n1),
y2i = ith observation (chip) on difference A − AB (i = 1,…, n2),
y3i = ith observation (chip) on difference A − BA (i = 1,…, n3),
y4i = ith observation (chip) on difference B − AB (i = 1,…, n4),
y5i = ith observation (chip) on difference B − BA (i = 1,…, n5),
y6i = ith observation (chip) on difference AB − BA (i = 1,…, n6),
where nj is the number of chips used for the jth genotype pair. The differences have the following expected values:
![]() |
(7) |
The total sample size is given by
. For symmetry reasons, we require the same number n0 of observations for each parent-hybrid pair, i.e., n2 = n3 = n4 = n5 = n0. Thus, the optimal allocation is given by (n0, n1, n6). To ensure identifiability, we set τBA = 0. To account for dye effects, one commonly swaps dyes for half the chips of a genotype pair. The dye swap can be accommodated by extending the linear model with dye effects and dye-by-genotype interactions. To derive a design optimal with respect to contrasts among genotype main effects, it suffices to use model (7) and require that the number of arrays for a particular genotype pair be allocated in equal parts to both possible dye swaps.
Model (7) may be expressed as
![]() |
(8) |
where
,
, and X is the appropriate design matrix with dummies 1 and −1. The heterosis contrast of BA can be written as
, where
![]() |
(9) |
The least-squares estimator is
![]() |
(10) |
which has variance
![]() |
(11) |
where
![]() |
(12) |
I2 is a 2 × 2 identity matrix,
with 12 = (1, 1)′, and σ2 is the variance of a difference yji (j = 1,…, 6). A derivation of Equation 12 is given in the appendix.
A design for a given sample size n involves an allocation (n0, n1, n6) to the different genotype pairings. We now derive an allocation that minimizes
. It is first shown that
does not depend on n1. Thus, we set n1 = 0. In the next step, we find the optimal value of n0 subject to the constraint n = 4n0 + n6.
It can be shown that
![]() |
(13) |
which is free of n1. Thus, for any fixed values of n0 and n6, the variance of the heterosis contrast does not change with n1. This proves that parent-parent chips (A-B pairs) do not add any information with regard to the heterosis contrast (6), and so the optimal design must have n1 = 0. Setting n1 = 0 and n6 = n − 4n0, we obtain
![]() |
(14) |
and
![]() |
(15) |
yielding a quadratic equation in n0 with roots
![]() |
(16) |
Since n0 ≤ n/4, the only feasible solution is
![]() |
(17) |
Thus, for a given total sample size n, the quantity 1′D1, and hence the variance of the heterosis estimator,
, is minimized for the allocation
![]() |
(18) |
This same allocation also minimizes the variance of the other heterosis contrast (4),
.
The optimal allocation was derived by looking at a single gene, while in gene expression studies thousands of genes are studied simultaneously. It is perhaps useful to point out that generally the optimal allocation derived here is independent of the variance σ2, which may be gene specific. Thus, the optimal allocation applies to all genes simultaneously. Differences among genes in variance affect only the optimal total sample size needed to achieve a desired accuracy, which may be determined by standard procedures (Steel and Torrie 1980).
Only one hybrid:
When only one of the two possible hybrids is tested (hybrid AB, say), the model simplifies to
![]() |
(19) |
with
and the constraint τAB = 0. It can be shown that
![]() |
(20) |
where n1 is the number of A-B pairs and n0 is the number of A-AB or of B-AB pairs. Noting that
with
, it can be shown that
![]() |
(21) |
i.e., the variance does not depend on n1, the number of parent-parent (A-B) pairs. Obviously, the variance is minimized when n0 = n/2, where n is the total sample size. Thus, all microarrays should be allocated to parent-hybrid pairs.
Additive gene effects:
In deriving an optimal allocation, we have focused on the accuracy in estimating δ. It is sometimes of interest to also estimate the additive gene effect. The accuracy of such estimates in designs optimized for δ is now considered for the two cases studied (Hybrids and reciprocals as well as Only one hybrid).
Design for hybrids and reciprocals:
By not allocating any chips to A-B pairs, we have no direct comparisons of the parents. It turns out, however, that the A-B comparison can be made with good accuracy. More specifically, it may be of interest to estimate the additive gene effect defined by
![]() |
(22) |
where
![]() |
(23) |
The additive gene effect is of interest when studying the mode of dominance. When |δ| = |α|, there is complete dominance, while dominance is only partial when |δ| < |α| and overdominance occurs when |δ| > |α| (Kearsey and Pooni 1996). To study the mode of dominance it is desirable to estimate α with about the same accuracy as δ. It turns out that with n1 = 0 we have
![]() |
(24) |
![]() |
(25) |
So generally, the additive genetic effect α will be estimated more accurately than both δAB and δBA, when the design is optimized with respect to these two heterosis contrasts.
Design for only one hybrid:
The additive gene effect
with
, is estimated with variance
![]() |
(26) |
Thus, when all microarrays are allocated to parent-hybrid pairs (n1 = 0), the additive effect is estimated with the same accuracy as the dominance effect.
BETTER-PARENT HETEROSIS
Hybrids and reciprocal:
It is most convenient to consider the hybrid BA. Results for the other hybrid, AB, are analogous. Assessing better-parent heterosis of hybrid BA requires good estimates of the contrasts λBA(A) = τBA − τA and λBA(B) = τBA − τB. The coefficient vector for the first of these contrasts equals
, and the associated variance is
![]() |
(27) |
where D11 is the first diagonal element of D in Equation 12. The variance D11 is seen to be symmetric in n1 and n6; i.e., the equation remains unaltered if n1 and n6 are exchanged. Therefore the optimal design should be such that n1 = n6. The common sample size is denoted as n00, i.e., n1 = n6 = n00; whence
![]() |
(28) |
After some algebra using n00 = (n − 4n0)/2 this becomes
![]() |
(29) |
The differential equation
yields a quadratic equation in n0, which can be shown to have roots
![]() |
(30) |
Obviously, the only feasible solution is
![]() |
(31) |
Thus, ∼20% of the total sample size is to be used with each of four parent-hybrid pairs, leaving a little <20% for the parent-parent pair A-B and the hybrid-reciprocal pair AB-BA. As n1 = n6 for the optimal design, ∼10% should therefore be allocated to each of these two pairings. As in the case of midparent heterosis, most of the resources (∼80%) should be used on the hybrid-parent pairs.
Only one hybrid:
The variance of the contrast λAB(A) = τAB − τA is
![]() |
(32) |
where
is the first diagonal element of
in Equation 20. Using n1 = n − 2n0, this can be shown to equal
![]() |
(33) |
Maximization again leads to a quadratic equation in n0, which has roots
![]() |
(34) |
The only feasible solution is therefore given by
![]() |
(35) |
Thus, ∼84% of the sample size is allocated to the parent-hybrid pairs, while only 16% of the chips are spent on the parent-parent pair.
It is worth mentioning that the design problem here is equivalent to that of a multiple comparison with a control. For a completely randomized design and when two treatments are to be compared with a control, the optimal allocation is known to be
, where mh is the number of observations per hybrid and mp is the number of observations per parent. This allocation minimizes the variance of a hybrid-parent contrast. Using a somewhat different optimality criterion, Dunnett (1955) found the same optimal allocation. Note that complete randomization would imply a single genotype per chip. By comparison, the optimal allocation (35) implies that
, where mh = 2n0 and mp = n0 + n1, which is rather close but not equal to
. The difference is mainly due to the incomplete blocking, with blocks corresponding to chips.
DISCUSSION
In this article we have derived formulas for the optimal allocation of resourses in cDNA expression studies to reveal midparent heterosis or better-parent heterosis at the gene level. A common feature of both of these cases is that most of the resources are allocated to the parent-hybrid pairs. The researcher needs to make up his mind as to which type of heterosis he wishes to assess. In the case of midparent heterosis, the parent-parent pair need not be tested at all, while with better-parent heterosis a small fraction of the total resources should be devoted to both parent-parent pairs and hybrid-reciprocal pairs.
We have not addressed the question of optimal sample size n. This may be determined by standard procedures (Steel and Torrie 1980). The sample size needed to detect heterosis will, among other things, depend on the variance. It should be stressed that variance will usually be gene specific, so optimal sample size will differ among genes. In designs with small sample sizes, efficient estimation of the variance is critical, and it may be useful to borrow strength from other genes (Wright and Simon 2003), trading variance for some bias. As pointed out by a referee it may also be necessary to account for dye bias in variance estimation.
The result that in optimal designs, parent-parent pairs provide no information regarding midparent heterosis contrasts, may seem trivial on first sight. It should be pointed out, however, that it does not generally hold in suboptimal designs and is therefore not as trivial as it may seem. The reason is that parent-parent pairs provide indirect information regarding heterosis contrasts. For example, data on the parent pair A-B and on the parent-hybrid BA-A allow an indirect comparison for the pairing BA-B, since BA-A − (A-B) = BA-B. Therefore, it is often found (results not shown), that with suboptimal designs, the parent-parent pair provides information for the heterosis contrast. For optimal designs, this information vanishes in much the same way as information from indirect comparisons vanishes in a complete block design.
In many experiments, the linear model needs to account for several fixed and random sources of variation, giving rise to a complex mixed linear model (Wolfinger et al. 2001). In this case, finding an optimal allocation will typically require numerical search strategies such as simulated annealing (Keller et al. 2005). On the basis of the examples given in Keller et al. (2005) it may be conjectured that the optimal allocation in more complex settings will not deviate dramatically from that derived in this article.
To study heterosis, one may estimate the dominance ratio, θ = δ/α (Kearsey and Pooni 1996). Using the δ-method (Johnson et al. 1993) and exploiting the fact that dominance and additive gene effect estimates are stochastically independent, the approximate variance of the dominance ratio is
![]() |
(36) |
One might consider finding a design that minimizes this variance. This approach is not usually feasible, however, unless a priori information is available on both α and δ, which will rarely be the case. The same problem would apply if one were to work with the exact distribution of
, assuming normality (Hinkley 1969), or Fieller's (1954) method (Piepho and Emrich 2005). Thus, it is preferable to optimize the design for contrasts related to either midparent heterosis or better-parent heterosis.
Acknowledgments
I thank two anonymous referees for several helpful suggestions.
APPENDIX
We here derive Equation 12 for matrix D. As we require n2 = n3 = n4 = n5 = n0, the matrix X′X is given by
![]() |
(A1) |
with
![]() |
(A2) |
![]() |
(A3) |
and
![]() |
(A4) |
where 12 = (1, 1)′, I2 is a 2 × 2 identity matrix, and
. Using results on the inverse of a partitioned matrix (Harville 2000, p. 99) we find
![]() |
(A5) |
where
![]() |
(A6) |
![]() |
(A7) |
and
![]() |
(A8) |
To study the heterosis contrast (4), it is sufficient to find D. Using
![]() |
(A9) |
and
![]() |
(A10) |
(Searle et al., 1992, p. 443), it can be shown that
![]() |
(A11) |
Now the least-squares estimator
has variance
![]() |
(A12) |
References
- Birchler, J. A., D. L. Auger and N. C. Riddle, 2003. In search of the molecular basis of heterosis. Plant Cell 15: 2236–2239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dobbin, K., and R. Simon, 2002. Comparison of microarray designs for class comparison and class discovery. Bioinformatics 18: 1438–1445. [DOI] [PubMed] [Google Scholar]
- Dobbin, K., J. H. Shih and R. Simon, 2003. a Questions and answers on design of dual-label microarrays for identifying differentially expressed genes. J. Natl. Cancer Inst. 95: 1362–1369. [DOI] [PubMed] [Google Scholar]
- Dobbin, K., J. H. Shih and R. Simon, 2003. b Statistical design of reverse dye microarrays. Bioinformatics 19: 803–810. [DOI] [PubMed] [Google Scholar]
- Dunnett, C. W., 1955. A multiple comparison procedure for comparing several treatments with a control. J. Am. Stat. Assoc. 50: 1096–1121. [Google Scholar]
- Falconer, D. S., and T. F. C. Mackay, 1996. Introduction to Quantitative Genetics, Ed. 4. Longman, Harlow, UK.
- Fieller, E., 1954. Some problems in interval estimation. J. R. Stat. Soc. B 16: 175–185. [Google Scholar]
- Gibson, G., R. Riley-Berger, L. Harshman, A. Kopp, S. Vacha et al., 2004. Extensive sex-specific nonadditivity of gene expression in Drosophila melanogaster. Genetics 167: 1791–1799. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guo, M., M. A. Rupe, O. N. Danilevskaya, X. Yang and Z. Hu, 2003. Genome-wide mRNA profiling reveals heterochronic allelic variation and a new imprinted gene in hybrid maize endosperm. Plant J. 36: 30–44. [DOI] [PubMed] [Google Scholar]
- Harville, D. A., 2000. Matrix Algebra from a Statistician's Perspective. Springer, Berlin.
- Hinkley, D. V., 1969. On the ratio of two correlated normal random variables. Biometrika 56: 635–639 (correction: Biometrika 57: 683). [Google Scholar]
- John, J. A., and E. R. Williams, 1995. Cyclic and Computer Generated Designs. Chapman & Hall, London.
- Johnson, N. L., S. Kotz and A. W. Kemp, 1993. Univariate Discrete Distributions, Ed. 2. Wiley, New York.
- Kearsey, M., and H. S. Pooni, 1996. The Genetical Analysis of Quantitative Traits. Chapman & Hall, London.
- Keller, B., K. Emrich, N. Hoecker, M. Sauer, F. Hochholdinger et al., 2005. Designing a microarray experiment to estimate dominance in maize (Zea mays L.). Theor. Appl. Genet. 111: 57–64. [DOI] [PubMed] [Google Scholar]
- Kerr, M. K., 2003. Design considerations for efficient and effective microarray studies. Biometrics 59: 822–828. [DOI] [PubMed] [Google Scholar]
- Kerr, M. K, and G. A. Churchill, 2001. Experimental design for gene expression microarrays. Biostatistics 2: 183–201. [DOI] [PubMed] [Google Scholar]
- Kollipara, K. P., I. N. Saab, R. D. Wych, M. J. Lauer and G. W. Singletary, 2002. Expression profiling of reciprocal maize hybrids divergent for cold germination and desiccation tolerance. Plant Physiol. 129: 974–992. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lynch, M., and B. Walsh, 1998. Genetics and the Analysis of Quantitative Traits. Sinauer, Sunderland, MA.
- Ni, N. Z., Q. Sun, Z. Liu, L. Wu and X. Wang, 2000. Identification of a hybrid-specific expressed gene encoding novel RNA-binding protein in wheat seedling leaves using differential display of mRNA. Mol. Gen. Genet. 263: 934–938. [DOI] [PubMed] [Google Scholar]
- Piepho, H. P., and K. Emrich, 2005. Simultaneous confidence intervals for two estimable functions and their ratio under a linear model (in press).
- Searle, S. R., G. Casella and C. E. McCulloch, 1992. Variance Components. Wiley, New York.
- Simon, R., E. Korn, L. McShane, M. Rademacher, G. Wright et al., 2003. Design and Analysis of DNA Microarray Investigations. Springer, New York.
- Steel, R. G. D., and J. H. Torrie, 1980. Principles and Procedures of Statistics: A Biometrical Approach. McGraw-Hill, New York.
- Stuber, C. W., S. E. Lincoln, D. W. Wolff, T. Helentjaris and E. S. Lander, 1992. Identification of genetic factors contributing to heterosis in a hybrid from two elite maize inbred lines using molecular markers. Genetics 132: 823–839. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wolfinger, R., G. Gibson, E. D. Wolfinger, L. Bennett, H. Hamadeh et al., 2001. Assessing gene significance from cDNA microarray expression data via mixed models. J. Comput. Biol. 8: 625–637. [DOI] [PubMed] [Google Scholar]
- Wright, G. W., and R. M. Simon, 2003. A random variance model for detection of differential gene expression in small microarray experiments. Bioinformatics 19: 2448–2455. [DOI] [PubMed] [Google Scholar]
- Yang, X., K. Ye and I. Hoeschele, 2002. Some E-optimal designs for cDNA microarray experiments. ASA Proceedings of the Joint Statistical Meetings, New York, pp. 3853–3954.
















































