## Abstract

Heterosis is defined as the superiority of a hybrid cross over its two parents. Plant and animals breeders have long been exploiting heterosis, but the causes of this phenomenon are as yet only partly understood. Recently, chip technology has opened up the opportunity to study heterosis at the gene expression level. This article considers the cDNA chip technology, which allows assaying two genotypes simultaneously on the same chip. Heterosis involves the response of at least three genotypes (two parents and their hybrid), so a chip or microarray constitutes an incomplete block, which raises a design problem specific to heterosis studies. The question to be answered is how genotype pairs should be allocated to chips. We address this design problem for two types of heterosis: midparent heterosis and better-parent heterosis. The general picture emerging from our results is that most of the resources should be allocated to parent-hybrid pairs, while chips with parent-parent pairs or hybrid-reciprocal pairs should be used sparingly or not at all.

PROGRESS in plant and animal breeding is often made by exploiting nonadditive gene action. For example, when two maize inbred lines are crossed, the resulting hybrid is frequently found to be superior to the midparent value, *i.e*., the average of the two parent means (Falconer and Mackay 1996; Lynch and Walsh 1998). This phenomenon is commonly denoted as midparent heterosis of hybrid vigor. Historically, heterosis was first studied at the phenotypic level of agronomically relevant traits such as yield. Several theories have been put forward to explain heterosis (*e.g.*, Stuber *et al.* 1992), but a consensus has not yet emerged. The advent of chip technologies has now opened up the scope to study heterosis at the gene expression level (Ni *et al.* 2000; Kollipara *et al.* 2002; Guo *et al.* 2003), thus increasing our understanding of the underlying molecular basis of heterosis (Birchler *et al.* 2003). This article is concerned with the optimal design of gene expression studies aiming at heterosis.

The notion of heterosis may be associated with a linear model as follows. The expected phenotypic values of two parent genotypes *A* and *B* and their hybrid *AB* can be expressed as(1)(2)and(3)where ϕ is a general effect and τ is the genotypic effect. Midparent heterosis may be defined by the linear contrast(4)

Midparent heterosis occurs whenever δ_{AB} ≠ 0. Often, it matters which inbred line is the male parent. It is then important to also study the reciprocal cross, which we denote as *BA*. The linear model for this genotype is(5)

The reciprocal's midparent heterosis is(6)

Heterosis of an agronomic trait is economically useful, when the hybrid outperforms both parents. This type of heterosis is also known as better-parent heterosis, and it will occur for hybrid *AB*, when τ_{AB} > τ_{A} and τ_{AB} > τ_{B}, assuming nonnegative coefficients and that an increase in average phenotype is considered advantageous.

Heterosis is thought to be associated with nonadditive gene action or dominance. In fact, dominance may be regarded as midparent heterosis at the gene level. Similarly, overdominance occurs when there is better-parent heterosis at the gene level. If an expression product for a specific gene can be measured for the inbred parents and their hybrids, dominance can be estimated on the basis of (4) and (6). Similarly, overdominance can be assessed on the basis of the contrasts τ_{AB} − τ_{A} and τ_{AB} − τ_{B} at the expression level.

This article is concerned with cDNA chip technology, where each of a large number of genes is represented by a cDNA spot on a glass slide. Expression profiles of two mRNA samples representing two different genotypes are assayed on a slide in parallel. Genotypes are labeled by fluorescent dyes, resulting in a green signal for the one genotype and a red signal for the other genotype. To account for dye effects, it is customary to swap dyes on about half of the chips assigned to the same genotype pair. Statistically, a microarray may be considered as an incomplete block accommodating only two treatments (genotypes) (Kerr and Churchill 2001; Kerr 2003). The design problem is how to allocate different genotype pairs to chips.

Most of the current literature on experimental designs for identifying differentially expressed genes deals with the case where two or more treatments of equal interest are to be compared. Efficient designs in this context are the reference design, the loop design, and balanced block designs (Dobbin and Simon 2002; Kerr 2003; Dobbin *et al.* 2003a,b; Simon *et al.* 2003). The objective of heterosis studies differs from those commonly considered in that the treatment contrast of interest involves three treatments, so efficiency regarding all pairwise comparisons is irrelevant. Also, most of the theory of optimal designs revolves around criteria such as A-optimality or E-optimality (John and Williams 1995; Yang *et al.* 2002), which strive for optimality relative to a broad class of contrasts. In the case of heterosis, such approaches are not optimal, because the class of contrasts of interest is much more limited. Clearly there is only one type of contrast. While other designs such as the loop design may provide good heterosis estimates (Gibson *et al.* 2004), they are not usually optimal (Keller *et al.* 2005). By analogy, a balanced block design optimal with respect to all pairwise comparisons is not optimal regarding multiple comparison with a control. Generally, it will be more efficient to directly optimize the design with respect to the particular contrast(s) of interest (John and Williams, 1995).

This article is concerned with the problem of finding a design by which heterosis or dominance can be estimated with minimal standard error. Specifically, we search for the optimal allocation of a fixed number of chips among all possible genotype pairs. We first consider midparent heterosis and then turn to better-parent heterosis. With both types of heterosis, we study the case of two hybrids as well as that of a single hybrid (no reciprocal tested). The derivations for different cases are organized as follows: first an appropriate linear model is formulated and contrasts of interest are defined in terms of the parameters of that model. Optimality is then defined in terms of the variance of a contrast of interest. Minimization of this criterion leads to the optimal allocation.

## MIDPARENT HETEROSIS

#### Hybrids and reciprocal:

We assume that analysis of normalized gene expression data is done in standard fashion on the basis of a linear model for log measurements. The model accounts for all relevant effects, including dye, chip, and genotype (treatment). For details the reader is referred to Kerr and Churchill (2001), Wolfinger *et al.* (2001), and Keller *et al.* (2005).

It is assumed throughout that chip effects are taken as fixed, implying that interchip information is not recovered. This approach corresponds to the usual assumption made when deriving optimal incomplete block designs (John and Williams 1995). Since there are only two genotypes per chip, all information on genotype contrasts is contained in pairwise differences of genotypic expression levels per chip. Clearly, the analysis of differences of log measurements is equivalent to analysis of actual log measurements, when chip effects are fixed. We express the model in terms of genotype differences, because this greatly simplifies our study of optimal allocation. In applications one will not usually analyze actual log intensities instead of differences.

Let *y _{ji}* denote the

*i*th observed genotype difference for the

*j*th genotype pair. Specifically, let

*y*_{1i}=*i*th observation (chip) on difference*A − B*(*i*= 1,…,*n*_{1}),*y*_{2i}=*i*th observation (chip) on difference*A − AB*(*i*= 1,…,*n*_{2}),*y*_{3i}=*i*th observation (chip) on difference*A − BA*(*i*= 1,…,*n*_{3}),*y*_{4i}=*i*th observation (chip) on difference*B − AB*(*i*= 1,…,*n*_{4}),*y*_{5i}=*i*th observation (chip) on difference*B − BA*(*i*= 1,…,*n*_{5}),*y*_{6i}=*i*th observation (chip) on difference*AB − BA*(*i*= 1,…,*n*_{6}),

where *n _{j}* is the number of chips used for the

*j*th genotype pair. The differences have the following expected values:(7)

The total sample size is given by . For symmetry reasons, we require the same number *n*_{0} of observations for each parent-hybrid pair, *i.e.*, *n*_{2} = *n*_{3} = *n*_{4} = *n*_{5} = *n*_{0}. Thus, the optimal allocation is given by (*n*_{0}, *n*_{1}, *n*_{6}). To ensure identifiability, we set τ_{BA} = 0. To account for dye effects, one commonly swaps dyes for half the chips of a genotype pair. The dye swap can be accommodated by extending the linear model with dye effects and dye-by-genotype interactions. To derive a design optimal with respect to contrasts among genotype main effects, it suffices to use model (7) and require that the number of arrays for a particular genotype pair be allocated in equal parts to both possible dye swaps.

Model (7) may be expressed as(8)where , , and *X* is the appropriate design matrix with dummies 1 and −1. The heterosis contrast of *BA* can be written as , where(9)

The least-squares estimator is(10)which has variance(11)where(12)*I*_{2} is a 2 × 2 identity matrix, with 1_{2} = (1, 1)′, and σ^{2} is the variance of a difference *y _{ji}* (

*j*= 1,…, 6). A derivation of Equation 12 is given in the appendix.

A design for a given sample size *n* involves an allocation (*n*_{0}, *n*_{1}, *n*_{6}) to the different genotype pairings. We now derive an allocation that minimizes . It is first shown that does not depend on *n*_{1}. Thus, we set *n*_{1} = 0. In the next step, we find the optimal value of *n*_{0} subject to the constraint *n* = 4*n*_{0} + *n*_{6}.

It can be shown that(13)which is free of *n*_{1}. Thus, for any fixed values of *n*_{0} and *n*_{6}, the variance of the heterosis contrast does not change with *n*_{1}. This proves that parent-parent chips (*A-B* pairs) do not add any information with regard to the heterosis contrast (6), and so the optimal design must have *n*_{1} = 0. Setting *n*_{1} = 0 and *n*_{6} = *n* − 4*n*_{0}, we obtain(14)and(15)yielding a quadratic equation in *n*_{0} with roots(16)

Since *n*_{0} ≤ *n*/4, the only feasible solution is(17)

Thus, for a given total sample size *n*, the quantity 1′*D*1, and hence the variance of the heterosis estimator, , is minimized for the allocation(18)

This same allocation also minimizes the variance of the other heterosis contrast (4), .

The optimal allocation was derived by looking at a single gene, while in gene expression studies thousands of genes are studied simultaneously. It is perhaps useful to point out that generally the optimal allocation derived here is independent of the variance σ^{2}, which may be gene specific. Thus, the optimal allocation applies to all genes simultaneously. Differences among genes in variance affect only the optimal total sample size needed to achieve a desired accuracy, which may be determined by standard procedures (Steel and Torrie 1980).

#### Only one hybrid:

When only one of the two possible hybrids is tested (hybrid *AB*, say), the model simplifies to(19)with and the constraint τ_{AB} = 0. It can be shown that(20)where *n*_{1} is the number of *A-B* pairs and *n*_{0} is the number of *A-AB* or of *B-AB* pairs. Noting that with , it can be shown that(21)*i.e.*, the variance does not depend on *n*_{1}, the number of parent-parent (*A-B*) pairs. Obviously, the variance is minimized when *n*_{0} = *n*/2, where *n* is the total sample size. Thus, all microarrays should be allocated to parent-hybrid pairs.

#### Additive gene effects:

In deriving an optimal allocation, we have focused on the accuracy in estimating δ. It is sometimes of interest to also estimate the additive gene effect. The accuracy of such estimates in designs optimized for δ is now considered for the two cases studied (*Hybrids and reciprocals* as well as *Only one hybrid*).

##### Design for hybrids and reciprocals:

By not allocating any chips to *A-B* pairs, we have no direct comparisons of the parents. It turns out, however, that the *A-B* comparison can be made with good accuracy. More specifically, it may be of interest to estimate the additive gene effect defined by(22)where(23)

The additive gene effect is of interest when studying the mode of dominance. When |δ| = |α|, there is complete dominance, while dominance is only partial when |δ| < |α| and overdominance occurs when |δ| > |α| (Kearsey and Pooni 1996). To study the mode of dominance it is desirable to estimate α with about the same accuracy as δ. It turns out that with *n*_{1} = 0 we have(24)so that from (11) and (13)(25)

So generally, the additive genetic effect α will be estimated more accurately than both δ_{AB} and δ_{BA}, when the design is optimized with respect to these two heterosis contrasts.

##### Design for only one hybrid:

The additive gene effect with , is estimated with variance(26)

Thus, when all microarrays are allocated to parent-hybrid pairs (*n*_{1} = 0), the additive effect is estimated with the same accuracy as the dominance effect.

## BETTER-PARENT HETEROSIS

#### Hybrids and reciprocal:

It is most convenient to consider the hybrid *BA*. Results for the other hybrid, *AB*, are analogous. Assessing better-parent heterosis of hybrid *BA* requires good estimates of the contrasts λ_{BA(A)} = τ_{BA} − τ_{A} and λ_{BA(B)} = τ_{BA} − τ_{B}. The coefficient vector for the first of these contrasts equals , and the associated variance is(27)where *D*_{11} is the first diagonal element of *D* in Equation 12. The variance *D*_{11} is seen to be symmetric in *n*_{1} and *n*_{6}; *i.e*., the equation remains unaltered if *n*_{1} and *n*_{6} are exchanged. Therefore the optimal design should be such that *n*_{1} = *n*_{6}. The common sample size is denoted as *n*_{00}, *i.e.*, *n*_{1} = *n*_{6} = *n*_{00}; whence(28)

After some algebra using *n*_{00} = (*n* − 4*n*_{0})/2 this becomes(29)

The differential equation yields a quadratic equation in *n*_{0}, which can be shown to have roots(30)

Obviously, the only feasible solution is(31)

Thus, ∼20% of the total sample size is to be used with each of four parent-hybrid pairs, leaving a little <20% for the parent-parent pair *A-B* and the hybrid-reciprocal pair *AB-BA*. As *n*_{1} = *n*_{6} for the optimal design, ∼10% should therefore be allocated to each of these two pairings. As in the case of midparent heterosis, most of the resources (∼80%) should be used on the hybrid-parent pairs.

#### Only one hybrid:

The variance of the contrast λ_{AB(A)} = τ_{AB} − τ_{A} is(32)where is the first diagonal element of in Equation 20. Using *n*_{1} = *n* − 2*n*_{0}, this can be shown to equal(33)

Maximization again leads to a quadratic equation in *n*_{0}, which has roots(34)

The only feasible solution is therefore given by(35)

Thus, ∼84% of the sample size is allocated to the parent-hybrid pairs, while only 16% of the chips are spent on the parent-parent pair.

It is worth mentioning that the design problem here is equivalent to that of a multiple comparison with a control. For a completely randomized design and when two treatments are to be compared with a control, the optimal allocation is known to be , where *m*_{h} is the number of observations per hybrid and *m*_{p} is the number of observations per parent. This allocation minimizes the variance of a hybrid-parent contrast. Using a somewhat different optimality criterion, Dunnett (1955) found the same optimal allocation. Note that complete randomization would imply a single genotype per chip. By comparison, the optimal allocation (35) implies that , where *m*_{h} = 2*n*_{0} and *m*_{p} = *n*_{0} + *n*_{1}, which is rather close but not equal to . The difference is mainly due to the incomplete blocking, with blocks corresponding to chips.

## DISCUSSION

In this article we have derived formulas for the optimal allocation of resourses in cDNA expression studies to reveal midparent heterosis or better-parent heterosis at the gene level. A common feature of both of these cases is that most of the resources are allocated to the parent-hybrid pairs. The researcher needs to make up his mind as to which type of heterosis he wishes to assess. In the case of midparent heterosis, the parent-parent pair need not be tested at all, while with better-parent heterosis a small fraction of the total resources should be devoted to both parent-parent pairs and hybrid-reciprocal pairs.

We have not addressed the question of optimal sample size *n*. This may be determined by standard procedures (Steel and Torrie 1980). The sample size needed to detect heterosis will, among other things, depend on the variance. It should be stressed that variance will usually be gene specific, so optimal sample size will differ among genes. In designs with small sample sizes, efficient estimation of the variance is critical, and it may be useful to borrow strength from other genes (Wright and Simon 2003), trading variance for some bias. As pointed out by a referee it may also be necessary to account for dye bias in variance estimation.

The result that in optimal designs, parent-parent pairs provide no information regarding midparent heterosis contrasts, may seem trivial on first sight. It should be pointed out, however, that it does not generally hold in suboptimal designs and is therefore not as trivial as it may seem. The reason is that parent-parent pairs provide indirect information regarding heterosis contrasts. For example, data on the parent pair *A-B* and on the parent-hybrid *BA-A* allow an indirect comparison for the pairing *BA-B*, since *BA-A* − (*A-B*) = *BA-B*. Therefore, it is often found (results not shown), that with suboptimal designs, the parent-parent pair provides information for the heterosis contrast. For optimal designs, this information vanishes in much the same way as information from indirect comparisons vanishes in a complete block design.

In many experiments, the linear model needs to account for several fixed and random sources of variation, giving rise to a complex mixed linear model (Wolfinger *et al*. 2001). In this case, finding an optimal allocation will typically require numerical search strategies such as simulated annealing (Keller *et al*. 2005). On the basis of the examples given in Keller *et al*. (2005) it may be conjectured that the optimal allocation in more complex settings will not deviate dramatically from that derived in this article.

To study heterosis, one may estimate the dominance ratio, θ = δ/α (Kearsey and Pooni 1996). Using the δ-method (Johnson *et al*. 1993) and exploiting the fact that dominance and additive gene effect estimates are stochastically independent, the approximate variance of the dominance ratio is(36)

One might consider finding a design that minimizes this variance. This approach is not usually feasible, however, unless *a priori* information is available on both α and δ, which will rarely be the case. The same problem would apply if one were to work with the exact distribution of , assuming normality (Hinkley 1969), or Fieller's (1954) method (Piepho and Emrich 2005). Thus, it is preferable to optimize the design for contrasts related to either midparent heterosis or better-parent heterosis.

## APPENDIX

We here derive Equation 12 for matrix *D*. As we require *n*_{2} = *n*_{3} = *n*_{4} = *n*_{5} = *n*_{0}, the matrix *X*′*X* is given by(A1)with(A2)(A3)and(A4)where 1_{2} = (1, 1)′, *I*_{2} is a 2 × 2 identity matrix, and . Using results on the inverse of a partitioned matrix (Harville 2000, p. 99) we find(A5)where(A6)(A7)and(A8)

To study the heterosis contrast (4), it is sufficient to find *D*. Using(A9)and(A10)(Searle *et al*., 1992, p. 443), it can be shown that(A11)

Now the least-squares estimator has variance(A12)

## Acknowledgments

I thank two anonymous referees for several helpful suggestions.

## Footnotes

Communicating editor: J. B. Walsh

- Received November 11, 2004.
- Accepted May 30, 2005.

- Copyright © 2005 by the Genetics Society of America