## Abstract

Eigenvalues and eigenvectors of covariance matrices are important statistics for multivariate problems in many applications, including quantitative genetics. Estimates of these quantities are subject to different types of bias. This article reviews and extends the existing theory on these biases, considering a balanced one-way classification and restricted maximum-likelihood estimation. Biases are due to the spread of sample roots and arise from ignoring selected principal components when imposing constraints on the parameter space, to ensure positive semidefinite estimates or to estimate covariance matrices of chosen, reduced rank. In addition, it is shown that reduced-rank estimators that consider only the leading eigenvalues and -vectors of the “between-group” covariance matrix may be biased due to selecting the wrong subset of principal components. In a genetic context, with groups representing families, this bias is inverse proportional to the degree of genetic relationship among family members, but is independent of sample size. Theoretical results are supplemented by a simulation study, demonstrating close agreement between predicted and observed bias for large samples. It is emphasized that the rank of the genetic covariance matrix should be chosen sufficiently large to accommodate all important genetic principal components, even though, paradoxically, this may require including a number of components with negligible eigenvalues. A strategy for rank selection in practical analyses is outlined.

TRAITS of interest in quantitative genetics are seldom independent of each other. Hence, in analyses of “complex” phenotypes it is desirable to consider all components simultaneously, in particular when considering the effects of selection and its impact on evolution (Blows and Walsh 2008). However, analyses to estimate genetic parameters are often limited to a few traits only. This can be attributed to the burden imposed by multivariate estimation, due both to computational requirements and limitations and to the need for sufficiently large data sets to support accurate estimation of the numerous parameters involved.

By and large, covariance matrices are considered to be unstructured; *i.e*., for *q* traits of interest we have *q*(*q* + 1)/2 distinct variance and covariance components among them. In a genetic context, there are at least two covariance matrices to be determined, namely the covariance matrix due to additive genetic effects and the corresponding matrix due to residual effects. This yields *q*(*q* + 1) parameters to be estimated; *i.e*., the number of parameters increases quadratically with the number of traits considered. Recently, improvements in computing facilities together with advances in the implementation of modern inference procedures, such as residual or restricted maximum likelihood (REML), have made routine multivariate analyses involving numerous traits and large data sets feasible. In addition, availability of corresponding software, specialized toward quantitative genetic analyses fitting the so-called “animal model,” has made analyses conceptually straightforward, even for scenarios with complex pedigrees, many fixed effects, additional random effects, or arbitrary patterns of missing observations.

Yet, the “curse of dimensionality” remains. This has kindled interest in estimation imposing a structure, in particular for genetic covariance matrices; see Meyer (2007a) for a recent review. Principal component (PC) analysis is a widely used method to summarize multivariate information, dating back as far as Hotelling (1933) and Pearson (1901) (both reprinted in Bryant and Atchley 1975). Moreover, PCs are invaluable in reducing the dimension of analyses, *i.e*., the number of variables to be considered. For a set of *q* variables, the PCs are the *q* linear combinations of the variables that are independent of each other and successively explain a maximum amount of variation. Hence, if the *m* + 1th PC explains negligible variation, PCs *m* + 1 to *q* convey little information that is not already contained in PCs 1 to *m*. We can then safely ignore PCs *m* + 1 to *q*, reducing the number of variables from *q* to *m*. In many applications, *m* can be considerably smaller than *q*.

The PCs for a set of variables are estimated from the eigen decomposition of the corresponding covariance matrix: The eigenvectors provide the linear functions of the original variables while the corresponding eigenvalues give the amount of variance explained by each PC. PCs are usually given in descending order of the eigenvalues and the matrix of eigenvectors is scaled to be orthonormal. The latter implies that the matrix of eigenvectors (with *q*^{2} distinct elements) is described by *q*(*q* – 1)/2 parameters, with the remaining *q*(*q* + 1)/2 elements defined by the orthonormality constraints. Ignoring PCs *m* + 1 to *q* thus reduces the number of parameters to describe the covariances among the *q* traits to *m*(2*q* – *m* + 1)/2, comprising *m* eigenvalues and the *m*(2*q* – *m* – 1)/2 “free” elements of the first *m* eigenvectors. The resulting covariance matrix, of size *q* × *q*, has reduced rank *m*.

Closely related technically, but with a somewhat different underlying concept is factor analysis (FA). While PC analysis aims at identifying variables that explain maximum variation, FA is primarily concerned with finding the common “factors” that cause covariances between traits. Like PCs, the predictors of such factors are independent, linear combinations of the original traits. In addition, FA allows for specific effects that are generally assumed to be uncorrelated. More importantly, FA implies a latent variable model for the original traits—modeling each as the sum of common factors and specific effects—while PC analysis does not. In the general case, specific variances for all *q* traits are assumed to be nonzero and different from each other. Considering a FA model with *m* factors, this yields a full rank covariance matrix modeled by *q* + *m*(2*q* – *m* + 1)/2 parameters. This quantity cannot exceed the number in the unstructured case, *q*(*q* + 1)/2, which limits the maximum number of common factors that can be fitted. If specific effects are assumed to have zero variance, the FA model with *m* factors yields the same, reduced-rank covariance structure as considering the leading *m* PCs only.

Hence FA models are readily incorporated in the linear mixed models commonly fitted for the estimation of genetic parameters. This enables direct estimation of the leading principal components (*i.e*., factors, assuming no specific effects) only of genetic covariance matrices, as proposed by Kirkpatrick and Meyer (2004). Here “direct” refers to estimation directly from the data instead of the more customary two-step procedure that involves estimation of the complete, unstructured genetic covariance matrix before considering its eigen decomposition. For highly correlated traits with PCs that have eigenvalues close to zero, such reduced-rank analyses avoid estimation of unnecessary parameters. This not only makes more efficient use of the data, but also can reduce computational requirements markedly over those in a full rank analysis. Suitable REML algorithms have been described by Thompson *et al*. (2003), Meyer and Kirkpatrick (2005), and Meyer (2008), and Bayesian estimation via Gibbs sampling has been outlined by Los Campos and Gianola (2007).

Classic PC analysis considers a single covariance matrix at a time. Due to their orthogonality, we can alter the number of PCs fitted for a single matrix successively; *i.e*., estimates of the *i*th PC are expected to remain more or less the same when increasing or reducing the number of PCs considered. For quantitative genetic analyses, however, with at least two covariance matrices estimated simultaneously, this is not necessarily the case. If genetic PCs with nonnegligible eigenvalues are omitted while estimating a full rank residual covariance matrix, genetic covariances can be partitioned into the environmental components, leading to biased estimates. This article examines the properties of estimates of genetic covariance matrices and their eigenvalues and -vectors from reduced-rank analyses, showing that up to three sources of bias may affect estimates.

## THEORETICAL CONSIDERATIONS

Here present a brief review of pertinent statistical literature on multivariate estimation, focusing on problems associated with sampling variation and constraining estimates to the parameter space. Subsequently, we examine the bias arising from estimating genetic covariance matrices of reduced rank.

### Bias due to sampling variance

#### Spread of sample roots:

Consider *N* sets of observations for *q* traits from a multivariate normal distribution with population covariance matrix . The sample covariance matrix **S**, estimated from the sums of squares and cross-products among observations, then has a central Wishart distribution. It is well known that the eigenvalues (latent roots) of such a sample covariance matrix are spread farther than the population values. This results in an upward bias of estimates of the largest eigenvalues and a downward bias of estimates of the smallest values while their mean is expected to be unbiased. Let λ_{i} denote the *i*th eigenvalue of , and the *i*th eigenvalue of **S**. For λ_{i} distinct from all other roots, Lawley (1956) gave an expected value for the corresponding sample value of(1)with *O*(1/*N*^{2}) a residual term proportional to 1/*N*^{2}. Equation 1 shows that the bias decreases inversely proportionally to the sample size, becoming small as *q*/*N* becomes negligible. Furthermore, bias tends to be the more pronounced the closer together population roots are. Lawley (1956) suggested that an estimate of λ_{i} less biased than might be obtained from Equation 1, substituting sample for population values, as .

An obvious alternative to improve estimates of covariance matrices for small samples or of high dimensions is to squeeze the sample eigenvalues together. A number of improved estimators have been proposed, based on minimizing a loss or risk function. These differ in the functions utilized and resulting properties, *e.g*., whether the original order of eigenvalues is maintained or whether the resulting estimates are guaranteed to be nonnegative; see Muirhead (1987) for a review of early work and Srivastava and Kubokawa (1999) for more recent references. Minimizing the squared error loss to determine an optimal amount of shrinkage, Ledoit and Wolf (2004) derived an estimator that regresses sample eigenvalues toward their mean and yields a weighted combination of the sample covariance matrix and an identity matrix. While this has seen diverse applications, including the analysis of high-dimensional genomic data (Schäfer and Strimmer 2005), Daniels and Kass (2001) reported overshrinkage of the smallest roots when eigenvalues were spread far apart and suggested shrinking the log sample eigenvalues toward their posterior mean as an alternative.

Corresponding work has considered the properties of the simultaneous distribution of the roots of two sample covariance matrices, **S**_{1} and **S**_{2}, in particular for the product **S**_{1}**S** or **S**_{1}(**S**_{1} + **S**_{2})^{−1} (*e.g*., Chang 1970; Krishnaiah and Chang 1971; Venables 1973; Muirhead and Verathaworn 1985; Bilodeau and Srivastava 1992). These matrices and their eigenvalues play an important role in the multivariate analysis of variance (MANOVA) and hypothesis testing. However, asymptotic distributions and expansions given are by and large complicated and cumbersome to evaluate, or derivations are limited to special cases. As in the single sample case, a number of estimators have been suggested that involve some form of shrinkage or truncation and minimize the quadratic or entropy loss (*e.g*., Dey 1988; Loh 1991; Srivastava and Kubokawa 1999).

#### Analysis of variance:

The simplest scenario for a MANOVA is the balanced one-way classification, and estimation for this case has received substantial attention in the statistical literature. A particular problem is that the standard quadratic, unbiased estimator of the between-groups covariance matrix is not guaranteed to be positive semidefinite (p.s.d.), *i.e*., to have eigenvalues that are nonnegative. Let and denote the population matrices of the between- and within-group covariances in a one-way classification, with *s* groups and *n* observations per group. Further, let **W** and **B** be the corresponding matrices of mean squares and cross-products (MSCP) within and between groups with expected values and . Estimates of and are then obtained by equating **B** and **W** to their expectations, and .

In a quantitative genetic context, groups are families with a degree of genetic relationship of ρ (*e.g*., ρ = 0.25 for half-sib and ρ = 0.50 for full-sib families), and estimates of the genetic , environmental , and phenotypic covariance matrices are obtained as , , and , respectively. An extensive simulation study by Hill and Thompson (1978) demonstrated that the probability of obtaining nonpositive definite estimates of and thus is high, increasing with the number of traits considered and decreasing sample size, in particular number of groups. Bhargava and Disch (1982) reported similar results, presenting an analytical method to determine this probability. In addition, Hill and Thompson (1978) showed that this probability does not depend on the individual elements of and , but is determined entirely by the eigenvalues of the product .

Hayes and Hill (1981) considered the use of the estimated covariance matrices to derive weights in a genetic selection index. This involves the product . The authors thus proposed to shrink the roots of this product toward their mean to reduce the effect of sampling errors and showed in a simulation study that this could improve the achieved response to selection. Rather than manipulating the roots of directly, Hayes and Hill (1981) modified **W**^{−1}**B**, using that for γ_{i} a root of **W**^{−1}**B**, ρ^{−1}(γ_{i} – 1)/(γ_{i} – 1 + *n*) is a root of . This was referred to as “bending” the matrix of between-groups MSCP toward the matrix of within-groups MSCP. In particular, the authors suggested to choose the shrinkage or bending factor so that the smallest, modified root of was zero or equal to a small positive value. Similarly, if λ_{i} is a root of , then 1 + *n*λ_{i} is a root of **W**^{−1}**B** and ρ^{−1}λ_{i}/(λ_{i} + 1) is a root of ; *i.e*., Hayes and Hill's (1981) procedure is also equivalent to squeezing the roots of together. Note that bending, as proposed originally, relates to two matrices. However, it is often used to describe the corresponding modification of the eigenvalues of a single matrix, a procedure more appropriately termed “squeezing” by Kirkpatrick and Lofsvold (1992, Appendix B).

Earlier, Klotz and Putter (1969) derived maximum-likelihood (ML) and REML estimators of for the balanced one-way classification, which were constrained to be p.s.d. Anderson *et al*. (1986) extended this to a p.s.d. ML estimator for of maximum rank, while Amemiya (1985) simply considered how to modify the matrix of between components when estimates were not positive definite. In essence, all these again considered **W**^{−1}**B** and its latent roots. Rather than applying shrinkage, however, any eigenvalues less than unity were fixed at unity, yielding a p.s.d. estimator of of rank equal to the number of eigenvalues greater than unity. This is equivalent to setting any negative eigenvalues of to zero, *i.e*., results in the nearest symmetric p.s.d. matrix in the Frobenius norm (Higham 1988). For ML the divisor used to obtain **B** from the sums of squares and cross-products of group means is *s*, rather than *s* – 1 as in the MANOVA.

In the balanced case, REML estimators without restrictions on the parameter space have closed form and are identical to their counterparts from (M)ANOVA (*e.g*., Corbeil and Searle 1976; Lee and Kapadia 1984). For normally distributed data, restricting the eigenvalues of **W**^{−1}**B** in the one-way classification to a minimum of unity in fact is equivalent to REML estimation of and constrained to the parameter space (Amemiya 1985, Appendix). Note, however, that in a genetic context where we estimate as , this does not guarantee that is positive definite (except for clones; *i.e*., ρ = 1). In contrast, animal model REML analyses estimate directly, constraining to have positive eigenvalues, both for full-rank multivariate analyses and reduced-rank estimation imposing a FA structure on (Meyer and Kirkpatrick 2005). While yielding the same estimates if is positive definite, the two approaches are thus not strictly equivalent.

Corresponding arguments apply for more complicated scenarios. For instance, extensions to nested two-way classifications have been examined by Hill and Thompson (1978), Amemiya (1985), and Das (1996). Similarly, Calvin and Dykstra (1991, 1992) considered constrained ML and REML estimation for the class of MANOVA models in which restrictions can be formulated as nonnegativity of matrices of MSCP and of pairwise differences between matrices of expected MSCP. This class includes all nested and two-factor models. This suggests that the same mechanisms are operational in analyses utilizing several types of covariances between groups simultaneously, such as animal model REML analyses to estimate genetic variances for complex pedigrees.

### Bias due to reduced-rank estimation

Reduced-rank estimators of covariance matrices yield estimates with eigenvalues that are zero, so that their rank is less than their dimension. Amemiya (1985) proposed to constrain estimated covariance matrices between groups to be p.s.d. by discarding any PCs with negative eigenvalues. More recently, generalized reduced-rank estimators have been suggested to reduce the number of parameters to be estimated (Kirkpatrick and Meyer 2004; Meyer and Kirkpatrick 2005). This section examines the bias in reduced-rank estimators for a balanced one-way classification. After reviewing the canonical transformation on which estimators are based, we describe the estimators themselves and show that they are subject to two sources of bias.

#### Canonical transformation:

As outlined above, estimates of the genetic covariance matrix are obtained by estimating the between-group covariance matrix, . Examination of the relationship between the two matrices involved, **B** and **W**, is made easier by applying a transformation so that(2)with the matrix of eigenvalues of **W**^{−1}**B**.

This is the so-called canonical decomposition or transformation: For any two symmetric (real) matrices, **W** and **B**, of size *q* × *q* with **W** positive definite and **B** p.s.d. there exists a matrix **T** such that **TT**′ = **W** and **TDT**′ = **B**, with **D** diagonal (Anderson 1984). The canonical transformation has been used to simplify various multivariate problems in quantitative genetics, such as examination of selection indexes (Hayes and Hill 1980), and to reduce computational requirements in genetic evaluation (Ducrocq and Chapuis 1997) or estimation of genetic parameters (Meyer 1985). In addition, it has been instrumental in the derivation of constrained REML or ML estimators of covariance matrices, reviewed above. A detailed description of the computational steps involved can be found in Seal (1964) or Amemiya (1985).

The transformation matrix **T** is given by(3)with **W**^{1/2} a matrix square root of **W** and **E**_{Q} the matrix of eigenvectors of(4)Let and represent the eigen-decompositions of **W** and **B**. Matrix square roots are not uniquely defined. Suitable forms are or, as **E**_{W} is orthonormal, , or the Cholesky factor of **W**. For , **Q** becomes(5)

#### Reduced-rank estimators:

From Equation 2 it follows directly that has negative eigenvalues if **Q** has any eigenvalues λ_{Qi} less than unity. Amemiya (1985) proposed an estimator that is constrained to be p.s.d. that utilizes this relationship.

##### Constrained estimator:

Assume that **Q** has *m*_{0} eigenvalues λ_{Qi} greater than or equal to unity, with the remaining *q* – *m*_{0} eigenvalues less than unity. We can then think of in Equation 2 as the sum of a p.s.d. matrix **M** and a negative definite matrix ,(6)where is with eigenvalues *m*_{0} + 1 to *q* replaced by unity, and **t**_{i} denotes the *i*th column of **T**. A natural interpretation is then to consider **M** as an estimator of and as an estimator of ; *i.e*., an estimator of constrained to be p.s.d. is obtained by omitting (Amemiya 1985). This gives an estimator that has rank *m*, with *m* ≤ *m*_{0} the number of roots of **Q** that are >1:(7)The corresponding estimator for is the combination of **W** and the portion of **B** not used to estimate , each weighted by the appropriate degrees of freedom:(8)It is readily shown that , *i.e*., that the new estimators comprise the total sums of squares and cross-products.

##### General reduced-rank estimators:

Amemiya (1985) considered *m* and *m*_{0} to be determined by the sample. In some circumstances, however, we may wish to select a specific value of *m* for a particular analysis (Kirkpatrick and Meyer 2004). Such a general reduced-rank estimator of is obtained by substituting *m* for *m*_{0} in Equation 7. Note that, for simplicity of argument, we assume here that *m* ≤ *m*_{0}. In practice, we may have less than *m* values λ_{Qi} ≥ 1; *i.e*., strictly speaking, we can choose only an estimator of that is p.s.d. and has at most a rank of *m*.

#### Bias due to rank reduction:

It is well known, though often ignored, that constraining REML estimates of (co)variance components to the parameter space gives biased results. For instance, if the population value for a variance is zero, constraining estimates to be nonnegative yields estimates with expectation greater than zero. Similarly, estimates of covariance matrices forced to be p.s.d. are biased. From Equations 7 and 8, expected values of estimators are(9)*i.e*., are biased proportional to . Amount and sign of this bias depend on how the eigenvalues of **Q** used to obtain are determined.

For the constrained estimator (Amemiya 1985), is negative definite, as λ_{Qi} < 1 for all *i* > *m*_{0}. Thus, from Equation 9, the diagonal elements of are biased upward, and those of are correspondingly biased downward. For the general, reduced-rank estimator, depending on the choice of *m*, we may ignore principal components with nonnegligible eigenvalues. In the simplest scenario, *m*_{0} = *q*; *i.e*., all λ_{Qi} are greater than or equal to unity. Consequently, for *m* < *m*_{0}, diagonal elements of and thus are biased downward, while is biased upward [with tr(·) denoting the trace operator, *i.e*., the sum of the diagonal elements of a matrix]. As the trace of a matrix is equal to the sum of its eigenvalues, this implies that the estimated eigenvalues are biased.

The general, reduced-rank estimator is inconsistent if the chosen rank *m* is less than the true rank of (Remadi and Amemiya 1994), *i.e*., does not converge to its true value as sample size increases. This is perhaps not surprising, since it is the nature the reduced-rank estimator that it discards part of the quantity that it seeks to estimate. A heuristic argument for this inconsistency is as follows. From Equation 9, the bias in is(10)(using that λ_{Qi} = 1 + *n*λ_{i}). One can show that expected values of λ_{i} and **t**_{i} are independent of sample size. Our conclusion is therefore that, for fixed *m*, the bias in does not decline as the sample size increases. In contrast, the constrained estimator of Amemiya (1985) does not have this problem of inconsistency, as *m* is not fixed but rather converges on *q* as and , causing to vanish.

For genetic analyses and ,(11)Hence, the bias in both and is inverse proportional to the degree of relationship among family members. For this is tempered by a downward bias proportional to the group size, which results in a slight downward bias in the estimate of the phenotypic covariance matrix.

#### Bias due to subset selection:

In addition, reduced-rank estimators of can suffer from another source of bias, introduced when the estimation procedure retains—or “picks up”—one or more of the PCs of belonging to the subset that should have been discarded, on the basis of the size of the corresponding eigenvalues.

To gain further insight, consider the scenario where genetic and environmental eigenvectors are collinear. Let with **E** the matrix of eigenvectors and the diagonal matrix of genetic eigenvalues, in descending order. Further, let , with the matrix of environmental eigenvalues, in appropriate order. Assume an infinitely large sample, so that **W** and **B** are equal to their population values; *i.e*., and . Consequently, **Q** (Equation 4) is a diagonal matrix with elements λ_{Qi} = (λ_{Ei} + (1 + (*n* – 1)ρ)λ_{Gi})/(λ_{Ei} + (1 – ρ)λ_{Gi}). These are the eigenvalues of **Q** and the corresponding matrix of eigenvectors is an identity matrix. However, unless the diagonal elements of **Q** are in strictly descending order, the sequence of both eigenvalues and eigenvectors is changed when arranging λ_{Qi} in descending order of magnitude. This results in **E**_{Q} being equal to an identity matrix with permuted columns. Only if **E**_{Q} is equal to a nonpermuted identity matrix are the leading PCs—eigenvalues and corresponding eigenvectors—of estimated correctly, regardless of the order of fit, and only then is the bias in proportional to the smallest eigenvalues λ_{Gm+1} to λ_{Gq}. When increasing the rank of by one, this results in an increase in equal to the estimate of the last genetic eigenvalue fitted. In all other scenarios, the permutations will cause us to pick up the wrong PC in at least some reduced-rank analyses. Generalizations are difficult, but in essence, we estimate the first *k* ≤ *m* PCs correctly from an analysis of rank *m*, if the subset of *m* PCs considered comprises all PCs from 1 to *k*, *i.e*., if the first *m* columns of **E**_{Q} include all *k* elementary vectors for *j* = 1, *k* (with a vector of length *q* with a single nonzero element of unity in position *j*).

##### Example:

For illustration, say we have *q* = 3 and **Q** with diagonal elements λ_{Q1} < λ_{Q2} < λ_{Q3}. Rearranging eigenvalues and -vectors in descending magnitude of the λ_{Qi} then gives with elements λ_{Q3} in position (1, 1), λ_{Q2} in position (2, 2), and λ_{Q1} in position (3, 3), while the columns of **E**_{Q} are equal to the corresponding elementary vectors; *i.e*., **E**_{Q} has nonzero elements of unity in positions (1, 3), (2, 2), and (3, 1). Let **e**_{i} denote the *i*th column of **E**. With , for *m* = 1 this gives (from Equation 7) *i.e*., we would obtain an estimate of λ_{G1} equal to the true value for λ_{G3}, with the corresponding estimated eigenvector at an angle of 90° to the true first eigenvector. In other words, our estimate of the first PC would be equal to the third PC. Similarly, for *m* = 2 estimates and would be equal to λ_{G2} and λ_{G3}, respectively, with both estimated eigenvectors orthogonal to the true vectors. The sum of estimated genetic eigenvalues would increase from for *m* = 1 to for *m* = 2. Only for *m* = 3 would we estimate λ_{G1} correctly. Hence, we would severely underestimate in both reduced-rank analyses and would be likely to conclude that all *q* PCs were required to model adequately. Paradoxically, this would be independent of the size of λ_{Q3}, *i.e*., holding even if the last genetic eigenvalue were close to zero and thus explained negligible variation.

In a more general scenario, genetic and environmental eigenvectors are not the same, **E**_{G} ≠ **E**_{E}, and the effect of permutations is reduced. To investigate this situation, it is convenient to describe the orientation of a set of eigenvectors in terms of the angles by which they deviate from the axes [the so-called “Givens angles,” (*e.g*., Pinheiro and Bates 1996)]. It is easy to see that for *q* = 2 a single angle describes the orientation of the first eigenvector relative to the *x*-axis and, simultaneously, the orientation of the second eigenvector relative to the *y*-axis. For an arbitrary number of dimensions, *q*(*q* – 1)/2 angles are required to describe the orientation of all the eigenvectors. Any orthonormal matrix, such as a matrix of eigenvectors, can be written in terms of these angles,where α_{ij} is the angle of rotation in the plane defined by the *i*th and *j*th axes, and **R** is the corresponding rotation matrix. **R**(α_{ij}) has diagonal elements *r _{ii}* =

*r*= cos(α

_{jj}_{ij}) and

*r*= 1 for all

_{kk}*k*≠

*i*,

*j*, off-diagonal elements

*r*= –sin(α

_{ij}_{ij}) and

*r*= sin(α

_{ji}_{ij}), and all other elements are zero. When all angles α

_{ij}are 0, the eigenvectors coincide with the axes and all traits are uncorrelated. A useful property of the rotation matrices is that

**R**(α

_{ij})′

**R**(β

_{ij}) =

**R**(α

_{ij}– β

_{ij})′. This means that the product of in Equation 5 depends only on the difference in the corresponding angles between the matrices

**E**

_{W}and

**E**

_{B}.

This can be used to examine how a disparity in the orientation of the genetic and environmental eigenvectors affects reduced-rank estimation for the case of *q* = 2. Let α_{W} and α_{B} denote the single angles describing the orientation of **E**_{W} and **E**_{B}, respectively, and let δ = 2(α_{W} – α_{B}). For λ_{Wi} and λ_{Bi} the eigenvalues of **W** and **B**, this givesAs discussed above, for equal angles (α_{B} = α_{W}), **Q** is a diagonal matrix with elements λ_{Bi}/λ_{Wi}, and any bias is determined by the relative magnitude of these ratios. Similarly, **Q** is diagonal for equal roots of **B**, λ_{B1} = λ_{B2}. Any difference in angles then does affect the estimate of λ_{G1} from a reduced-rank (*m* = 1) analysis, although the estimated direction of the corresponding eigenvector may be distorted if λ_{W1} ≠ λ_{W2}.

For practical analyses, the effects of sampling variation may modify the bias observed or yield biased estimates when population values would not predict so. This is illustrated in Figure 1, which shows the distribution of estimates of λ_{G1} and the angle (θ_{1}) between the corresponding true and estimated eigenvectors for two traits with equal phenotypic variances (of = 100) and heritabilities (of *h*^{2} = 0.4). Due to the equality of and *h*^{2}, the first eigenvector of both and (or, equivalently, and ) is characterized by an angle of 45°; *i.e*., α_{B} = α_{W}, which is independent from the level of correlations. Assuming a genetic correlation of *r*_{G} = 0.6, population values for the genetic roots are λ_{G1} = *h*^{2}(1 + *r*_{G}) = 64 and λ_{G2} = *h*^{2}(1 – *r*_{G}) = 16. For an environmental correlation of *r*_{E} = 0.58 this gives λ_{Q1} = 1.672 and λ_{Q2} = 1.645. For such population values (λ_{Q1} > λ_{Q2}) we expect to estimate the first PC correctly in a reduced-rank analysis fitting this component only (*m* = 1). However, with considerable sampling variation, we pick up the wrong PC in a substantial number of cases. This results in a bimodal frequency distribution for , with modes close to true population values for λ_{G1} and λ_{G2}.

## MATERIALS AND METHODS

#### Simulation:

A simulation study was conducted to examine the joint effects of sampling variation, constraints on the parameter space, and bias due to reduced-rank estimation on estimates of genetic PCs, contrasting sample results with values predicted from the population parameters.

##### Samples:

The simulation assumed data with a paternal half-sib structure, comprising *s* sire families of size *n*. Simulations were carried out by sampling matrices of between- and within-family MSCP, **B** and **W**, from central Wishart distributions with respective degrees of freedom of *s* – 1 and *s*(*n* – 1), as described by Odell and Feiveson (1966). Combinations considered were *s* = 5000 with *n* = 40 (referred to as S5000N40) for a very large sample, *s* = 1000 with *n* = 20 or *n* = 6 (S1000N20 or S1000N6) for (moderately) large samples, *s* = 200 with *n* = 20 or *n* = 6 (S200N20 or S200N6) for moderately small samples, and *s* = 100 with *n* = 20 or *n* = 6 (S100N20 or S100N6) for a small sample size. A total of 10,000 replicates were carried out for each combination of population values and sample size.

##### Population values:

The number of traits considered was *q* = 6 throughout. Population values for genetic and environmental covariance matrices were parameterized in terms of their eigenvalues and Givens angles (α_{G} and α_{E}) to determine the corresponding eigenvectors. Six different scenarios were considered, chosen to represent different spreads of eigenvalues and thus rates of decline in genetic roots, and ratios of genetic to environmental roots, with the absolute values not important. Genetic eigenvalues for case A were 53, 52, 51, 49, 48, and 47, *i.e*., represented a scenario with very similar eigenvalues. For case B, population roots were moderately spread, with values for λ_{Gi} of 100, 75, 50, 35, 25, and 15. Cases C, E, and F assumed values of 175, 50, 35, 20, 15, and 5, *i.e*., a fairly wide spread, and case D, with values of 220, 45, 15, 10, 7, and 3, mimicked an even more extreme spread.

This yielded of true rank = *q* and the same tr and average eigenvalue of 50 for all scenarios. For cases A–D, eigenvalues of were simulated as twice the value of their genetic counterparts, which yielded equal λ_{Qi} (*i* = 1, *q*). For cases E and F, λ_{Ei} = 2λ_{Gi} – 5 and λ_{Ei} = 5λ_{Gi} – 5 were used. This was aimed at creating constellations where population values for λ_{Qi} were different and in an order that was likely to cause reduced-rank estimates to pick up the wrong subset of principal components, while still allowing sampling variation to have an effect. Again, the choice of actual values was somewhat arbitrary and was made after considering a range of possibilities. In addition, a scenario where had a true rank of = *q*/2 = 3 was simulated for all six constellations, by setting λ_{Gi} = 0 for *i* = 4, 6.

Motivated by the analytic results, we also examined the effects of the orientation of the eigenvectors of and . As noted earlier, with *q* variables *q*(*q* – 1)/2 angles are needed to describe the orientation of a set of eigenvectors. For simplicity, we assumed all these angles were equal to α_{G} for the genetic eigenvectors. For cases A–D we set α_{G} = 0°, which is equivalent to all genetic correlations being zero. For cases E and F, we took α_{G} = 45°, meaning that the genetic principal components point in directions that lie between the axes. All angles for the environmental eigenvectors were also assumed equal. These were expressed as differences relative to the genetic eigenvectors; thus α_{E} = 0° means that the environmental and genetic eigenvectors coincide. Values ranging from α_{E} = −45 to α_{E} = 90° were used to parameterize .

#### Analyses:

##### Estimation:

For each replicate, REML estimates of and were obtained for of rank *m* = 1, *q* and of rank *q*, using Amemiya's (1985) procedure to obtain estimates of and (see Equations 7 and 8 above) and derive and . If this yielded an estimate of that was not positive definite, a derivative-free search strategy was employed to maximize the REML log likelihood,(12)constraining to have full rank and to have rank no larger than *m*. While computationally more demanding than a maximization procedure using derivatives of , this was chosen for its ease of implementation.

##### Summary statistics:

Bias in estimates of was quantified by considering the estimated PCs, *i.e*., both eigenvalues and eigenvectors. Rather than examining the *m*(2*q* – *m* – 1)/2 individual elements or Givens angles of the eigenvectors of , differences between estimates and population values were summarized by considering the *m* angles between estimated and corresponding population vectors. For **e**_{i} the *i*th eigenvector and its estimate, the angle (in degrees) isTaking the absolute value of the inner product projects all angles to the first quadrant; *i.e*., θ_{i} has a value between 0° and 90°.

The effect of sampling and bias on the estimates of covariance matrices was summarized by considering the Frobenius norm of the matrix difference, (*X* = *G*, *E*) [for a matrix **M** with elements *m _{ij}* the Frobenius norm is (Golub and Van Loan 1996)]. For all statistics, means and sampling deviation across replicates were calculated. In addition, mean square errors (MSE) were obtained as the average (over replicates) of squared deviations of individual parameter estimates from the corresponding population values.

Predicted values for estimates of λ_{Gi} and corresponding θ_{i} were obtained from the eigen-decomposition of (Equation 11), evaluated for the population values and given rank *m*.

##### Rank tests:

For each sample, the rank of , *i.e*., the number of eigenvalues significantly different from zero, was determined using several procedures. Where applicable, an error probability of 5% was used. On the basis of Equation 12, likelihood-ratio tests (LRT) were carried out, comparing minus twice the difference in for *m* = 2, *q* and for *m* – 1 against a χ^{2}-criterion with *q* – *m* + 1 d.f. In addition, the LRT described by Anderson *et al*. (1986) to test the hypothesis that the estimate of from the balanced one-way classification had rank of at most *m vs*. the alternative that it had rank greater than *m*, was applied. Quantiles of the distribution of the test statistic, which does not follow a χ^{2}-distribution, have been tabulated by Amemiya *et al*. (1990) and Kuriki (1993). Let denote the number of eigenvalues of **W**^{−1}**B**, λ_{Qi}, greater than unity. The test criterion is then(from Amemiya *et al*. 1990). If *Y* exceeds the tabulated value for a probability of 0.95 and rank difference *q* – *m*, the alternative hypothesis is accepted, with an error probability of 5%; see Hine and Blows (2006) for a detailed description.

Further, the Akaike information criterion (AIC), adjusted for small sample size, and Bayesian information criterion (BIC) for each analysis were obtained asand(Wolfinger 1996; Burnham and Anderson 2004) with the number of parameters for an analysis estimating of rank *m*. For *q* = 6 and *m* = 1, 6, values were *p* = 27, 32, 36, 39, 41, and 42, respectively. The rank of was then chosen as the value of *m* with the smallest value of the information criterion.

## RESULTS

This section describes results of the simulation study. To begin with, we examine the bias in reduced-rank estimates, showing that, in practice, reduced-rank estimates of are subject to bias from up to three sources, namely due to the spread of sample roots, due to constraints imposed to ensure that estimates are within the parameter space, and due to picking up the wrong PC. We then demonstrate that the latter can dominate MSE, so that omitting PCs with negligible eigenvalues is not always as advantageous as we might hope. Finally, we examine the scope for likelihood-based tests to determine the rank of correctly.

#### Bias:

Figure 2 summarizes mean estimates of genetic eigenvalues from full-rank analyses, for cases A and B with α_{E} = α_{G} = 0, and a number of sample sizes. Such angles yield population covariance matrices that are diagonal. However, as emphasized by Hill and Thompson (1978), this can be thought of as representing a variety of nondiagonal constellations that have the same eigenvalues. For the case of equal roots, the authors presented a number of combinations of genetic and phenotypic correlations and heritabilities that resulted in the same eigenvalues as diagonal population values (Table 3 in Hill and Thompson 1978). Results clearly show the upward bias of the largest and corresponding downward bias of the smallest sample roots. This effect is more pronounced the more similar the population roots are and, for case A, increases dramatically with decreasing sample size. Estimates of the corresponding eigenvalues of exhibited an analogous pattern (not shown).

Mean estimates of the first genetic eigenvalue and the direction of the corresponding eigenvector are contrasted in Figure 3 to their predicted values, for analyses restricting to rank *m* = 1 and allowing it to have full rank (*m* = 6) for cases A–D. Graphs do not include expected values for α_{E} = 0 as for these cases all population roots λ_{Qi} are the same, which makes the order in which PCs are picked up arbitrary. Inspecting the (log) profile likelihood (at population values) for individual λ_{Gi}, this is exemplified by a horizontal portion of the curve; *i.e*., there is no clear maximum. For case A, the predicted, downward bias in due to reduced-rank estimation (*m* = 1) is small, for all values of α_{E}. Mean estimates, however, are consistently higher than predicted and higher than the population values, even for a relatively large sample involving 1000 sires with 20 progeny each. There is little difference between reduced- and full-rank estimates; *i.e*., this reflects the bias due to the spread in sampling roots that, as demonstrated above, is largest when the true eigenvalues are close together. Even at full rank, estimates of the direction of the first genetic eigenvector differ markedly from the population direction, indicating that the wrong PC is picked up in a substantial number of replicates. Due to the closeness of the population roots, however, this has little effect on the corresponding estimates of eigenvalues. For cases B and C, there is good agreement between predicted and observed bias for the larger sample size. For the small sample in case B, there is again a marked upward bias in due to the spread in sample roots.

For case C, however, estimates for the small sample are less than expected, increasingly so as the population values for α_{E} increase. This can be attributed to constraining the estimate of to be positive definite. This causes genetic variation to be partitioned into the environmental components, counteracting the upward bias in due to dispersion in sample roots. For this scenario (case C, S200N6), the proportion of samples in which such constraints were required increased from 3% for α_{E} = 10° to 48% for α_{E} = 90°. A similar pattern is evident for case D, with even higher proportions of samples needing to be constrained, so that a downward bias for was notable for *m* = 1 at higher values of α_{E}, even for the large sample size.

Corresponding results for case E and various orders of fit are shown in Figure 4. For this constellation of population values, we expect a complete reversal in the order of λ_{Qi} at equal genetic and environmental angles (α_{E} = 0). Hence, for orders of fit of *m* = 1, 2, 4, and 5, predicted values of are λ_{G6} = 5, λ_{G5} = 15, λ_{G3} = 35, and λ_{G2} = 50, respectively. Correspondingly, we expect to find an angle between the true and the estimated eigenvector of 90° for all reduced-rank analyses. As the graphs show, predicted bias decreases rapidly with increasing difference of angle α_{E} from α_{G}. Due to sampling variation, the probability of having equal angles in a sample is exceedingly small, and mean estimates for the first PC at α_{E} = 0 are thus much less biased than expected, in particular for the direction of the first eigenvector. Moreover, the deviation from predicted values for similar genetic and environmental eigenvectors appears to increase with the number of genetic PCs fitted. As above, simulation results are again modulated by the effects of constraining to the parameter space, in particular for S200N6 and large differences between α_{E} and α_{G}.

#### Mean square error:

Figure 5 illustrates the combined effects of bias and sampling variation on estimates of λ_{G1} and , for with true rank of = 3. Shown are the MSE for and the mean error, *i.e*., the average of over replicates, for . As demonstrated above, bias in is largest for α_{G} = α_{E} and large samples, when is estimated at less than true rank (*m* < ). Observed MSEs for case E follow a similar pattern, indicating that the bias dominates over any reduction in sampling variances due to a reduction in the number of parameters estimated. Overall there appears to be relatively little increase in MSE when attempting to fit more PCs for than the true rank.

Results may, to some extent at least, reflect that population values considered implied moderate to moderately high heritabilities. Additional simulations (not presented here) demonstrated some advantages of estimating with rank *m* < or bigger penalties for fitting too many PCs at low levels of heritabilities. For case F and a small sample size we do observe a reduction in MSE of for analyses fitting less than three genetic PCs. Case F differs from case E only by much higher environmental covariances. A similar pattern, in particular a notable reduction in MSE for reduced-rank analyses when was considerably larger than , has been reported by Remadi and Amemiya (1994).

#### Rank:

An important question in conjunction with reduced-rank estimation is whether we can reliably identify the number of PCs that need to be fitted. Figure 6 summarizes the proportion of replicates with different numbers of sample roots λ_{Qi} greater than unity together with the rank determined on the basis of LRTs and information criteria. As expected, due to the sample spread in the roots of **Q** (see Hill and Thompson 1978), sampling at full rank yielded a substantial number of replicates with less than *q* eigenvalues of **Q** greater than unity. For case C, this occurred even for a large sample size and increasingly as genetic and environmental eigenvectors deviated in direction. Conversely, for population values of with rank = 3, more than three roots of **Q** exceeded unity in most samples. Similarly, Remadi and Amemiya (1994) found a much greater frequency of samples with more eigenvalues λ_{Qi} > 1 than the true rank of , when did not have full rank.

Rank tests performed reasonably well at the larger sample size, except for case C and larger values of α_{E}. The standard LRT used ignores the fact that hypotheses tested involved parameter values at the boundary of the parameter space and are thus expected to be too conservative (Self and Liang 1987), while the test of Anderson *et al*. (1986) should account for the resulting, nonstandard conditions. However, for both LRTs, the proportion of samples classified at a rank other than that simulated frequently exceeded the nominal error rate of 5%. For a small sample, rank tests underestimated the true rank of in a substantial number of cases. In particular, BIC proved very stringent and seldom provided a correct estimate, while AIC and LRTs yielded comparable results. Corresponding results were obtained for the other cases (not shown). For a simulation with similarly small samples, Hine and Blows (2006) also reported a substantial frequency of underestimates of the rank of , based on the Y criterion. Tests applied rely on asymptotic large sample properties. Not surprisingly this appears not to hold all that well for the smaller samples.

## DISCUSSION

Principal components are an important tool for multivariate statistical analyses. Their utility, however, comes at a price, as estimates of both eigenvalues and eigenvectors are subject to bias. It is well known that sampling variation causes the leading eigenvalues to be overestimated and the smallest eigenvalues to be underestimated. One consequence is that estimated genetic covariance matrices can lie outside the parameter space (Hill and Thompson 1978). Reduced-rank estimators have been proposed initially to constrain estimates to be p.s.d. (Amemiya 1985), but can be generalized to estimators of given maximum rank *m*. Both are biased, with magnitude and sign of the bias depending on the choice of rank and the PCs of that have been discarded.

#### Importance of bias:

Our results have shown that the bias in reduced-rank estimates of genetic PCs can be large. It has to be emphasized, however, that by considering the first genetic eigenvalue and reduced-rank analyses estimating at well below its true rank, we have concentrated—for the purpose of illustration—on the worst possible scenario. In practice, we are often interested in the first two to three genetic PCs; as shown, bias in the leading genetic PCs declines rapidly as the number of PCs fitted increases. Examining a number of multivariate estimates from the literature, Kirkpatrick (2008) postulated that the “effective number of dimensions” of is generally less than two and showed that the ratio of λ_{G1} to λ_{G2} is often in the vicinity of 5:1. This is synonymous with a large proportion of the genetic variance explained by the leading PCs and a fairly wide spread of eigenvalues. Hence, we might expect a substantial proportion of applications to fall somewhere in the range spanned by cases C and D considered in the simulation study. Simulation results further showed that the effects of sampling variation on the spread of eigenvalues can be pronounced, especially for small samples, and that this can counteract the bias due to estimating of rank *m* < . The bias due to reduced-rank estimation we are likely to encounter in practice could thus be substantially less than demonstrated here.

As shown, biases are affected not only by the spread of sample roots, but also by the relative orientation of the genetic and environmental eigenvectors. This issue has not been examined for practical data sets. Cheverud (1988) reported that genetic and phenotypic correlations are often similar, which implies similarity of genetic and environmental correlations. Common eigenvectors for two matrices generate the same correlation structure if the corresponding eigenvalues are proportional. For cases with similar genetic and environmental correlations, this suggests that genetic and environmental eigenvectors may have directions that are also similar, *i.e*., that differences in α_{G} and α_{E} are small. If so, this will tend to exacerbate the problems of bias due to reduced-rank estimation.

This study has been motivated by results from large-scale, reduced-rank multivariate animal model REML analyses of data from beef cattle, where estimated genetic eigenvalues changed dramatically with the number of genetic PCs fitted (Meyer 2005, 2007b). In particular, it was observed that the eigenvalue for the last PC fitted tended to be underestimated. Our findings clearly demonstrate that this was due to the inherent bias in estimates when imposing rank constraints. Moreover, they explain the paradox that some PCs with apparently negligible eigenvalues needed to be fitted to obtain reduced-rank estimates of the genetic covariance matrix that adequately described the dispersion structure in the data. With most relationships in the data due to paternal half-sibs, substituting estimates from a full-rank analysis for population values, Meyer and Kirkpatrick (2007) showed that Equation 11 provided a good prediction of estimates of genetic PCs obtained from the practical, reduced-rank analyses.

#### Implications for reduced-rank estimation:

Biases have been examined for the case of a balanced one-way classification, under the surmise that similar mechanisms affect REML estimates of the genetic covariance matrix with reduced rank in more general cases. The surprising result has been that the bias in reduced-rank estimates of can be markedly higher than expected from simply ignoring PCs *m* + 1 to . As outlined above, the underlying mechanism for this additional bias is that the REML estimator picks up the wrong subset of PCs, with some rotations of estimates determined by the differences in orientation of genetic and environmental eigenvectors and the spread in genetic eigenvalues. This “extra” bias then tends to dominate the MSE, so that even for small samples, MSEs are often not reduced for analyses fitting less than PCs. Comparing convergence rates for reduced-rank analyses, Meyer (2008) found that severe underfitting of the rank of tended to increase total computational requirements of REML analyses, in spite of markedly reduced requirements for individual iterates. This suggests that biases may also affect the topography of the likelihood surface, making standard numerical maximization techniques less effective.

The idea of fitting only as many genetic PCs or, equivalently, parameters to be estimated, as can be supported by the data (Kirkpatrick and Meyer 2004; Blows 2007) is appealing, in particular for small data sets. However, without further qualifications it is applicable only in specific cases and thus needs to be utilized cautiously. A “safer” recommendation is to ensure that we do not underfit , *i.e*., to fit sufficient genetic PCs to capture all important genetic PCs. If this does comprise PCs with negligible eigenvalues, we may omit these in subsequent applications, *e.g*., when obtaining breeding values based on the estimated genetic covariance matrix. In other words, the optimal number of genetic PCs considered for genetic evaluation may differ from that for variance component estimation.

##### Selecting the rank of Σ_{G}:

As computational requirements of REML analyses decrease with the number of genetic PCs fitted, Kirkpatrick and Meyer (2004) proposed a scheme that increased the rank of successively, until no further nonneglible eigenvalues were found. However, results from this study suggest that this may be misleading. For instance, we may fit *m* genetic PCs and find that the estimate of the *m*th eigenvalue is close to zero. As we may, in fact, have picked up one of the subsequent PCs, this estimate may increase substantially when increasing the number of PCs considered to *m* + 1. Hence, a “step-down” procedure to determine the rank of to be fitted may be preferable for practical applications. If the estimate of λ_{Gm} from an analysis fitting *m* genetic PCs is markedly less than the corresponding estimate from an analysis fitting *m* + 1 PCs, we may suspect that we have picked up one of the subsequent PCs, *i.e*., one of PC *m* + 1–*q* instead of PC *m*, and that reducing the number of PCs further would be futile.

An additional criterion to monitor is the sum of estimated genetic eigenvalues, *i.e*., . This may drop slightly, in particular for small samples, as we successively ignore PCs with negligible eigenvalues, due to the small eigenvalues omitted and a somewhat reduced spread in the sample roots. Ignoring PC *m* + 1 with eigenvalue λ_{Gm+1} should yield a reduction in , from an analysis fitting *m* + 1 genetic PCs to an analysis fitting *m* PCs, of that amount (*i.e*., λ_{Gm+1}). A larger difference then again indicates that we have encountered an analysis in which we do not estimate the correct subset of PCs, *i.e*., that we should fit no less than *m* + 1 genetic PCs, even if is close to zero. This should be accompanied by a significant decrease in the likelihood. For higher-dimensional multivariate analyses it is good practice to carry out a series of preliminary, bivariate analyses, pooling results to obtain starting values of and for REML analyses considering all traits. Even if we have a more complicated pedigree structure than that in a one-way classification, substituting these for population values in Equation 11 may then give an indication at what order of fit for problems might occur. In addition, such calculations may provide more appropriate starting values for analyses with reduced rank.

Identification of the effective dimension of the genetic covariance matrix, *i.e*., the number of PCs with eigenvalues greater than zero, is of considerable interest to quantitative geneticists. These include animal breeders who would like to know which is the simplest, reduced-rank model that is appropriate and evolutionary biologists for whom matrices of less than full rank indicate constraints by nature on response to selection (Kirkpatrick and Lofsvold 1992; Mezey and Houle 2005; Hine and Blows 2006; Blows and Walsh 2008; Kirkpatrick 2008). A number of tests for matrix rank are commonly used. Disconcertingly, simulation studies available generally show somewhat inconsistent results, both between different tests and in the ability to find the correct dimension (Jackson 1993; Ferré 1995; Peres-Neto *et al*. 2005; Dray 2007). Similarly, identification of the correct rank in our study, based on the log likelihood and information criteria, has only been moderately successful, with substantial underestimates of the true rank for smaller samples. On the one hand, LRTs are known to favor the most detailed model. On the other hand, there has been some concern that use of AIC and BIC in a random-effects model violates some of the underlying assumptions (Ripley 2004). However, LRTs and AIC yielded, by and large comparable, results, while BIC appeared to be far too stringent. Reliable identification of the dimension of hence remains on open challenge.

#### Conclusions:

Reduced-rank estimation of genetic covariance matrices is appealing and readily accommodated in standard, mixed-model-based estimation procedures such as REML. However, reduced-rank estimation can yield estimates biased by picking up the wrong subset of genetic principal components. It is thus important to choose the rank judiciously, *i.e*., sufficiently large to avoid such problems even if this does comprise a number of components with small eigenvalues.

## Acknowledgments

This work was supported by Meat and Livestock Australia under grant BFGEN. 100B (K.M.) and by National Science Foundation grant EF-0328598 (M.K.).

## Footnotes

↵2 This work is a joint venture with NSW Agriculture.

Communicating editor: J. B. Walsh

- Received April 16, 2008.
- Accepted July 18, 2008.

- Copyright © 2008 by the Genetics Society of America