Determining the Effective Dimensionality of the Genetic Variance–Covariance Matrix
Emma Hine, Mark W. Blows

Abstract

Determining the dimensionality of G provides an important perspective on the genetic basis of a multivariate suite of traits. Since the introduction of Fisher's geometric model, the number of genetically independent traits underlying a set of functionally related phenotypic traits has been recognized as an important factor influencing the response to selection. Here, we show how the effective dimensionality of G can be established, using a method for the determination of the dimensionality of the effect space from a multivariate general linear model introduced by Amemiya (1985). We compare this approach with two other available methods, factor-analytic modeling and bootstrapping, using a half-sib experiment that estimated G for eight cuticular hydrocarbons of Drosophila serrata. In our example, eight pheromone traits were shown to be adequately represented by only two underlying genetic dimensions by Amemiya's approach and factor-analytic modeling of the covariance structure at the sire level. In contrast, bootstrapping identified four dimensions with significant genetic variance. A simulation study indicated that while the performance of Amemiya's method was more sensitive to power constraints, it performed as well or better than factor-analytic modeling in correctly identifying the original genetic dimensions at moderate to high levels of heritability. The bootstrap approach consistently overestimated the number of dimensions in all cases and performed less well than Amemiya's method at subspace recovery.

GENETIC variance–covariance (G) matrices conveniently summarize the genetic relationships among a suite of traits and are a central parameter in the determination of the multivariate response to selection (Lande 1979). Although individual elements of G, single-trait genetic variances along the diagonal and bivariate genetic covariances off the diagonal, are frequently the subject of hypothesis testing and biological interpretation, such a compartmentalized approach to the analysis of a set of functionally related traits is of limited value (Pease and Bull 1988; Charlesworth 1990; Blows and Hoffmann 2005). The multivariate pattern of genetic covariance represented by G can be analyzed by determining the eigenstructure of this symmetrical matrix and, subsequently, how the genetic variance is partitioned among genetically independent traits. Of particular importance is the rank of G, that is, how many genetically independent traits are represented by a set of functionally related phenotypic traits, a question that can be traced back to Fisher's (1930) geometric model (Orr 2000).

The distribution of the eigenvalues of G (the genetic variances of the genetically independent traits defined by the eigenvectors) allows an unambiguous definition of a genetic constraint. A singular G matrix (one or more of the eigenvalues are zero) suggests that an absolute genetic constraint exists (Pease and Bull 1988; Blows and Hoffmann 2005; Mezey and Houle 2005) and the effective dimensionality of G may be less than the number of traits measured. In turn, the eigenvectors of G have become important components of a number of approaches to the study of multivariate genetic constraints (Schluter 1996; Blows et al. 2004; McGuigan et al. 2005) and investigations of how selection may change the genetic variance (Blows et al. 2004; Hine et al. 2004).

Importantly, the singular nature (or otherwise) of G cannot be determined from a cursory viewing of the elements of G; G may be still be singular in the presence of genetic variance in all individual traits, for instance (Dickerson 1955; Charlesworth 1990). Determining the dimensionality of a covariance matrix, and the subsequent estimation of nonnegative definite covariance matrices, has received considerable attention in the statistical literature (Amemiya 1985; Anderson and Amemiya 1991; Calvin and Dykstra 1991; Mathew et al. 1994; Sun et al. 2003). However, the effective dimensionality of G has been addressed only on a few occasions, primarily using resampling approaches to generate confidence intervals for the eigenvalues of G (Kirkpatrick et al. 1990; Mezey and Houle 2005) and to find whether one or more eigenvalues of G were significantly different from zero (Kirkpatrick et al. 1990). In a recent departure from this approach, factor-analytic modeling has been adopted as a way of fitting genetic principal components in a framework that will also allow direct hypothesis tests of the number of genetic dimensions required to explain genetic covariation among environment and traits (Thompson et al. 2003; Kirkpatrick and Meyer 2004).

Here, we assess the utility of three methods for determining the dimensionality of G. First, we introduce how the effective dimensionality of G can be established, using a method for the determination of the dimensionality of the effect space from a multivariate general linear model (Amemiya 1985; Amemiya et al. 1990; Anderson and Amemiya 1991). We outline how the effective number of dimensions of a genetic covariance matrix obtained from either one or two random factor experimental designs commonly used in evolutionary studies can be determined. We then show how the original genetic covariance matrix can be partitioned so that a new, nonnegative definite covariance matrix can be constructed that contains only those genetic dimensions that have strong statistical support. Conveniently, this method requires calculating only the eigenvectors and eigenvalues of a symmetrical matrix without iteration, an approach that can be implemented in many commonly used software packages. Second, factor-analytic modeling in a restricted maximum-likelihood framework is utilized to demonstrate how dimensionality may be determined in many genetic experimental designs using a mixed-model approach, again resulting in a guaranteed positive semidefinite estimate of G. Finally, the bootstrap approach, recently used to establish full rank in Drosophila melanogaster wing shape by Mezey and Houle (2005), is implemented.

We compare the performance of all three methods in a simulation study conducted on 100 simulated data sets of known dimensionality, as well as applying them to a multivariate set of contact pheromones (cuticular hydrocarbons) from a standard half-sib breeding design using D. serrata. We have been investigating the effect of directional selection on the genetic variance in male sexually selected display traits of D. serrata (Blows et al. 2004; Hine et al. 2004). In this system, multiple male cuticular hydrocarbons act as contact pheromones and are under linear sexual selection. The vast majority of genetic variance in these male traits was found to be orientated in such a way that it was almost orthogonal to the direction of sexual selection in both a laboratory-adapted (Blows et al. 2004) and a field (Hine et al. 2004) population. These results suggested that sexual selection might have depleted genetic variance, resulting in an ill-conditioned G matrix that was effectively not of full rank.

MATERIALS AND METHODS

Method I: the dimensionality of the effect space from a linear model:

The dimensionality of a covariance matrix:

To begin, consider the simple case of an experimental design that could be analyzed appropriately by a one-way multivariate analysis of variance; for example, multiple traits are measured on a number of individuals from a set of families. In this case, extraction of the between- and within-mean square matrices (Math and Math, respectively) is the first step in the estimation of the between-source (family) variance–covariance matrix (Math), usingMath(1)where r is the coefficient of the variance components at the between-source level. The characteristic roots (λi) of Math in the metric of Math can be found by solvingMath(2)(Amemiya 1985). If λi ≥ 1 for all i = 1, 2,…, p, then the covariance matrix Math is nonnegative definite. If any λi < 1, then Math is indefinite as will often be the case with estimates of G (Hill and Thompson 1978). Solving the polynomial in λ represented by (2) requires specialized software. Instead, one can use linear algebra to find the character roots of Math in the metric of Math as defined by the eigenvalues (λi) ofMathwhere L is a lower triangular matrix and defined as the transpose inverse of U (upper triangular), which in turn is the Cholesky root of Math (i.e., Math). Note that the notation here differs from that of Amemiya (1985) to avoid confusion with lower (L) and upper (U) matrix notation. The classical Cholesky decomposition is a lower triangular matrix (Math), but many common mathematical and statistical programs (such as Matlab and SAS IML) use the upper triangular matrix convention (Math). Be aware of which convention your program of choice uses and ensure that the inverse of the Cholesky root is arranged in lower triangular form to define L.

When the existence of genetic constraints is of interest, the initial null hypothesis will be full rank (p) as a G matrix of full rank indicates there are no genetic constraints on the evolution of the set of traits. Let k be the number of characteristic roots such that λi ≥ 1. Amemiya et al. (1990; Anderson and Amemiya 1991) provide a way of determining how many of the k dimensions are supported statistically to determine the effective number of dimensions (m). As a first step, the null hypothesis that mk can be accepted immediately, because, at most, there is evidence for the same number of dimensions as there are λi ≥ 1. Next, we test the new null hypothesis that mk − 1. If this null hypothesis is accepted, as outlined below, we continue to test mk − 2 and so on until one in the series of null hypotheses is rejected or the null hypothesis that m ≤ 0 is accepted, indicating a lack of support for any effect.

To test the null hypothesis that mb (0 < b < k − 1), order the λi ≥ 1 from greatest to smallest (λ1 ≥ λ2 ≥ ⋯ ≥ λk ≥ 1) and to each apply the equationMathwhere M and N are the degrees of freedom at the effect (between) and error (within) levels, respectively. The test statistic for the current null hypothesis is thenMath(3)To determine whether to reject the null hypothesis, the test statistic can be compared to the distribution of Y for Math in Table 1 of Amemiya et al. (1990). Rejecting the null hypothesis mb indicates that there is evidence for b + 1 dimensions.

Constructing a nonnegative estimate of a covariance matrix:

It is possible to construct a nonnegative matrix of rank k from the eigenvectors corresponding to those λi ≥ 1 (Amemiya 1985). Let the characteristic vectors of Math in the metric of Math from (2) form the columns of a matrix T. Define the Math matrix P asMath(Amemiya 1985). Using P we can now express the original difference Math asMathwhere Math and Ipp is the Math identity matrix. The characteristic roots (λi) of Math in the metric of Math can be used to partition Math into a nonnegative definite matrix and a negative definite matrix,Math(4)where Pk is a Math matrix containing the first k columns of P, Pl contains the remaining pk columns of P, Math, and Math. The first term on the right-hand side of (4) then represents the nonnegative definite portion of Math and the second term represents the negative definite portion. By dropping the second term representing the negative definite portion, the negative eigenvalues are set to zero. The remaining term is then the projection of Math in the metric of Math onto the set of all nonnegative definite matrices (Amemiya 1985).

Again, in practice it is easier not to deal with estimating the characteristic vectors of Math in the metric of Math using (2). Alternatively, let Q be the matrix containing the eigenvectors of Math as columns (Amemiya 1985). Define the Math matrix P (Amemiya 1985):Math

Having established the effective number of dimensions, we can construct a new covariance matrix of rank m that is nonnegative definite and contains only the dimensions supported by the hypothesis-testing procedure, rather than of rank k from the eigenvectors corresponding to those λi ≥ 1. The first m columns of P (forming the Math matrix Pm) are associated with the characteristic roots of Math in the metric of Math that are statistically supported. Pm is used to construct a nonnegative definite covariance matrix of rank m byMath(5)where Math and Imm is the Math identity matrix. The first m eigenvectors of Math represent the dimensions of the original covariance matrix that have strong statistical support as defined by the hypothesis-testing procedure outlined above. The remaining eigenvectors will have eigenvalues equal to zero. Amemiya (1985, appendix) goes on to show that (5) is equivalent to the restricted maximum-likelihood estimator of Math for balanced experimental designs.

Method II: factor-analytic modeling:

The dimensionality of a covariance matrix:

Although Amemiya's approach allows the determination of the number of dimensions of a G matrix, it is restricted in the experimental designs that can be accommodated by the approach. Factor-analytic modeling is commonly used to determine the minimum number of factors (dimensions) required to explain the pattern of covariation among a set of variables (Krzanowski and Lai 1988). Although not explicitly the goal of their development of factor-analytic modeling for fitting genetic principal components (Kirkpatrick and Meyer 2004; Meyer and Kirkpatrick 2005), factor-analytic modeling provides an alternative way to determine the dimensionality of G matrices because likelihood-ratio tests can be made of the improvement in fit as each principal component is added (or subtracted) from the model (Kirkpatrick and Meyer 2004; Meyer and Kirkpatrick 2005). This approach displays great promise as it allows for almost any experimental design to be used, and tests of dimensionality can be conducted within a restricted maximum-likelihood framework.

Kirkpatrick and Meyer (2004) introduced a new algorithm to fit genetic principal components that we do not attempt to implement here. Rather, as suggested by these authors, alternative algorithms are available, and here we outline a readily available approach using standard statistical software (Proc Mixed in SAS). In particular, the reduced-rank model in which the specific variances are assumed to be zero can be specified in Proc Mixed for the level of interest that contains the additive genetic variance components (the sire level in a half-sib design, for example), using the factor-analytic model [FA0(m)] covariance structure. At the genetic level of interest then, a reduced-rank covariance matrix Math is given byMath(6)where Math(Math) is a lower triangular matrix of constants that represent the factor loadings of the m latent variables. A series of nested hypothesis tests may be conducted to determine how many genetic dimensions are required to explain the observed patterns of genetic covariance. A full model is first fit (m = p), and factors are sequentially dropped until the model achieves a significantly worse fit compared to the previous model (in which it is nested), indicating that the amount of variation accounted for by the tested factor is sufficient for the factor to be retained. Since Math is a lower triangular matrix, one less parameter is estimated in each subsequent column. Therefore, the degrees of freedom associated with each log-likelihood-ratio test are determined by the change in the number of parameters required in Math given by pm + 1.

Constructing a nonnegative estimate of a covariance matrix:

Conveniently, fitting the reduced model will guarantee a nonnegative definite variance–covariance matrix with the desired number of dimensions. For example, fitting mp factors will result in a G matrix that is positive semidefinite with m dimensions, even if a number of the eigenvalues of the original G matrix were negative. Once the number of factors to be retained is established, the reduced-rank covariance matrix, Math, is obtained by multiplying the matrix of factor loadings by its transpose as in (6).

Method III: bootstrapping of the eigenvalues of G:

Mezey and Houle (2005) attempted to determine the dimensionality of G by generating a bootstrapped sample of the eigenvalues of the original G matrix. Here, we implement this bootstrap procedure to enable a comparison with the other two methods we present. Bootstrapping was conducted by sampling sires (and all their progeny as blocks) with replacement, generating 1000 estimates of G. The eigenvalues of each G were then calculated and mean eigenvalues and their 95% confidence intervals are presented. Constructing a nonnegative definite matrix from this approach is not possible as it does not allow a determination of the exact dimensions that have received statistical support, a limitation that we discuss in detail below.

Simulation:

To assess the relative effectiveness of the three methods, we conducted a simulation using 100 data sets. For simplicity and to reduce computational times we did not simulate a hierarchical data structure as in the half-sib experiment, but rather each data set consisted of 100 “sires,” each with 6 “offspring,” resulting in 600 offspring for which eight traits were generated. To generate data with known dimensionality at the sire level for each data set, we first generated an 8 × 2 factor matrix, Math, to represent the factor loadings of two latent variables, with randomly drawn constants in all elements except Math, which was set to zero as the pivot. We then multiplied Math by its transpose (as in factor-analytic modeling, Equation 6) to construct the 8 × 8 “genetic” covariance matrix. One hundred random linear combinations of the first 2 eigenvectors of G, where the random multipliers were drawn from distributions of mean zero and variance equal to the corresponding eigenvalue, were produced to represent 100 sire “breeding values.” One hundred simulated sets of breeding values for eight traits that could be completely described by two different latent variables were generated using this process.

We conducted the simulation on four different approximate levels of heritability: 0.75, 0.50, 0.25, and 0.10. To generate offspring phenotypes with an average heritability of ∼0.75, we generated for each simulated data set 600 (100 sires by six offspring) rows and eight columns of random numbers independently distributed with mean 0 and variance 1.5. The breeding value of each sire was added to six of these random numbers to create six offspring values for that sire. To obtain heritabilities of ∼0.50, 0.25, and 0.10, the diagonals of the “error” covariance matrix were set at 10, 30, and 100, respectively.

We applied Amemiya's method and factor analytic modeling to the same 100 data sets for each heritability level. Bootstrapping was initially applied to the 10 and 75% heritability treatments, and because the results differed little between the two extreme treatments, and to save on computational time, we did not proceed to apply this method to the intermediate heritability treatments. For Amemiya's method, variance components were estimated using least-squares, using Proc GLM in SAS as the data sets were balanced. Factor-analytic modeling was conducted using Proc Mixed in SAS, with an unconstrained covariance structure modeled at the “within-sire” level and the factor-analytic structure at the sire level. Bootstrapping of each data set was conducted by sampling sires with replacement, and 1000 bootstrap replicates of each of the 100 data sets were generated.

RESULTS

Here, we have reanalyzed the data from the half-sib experiment described in Blows et al. (2004). In this experiment 66 sires were each mated to up to three dams, resulting in 367 offspring for which eight contact pheromones were analyzed. To this data set we have applied each of the three methods for determining the dimensionality of the additive genetic variance–covariance matrix. Although this experiment has not been designed to allow an exhaustive search for the presence of very small levels of genetic variance (Mezey and Houle 2005), it serves to illustrate the implementation and relative performance of the three methods in an experiment of a size commonly found in evolutionary studies.

Method I: the dimensionality of the effect space from a linear model:

The model for this experimental design was the standard nested MANOVA model, with dams nested within sires. The mean square matrices Math (r = 5.4218), Math (r = 1.8326), and Math can be found in Table 1. The approach outlined by Amemiya (1985) for such a model differed in one aspect from that presented above for the one-way model. Here, the approach was applied in a two-step process, by first finding that part of the dam-level covariance matrix (Math) that was nonnegative definite and had strong statistical support. By using Math and Math in the above procedure, the ordered eigenvalues of Math were 3.260, 2.464, 2.115, 1.913, 1.726, 1.277, 1.164, and 1.003. Since all λi > 1, Σdam is nonnegative definite, although a number of the eigenvalues are quite small. We began the hypothesis-testing procedure by accepting the null hypothesis that m ≤ 8, and each subsequent null hypothesis was accepted until m ≤ 2, which was rejected (Y = 22.032, P < 0.05), suggesting that Math has three dimensions that have statistical support. A new dam-level covariance matrix Math was constructed with a rank of 3, which allowed the construction of Math by rearranging (1) to Math, where Math.

View this table:
TABLE 1

Mean squares matrices for the sire, dam, and error levels (msire, mdam, and mwithin) of the genetic example

The dimensionality of Math (Table 2) was determined by following the same procedure, now using Math and the Cholesky decomposition of Math. The ordered eigenvalues of Math were 3.542, 3.001, 2.580, 1.506, 1.436, 1.046, 0.941, and 0.638. Since two roots were <1, the null hypothesis that m ≤ 6 was accepted. Each subsequent null hypothesis was accepted until m ≤ 1, which was marginally nonsignificant (Y = 24.410, P = 0.059), where the exact probability level was derived from Equation 4 in Amemiya et al. (1990), suggesting that there was evidence for two dimensions of genetic variance in the set of eight pheromones.

View this table:
TABLE 2

Original (italics, above diagonal) and reduced-rank (non-italics, below diagonal) estimates of the sire covariance matrix (Σsire and Formula, respectively)

The two genetic dimensions supported by the analysis are the first two principal components of the constructed nonnegative definite matrix Math (Table 2) and are displayed in Table 3. The original G matrix presented in Blows et al. (2004) was indefinite with three negative eigenvalues. Here, it can be seen that the two dimensions supported by the current analysis are very similar to the first two principal components of the original G matrix (Table 3). A more formal subspace comparison (Krzanowski 1979; Blows et al. 2004) indicated the two two-dimensional subspaces in Table 3 are indeed very similar. The comparison yields a value of 1.76 for the sum of the eigenvalues of the S matrix (hereon represented as Math) defined in Equation 3 in Blows et al. (2004), which in this case ranges from 0 for orthogonal subspaces to 2 for coincident subspaces.

View this table:
TABLE 3

The first and second principal components (PC) of the original Σsire, the Amemiya (1985) reduced-rank estimate of Formula, and the factor-analytic model containing two genetic dimensions

Method II: factor-analytic modeling:

Factor-analytic models were run constraining the covariance matrix at the sire level to be from a single dimension to full rank (eight dimensions), and model fitting statistics are displayed in Table 4. Of the eight models, the two-dimensional model, FA0(2), displayed the lowest value of the Akaike information criterion (AIC), suggesting that overall, this was the model of best fit. The series of nested hypothesis tests indicated that moving from two dimensions to one resulted in the first significant increase in the log likelihood (χ2 = 17.4, d.f. = 7, P = 0.015), again suggesting that the two-dimensional model was the favored model. Eigenanalysis of the resulting covariance matrix constrained to be two-dimensional revealed that the two significant genetic dimensions were almost identical to the two dimensions found to have statistical support by Amemiya's approach (Table 3). A formal subspace comparison indicated that the two dimensions from each of these covariance matrices described almost coincident subspaces (Math = 1.93 of 2).

View this table:
TABLE 4

Model fit statistics for the nested series of factor-analytic models testing the dimensionality of the sire-level covariance matrix using REML

Method III: bootstrapping of the eigenvalues of G:

Mean eigenvalues from the bootstrapped sample of each of the eigenvectors are presented in Table 5, along with the fifth percentile value that represents the lower 95% confidence interval in this circumstance as the significance of a variance component from zero is a one-tailed hypothesis. The lower 95% confidence interval does not overlap zero for the first four eigenvectors, indicating that significant genetic variance is present in these four dimensions. The next eigenvector, g5, has a positive mean eigenvalue, but the lower 95% confidence interval overlaps zero and equates to a significance level of P = 0.187, indicating that there is no evidence for genetic variance in this dimension.

View this table:
TABLE 5

Mean bootstrap eigenvalue and 5th percentile bootstrap eigenvalue for the half-sib genetic data

Simulation:

Amemiya's method successfully identified two dimensions at the sire level in 99 of the 100 data sets in the 75% heritability treatment. The sole remaining case was identified as having only a single dimension and was the data set that displayed the most extreme condition index (here defined as the ratio of the first to the second eigenvalue). This method was less effective at lower heritability, with only 22 cases correctly identified as two-dimensional in the 10% heritability treatment. All other cases were identified as one-dimensional, indicating Amemiya's approach consistently underestimated the number of dimensions when heritability was low.

In contrast, factor-analytic modeling was relatively consistent in its ability to identify the correct number of dimensions across heritability treatments (Figure 1), although rejection of the two-dimensional model occurred in at least 30% of cases in each heritability treatment. The behavior of factor-analytic modeling with a change in heritability was more complex than Amemiya's approach, with overestimation of the number of dimensions at high heritability, with increasing underestimation of dimensionality at lower heritabilities (Table 6). One possible explanation for the failure of factor-analytic modeling to identify the correct number of dimensions was if the search algorithm had been captured by a local peak on the likelihood surface. To test this, we reran a subset of cases using the correct factor-analytic structure as the starting values. REML iterated to the same solution as when the default MIVQUE starting values were used, suggesting that the global maximum likelihood had been found in these cases.

Figure 1.—

Relationship between heritability and the correct identification of a two-dimensional subspace in the simulation. Open circles, factor-analytic modeling; solid circles, Amemiya's approach.

View this table:
TABLE 6

Summary of simulation results

To determine the effectiveness of Amemiya's method and factor-analytic modeling in identifying the correct genetic dimensions, we investigated the ability of the two approaches to identify the correct two-dimensional subspace in the 75% heritability treatment in which Amemiya's approach had a 99% success rate. We compared the two-dimensional subspaces of the estimated positive semidefinite covariance matrices from the two methods with the subspace defined by the first two eigenvectors of the rescaled simulated genetic covariance matrix (to account for orientation changes due to the standardization of the traits) for each data set. Both Amemiya's method and factor-analytic modeling were highly successful in uncovering the correct two-dimensional genetic subspace, with the mean values of Math (±SD) being 1.982 ± 0.102 and 1.944 ± 0.106, respectively. Figure 2 shows the relationship between the two methods in their ability to estimate the correct two-dimensional subspace. When factor-analytic modeling incorrectly indicated three dimensions, the two-dimensional subspace recovered by this method was consistently a poorer match to the correct subspace (Figure 2A). Including the third dimension in the subspace comparison (Figure 2B) improved the recovery of the correct subspace to a level comparable to that in Amemiya's method. Therefore, in the data sets in which factor-analytic modeling incorrectly identified three dimensions, the third dimension does in fact contain some proportion of the originally defined two-dimensional genetic subspace. Figure 2C shows that for those cases where factor-analytic modeling correctly indicated that two dimensions existed, there is an almost perfect relationship between the two methods.

Figure 2.—

Relationship between Amemiya's method and factor-analytic modeling in recovering the original genetic subspace in the simulation. (A) Two-dimensional factor-analytic model. (B) Three-dimensional factor-analytic model. (C) Magnification of region of high subspace recovery scores from A. Simulated data sets that factor-analytic modeling correctly identified as having two dimensions are indicated by solid circles, and those data sets incorrectly identified as having three dimensions are indicated by open circles.

Bootstrapping did not recover the correct number of dimensions in any of the 100 data sets, consistently overestimating the number of dimensions with 21, 78, and 1 cases having three, four, and five dimensions, respectively, at the 75% heritability level (Table 6). At the 10% heritability level bootstrapping again consistently overestimated the rank of G, although the number of dimensions identified decreased overall, with 42 and 58 cases of three and four dimensions, respectively. To assess the performance of bootstrapping in recovering the correct subspace within the dimensions identified as being significant, the mean bootstrap estimate of G was compared to simulated (rescaled, as above) G. Bootstrapping performed better for the higher heritability level, with mean values of Math (±SD) being 1.994 ± .017 and 1.781 ± 0.245 at the 75 and 10% heritability levels, respectively.

DISCUSSION

Determining the dimensionality of G matrices:

The effective dimensionality of a set of traits plays a central role in the genetics of adaptation (Fisher 1930; Orr 2000) and the definition of absolute multivariate genetic constraints (Pease and Bull 1988; Blows and Hoffmann 2005). Here, we have investigated three methods for the determination of the effective dimensionality of a G matrix. Of the three methods, the simulation study indicated that Amemiya's approach performed best at higher heritabilities, but consistently underestimated dimensionality at lower heritabilities and was particularly biased in the 10% heritability treatment. Factor-analytic modeling tended to overestimate the number of genetic dimensions by 1 at higher heritabilities and underestimate at lower heritabilities by the same magnitude in a substantial number of cases. Although the performance of factor analysis in recovering the original genetic dimensions dropped off at the lowest heritability level, the two methods recovered the same two-dimensional subspace when the correct number of dimensions was indicated by factor-analytic modeling for the 25, 50, and 75% heritability treatments, suggesting a close mathematical association between the two approaches. The precise mathematical relationship between the two methods deserves further attention.

The simulation study indicated that bootstrapping will consistently overestimate the dimensionality of the G matrix, which is likely to be a consequence of two related limitations. First, each bootstrapped estimate of the covariance matrix will have a unique set of eigenvectors. Although the first eigenvalue will always be the largest in each replicate, the eigenvectors associated with each of the eigenvalues will be different; how different will depend mainly on the sample size, but even with very large samples the same trait combination (eigenvector) is unlikely to be replicated exactly in any of the bootstrapped samples. Since the eigenvalue is simply an estimate of the genetic variance, this is analogous to estimating the genetic variance of a different trait in each of the bootstrapped replicates, but treating the eigenvalues for hypothesis testing as if they estimated the genetic variance in the same trait each time. Second, if the eigenvalues of the original covariance matrix are distributed in such a way that some are very close in size, the order of the eigenvectors may change in different bootstrapped replicates (Cohn 1999; Peres-Neto et al. 2005), adding to the variation in the trait combinations that will be associated with each ranked eigenvalue. This is likely to be a particular problem with eigenvectors associated with small eigenvalues.

Either of these problems will result in like-sized eigenvalues of bootstrapped replicates being binned together for the calculation of the confidence interval for an estimate of an original eigenvalue, even though each estimate reflects a different trait combination. In this way, confidence intervals will be estimated as being small as the variance in each bin (i.e., first eigenvalue, second eigenvalue, etc.) is determined only by the rank of the eigenvalue for each bootstrapped replicate, rather than by how much genetic variance is present for a particular trait combination in each bootstrapped replicate. Such an approach will tend to uncover as many statistically significant dimensions as there were positive eigenvalues in the original estimate of the G matrix; in our case, five positive eigenvalues were present in the original G matrix, and four dimensions had nonzero eigenvalues determined by bootstrapping. This pattern was also present in the results from the simulation study; in 63 (75% heritability treatment) and 70 (10% heritability treatment) cases, one less dimension than the number of positive eigenvalues was recovered.

Although the factor-analytic approach provides the most flexibility for determining the dimensionality of G from virtually any experimental design, Amemiya's approach may provide an important alternative under some conditions. One such situation is that in practice REML often does not converge with the inclusion of a large number of traits. For laboratory-based quantitative genetic or genomic experiments based on simple breeding designs with relatively balanced data sets, Amemiya's approach supplies a noniterative alternative to determining dimensionality. For example, microarray analysis tends to involve a very large number of gene expression traits. One approach to future microarray experiments will be to conduct these experiments within a breeding design. Such an experimental design will allow the determination of the number of independent genetic factors underlying the typically large number of gene expression changes among treatment groups. With so many traits to be included in a factor-analytic model, convergence problems are likely to be encountered and therefore these experiments may benefit from applying Amemiya's approach.

It is important to note that the approaches outlined above are subject to the usual power constraints due to small sample sizes. Genetic experiments conducted with small sample sizes are likely to be capable of detecting only one or two dimensions that explain a substantial amount of genetic variance. The poor performance of Amemiya's approach at low heritability suggests that this method will be particularly adversely affected by small sample size. Factor-analytic modeling also underestimated the number of dimensions as heritability dropped, with a less dramatic loss of performance than Amemiya's approach as the information content of the data was reduced. Consequently, we recommend these approaches as objective aids in defining a genetic subspace, but they should not be applied as the sole determining factor in subspace selection. In particular, in cases where fewer genetic dimensions included may increase the likelihood of rejecting a null hypothesis of interest (Blows et al. 2004 would be a good example), these approaches should be used with caution.

Are G matrices of full rank?

In our example, eight pheromone traits were shown to be adequately represented by only two underlying genetic dimensions by two of these methods: Amemiya's approach and factor-analytic modeling of the covariance structure at the sire level. The bootstrap method identified four dimensions that had eigenvalues significantly different from zero, but as indicated by the simulation, this is likely to be an overestimate. The sample size in this data set is not sufficient to allow an exhaustive search for dimensions describing small levels of genetic variance, so a definitive conclusion that only two dimensions have genetic variance would be premature. In the only other statistical determination of the rank of G, Mezey and Houle (2005) reported G matrices of full rank for wing shape in D. melanogaster, using the bootstrap approach in an experiment with a very large sample. Since the bootstrap approach is likely to overestimate dimensionality, it is uncertain if G matrices of full rank may be a feature of multivariate descriptions of morphology.

A major advantage of Amemiya's approach and factor-analytic modeling is that both can result in the identification of the reduced subspace for which there is strong statistical support for the presence of genetic variance. It is important to note that in our example, the two dimensions supported by these analyses were not exactly the same as the first two eigenvectors of estimated G, suggesting that directions in the space of G that describe substantial amounts of the additive genetic variance may also be associated with large nonadditive or environmental sources of variance.

In conclusion, determining the dimensionality of G provides an important perspective on the genetic basis of a multivariate suite of traits. Understanding the genetic basis of adaptation will require a multivariate approach, as single traits are rarely under selection in isolation (Lande and Arnold 1983; Schluter and Nychka 1994; Blows and Brooks 2003). In combination with emerging genomic technologies, statistical approaches to determining the number of genetically independent traits may potentially allow an investigation of how many independent genetic changes underlie a response to selection involving complex adaptations.

Acknowledgments

We thank Y. Amemiya, D. Houle, H. Rundle, and four anonymous reviewers for comments on previous drafts of this manuscript. M.W.B. was supported by a grant from the Australian Research Council.

Footnotes

  • Communicating editor: D. Houle

  • Received December 12, 2005.
  • Accepted March 14, 2006.

References

View Abstract