The availability of dense molecular markers has made possible the use of genomic selection (GS) for plant breeding. However, the evaluation of models for GS in real plant populations is very limited. This article evaluates the performance of parametric and semiparametric models for GS using wheat (Triticum aestivum L.) and maize (Zea mays) data in which different traits were measured in several environmental conditions. The findings, based on extensive cross-validations, indicate that models including marker information had higher predictive ability than pedigree-based models. In the wheat data set, and relative to a pedigree model, gains in predictive ability due to inclusion of markers ranged from 7.7 to 35.7%. Correlation between observed and predictive values in the maize data set achieved values up to 0.79. Estimates of marker effects were different across environmental conditions, indicating that genotype × environment interaction is an important component of genetic variability. These results indicate that GS in plant breeding can be an effective strategy for selecting among lines whose phenotypes have yet to be observed.
PEDIGREE-BASED prediction of genetic values based on the additive infinitesimal model (Fisher 1918) has played a central role in genetic improvement of complex traits in plants and animals. Animal breeders have used this model for predicting breeding values either in a mixed model (best linear unbiased prediction, BLUP) (Henderson 1984) or in a Bayesian framework (Gianola and Fernando 1986). More recently, plant breeders have incorporated pedigree information into linear mixed models for predicting breeding values (Crossa et al. 2006, 2007; Oakey et al. 2006; Burgueño et al. 2007; Piepho et al. 2007).
The availability of thousands of genome-wide molecular markers has made possible the use of genomic selection (GS) for prediction of genetic values (Meuwissen et al. 2001) in plants (e.g., Bernardo and Yu 2007; Piepho 2009; Jannink et al. 2010) and animals (Gonzalez-Recio et al. 2008; VanRaden et al. 2008; Hayes et al. 2009; de los Campos et al. 2009a). Implementing GS poses several statistical and computational challenges, such as how models can cope with the curse of dimensionality, colinearity between markers, or the complexity of quantitative traits. Parametric (e.g., Meuwissen et al. 2001) and semiparametric (e.g., Gianola et al. 2006; Gianola and van Kaam 2008) methods address these problems differently.
In standard genetic models, phenotypic outcomes, , are viewed as the sum of a genetic value, , and a model residual, ; that is, . In parametric models for GS, is described as a regression on marker covariates (j = 1, … , p molecular markers) of the form , such that(or , in matrix notation), where is the regression of on the jth marker covariate .
Estimation of via multiple regression by ordinary least squares (OLS) is not feasible when p > n. A commonly used alternative is to estimate marker effects jointly using penalized methods such as ridge regression (Hoerl and Kennard 1970) or the Least Absolute Shrinkage and Selection Operator (LASSO) (Tibshirani 1996) or their Bayesian counterpart. This approach yields greater accuracy of estimated genetic values and can be coupled with geostatistical techniques commonly used in plant breeding to model multienvironments trials (Piepho 2009).
In ridge regression (or its Bayesian counterpart) the extent of shrinkage is homogeneous across markers, which may not be appropriate if some markers are located in regions that are not associated with genetic variance, while markers in other regions may be linked to QTL (Goddard and Hayes 2007). To overcome this limitation, many authors have proposed methods that use marker-specific shrinkage. In a Bayesian setting, this can be implemented using priors of marker effects that are mixtures of scaled-normal densities. Examples of this are methods Bayes A and Bayes B of Meuwissen et al. (2001) and the Bayesian LASSO of Park and Casella (2008).
An alternative to parametric regressions is to use semiparametric methods such as reproducing kernel Hilbert spaces (RKHS) regression (Gianola and van Kaam 2008). The Bayesian RKHS regression regards genetic values as random variables coming from a Gaussian process centered at zero and with a (co)variance structure that is proportional to a kernel matrix K (de los Campos et al. 2009b); that is, , where , are vectors of marker genotypes for the ith and jth individuals, respectively, and is a positive definite function evaluated in marker genotypes. In a finite-dimensional setting this amounts to modeling the vector of genetic values, , as multivariate normal; that is, where is a variance parameter. One of the most attractive features of RKHS regression is that the methodology can be used with almost any information set (e.g., covariates, strings, images, graphs). A second advantage is that with RKHS the model is represented in terms of n unknowns, which gives RKHS a great computational advantage relative to some parametric methods, especially when p ≫ n.
This study presents an evaluation of several methods for GS, using two extensive data sets. One contains phenotypic records of a series of wheat trials and recently generated genomic data. The other data set pertains to international maize trials in which different traits were measured in maize lines evaluated under severe drought and well-watered conditions.
MATERIALS AND METHODS
Two distinct data sets were used: the first one comprises information from a collection of 599 historical CIMMYT wheat lines, and the second one includes information on 300 CIMMYT maize lines.
Wheat data set:
This data set includes 599 wheat lines developed by the CIMMYT Global Wheat Breeding program. Environments were grouped into four target sets of environments (E1–E4). The trait was grain yield (GY). Hereinafter we refer to this data set as wheat-grain yield (W-GY). A pedigree was used for deriving the additive relationship matrix A among the 599 lines, as described in http://cropwiki.irri.org/icis/index.php/TDM_GMS_Browse (McLaren et al. 2005). The entries of this matrix equal twice the kinship coefficient (or coefficient of parentage) between pairs of lines.
Wheat lines were genotyped using 1447 Diversity Array Technology markers (hereinafter generically referred to as markers) generated by Triticarte Pty. Ltd. (Canberra, Australia; http://www.triticarte.com.au). These markers may take on two values, denoted by their presence (1) or absence (0). In this data set, the overall mean frequency of the allele coded as 1 was 0.561, with a minimum of 0.008 and a maximum of 0.987. Markers with allele frequency <0.05 or >0.95 were removed. Missing genotypes were imputed using samples from the marginal distribution of marker genotypes, that is, , where is the estimated allele frequency computed from the nonmissing genotypes. After edition, 1279 markers were retained.
Maize data set:
The maize data set is from the Drought Tolerance Maize for Africa project of CIMMYT's Global Maize Program. The original data set included 300 tropical lines genotyped with 1148 single-nucleotide polymorphisms (hereinafter generically referred to as markers). For each marker, the allele with lowest frequency was coded as one.
No pedigree was available for these data. Traits analyzed for this study were GY, female flowering (FFL) (or days to silking), male flowering (MFL) (or days to anthesis), and the anthesis-silking interval (ASI), each evaluated under severe drought stress (SS) and well-watered (WW) conditions. Hereinafter we refer to these data sets as maize-grain yield (M-GY) and maize-flowering (M-F), respectively. The number of lines in the M-F data set was 284, whereas 264 lines were available in M-GY. The average minor allele frequency in these data sets was 0.20. After editing (with the same procedures as those described above), the numbers of markers available for analysis were 1148 and 1135 in M-F and M-GY, respectively.
This study evaluated several models for GS that differ depending on the type of information used for constructing predictions (pedigree, markers, or both) and on how molecular markers were incorporated into the model (parametric vs. semiparametric). All the unknowns in the model were trait–environment specific. Consequently, separate models were fitted to each trait–environment combination. For ease of presentation, models are described for a generic trait–environment.
In all models, phenotypic records were described aswhere is the average performance of the ith line, is the number of replicates used for computing the mean value of the ith genotype, is an intercept, is the genetic value of the ith genotype, and is a model residual. In all environments, the response variable was standardized to a sample variance equal to one. The joint distribution of model residuals was . With this assumption, the likelihood function becomes(1)Models differed on how pedigree and molecular marker information was included in .
Standard infinitesimal model:
In this model, denoted as pedigree (P), and , where is the additive relationship matrix computed from the pedigree and is the infinitesimal additive genetic variance. Following standard assumptions, the joint prior of model unknowns in P was(2a)where are scaled inverse chi-square priors assigned to the variance parameters. The prior scale and degrees of freedom parameters were set to and , respectively. This prior has finite variance and an expectation of 0.5. Combining (1) and (2a), the joint posterior density of P is(2b)
Above, denotes all hyperparameters indexing the prior distribution. This posterior distribution does not have a closed form; however, samples from the above model can be obtained from a Gibbs sampler, as described, for example, in Sorensen and Gianola (2002). No pedigree data were available for the maize data set; therefore, this model was only in the wheat data set.
Parametric genomic models:
For parametric regression, we use the Bayesian LASSO (BL) (Park and Casella 2008), extended by inclusion of an infinitesimal effect, as described in de los Campos et al. (2009a). In this model,and the joint prior density of the model unknowns (upon assigning a flat prior to ) is(3a)
Above, marker effects are assigned independent Gaussian priors with marker-specific variances (). At the next level of the hierarchical model, the 's are assigned iid exponential priors . At a deeper level of the hierarchy is assigned a Gamma prior with rate (δ) and shape (r), which in this study were set to and , respectively. Finally, independent scaled inverse chi-square priors were assigned to the variance parameters, and the scale and degree of freedom parameters were set to and , respectively. The above model is referred to as pedigrees plus markers BL (PM)-BL.
The effect of the prior choice for in the BL has been addressed in de los Campos et al. (2009a). These authors studied the influence of the choice of hyperparameters for on inference of several items and concluded that, even when the prior for had influence on inferences about this unknown, model goodness-of-fit and estimates of genetic values were robust with respect to the choice of . Figure A1 (appendix a) depicts the prior density of λ, , corresponding to the hyperparameter values used in this study; this prior gave a high density over a wide range of values of . Also, as shown later, the posterior mean of λ changed between traits and data sets, indicating that Bayesian learning took place.
Combining the assumptions of the likelihood (1) and the prior described in (3a), the joint posterior density is(3b)
This density does not have a closed form; however, samples from the above model can be obtained from a Gibbs sampler, as described in de los Campos et al. (2009a). Inferences for the regularization parameter are presented in terms of , which were obtained by taking the positive square root of samples from the posterior distribution of .
A marker-based model, M-BL, can be obtained from (3b) by setting , which implies .
BLUP using marker genotypes:
Prediction of genetic values using BLUP (e.g., Robinson 1991) of marker effects is commonly used in GS (e.g., Meuwissen et al. 2001; Bernardo and Yu 2007). We include this method as a reference. BLUP estimates are derived from the modelwhere D = . From these assumptions, the BLUP estimates of marker effects are
Computation of BLUPs requires knowledge of . To this end, we fitted a random-effects modelwhere is the observed phenotype of the kth replicate of the ith genotype (; ), and . This model yields estimates of , where . An estimate of was obtained by plugging the estimate of in (e.g., Meuwissen et al. 2001; VanRaden 2007), where is the estimated allelic frequency of the jth marker, and is the average (across markers) allele frequency, which in our case was estimated from the marker data.
Semiparametric models (RKHS):
In RKHS, genetic values are viewed as a Gaussian process. When markers and a pedigree are available, genetic values can be modeled as the sum of two componentswhere is as before and is a Gaussian process with a (co)variance function proportional to the evaluations of a reproducing kernel, , evaluated in marker genotypes; here and are vectors of marker genotype codes for the ith and jth individuals, respectively. The joint prior distribution of , , and the associated variance parameters , , , and , are as follows:(4a)
Above, K is a kernel matrix, which is symmetric and positive definite. In this study, the entries of these matrices were the evaluations of a Gaussian kernel, , where is a squared-Euclidean distance, and is a bandwidth parameter that controls how fast the prior correlation drops as lines get farther apart in the sense of . The values of the distance function depend on p, on allele frequencies, and on how related the lines are. The choice of the bandwidth parameter should consider the observed distribution of to avoid situations where K is either a matrix full of ones or an identity matrix. In this study we chose , where is the sample median of . This choice yields at the median distance. Higher (lower) prior correlation is assigned to pairs of lines that are closer (farther apart) than , as measured by . Addressing the optimal choice of bandwidth parameter is not within the scope of this study; see de los Campos et al. (2010). The scale and degree of freedom parameters of the prior described in (4a) were and .
Combining the assumptions in (1) and (4a), the joint posterior density of this marker and pedigree RKHS model (PM-RKHS) is(4b)
This density does not possess a closed form; however, samples from this posterior distribution can be obtained using a slightly modified version of the Gibbs sampler that implements the pedigree model in (2a).
In the RKHS regression of (4b), the variances of and can gauge the relative contribution of each of these components to the conditional expectation function. From (4a), , where is the ith diagonal element of matrix A, and . Here, is a standardized kernel, with . This does not occur in ; here , where is the coefficient of inbreeding of the ith individual. In the wheat population, the average value of was 1.98.
As with parametric methods, a marker-based model, M-RKHS, can be obtained as a particular case of (4b), with , which implies .
Models were first fitted using all lines in the data set, and inferences for each fit were based on 30,000 samples (obtained after discarding 5000 samples as burn-in). Convergence was checked by inspecting trace plots of variance parameters.
Prediction of performance of lines whose phenotypes are yet to be observed is a central problem in plant breeding. Such prediction can be used, for example, to decide which of the newly generated lines will be evaluated in field trials. Cross-validation (CV) methods were used to evaluate the ability of a model to predict future outcomes. To this end, data were divided into 10 folds; this was done by using an index variable, , i = 1, … , n, that randomly assigns observations to 10 disjoint folds, , j = 1, … , 10. CV predictions of the observations in the first fold, , are obtained by omitting phenotypic data on all lines in the first fold. This yields CV predictions of lines in the first fold, that is, . Repeating this exercise for the second, third, … , 10th folds yields a whole set of CV predictions that can be compared with actual observations to assess predictive ability.
Principal component analysis of estimated marker effects:
Parametric models such as the BL yield estimates of marker effects, which, in our case, are environment specific. These estimates can be used to assess and visualize genetic effect × environment interaction. Biplots from principal component analysis of the matrix of estimated marker effects in each trait–environment combination were obtained. The methodology is briefly explained in appendix b. Use of biplots to assess genetic effect × environment interaction is further described in Cornelius et al. (2001).
This section begins by presenting estimates of variance parameters and of the regularization parameters of BL and RKHS that were obtained when models were fitted using all available records (i.e., full data analysis). Next, results from the principal components analysis of estimated marker effects (also obtained from the full data analysis) for the W-GY data set are given (results for the maize data set are provided in appendix c). Subsequently, estimates of measures of predictive ability obtained from cross-validation are presented.
Variance and regularization parameters:
Tables 1 and 2 give the estimates of posterior means of variance parameters and of λ in the BL. The posterior mean of the residual variance () can be used to assess model goodness-of-fit. Since the response variable was standardized within trait–environment combinations, the estimate of gives an indication of the fraction of the phenotypic variance that can be attributable to model residuals. In the GY-W data set (Table 1), RKHS models fitted data markedly better (smaller ) than P, M-BL, or PM-BL. Model M-BL had a posterior mean of residual variance that was either similar to or slightly larger than that of P, while PM-BL fitted the data better than P. Results from the maize data sets (Table 2) were mixed: M-BL fitted the data much better than M-RKHS for FFL and MFL, regardless of environmental conditions, but the opposite was observed (i.e., M-RKHS fitted data better than M-BL) for ASI and GY (Table 2).
For the W-GY data set, the posterior means of in PM-BL and PM-RKHS were smaller than that obtained in P (Table 1). This indicates that the inclusion of markers reduces the relative contribution of the regression on the pedigree, . In PM-RKHS, the ratio , evaluated at and at the posterior mean of and , was always >2 (Table 1), indicating that in PM-RKHS models, the regression on the markers made a much more important contribution to the conditional expectation than the regression on the pedigree.
The multivariate analysis of estimated marker effects for the W-GY data set indicated that the first two principal components explained 74% of the total variability in estimated marker effects (Figure 1). Sample correlations between phenotypes in the four environments (E) showed that E2 and E3 had a correlation of 0.661, whereas E2 and E4 and E3 and E4 had correlations of 0.411 and 0.388, respectively. The correlation patterns of estimated marker effects were similar, but the strength of the association was slightly weaker. For instance, the correlations between estimates of marker effects were 0.633 (E2–E3), 0.388 (E2–E4), and 0.384 (E3–E4). Correlations between E1 and the other environments were low and negative for phenotypic and estimated marker effect data.
The variance of estimated marker effects was slightly smaller in E4; this can be inferred by the length of the corresponding vector in Figure 1. The vast majority of the estimated effects are located around the center of Figure 1 (i.e., estimated effects were small, in absolute value), which reflects shrinkage of the BL model. However, some markers had estimated effects that were large in absolute value; some of those markers are identified by their name in Figure 1, and the estimated effects are given in Table S1. An approximation to the estimated effect of the presence of a marker in GY for a given environment can be obtained by orthogonal projection of the marker effect displayed in Figure 1 on the vector of the corresponding environment. To illustrate this, consider E1, where the presence of markers wPt.9256, wPt.6047, and wPt.3904 is expected to increase GY (Figure 1); in contrast, the presence of markers wPt.3462, wPt.3922, and wPt.4988 (located in the opposite direction of E1) is expected to reduce GY.
The multivariate analysis of estimated marker effects allows identifying which markers contribute to positive/negative genetic correlation between environments. Markers whose presence is expected to increase or decrease GY across environments can be viewed as contributing to positive genetic correlations in GY between environments. Examples of this group are markers wPt.9256, wPt.6047, and c.373879, whose presence increased GY in the four environments, and wPt.3393, c.380591, and c.381717, whose presence decreased GY in all environments. However, some markers act in an “antagonistic” fashion; that is, the presence of a marker increases (decreases) GY in some environments and decreases (increases) GY in others.
Results from the multivariate analysis of marker effects in the maize data sets (M-F and M-GY) were similar to those observed in the wheat data set in regard to the following: (1) the first two principal components explained a large proportion (85.8%) of the observed variability of estimated marker effects; (2) due to shrinkage, most estimated marker effects clustered around zero; and (3) although the overall correlation patterns between estimated marker effects reflected the type of association observed between phenotypes, it was possible to identify subsets of markers that contributed to positive genetic correlation and others that induced negative genetic associations. A detailed discussion of these results is given in appendix c.
Tables 3 and 4 show the estimated correlations between phenotypic outcomes and CV predictions for W-GY, M-F, and M-GY data sets. Overall, the values of these correlations, especially those obtained with BL or RKHS methods, were large for all models, data sets, and traits, indicating that genomic selection can be effective for predicting the performance of lines with yet-to-be observed phenotypes. Predictive ability was different between models and data sets: for W-GY correlations ranged from 0.355 to 0.608, for M-F correlations varied from 0.464 to 0.79, and for M-GY they ranged from 0.415 to 0.514.
Wheat data set:
In the W-GY, correlations ranged from 0.355 (BLUP in E3) to 0.608 (PM-RKHS in E1) (Table 3), and relative to the P model, the PM-RKHS model produced the highest relative gain in CV correlation in three of four environments. BLUP was outperformed by BL and RKHS methods across environments. In these data, PM models had better predictive ability than P models, and the magnitude of the gain in predictive ability attained by including markers in the model varied from a modest 7.7% (PM-BL in GY-E3) to a very important 35.7% (PM-RKHS in GY-E1) (Table 3). In general, RKHS outperformed BL both in M and PM, and BLUP outperformed P models in three of four environments (all but E3); however, as stated, BLUP was outperformed by BL and RKHS.
In the M-F, correlations ranged from 0.464 (BLUP for MFL-SS) to 0.790 (M-BL for MFL-WW) (Table 4). For these traits, BLUP was systematically outperformed by BL and RKHS. Also for these traits, M-BL yielded better predictions than M-RKHS, with relatively high correlation values that ranged from 0.774 to 0.790. However, for ASI under severe drought stress and well-watered conditions, correlations were not as strong as those found for the other flowering-time traits, and M-RKHS outperformed M-BL, with correlation values of 0.547 and 0.572, respectively (Table 4).
Maize grain yield:
Predictive correlations in M-GY (Table 4) were smaller than those obtained in flowering traits, and the differences between methods were not clear as in the M-F data set. Here, CV correlations ranged from 0.415 (M-BL GY under drought stress) to 0.525 (M-BL GY well watered). These traits did not yield a clear ranking of models: BL was best for GY under well-watered conditions, and RKHS was best for GY under drought stress. However, as stated, in M-GY the differences in predictive ability between models were not large.
Several simulation studies (Bernardo and Yu 2007; Wong and Bernardo 2008; Mayor and Bernardo 2009; Zhong et al. 2009) have reported important gains in genetic progress associated with the use of GS in plant breeding. Recently, Heffner et al. (2009) concluded that the high correlation between true breeding values and the genomic estimated breeding values found in several simulation studies is sufficient for considering selection based on molecular markers alone; however, evaluation of these methods with real plant data is still very limited.
Empirical evaluation of GS:
The results of this study indicate that, even with a modest number of molecular markers, models for GS can attain relatively high predictive ability for genetic values of traits of economic interest in contrasting environmental conditions. These findings are in agreement with simulation-based studies such as those mentioned above and with empirical evidence reported in animal breeding (e.g., Gonzalez-Recio et al. 2008; VanRaden et al. 2008; Hayes et al. 2009; Weigel et al. 2009).
Evaluation of predictive ability indicated that models using marker and pedigree data jointly (PM) outperformed pedigree models (P) across traits and environments, regardless of the choice of model (BL, RKHS). These results are consistent with those reported by Crossa et al. (2010), who evaluated P, M, and PM models using the BL and RKHS for grain yield in wheat (n = 170) and several disease traits in maize.
Despite the gains in predictive ability obtained with PM models, our results suggest that there is room for improving predictive ability even further. To illustrate this, and as an exercise, let us assume that the model holds, and consider as the best (unlikely) scenario that CV predictions, , are such that . If so, the maximum attainable correlation is , where h is the square root of the heritability of the trait. Thus, if heritability is 0.5, then the maximum correlation is 0.707. This will hold if only one replicate is available; for data involving repeated measures, as was the case in this study, the maximum correlation is . CV correlations in this study ranged from 0.40 to 0.79; these values are well below the theoretical maxima given the heritability of the traits and the number of replicates available. We therefore conclude that larger gains in predictive ability can be expected (1) when more markers are available or (2) by improving upon the methods used to implement GS.
Choice of model:
There are different ways of incorporating markers into models for GS. Here we evaluated the BL, BLUP, and RKHS methods. BLUP and BL use parametric regression on marker covariates, whereas RKHS is a semiparametric method. In general, BL outperformed BLUP, which may be attributed to at least two reasons: (1) similar to other methods for GS such as methods Bayes A and Bayes B of Meuwissen et al. (2001), BL performs marker-specific shrinkage of effects, whereas BLUP penalizes all marker effects equally; and (2) in BL, variance parameters and marker effects are inferred jointly, whereas BLUP typically involves two steps (a first one in which variance parameters are inferred and a second one in which marker effects are estimated).
The comparison between BL and RKHS yielded mixed results; this finding is in agreement with those of Zhong et al. (2009), who evaluated different models in different scenarios (mating systems) and did not find one method that performed best across scenarios. For grain yield and anthesis-silking interval, RKHS methods performed either similarly or better than the BL; however, for female and male flowering traits in maize, BL outperformed RKHS markedly. The BL is an additive model, whereas RKHS may be able to capture complex epistatic interactions better (e.g., Gianola and van Kaam 2008). Therefore, one could expect the BL to perform well in traits where additive effects play a central role and RKHS to perform better in traits where epitasis is more relevant. Buckler et al. (2009) provide evidence suggesting that female and male flowering traits in maize are, for the most part, additive traits. The good performance of the BL observed in this study for those traits is consistent with this finding.
Marker vs. pedigree plus marker models:
In general, PM models in W-GY had a slight but consistent superiority in all four environments for predictive ability as compared to the M model; this is in agreement with previous findings (e.g., de los Campos et al. 2009a). The advantage of considering pedigree and markers jointly is small because there is some redundancy between regression on the pedigree and regression on markers (e.g., Habier et al. 2009). It is reasonable to expect that as the number of molecular markers increases, the relative contribution of pedigree information will decrease.
Assessment of genetic effect × environment interaction with estimates of marker effects:
Parametric methods such as M-BL, PM-BL, or BLUP provide estimates of “marker effects” that may be used to gain a better understanding of the underlying architecture of the traits. The results obtained here with W-GY are consistent with those reported by Crossa et al. (2007) and indicate that markers such as wPt.6047, wPt.3393, wPt3462, and wPt.3904 (located in chromosome 3B, the long arm of chromosome 7A, chromosome 1A, and the short arm of chromosome 1A, respectively) are indeed associated with GY in wheat.
Estimates of marker effects can be also used to gain insights on the sources of genetic effect × environment interaction. Here, we used principal component analysis of estimates of marker effects as a way of assessing sources of marker effect × environment interaction. Overall, the correlation patterns of estimated marker effects were similar to those observed at the phenotypic level; however, in all trait–environment combinations it was possible to detect markers that made contributions to positive or negative genetic correlation. For example, for the M-F data set, results indicate important molecular marker effect × environment interactions, which translate into genotype × environment interaction. In this respect, our results are different from those of Buckler et al. (2009), who reported low levels of genotype × environment interaction for the same traits.
Results of this study showed that models including markers or markers and pedigrees yield relatively high correlations between predicted and observed phenotypic outcomes. The superiority of models using markers or markers and pedigree was clear regardless of the choice of method (BL, RKHS). Moreover, we did not find a method (BL or RKHS) that was consistently superior across environments and traits. Differences in the underlying genetic architecture of the traits may well explain these results.
The relatively promising results from RKHS indicate that designing methods to address the problem of kernel choice is a relevant area of research in the context of semiparametric models for GS. In this study, separate models were fitted to each trait–environment combination. Multiple-environment (multiple-trait) models are ubiquitous in plant and animal breeding, and the development and evaluation of multiple-environment models for GS where marker effects and genomic values for several traits are estimated jointly appears to be a relevant area of research.
The Bayesian LASSO was fitted using the BLR package which is available in R (R Development Core Team 2010; G. de los Campos and P. Pérez) and described in Pérez et al. (2010). The wheat and maize experimental data, and other computer programs written in R for fitting the RKHS models using the Gibbs sampler described in this article, are available in File S1.
APPENDIX B: MULTIVARIATE ANALYSIS OF ESTIMATED MARKER EFFECTS
Consider a matrix of estimated molecular marker effects, , whose columns, , , are estimates of the effects of p markers in q different environments. The singular value decomposition of this matrix is , where and are ortho-normal matrices that span the row (marker) and column (environment) spaces of , respectively, and is a diagonal matrix whose nonnull entries are the singular values of ; that is, .
The biplot is constructed using the first two principal components axis of (, and , ). Points in the biplot are the marker effects projected in the first two components and are displayed using the coordinates provided by and . The “environmental effects” are displayed as vectors whose coordinates are given by and . The length of the vectors approximates the variance accounted for by the specific molecular marker and environmental effect. Molecular markers represented in the same direction as the environments had positive effects on those environments, whereas molecular markers located in the opposite direction to the environmental vectors had negative effects on those environments. The cosine of the angle between the vectors representing a pair of environments (or molecular marker effect) approximates the correlation of the two environments (or molecular marker), with an angle of zero indicating a correlation of +1, an angle of 90° (or −90°) a correlation of 0, and an angle of 180° a correlation of −1.
Marker effects for maize flowering data:
The display of the first two component axes (accounting for 85.79% of the total variability in estimated marker effects) on estimated effects of the markers in the six trait–environment combinations (MFL-SS, MFL-WW, FFL-SS, FFL-WW, ASI-SS, and ASI-WW) of the M-F data set obtained from the BL model is depicted in Figure C1. Clearly the two groups of trait–environment combinations are dominated more by the trait (ASI vs. FFL and MFL) and less by the environmental condition (SS and WW). Phenotypic outcomes and estimates of marker effects for ASI showed relatively small correlations with those of FFL and MFL. Phenotypic correlations between MFL in WW and SS, ASI in WW and SS, and FFL in SS and WW were positive and high, ranging from 0.686 to 0.728. Correlations ASI-MFL and ASI-FFL at the different water regimes (SS and WW) ranged from −0.123 to 0.446.
Interpretation of the estimated marker effect on these traits should be different from that for grain yield. For FFL and MFL, the favorable allele is the one whose estimated effect is negative (i.e., it decreases FFL and MFL), whereas for ASI, selection seeks to set this trait as close to zero as possible. Alleles coded as 1 of markers whose estimated effects are located on the left side and in the top left corner of Figure C1 (i.e., PZA03551.1, PZA03578.1, PZA03222.1, PZA03385.1, PZB01201.1, and PZB00118.2) increase FFL, MFL, and ASI (they all have positive effects in all trait–environment combinations), whereas those markers located on the opposite side of the biplot (bottom right corner) (i.e., PZA02587.16, PZA00236.7, PZB0255.1, and PZA00676.2) decrease the value of FFL, MFL, and ASI. Those markers whose presence is expected to increase or decrease traits across environments can be viewed as contributing to positive genetic correlations in FFL, MFL, and ASI between environments.
Despite the high heritability (between 0.74 and 0.87) found for flowering time and ASI in this maize trial, results show substantial interaction between molecular marker effects and environment. The biplot in Figure C1 shows markers that had very contrasting effects across environments. For example, the minor alleles of markers whose estimated effects are located in the top right corner of the biplot (PZA03592.3, PZB01077.3, and PZB02076.1) increase the anthesis-silking interval under drought and well-watered conditions, but decrease days to male and female flowering. In contrast, the minor alleles of markers whose estimated effects are located in the opposite quadrant of the biplot (bottom left corner) (PZB00592.1, PHM13183.12, and PZB01964.5) showed a complete rank reversal with respect to the effects of markers PZA03592.3, PZB01077.3, and PZB01077.3 on those trait–environment combinations, i.e., a decrease in ASI under SS and WW and an increase in male and female flowering times.
The estimated effects used to perform the multivariate analysis included in this section are provided in Table S2.
Marker effects for maize grain yield under stress and well-watered environments:
Since only two trait–environment combinations (GY-WW and GY-SS) are available for the M-GY data set, no principal component analysis was performed. The phenotypic correlations between GY-WW and GY-SS (0.260), as well as the correlations between the estimated marker effects for grain yield (0.251), were low. Also, none of the 10 markers with the largest/smallest estimated effects in GY-WW was among those with the largest/smallest effects under GY-SS conditions. This indicates important context-dependent effects due to genotype × environment interaction. Estimates of marker effects for GY-WW and GY-SS are provided in Table S3.
This article benefited from valuable comments from two associate editors and two anonymous reviewers. The maize data set used in this study comes from the Drought Tolerance Maize for Africa project financed by the Bill and Melinda Gates Foundation. We thank the numerous cooperators in national agricultural research institutes who carried out the maize trials in Africa and the Elite Spring Wheat Yield Trials and provided the phenotypic data analyzed in this article. We also thank the International Nursery and Seed Distribution Units in the International Maize and Wheat Improvement Center (CIMMYT, Mexico), for preparing and distributing the seed and digitalizing the data. Gustavo de los Campos and Daniel Gianola acknowledge support by the Wisconsin Agriculture Experiment Station and from grant DMS-11044371 made by the Division of Mathematical Sciences of the National Science Foundation.
Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.110.118521/DC1.
↵2 These authors contributed equally to this work.
Communicating editor: M. Kirst
- Received May 5, 2010.
- Accepted July 28, 2010.
- Copyright © 2010 by the Genetics Society of America