Selection plans in plant and animal breeding are driven by genetic evaluation. Recent developments suggest using massive genetic marker information, known as “genomic selection.” There is little evidence of its performance, though. We empirically compared three strategies for selection: (1) use of pedigree and phenotypic information, (2) use of genomewide markers and phenotypic information, and (3) the combination of both. We analyzed four traits from a heterogeneous mouse population (http://gscan.well.ox.ac.uk/), including 1884 individuals and 10,946 SNP markers. We used linear mixed models, using extensions of association analysis. Cross-validation techniques were used, providing assumption-free estimates of predictive ability. Sampling of validation and training data sets was carried out across and within families, which allows comparing across- and within-family information. Use of genomewide genetic markers increased predictive ability up to 0.22 across families and up to 0.03 within families. The latter is not statistically significant. These values are roughly comparable to increases of up to 0.57 (across family) and 0.14 (within family) in accuracy of prediction of genetic value. In this data set, within-family information was more accurate than across-family information, and populational linkage disequilibrium was not a completely accurate source of information for genetic evaluation. This fact questions some applications of genomic selection.
THIS work evaluates the empirical performance of the so-called genomewide selection strategy for marker-assisted selection (MAS) in a mouse population, using massive molecular marker information. MAS techniques are advocated as a tool for more efficient selection schemes in plant and animal populations (Dekkers and Hospital 2002). In general, MAS techniques are based on tracing the inheritance of the quantitative trait loci (QTL) of interest throughout the pedigree with the help of molecular markers. However, the use of MAS techniques in animal populations is still not much extended, because of its complexity in practice (Boichard et al. 2006) and relatively small additional gains. For example, Chamberlain and Goddard (2006) estimated by cross-validation an increase in prediction accuracy over pedigree index (no use of MAS) of, at best, 0.02. Recent developments in massive single-nucleotide polymorphism (SNP) marker genotyping increased the interest for MAS techniques. Dense marker maps capture much richer information, including not only recombination events in the genotyped pedigree (i.e., linkage analysis) but also the populational linkage disequilibrium pattern in the genome, i.e., the possibility of predicting alleles at some loci on the basis of alleles in other (possibly close) loci. This allows for a much finer description of the genome.
Genetic evaluation methods and application issues in MAS in livestock have been extensively described (Fernando and Grossman 1989; Fernando and Totir 2003; Boichard et al. 2006). Roughly, this is a two-step process: first, putative QTL locations have to be found in a resource population; later, inheritance of these QTL loci is traced through linkage analysis, and this information is used to estimate breeding values. There are two sources of inefficiency in this approach: first, the fact of having to “declare” (usually by a statistical test in a QTL detection experiment) QTL locations implies that only a few QTL are used, due to lack of power, and their sizes are usually biased upward by the “Beavis” effect (Lynch and Walsh 1998). Second, at least at the first stage, linkage equilibrium has to be assumed between markers and QTL at the founders; this results in loss of across-family information and lower accuracies. However, linkage disequilibrium (LD) analysis (i.e., association between markers and QTL) is more effective because it applies to within- and across-families selection and because the phase of the QTL can be predicted across families (Boichard et al. 2006).
Marker-assisted selection techniques considering several QTL loci exist (Lande and Thompson 1990; Villanueva et al. 2005). Genomewide selection or genomic selection is a term used by Meuwissen et al. (2001). These authors overcome the problems of linkage analysis by fitting a mixed linear model with the effects of thousands of two-marker haplotypes or individual marker loci. In their simulations, breeding values were estimated with high accuracies, up to 0.85. There were two major insights in their work: to reject the previous stage of QTL position testing and to assume that, for dense marker maps, LD information alone is enough to inform about QTL effects. Piyasatian et al. (2007) showed that genomic selection is also valuable in the case of crossings between two inbred lines, requiring a much smaller number of genetic markers.
The main assumption of both methods is that most QTL explaining genetic variation are in linkage disequilibrium with available genetic markers, an assumption that is met in the simulations (Meuwissen et al. 2001), in the inbred populations (Piyasatian et al. 2007), and, to some extent, in crossings of outbred populations (Pérez-Enciso and Varona 2000). It is unknown to what extent this holds in outbred populations. Nevertheless, if promises from genomewide selection are fulfilled, very economically efficient selection schemes can be set up (Schaeffer 2006; Dekkers 2007).
There is little empirical evidence of practical performance of genomewide selection. Sölkner et al. (2007) compiled several approaches with evidence of high accuracies in dairy bulls, using progeny-test estimated breeding values as a proxy for true genetic values.
The objective of this article is to test the performance of genomewide selection and genetic evaluation, using data from a heterogeneous stock mouse population (Valdar et al. 2006a), including 1884 individuals (168 full-sib families) and 10,946 SNP markers and four different traits (weight, growth slope, body length, and body mass index). In addition, to gain insight on models and traits, variance components were estimated for different linear mixed models including genomic information.
Predictive ability (the correlation between predicted and observed phenotypes) was estimated using cross-validation. Its connection with genetic gain and accuracy (correlation between true and predicted genetic value) is shown. Cross-validation techniques considered sampling across families (i.e., choosing entire full-sib families) or within families (splitting families into two), thus disentangling across- and within-family information.
MATERIALS AND METHODS
Recently (Valdar et al. 2006a), a population of heterogeneous stock mice was used to finely describe the sources of quantitative genetic variation. This population has been extensively described and analyzed (Mott et al. 2000; Mott 2006; Solberg et al. 2006; Valdar et al. 2006a,b). We refer here to the relevant aspects for this work. The data are freely available at http://gscan.well.ox.ac.uk/. The origin of this population is a crossing of eight inbred strains, followed by 50 generations of pseudorandom mating. This population is valuable for testing genomewide selection because, due to the high number of markers, it is expected that many (about three of every five) QTL loci will be in complete LD with marker loci (Mott et al. 2000). Indeed, the extent of LD in this population is small (Valdar et al. 2006a), which indicates high resolution: average R2 among two loci falls from 0.5 within 2 Mb to 0.2 within 8 Mb, and average R2 among adjacent loci is 0.62. The family structure and history of the population are known and therefore interpretation of the results is easy.
Only animals with available phenotype and genotype were retained for data analysis. Details on the genotyping techniques and choice of SNP can be found in Valdar et al. (2006a). We discarded animals with <10,000 genotyped SNPs. Our data set was composed of 1884 individuals with 10,946 polymorphic loci (SNPs). Of these, some genotypes were missing, in a very low frequency of 0.001. These missing values should have a negligible effect on the analysis. To simplify the analysis, we imputed them at random from their allelic frequencies; no attempt of reconstruction based on family information was made. Pedigree extended over 2272 individuals. Genealogical information is available on parents of phenotyped mice but not on their grandparents. No parent of a phenotyped animal has been phenotyped itself. This genealogy is roughly organized into 168 full-sib families with 11.21 offspring on average.
We chose four morphological traits: weight at 6 weeks (hereinafter weight), growth slope, body mass index, and body length. The heritabilities of these traits are 0.74, 0.30, 0.21, and 0.13 (Valdar et al. 2006b). Environmental covariates affecting those traits include sex for weight and growth slope, and body weight, season, month, and day for body mass index and body length; moreover, there is a “cage” effect considered as random (Valdar et al. 2006b). To simplify the analysis, we used the precorrected (by fixed effects, but not cage) data, which are available at http://gscan.well.ox.ac.uk/. Analysis with true phenotypes for weight gave very similar results. An overall mean was added to these residuals and included in the model.
A note of caution has to be made about the cage effect. The allocation of animals to cages is not at random—most animals in the cage are full sibs. From 359 cages in the data, there are just 8 cages with offspring from more than one sire; conversely, each full-sib group is allocated to an average of 2.84 cages. Therefore, it can be considered that cage is a random effect almost nested within the sire effect. This means that, in the absence of a polygenic additive effect, the cage effect might take into account part of the (genetic) family effect.
Two types of methods were needed in this study. The first type is the use of different statistical/genetic models to estimate genetic value, conditionally on marker genotypes, phenotypes, and pedigree. These models are described in the following. The second type is the empirical evaluation of these estimates by cross-validation.
Models for genetic evaluation:
The following describes the linear mixed models (in the spirit of BLUP; Lynch and Walsh 1998) that were used for prediction of genetic and environmental values. In short, the models were as follows: model 1, including polygenic (or infinitesimal) effects, without using genomic information (this is the most typical model in genetic improvement nowadays); model 2, including genomic information (SNP genotypes) but not polygenic effects; and model 3, which considers both. In addition to the genetic effects, all models included a random cage effect. The details of the models are as follows.
Model 1—classical polygenic model:
This is the model of choice nowadays in applied animal breeding and does not rely on molecular information. This model can be expressed, in matrix algebra notation, aswhere b is a vector of environmental effects (an overall mean), c is a vector of cage effects, and u is a vector of additive genetic polygenic effects; and X, S, and T are the corresponding design matrices. As usual, residuals e are assumed independent and to follow a normal distribution, . We assumed c and u to be random normal effects with a priori normal distributionswhere I is the identity matrix and G is the additive genetic relationship matrix (Lynch and Walsh 1998). It is worth remarking that the purpose of genomewide selection and in general of any MAS strategy is to be of better predictive ability than this model.
Model 2—marker-locus effects model:
The basic model including SNP effects can be described as follows. Consider n SNP loci. In the jth locus, there are two possible alleles for each SNP (say 1 and 2), and there are three possible genotypes: “11,” “12,” and “22.” We arbitrarily assign the value to the allele 1 and the value to the allele 2. This follows a classical parameterization in which is half the difference between the two homozygotes (Lynch and Walsh 1998). These are the additive effects of the SNPs and they can be thought of as classical substitution effects in the polygenic model. It is possible to further postulate a “dominant” effect, assigning the value to the heterozygous genotype, 12. After preliminary analysis we discarded this option as it did not increase predictive ability of the different methods (not shown). Therefore the effects of the different genotypes are for 11, 0 for 12, and for 22. The effects of the different genotypes at the n loci sum up to form the genetic effect. The model for the phenotype (ignoring other effects for the sake of clarity) iswhere is the phenotype of the ith animal, is an indicator covariate for the ith animal and the jth SNP locus, and is a residual term. Hereinafter and for the sake of clarity we refer to as “marker-locus effects.” A marker-locus effect represents the effects on phenotype of unobserved genes (QTL) that are in partial or complete linkage disequilibrium with the marker locus.
If environmental effects are included, and in matrix notation, the model becomeswhere a is the marker-locus effects and Z is the corresponding design matrix. It is possible to fit a as a “fixed” effect, but for the case of large number of effects and small number of records, the predictive ability will be very poor (Lande and Thompson 1990; Miller 1990; Meuwissen et al. 2001). Therefore, we assume that a follows a normal distribution, . Meuwissen et al. (2001) used a priori information for (they divided the polygenic variance by the number of SNP loci), which matched their simulated population. Our attempt to do so resulted in worse predictive abilities (not shown). As for the BayesA and BayesB approaches, substantial a priori information is needed (number of segregating loci and variances) that we did not try to guess at.
Model 3—marker-locus effects model and polygenic component:
A simple extension of the previous model is to consider, in addition to marker-locus effects, polygenic components:The polygenic component u here can be thought of as fitting the genes not accounted for by the marker-locus effects in a.
There is extensive literature in model selection techniques, some of whose criteria have been applied in animal breeding. In this work we used cross-validation. This is a robust, nonparametric technique for model selection. The method consists of splitting the data y into a training data set () and a validation data set (). Model parameters are estimated in the training data set. Parameter estimates from are then used to predict observations in the validation data set (i.e., ). A function of interest among the predicted and true observations summarizes the performance of the model and is assumption free and comparable across models. We used Pearson's correlation among predicted and realized observations in the data set. Cross-validation has also an interpretation in terms of efficiency of genetic improvement; this is further developed in the discussion.
In the following, we talk of the correlation as the “predictive ability” (of unobserved records), whereas we keep the term “accuracy” for the correlation between total genetic value of an individual (g) and its estimate (ĝ). Accuracy can be approximately estimated from predictive ability, as shown in the appendix. In our work, , where c is the cage effect. Differences in accuracies among models can be estimated by , where and . Estimates for these variance components were obtained from model 1 in Table 1.
Training and validation sets:
A key feature in cross-validation is the choice of the training and validation sets. The first choice is the size of each set, as there is a trade-off between precision of the model in the training set and overfitting in the validation set. Usual recommendations are the validation set to be one-fifth or one-tenth of the full data set. However, we have chosen to split the data set into half for training and half for validation because we consider that, in this context, 1000 animals should be enough to get good estimates of the model. For example, Meuwissen et al. (2001) fitted 50,000 effects to a data set comprising 2000 records.
The second and more critical choice is how to split the data into training and validation sets. As explained above, the mouse population is composed of several full-sib families with little or no known relationship among them. We devised two options. The first option is to sample whole families; i.e., we use across-family information. The second option is to randomly split every family into two; i.e., we use within-family information. Note that prediction is based on different kinds of information in each setting. For example, when within-family information is used, performance is predicted basically from full sibs using relationships (in models 1 and 3) and from full-sibs and other families via genomic information (in models 2 and 3). On the other hand, when across-family information is used, performance is predicted from other families via genomic information (in models 2 and 3), but it is not possible to use relationships across families (in models 1 and 3) as these are unknown. Loosely speaking, across families, the genetic ties between training and validation data sets are distant relationships (i.e., at the level of grandparents) and populational LD. However, within families, the genetic ties are much stronger, being close relationships as well as populational LD. In practical selection schemes, this is a more likely setting; for instance, prospective bulls are chosen among sons of bulls with good estimated breeding values. The fact of splitting data in two different ways implicitly evaluates the relative weight of each source of information. Splitting was repeated at random 10 times to ensure that the results were not due to random sampling of the data, providing empirical estimates of the standard errors.
Variance component estimates with the full data set:
The first set of analyses consisted of parameters estimation for the different models, using the whole data set. Although these estimates are not of direct interest for the main subject of this work (efficiency of genomic selection), they help to clarify the different models. Parameters estimates were obtained in a Bayesian framework from the posterior distributions, using MCMC (in particular, Gibbs sampling).
Cross-validation of the genomewide predictive ability:
Values for the different unknowns (marker locus effects, polygenic effects, cage effects) were estimated, from the training data set, using Henderson's mixed-model equations (Lynch and Walsh 1998). Means of the marginal posterior distributions for the unknowns in the model were estimated in a Bayesian framework, using Gibbs sampling as well. This marginalization maximizes accuracy and expected genetic progress (Gianola and Fernando 1986), accounting for uncertainty in parameters (in particular, variance components), which might be very high for such small data sets. As discussed before, it was difficult to come up with adequate priors. Flat priors were thus used for variance components and fixed effects. In the cross-validation step, the correlation between the observed and predicted—from the estimates in the previous steps—performances in the validation data set was computed.
In variance component estimation and cross-validation, we used Henderson's mixed-model equations (Lynch and Walsh 1998) to compute solutions for the different models, using homemade software (available on request from the authors). Flat priors were used for variance components. Computing requirements (time and memory) of the mixed-model equations under these models are formidable, because for these models the matrix of crossproducts Z′Z to be included is of big size (10,946 × 10,946) and almost 100% dense. To alleviate this problem, the Gauss–Seidel with residual updating strategy was used (Legarra and Misztal 2008).
Variance component estimates with the full data set:
Table 1 summarizes estimates. The results differ for weight with respect to the other three traits. For growth slope, body length, and body mass index, estimates of cage and residual variance are fairly constant across models. This is not the case, though, for weight, where cage variance is inflated when the polygenic effect was not included in the model (model 2). Estimates of the residual variance indicate roughly the same fit for all models for growth slope, body length, and body mass index, but a loss of fit for weight when the polygenic term was not fit.
As for the effects of individual SNPs (a), they roughly follow normal distributions a posteriori. This is as expected because by the nature of the mixed model, they are severely shrunken toward a mean of 0. As an example, for body length the estimated (posterior means) a effects in model 2 range between −1.61 × 10−3 and 1.51 × 10−3, for a trait with a polygenic additive variance of 0.02.
Cross-validation of the genomewide predictive ability:
Results are shown in Table 2. Model 1 is the reference, as it is based on phenotype and pedigree information only. Across families, model 1 has a low predictive ability, because there is no family information to rely on, just the common cage environmental effect. Therefore, models 2 and 3 are expected to perform better, due to the molecular information. This is the case and the difference is significant, going up to an increase of 0.22 in predictive ability.
Within families, models including genomewide information slightly outperform model 1 in predictive ability. This gain in predictive ability is not significant, but nevertheless suggestive and fairly consistent across traits. Model 2 always has the better predictive ability in spite of being the simpler one.
Changes in accuracy of the genetic value are shown in Table 3. For the across-families case, accuracies increased up to ∼0.5. This is actually not surprising as these accuracies were close to 0 in model 1. For the within-families case, the increase in accuracy in prediction of genetic value ranged from 0 up to 0.14 by using genomic information. It has to be kept in mind, though, that in this case these values are not significant.
Genomewide selection tools show in general similar or better predictive ability than classical polygenic methods. This is as expected if data are adequate and underlying assumptions (additivity of QTL effects, strong linkage disequilibrium among at least some markers and QTL) are true (Meuwissen et al. 2001). The increase in accuracy of prediction of the genetic value in our study varied, but it was at best at ∼0.14. This value is comparable to values found by simulations or other data analysis (Meuwissen et al. 2001; Sölkner et al. 2007). However, it has to be kept in mind that we have only suggestive results for the within-families case. The estimators of accuracy are approximated and dependent on variance components, which are estimated (see materials and methods). This might be a source of error. However, this is not the case for the predictive ability shown in Table 2. This difference in predictive ability is assumption free and exact (up to numerical error) and clearly shows better predictive abilities of the models including genomic information.
Model 3 shows lower predictive ability than model 2 and even sometimes than model 1 (Table 2). One could expect the opposite, because the polygenic term is expected to catch all the genetic variability not traced by genetic markers. The most likely explanation is double: first, markers capture polygenic resemblance between relatives (see discussion below); second, for this reason, polygenic genetic values and “marker-explained” global genetic values are expected to be extremely collinear, which deteriorates performance of the estimation.
Intriguingly, results (Tables 2 and 3) show more benefit from using genomic selection for decreasing heritabilities. Whereas predictive ability is, as expected, lower for low-heritable traits, the difference in predictive ability and accuracy of the models including SNP information increases with decreasing heritability, with respect to the polygenic model (model 1). If this is confirmed, genomic selection would be a good tool for selecting low-heritable traits.
Role of within- and across-family information:
We have split the data in two ways for the cross-validation approach: across and within families. The comparison between these two ways shows the relative performance of each source of information, either close or distant relatives, respectively. In this work, the information from distant relatives is equivalent to the population-level information (indeed even the most distant individuals in a population are distant relatives). Clearly, in this data set, information from distant relatives has a poorer predictive ability (and thus accuracy in prediction of the genetic value) than information from close relatives, as shown in Table 2. Therefore family information should not be discarded for practical use. This fact partly invalidates the assumption of Meuwissen et al. (2001) that most genetic variation can be traced by the use of populational linkage disequilibrium between markers and QTL. This also invalidates some proposals of genomic selection (Schaeffer 2006) that assume no need of genotyping and phenotyping close-relative animals. The point has been explored in further detail by Habier et al. (2007), who show that some methods (BayesB and a fixed regression) capture better the population LD, which is more useful in the long range, i.e., after several generations.
Model 2, using genomewide SNP information, shows good accuracies and predictive abilities, in spite of not using explicitly pedigree information. This implies that genomewide selection might be of interest for species with difficult or no pedigree tracing, like fish (Martinez 2006), self-pollinating crops, or trees (Bauer et al. 2006). The molecular information would not be used to reconstruct the pedigree (probably with errors) but would be used “as is.” The accuracy will depend on whether individuals analyzed together are close relatives or not, but these relationships do not need to be known.
In our work we estimated by cross-validation the increase in predictive ability (Table 2). These estimates can be compared to accuracies found in simulations (Meuwissen et al. 2001) or other real data analysis (Sölkner et al. 2007). These authors showed accuracies in the prediction of genetic value of 0.81–0.85, for traits of heritability of 0.5 (Meuwissen et al. 2001) and ∼1 (progeny-tested estimated breeding values in bulls; Sölkner et al. 2007).
Unfortunately, they did not compare their genomewide genetic evaluations with a polygenic model strategy without genomic information, such as model 1. They suggest comparison with parents' information in a polygenic model, which is at best 0.71. Further increase could be achieved, in a polygenic model framework, only by means of the progeny information. While this is true for the simulations in Meuwissen et al. (2001), we consider that this assertion is false in Sölkner et al. (2007), who found accuracies of ∼0.8. They used a complex, real pedigree of dairy bulls. In a cross-validation approach, they sampled four-fifths of these bulls for training and one-fifth for validation. However, most of these bulls are related. It is thus likely that bulls in the validation data set have some descendants in the training data set. For example, assume a bull breeding value is estimated from the information from his father, maternal grandfather, and four sons. This would result in a theoretical accuracy of 0.81 (Van Vleck et al. 1987), without using genomic information.
At any rate, their increase in accuracy might be considered to be ∼0.10–0.14. Our results (Table 3) are comparable. This is uncertain, though, for differences are not always significant and are higher for low-heritable traits. Conversely, for a trait with heritability of 0.50, the increase of 14% in accuracy found by Meuwissen et al. (2001) would be reflected in an increase in predictive ability of 10%, which is comparable as well to our results in Table 2.
Cross-validation in a genetic improvement context:
We addressed validation of the genomewide genetic evaluation by cross-validation. Cross-validation has a clear interpretation in a genetic improvement context because it mimics a genetic improvement process. The objective of any breeding program is to improve future performance of the individuals in the population. In practice, the breeding process goes on through the analysis of a series of phenotypes (), to estimate breeding values, these estimators being used to select the next generation that in turn will express its phenotypes (“future” performances, ). If these predicted future performances are used to make breeding decisions (e.g., producing selected animals in the population expressing ), the observed phenotypic gain depends on the correlation r between and , , where i is the selection intensity. This equation reduces to the usual breeders' equation (Lynch and Walsh 1998) under the usual assumptions of the polygenic model P = G + E, if there is no other random effect.
This holds as well for the across-family cross-validation, where we might want to select individuals on the basis of information from other families. Therefore the correlation among predicted and observed performances in a cross-validation setting is a direct measure of the efficiency of a breeding scheme applying the proposed model to this set of data. Although this approach is robust and assumption free, its interpretation in genetic terms remains problematic as far as there are environmental effects. Another possibility for validation is the use of quasi-true estimated breeding values (Sölkner et al. 2007), although this neglects the role of other phenomena such as genotype–environment interactions, epistasis, or dominance.
Other models for genetic evaluation:
There is an equivalence between the genomewide marker-locus effects models and models using markers as indicators of relatedness (identity-by-state, IBS) (Caballero and Toro 2002; Habier et al. 2007). We can summarize the overall genetic value due to marker-locus effects (v) of individual i as , where are individual SNP effects ( for allele “1” and for allele “2”) and are indicator variables. Assuming small effects, it comes out that the joint distribution of v is approximately multivariate normal, with mean = 0 and variance = . Note that v is a “genomic” counterpart of u, the polygenic breeding value; v might be called genomewide or genomic breeding value. It can be shown that is related to the matrix of IBS probabilities, as . The IBS matrix would give the same information as the identical-by-descent (IBD) (the coefficient of coancestry) probabilities if SNPs were fully informative about their origin [i.e., infinite loci and different alleles at each loci for every individual in the base population (Caballero and Toro 2002)]. However, this is not the case here and two copies of the same locus might be IBS without being IBD.
Using genomewide breeding values v, a linear model can be constructed as follows:The model is equivalent to model 2 in the sense of Henderson (1985), i.e., after appropriate transformation solutions for v are identical. The joint distribution of v is well defined for positive-definite ; i.e., its rank is equal to the size of v (otherwise, likelihood is 0). We tried this approach, but it led to serious numerical problems ( positive definite but of extremely small determinant) and we did not further pursue this option.
IBS can be thought of as a molecular counterpart of the additive genetic relationship matrix G, with more and less information at the same time: more, because we can trace meiosis sampling among full sibs; less, because this matrix assumes that two animals sharing the same molecular information are identical in spite of the possibility of this being just by chance, without one being related to the other. A more refined approach is to use as well the IBD information conditional on molecular and genealogical information, to reflect relationships among relatives. Similar ideas have been used in different contexts: genomic control (Yu et al. 2006), refining of polygenic models (Visscher et al. 2006), or genetic evaluation (Villanueva et al. 2005). The latter three condition the IBD state to the available genealogy. The fact that predictive ability is better using within-family information suggests that within-family IBS (i.e., IBD) is important and might be worth modeling.
Two such models are (1) the segment-mapping approach (Pérez-Enciso and Varona 2000), which models total breeding value as a sum of small segments of the genome, allowing for linkage, and (2) the use of marker-assisted relatedness (Villanueva et al. 2005; Visscher et al. 2006), which models the covariance between relatives using molecular information conditional on genealogy. Another possibility is to avoid explicit modeling; nonparametric methods might have better predictive abilities and are robust to departures from the assumed theory (Fox 2000; Gianola et al. 2006).
Our results suggest, but do not prove, that genomewide genetic evaluation and selection have better accuracies and predictive abilities than the classical polygenic model. More traits and studies across different species need to be carried out to further confirm this hypothesis. Results also prove that within-family information, for this data set, is a more accurate source of information than across-family information. This information is relevant for the setup of genetic improvement programs. Cross-validation has been shown to be a valuable tool for this study. Results also show good properties of genomic selection for the case of unrecorded pedigrees, where available tools are scarce. As for its practical implementation, the use of genomic selection will depend on a cost-benefit analysis of recording of DNA samples against expected additional economic gains.
APPENDIX: RELATION BETWEEN ACCURACY IN THE ESTIMATION OF GENETIC VALUE AND PREDICTIVE ABILITY:
Let us assume and are random variables denoting the realization and the prediction of a phenotype. For a realization of , we can write the simple model , where is the overall genetic value and is a residual term. Variables and are assumed uncorrelated. On the other hand, . Therefore, the correlation between observed and predicted phenotypecan be reduced towhere , i.e., the broad-sense heritability. Therefore, the “true” accuracy might be obtained as .
In the case of our work we assumed three random variables in : a genetic effect (with different modelizations across models), the cage effect , and the residual . Therefore . Following a development as above, we havewhich might be simplified and split as
As an approximation, the second term might be assumed to be constant across models, because all of them fitted the cage effect. As for the first term, it can be further developed by expanding into :Assuming (i.e., the cage effect does not capture genetic information), the second term can be again neglected (this is likely an approximation). In this expression there are pseudoheritability terms that indicate the amount of variation in the prediction explained by the genetic part. This can be assumed to be fairly constant across models.
It is thus possible to refer, at least approximately, differences in the predictive ability among models to differences in the predictive ability for the genetic component, , aswhere and .
We thank Johann Sölkner, Zulma Vitezica, and Miguel Ángel Toro for discussions. We gratefully acknowledge The Wellcome Trust Centre for Human Genetics, Oxford, for making the heterogeneous stock data available at http://gscan.well.ox.ac.uk.
Communicating editor: C. Haley
- Received February 27, 2008.
- Accepted July 10, 2008.
- Copyright © 2008 by the Genetics Society of America