| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |


* Animal Breeding and Genomics Centre, Animal Sciences Group, Wageningen University and Research Centre, 8200 AB Lelystad, The Netherlands,
University of Life Sciences, Department of Animal and Aquacultural Sciences, N-1432 Ås, Norway and
HG, 6802 EB Arnhem, The Netherlands
1 Corresponding author: Animal Breeding and Genomics Centre, Animal Sciences Group, Wageningen University and Research Centre, P.O. Box 65, 8200 AB Lelystad, The Netherlands.
E-mail: mario.calus{at}wur.nl
| ABSTRACT |
|---|
|
|
|---|
Genomic selection as described by MEUWISSEN et al. (2001) predicts total breeding values on the basis of a large number of marker haplotypes across the entire genome. The underlying assumption of genomic selection is that haplotypes at some loci are in linkage disequilibrium (LD) with QTL alleles that affect the traits that are subject to selection. Different ways of deriving haplotypes of combinations of marker alleles, and the relationship between haplotypes at a locus, have been described. One method (SNP1) is to consider each different marker allele at a single locus to be a different haplotype, considering no relationships between different haplotypes, and thus breeding values are estimated directly for the marker alleles (XU 2003). A second method is to construct haplotypes from two alleles at adjacent markers, assuming a zero relation between haplotypes at the same locus (SNP2) (MEUWISSEN et al. 2001). A third method is to construct haplotypes (HAP_IBD) using two or more surrounding marker alleles and derive identical-by-descent (IBD) probabilities between the different haplotypes at the same locus (MEUWISSEN and GODDARD 2001).
The SNP1 model considers only two haplotypes at a locus and therefore may be suited for applications in, for instance, double-haploid populations with only two segregating genotypes at each locus (XU 2003). For outbred populations, where the association between markers and QTL might be different in different families, the SNP1 model is perhaps less well suited. The advantage of the SNP1 approach is that determining the linkage phase of the haplotypes is not required and the markers do not need to be mapped. A disadvantage of the SNP1 model is that no new haplotypes arise as a result of recombination, while such an event actually might change the linkage between the marker and the QTL alleles. SNP1 and SNP2 do not make a distinction between haplotypes that are alike-in-state (AIS) due to a common ancestor (i.e., IBD) or simply due to chance. The benefit from the HAP_IBD approach is that the common background of haplotypes, and thus the probability that different haplotypes are associated with the same QTL allele, is modeled more accurately. The HAP_IBD approach, as well as SNP2, however, does require an accurate marker map and the determination of the linkage phase. A disadvantage of the HAP_IBD approach is that it likely will yield much more effects at a single locus that need to be estimated.
The three different approaches have been compared before for their ability to fine map a single QTL (GRAPES et al. 2004). Although it was shown that the SNP1 method was able to compete with the HAP_IBD method, the HAP_IBD method gave more accurate results at the same number of markers (GRAPES et al. 2004). Arguably, genomewide selection could be seen as a special application of multiple-QTL fine mapping. The main difference is that QTL fine mapping aims at determining the position of the QTL, whereas in genomewide selection the aim is to predict accurate breeding values.
The objective of this study was to compare the accuracy of predicted breeding values used in genomic selection for an outbred population with these three different ways of including genomewide marker information. Since it is expected that the difference in marker density is an important factor, these methods are compared at five marker densities ranging from 1 marker/0.13 cM to 1 marker/2.52 cM.
| MATERIALS AND METHODS |
|---|
|
|
|---|
The considered genome comprised three chromosomes of 1 M each. The positions of 300,000 marker loci and 50,000 QTL loci were randomly determined, with all possible positions on the genome having equal chance. In the first generation, all QTL and marker loci had an allele coded as 1. The probability of having a recombination between two adjacent loci on the same chromosome was calculated using Haldane's mapping function based on the distance between the loci. In generations 1–1000, on average 300 marker and 50 QTL mutations per generation were simulated in the population, yielding mutated alleles coded as 2. Each locus had one mutation during the 1000 generations in a randomly drawn animal. The mutation rates for the markers and QTL were determined on the basis of the number of polymorphic loci in generation 1000 in preliminary analysis, targeting
2500 polymorphic SNPs and 75 QTL per 3 M. Simulating a whole genome was not realistic, but the value for the markers is comparable to a density of 25,000 SNPs on a 30-M genome. The value of 75 QTL per 3 M was chosen to ensure that the simulated variance would not differ too much across replicates due to a limited number of contributing QTL. All marker loci with a minor allele frequency in generations 1001–1003 of <0.02 were discarded. Different marker densities were created for each simulated data set, by at random selecting 100, 50, 20, 10, or 5% of the polymorphic markers.
All original QTL alleles were assumed to have no influence on the considered trait. All mutated QTL alleles received an effect drawn from a gamma distribution (with shape parameter 0.4 and scale parameter of 1.0), being positive or negative with equal chance, following MEUWISSEN et al. (2001). After the first 1000 generations, 3 additional generations (1001–1003) were simulated in which no mutations occurred. The simulated additive genetic variance at each locus i (
) was calculated using allele frequencies calculated from those three additional generations, using the formula
(FALCONER and MACKAY 1996), where p is the allele frequency of one of both alleles at a QTL locus, and a is the allele substitution effect. The total simulated genetic variance (
) was obtained by summing up the variance across all QTL loci, assuming no correlation between QTL. To obtain a heritability of 0.50 (0.10), the residuals were drawn from a random distribution N(0,
) (N(0, 9
)). All animals in generations 1001 and 1002 received one phenotypic record, obtained by adding a random residual to the true breeding value of the animals. All phenotypic records were scaled back, such that the phenotypic variance was 1.0. In the 1002nd generation, the population was expanded to 1000 animals and produced one more generation of 1000 offspring. Thus, 1100 animals (generations 1001 and 1002) with known phenotype and genotype were simulated, as well as 1000 juvenile animals with unknown phenotype and known genotype (generation 1003).
Models:
The general model to estimate the breeding values in the simulated data set was
![]() |
Four different variants of this general model were used for the estimation of genomic selection breeding values. A fifth model was used to estimate traditional breeding values using a polygenic model without haplotype effects. Animals without phenotypic information (i.e., the 1003rd generation) were included in all analyses and obtained their estimated breeding values through the mixed-model equations based on estimated breeding values of related animals and haplotypes. The differences between the four genomic selection models, SNP1, SNP2, HAP_IBD2, and HAP_IBD10, lie in the putative QTL positions, the definition of the haplotype effects, and the assumed relation between haplotypes at the same locus. For models SNP2, HAP_IBD2, and HAP_IBD10, estimated haplotype effects applied to the midpoint of a marker bracket, while for SNP1 the estimated haplotype effects applied to the marker loci. In the SNP1 model, a haplotype was defined as a marker allele on a single locus, yielding two random haplotype effects per locus. In the SNP2 model, a haplotype was defined as a combination of marker alleles of two adjacent loci, yielding four possible haplotypes per locus (i.e., 1_1, 1_2, 2_1, and 2_2). The SNP1 model is a model applicable in a practical situation where the linkage phases of the animals cannot be reconstructed. For SNP2, it was assumed that the linkage phase was known without error, to resemble the procedure applied by MEUWISSEN et al. (2001). In models SNP1 and SNP2 all haplotypes within loci that were not AIS were assumed to be unrelated. In the HAP_IBD models, linkage phases were assumed to be unknown and were reconstructed using the procedure described by WINDIG and MEUWISSEN (2004), resembling a practical situation where linkage phases can be reconstructed. A haplotype in the HAP_IBD2 (HAP_IBD10) model was defined as a combination of marker alleles of one (five) loci to the "left" of the midpoint of a marker bracket and marker alleles of one (five) loci to the "right" of the midpoint of a marker bracket. Between all haplotypes at the same locus, the probability of being IBD was calculated, combining linkage disequilibrium and linkage analysis information. The IBD probabilities between haplotypes of the first generation of genotyped animals were predicted using a simplified coalescence process, with the assumptions that 100 generations were between the current and base population and that the effective population size during those 100 generations was 100. The number of generations since the base population, i.e., the number of generations since the first marker mutation caused segregation of haplotypes at a locus, was generally <<1000 generations for any of the loci. Since the applied method to calculate IBD matrices proved to be quite robust for the assumption of the number of generations since the base generations (MEUWISSEN and GODDARD 2000), we used 100 for each situation. Haplotypes of animals in later generations were added to the IBD matrices using the recursive formulas as described by FERNANDO and GROSSMAN (1989). A full description of the method to predict the IBD probabilities is given by MEUWISSEN and GODDARD (2001). All pairs of haplotypes that had an IBD probability >0.95 were assumed to contain the same QTL allele and were therefore clustered, which reduced the number of haplotypes. The IBD matrix was used to model the covariances between haplotypes. As mentioned, in the SNP1 and SNP2 models the covariance between different haplotypes was considered to be zero.
The polygenic effects and variances were estimated in each of the four alternative models, using an inverse relationship matrix based on the pedigree of the last four generations of animals. Haplotype variances were estimated for each alternative per locus. Therefore, the number of QTL variances estimated was equal to the number of marker loci for model SNP1 and equal to the number of marker brackets for models HAP_IBD2, HAP_IBD10, and SNP2. The estimated haplotype variance at each locus was calculated as the heterozygosity of the haplotypes at that locus multiplied by the estimated variance of the effects at a locus. The heterozygosity was calculated as the frequency of heterozygote animals for each locus in models SNP1 and SNP2. The haplotype variance at a locus for the HAP_IBD models was calculated analogous to estimating the additive genetic variance in a polygenic model, relative to a base population of unrelated animals. In that case, the additive genetic variance is calculated as (1 – F) x
, where
is the estimated additive genetic variance in the base population, and F is the inbreeding in the current population (FALCONER and MACKAY 1996). We calculated the haplotype variance at a bracket as heterozygosity x
, where heterozygosity is the heterozygosity in the analyzed population and
is the estimated haplotype variance for the base population. In our situation we assume that the animals were unrelated in the base population considered in the prediction of the IBD probabilities (100 generations ago), meaning that the heterozygosity was assumed to be 1.0 and that the IBD probability between paternal and maternal haplotypes at a locus was 0.0. Across generations, animals became related, and some IBD probabilities between paternal and maternal haplotypes at a locus became >0.0. Following this reasoning, the heterozygosity for the HAP_IBD models at a locus was estimated as follows:
The four models were compared by the accuracy of the estimated breeding values for animals with (generations 1001 and 1002) and without a phenotypic record (generation 1003) and by regression of the simulated breeding values on the estimated breeding values for the animals of generation 1003. Accuracies were calculated as the correlation between simulated and estimated breeding values. Each simulated data set and model analysis were replicated 10 times.
| RESULTS |
|---|
|
|
|---|
25–500 times higher for the HAP_IBD models compared to the SNP1 and SNP2 models (Table 2). The number of haplotypes decreased nearly linearly with decreasing total number of markers for models SNP1 and SNP2. The number of haplotypes for the HAP_IBD models decreased relatively less with increasing total number of markers.
|
|
Comparison of simulated and estimated total breeding values:
The accuracies of the estimated total breeding values were plotted as a function of r2-values between adjacent markers (Figures 1 and 2). The accuracies of the estimated breeding values for the high- (low-) heritability trait using any of the genomic selection models compared to the polygenic model were between 0.03 and 0.10 (0.09 and 0.29) higher for animals with phenotypes and between 0.10 and 0.29 (0.22 and 0.34) higher for juvenile animals (Figures 1 and 2). Differences in accuracies of estimated breeding values for the high- (low-) heritability trait for phenotyped animals between the different genomic selection models ranged from 0.0 to 0.05 (0.03 to 0.04), whereas for juvenile animals the differences ranged from 0.01 to 0.11 (0.01 to 0.04).
|
|
0.12 was reached. For the low-heritability trait, the HAP_IBD10 model yielded the highest accuracies at lower marker densities, but at the highest marker density the accuracy of the SNP1 model was slightly better (Figure 2). Differences between the models were, however, small at all marker densities. The coefficients of the regression of simulated on estimated breeding values were for the high-heritability trait in nearly all cases close to 1.0 for the HAP_IBD model and >1.0 for the SNP1 and SNP2 models (Table 3). This indicates that the SNP models overestimated the total genetic variance when many markers were included, which is in agreement with the estimated variances (discussed below). The coefficients of the regression of simulated on estimated breeding values were for the low-heritability trait in all cases and for all models <1.0 (Table 3), which is in agreement with the generally overestimated total genetic variances (discussed below).
|
|
|
In Figure 3, the ratio of cumulated estimated haplotype variance to total simulated QTL variance, across loci with decreasing estimated haplotype variance, is plotted against the cumulative number of loci for marker densities with r2 between markers of 0.15 (Figure 3, A and C) and 0.21 (Figure 3, B and D). The points of the curves at the largest number of loci indicate the proportion of the simulated QTL variance that is explained by the total estimated haplotype variance across all loci; i.e., at marker densities with r2 between markers of 0.15 and 0.21, respectively, the estimated haplotype variances explained across models on average 38 (Figure 3A) and 54% (Figure 3B) of the simulated QTL variance for the trait with high heritability and 41 (Figure 3C) and 71% (Figure 3D) of the simulated QTL variance for the trait with low heritability. The initial steep progression of the curves indicates that at low marker density in all models a large proportion of the total estimated haplotype variance is fitted on a limited number of loci (Figure 3, A and C). At high marker density, still a few loci have a large estimated variance, but the contribution of these loci to the total explained haplotype variance is less (Figure 3, B and D). The linear progression of the curves for the HAP_IBD models in Figure 3, following the steep initial progression, indicates that for all situations most of the loci in the HAP_IBD models explain more or less the same amount of variance. The curvilinear progression of the curves for the SNP models, following the steep initial progression, indicates that the contribution of loci to the total estimated haplotype variance for the SNP models eventually becomes smaller. The average posterior probabilities that a QTL was sampled at a locus, for 30 loci with the highest posterior probabilities, are plotted in Figure 4. These results show that for the high-heritability trait the HAP_IBD models sampled QTL with a high posterior probability at a few loci, while the SNP models had much lower posterior probabilities on the loci with the largest estimated haplotype variance (Figure 4A). For the low-heritability trait, average posterior probabilities were much lower for the loci that had the largest estimated haplotype variance, and the highest average posterior probability was actually found for the SNP1 model (Figure 4B).
|
|
| DISCUSSION |
|---|
|
|
|---|
MEUWISSEN et al. (2001) discussed that their simulated microsatellite markers, spaced at 1-cM distances, resembled approximately three to five SNP markers. Thus, their marker distance of 1 cM would be comparable to the average distance between SNP markers of 0.26 cM in our study. The correlation between simulated and estimated breeding values of 0.82 for the high-heritability trait using model SNP2 in our study was comparable to the correlation of 0.79 reported by MEUWISSEN et al. (2001) when their breeding values were estimated on the basis of 1000 animals with phenotypes for a trait with heritability of 50%. SOLBERG et al. (2006) applied genomic selection to simulated data, estimating effects of single-marker alleles comparable to our SNP1 model for a trait with heritability of 50%. At marker distances of 1 and 0.5 cM the accuracies of their estimated breeding values were respectively 0.66 and 0.72, comparable to accuracies in our study of 0.67 and 0.75 at marker distances of respectively 1.3 and 0.65 cM. It should be noted that in our study polygenic effects were included in the model, whereas they were not in the studies by MEUWISSEN et al. (2001) and SOLBERG et al. (2006). Reported r2-values between markers of
0.3 in, for instance, dairy and beef cattle (HAYES et al. 2006) are comparable to the r2-values in our simulations and indicate a large potential benefit of applying genomic selection. It should, however, be noted that our simulated markers were relatively uniformly distributed, while commercially available SNPs might be less uniformly distributed. This implies that next to the LD between adjacent markers, also the distribution of the markers along the genome has to be taken into account to predict the potential benefit of genomic selection.
Four different models were compared in this study. The HAP_IBD2 and SNP2 models both used two markers to construct haplotypes, with the difference that the HAP_IBD2 model included IBD probabilities between different haplotypes. Thus, in the HAP_IBD2 model different linkage phases between marker haplotypes and QTL are considered in different (families of) animals, while the SNP2 approach assumes that a certain marker haplotype is always linked to the same QTL allele. The differences found in this study indicate that including linkage analysis information in the model considerably increases the accuracy of breeding values for a high-heritability trait. For the low-heritability trait, there was little difference between the models across the range of marker densities.
Adding additional markers in the HAP_IBD model slightly improved the accuracy for both traits at all marker densities. For QTL mapping, it has been shown that predictive ability of an IBD-based model was largest for an intermediate number of markers; i.e., at some stage additional markers led to lower predictive ability of the model (GRAPES et al. 2006). Since in our study only two haplotype sizes were considered, it remains unclear whether for genomic selection applications the accuracy of IBD models also is highest at an optimal number of markers or is not.
None of the models was clearly superior for the low-heritability trait. The most striking observation is that the SNP1 model yielded the lowest accuracy at low marker density and the highest accuracy at high marker density. To investigate this apparent inconsistency, we compared the results between the different models at different marker densities within replicates. Results of all replicates for the SNP1 and HAP_IBD10 models, as well as the average accuracy, are shown in Figure 5 for marker densities with r2 between adjacent markers of 0.15 and 0.21. Results in, for instance, replicates 4 and 6 appeared to be rather inconsistent, as the SNP1 model was clearly superior at the highest marker density, while at the other marker density (r2 = 0.15) the SNP1 model was clearly inferior (see dashed lines in Figure 5). In replicate 4, the SNP1 model found at the highest marker density the highest posterior probability for a QTL that explained 35% of the simulated genetic variance, while at the lower marker density the other models found higher posterior probabilities for this QTL. In replicate 6, at the lower SNP density a few important SNP were lost that disabled the SNP1 model to detect a very large QTL that explained 75% of the simulated genetic variance. The other three models, however, were still well able to pick up this QTL. Having a QTL that explained such a large amount of variance, i.e., both the simulated QTL effect and the heterozygosity (0.47) were large, replicate 6 was rather extreme. However, when discarding the results from replicate 6, the SNP1 and HAP_IBD10 models on average gave similar results. Thus, at higher marker density, the SNP1 model may actually yield the highest accuracy when some SNPs are (expected to be) closely linked to some important QTL, while the HAP_IBD models are a better choice if there are no SNPs (expected to be) closely linked to important QTL.
|
In Figure 1, the top horizontal solid line indicates that the accuracy for juvenile animals of the HAP_IBD10 model for the high-heritability trait at r2 between adjacent markers of 0.12 was comparable to the accuracy obtained with the SNP2 model at an r2 between adjacent markers of 0.15. The bottom horizontal solid line indicates that the accuracy for juvenile animals of the HAP_IBD10 model at an r2 between markers of 0.10 was comparable to the accuracy obtained with the SNP2 model at an r2 between markers of 0.14. Translated into numbers of markers in our simulated data sets, in the most extreme situations the SNP2 model needed two to three times as many markers to yield the same results as the HAP_IBD10 model for the high-heritability trait.
In conclusion, there is a clear advantage of genomic selection even at low marker densities and using a simple model that uses marker alleles as haplotypes. Unless there is an expectation that some SNPs are in high linkage disequilibrium with large QTL, the HAP_IBD model is the safest option. However, the results suggest that probably a combination of using alleles of SNPs that have a known effect in combination with reconstructed haplotypes for the parts of the genome with unknown effect might prove to be the best solution.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
| LITERATURE CITED |
|---|
|
|
|---|
FALCONER, D. S., and T. F. C. MACKAY, 1996 Introduction to Quantitative Genetics. Longman Group, Essex, UK.
FERNANDO, R. L., and M. GROSSMAN, 1989 Marker assisted selection using best linear unbiased prediction. Genet. Sel. Evol. 21: 467–477.[CrossRef]
GRAPES, L., J. C. M. DEKKERS, M. F. ROTHSCHILD and R. L. FERNANDO, 2004 Comparing linkage disequilibrium-based methods for fine mapping quantitative trait loci. Genetics 166: 1561–1570.
GRAPES, L., M. Z. FIRAT, J. C. M. DEKKERS, M. F. ROTHSCHILD and R. L. FERNANDO, 2006 Optimal haplotype structure for linkage disequilibrium-based fine mapping of quantitative trait loci using identity by descent. Genetics 172: 1955–1965.
HAYES, B. J., A. J. CHAMBERLAIN and M. E. GODDARD, 2006 Use of markers in linkage disequilibrium with QTL in breeding programs. Proceedings of the 8th World Congress on Genetics Applied to Livestock Production, Belo Horizonte, MG, Brazil, Communication 30–06.
MEUWISSEN, T. H. E., and M. E. GODDARD, 2000 Fine mapping of quantitative trait loci using linkage disequilibria with closely linked marker loci. Genetics 155: 421–430.
MEUWISSEN, T. H. E., and M. E. GODDARD, 2001 Prediction of identity by descent probabilities from marker-haplotypes. Genet. Sel. Evol. 33: 605–634.[CrossRef][Medline]
MEUWISSEN, T. H. E., and M. E. GODDARD, 2004 Mapping multiple QTL using linkage disequilibrium and linkage analysis information and multitrait data. Genet. Sel. Evol. 36: 261–279.[CrossRef][Medline]
MEUWISSEN, T. H. E., B. J. HAYES and M. E. GODDARD, 2001 Prediction of total genetic value using genomewide dense marker maps. Genetics 157: 1819–1829.
SCHAEFFER, L. R., 2006 Strategy for applying genome-wide selection in dairy cattle. J. Anim. Breed. Genet. 123: 218–223.[CrossRef][Medline]
SOLBERG, T. R., A. K. SONESSON, J. A. WOOLLIAMS and T. H. E. MEUWISSEN, 2006 Genomic selection using different marker types and density. Proceedings of the 8th World Congress on Genetics Applied to Livestock Production, Belo Horizonte, MG, Brazil, Communication 22–13.
WINDIG, J. J., and T. H. E. MEUWISSEN, 2004 Rapid haplotype reconstruction in pedigrees with dense marker maps. J. Anim. Breed. Genet. 121: 26–39.[CrossRef]
XU, S. Z., 2003 Estimating polygenic effects using markers of the entire genome. Genetics 163: 789–801.
This article has been cited by other articles:
![]() |
A. P. W. de Roos, B. J. Hayes, R. J. Spelman, and M. E. Goddard Linkage Disequilibrium and Persistence of Phase in Holstein-Friesian, Jersey and Angus Cattle Genetics, July 1, 2008; 179(3): 1503 - 1512. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |