Abstract
We suggest a new approximation for the prediction of genetic values in markerassisted selection. The new approximation is compared to the standard approach. It is shown that the new approach will often provide substantially better prediction of genetic values; furthermore the new approximation avoids some of the known statistical problems of the standard approach. The advantages of the new approach are illustrated by a simulation study in which the new approximation outperforms both the standard approach and phenotypic selection.
MARKERASSISTED selection (MAS), like many quantitative trait loci (QTL)mapping techniques (Haley and Knott 1992; Zeng 1994; Jansen 2001), exploits the linkage disequilibrium between markers and QTL produced when inbred lines are crossed. However, in MAS we do not aim to map the QTL and estimate their positions and effect sizes. The goal of markerassisted selection is the improvement of certain traits in breeding programs for plants or animals. We therefore want to predict the genetic values z for certain traits of each individual and select the best individuals according to their genetic values for further breeding. Considerable literature on this topic now exists, reviewed in Whittaker (2001). The key conclusion is that, since markerassisted selection employs both marker and phenotypic information for the prediction of the genetic values, it can outperform selection based solely on phenotypic information, especially when the sample size is large (Gimelfarb and Lande 1994; Whittakeret al. 1995; Hospitalet al. 1997).
The standard approach to markerassisted selection for inbred lines, due to Lande and Thompson (1990), is based on a twostage procedure, where first phenotypes are regressed on marker information to give a prediction of the genetic value known as the marker score for each individual and then this score is combined with phenotypic information using a selection index to give a final prediction of genetic merit.
Here we introduce an alternative singlestage procedure, which essentially treats information at each individual marker as a separate trait. Thus all marker information can be entered, together with phenotypic information, into a single selection index, which is then used to predict the genetic merit. This was first suggested for a single marker and trait by Smith and Simpson (1986); here we describe a method suitable for multiple markers and multiple traits. The approach is computationally simpler than the standard approach by Lande and Thompson (1990) and simulation studies show that the new approach can give substantial improvements in selection response. Further, the new approach can be seen as a natural extension of the method proposed by Smith and Simpson (1986).
METHODS
We start with n individuals from an F_{2} generation derived by crossing two inbred parental lines. For each individual we record phenotypes at m traits to give the vector y_{i} = (y_{i}1, …, y_{im})^{T}, i = 1, …, n, and also a set of p marker values x_{i} = (x_{i}_{1}, …, x_{ip}). We assume that in every individual the same genetic markers are typed.
Suppose we wish to improve all m traits simultaneously in the breeding program. We can rank and select individuals for further breeding only when the genetic “values” of the individuals can be characterized by a single scalar value and not by an m dimensional vector. Hence one has to combine the distinct trait values and define an overall genetic “value” for each individual.
This is typically done by computing a linear index
The individuals with the largest index values are selected for further breeding. We therefore have to predict the index vector
The parameters β_{kj} in Equation 6 are typically estimated by regressing the phenotype on the marker information. The best linear predictor
We now show how the twostage Lande and Thompson procedure (5) described above, with s_{i} calculated by regression of y_{i} on x_{i} and then y_{i} and s_{i} combined to predict the genetic value z_{i}, can be replaced by a singlestep procedure. Instead of the separate steps we compute directly the best linear predictor
This essentially sets up a selection index including all markers and phenotypes, where each marker is treated as a separate trait. To compare the Lande and Thompson approximation (7) with the approximation proposed in (8), we consider the spaces in which the best linear predictors are computed.
Because of the definition of S_{i} (Equation 6) it is obvious that
Finally, note that a further drawback of the Lande and Thompson approach is the appearance of the marker score S_{i} in the approximation (7). The marker score S_{i} is unknown and has to be predicted by Equation 6. Since the marker subsets
Now we can apply either the linear approximation (7) of E(Z_{i}y_{i}, s_{i}) or (8) of E(Z_{i}y_{i}, x_{i}) to the index prediction problem (9) and predict the index value
For either approach a number of matrices have to be estimated. To use the Lande and Thompson (1990) approach the matrices Var(Y), Cov(Y, S), Var(S), Var(Z), and Cov(Z, S) must be computed; a detailed discussion on the different ways of estimating these matrices is given in Whittaker (2001), but here we give only key details.
The estimation of the variance matrices Var(Y), Var(S), and Var(Z) is straightforward. Var(Y) is estimated directly by the empirical covariance matrix of the phenotypic information and Var(S) by
Now consider our new approach. To compute the weight matrices
In conclusion, theory suggests that approach (9) must be at least as good as the Lande and Thompson approach. Often, when the number of marker loci is relatively large compared to the number of traits, we would expect it to be substantially better. In the next section we compare the two methods by simulation experiments.
SIMULATION EXPERIMENTS
Simulation experiment when more records than markers are given: To compare the performance of markerassisted selection based on the Lande and Thompson approximation (9) with markerassisted selection based on our approximation (10), we simulate 20 chromosomes of an F_{2} generation from two inbred parental lines and distribute 11 markers uniformly over each chromosome. Each marker interval is 10 cM in length with a total chromosome length of 100 cM. We generate two Gaussian traits, each influenced by 23 QTL, so that the genome contains 46 QTL in total. QTL locations are obtained by drawing random samples from a uniform distribution. As Lande and Thompson (1990) suggested, we compute the QTL effects a_{ij} using geometric series and choose the effects of the QTL so that the total heritability of each trait is 0.2.
We use a population size of 600 individuals, with index a = (1, 1)^{T}, and assume that there is no environmental correlation (i.e., the ε_{ij} in Equation 4 are independent). We run the index selection for 20 generations and repeat the experiment 100 times. On the basis of their predicted index values we select the best 20% of the individuals in each generation and mate them randomly to produce the next generation.
We conduct the simulation experiment twice. First we use the true empirical covariance matrices Cov(Y, S) and Var(Z) in (9) and (10). This gives the performance of the two approaches in the absence of estimation error. Second, the marker scores s_{i} are predicted by linear regression of the phenotypic information on the marker information. For the Lande and Thompson approach the covariance matrix Cov(Y, S) is estimated by the crossvalidation method proposed by Whittaker et al. (1997). All parameters of the new approach and of the Lande and Thompson approach are reestimated in each generation, using the new marker and phenotypic data. On the basis of these estimates the indices are also recalculated in each generation.
The mean responses for the first and second simulation experiments are shown in Figures 1 and 2, respectively, with the Lande and Thompson approach (9) denoted by “L&T MAS” and approach (10) by “Opt MAS” in each case. Classical selection based only on phenotypic information is denoted by “Pheno.”
In the absence of estimation error, the plots show a clear superiority of approach (10) over the Lande and Thompson approach (9), with both being superior to classical phenotypic selection. Similar results are obtained in the presence of estimation error (Figure 2). The performance of all three methods is reduced, but the ordering of the methods is unaffected.
Simulation experiment when more markers than records are given: We repeat the previous simulation experiment with sample size 50, 10 chromosomes of length 1 M, and 11 markers spaced uniformly over each chromosome. The total heritability is assumed to be 0.05. Since there are more markers (110) than records (50),
the empirical variance matrix of the markers is now singular and we have to compute the generalized inverse matrix of
For the simulation study without estimation error the plots of the mean responses are shown in Figure 3. Figure 4 shows the same plots in the presence of estimation error.
In the absence of estimation error, the overall ordering of the methods is maintained, although response to selection is lower than in the previous section. However, in the presence of estimation error the response to our approach is reduced far more than the response to the Lande and Thompson approach. For the first five generations the Lande and Thompson approach even performs slightly better than our approach.
DISCUSSION
In this article we propose a new approximation for the prediction of genetic values that has theoretical advantages over the standard approach. We have shown that our new approach is also applicable when the sample size is smaller than or equal to the number of markers. The marker variance matrix Var(X) can be estimated by the empirical variance matrix even when the sample size is smaller than the number of markers. However, the empirical variance matrix will then be singular and instead of the standard inverse the generalized inverse matrix has to be computed. When the sample size is substantially smaller than the number of markers, the empirical variance matrix might be a poor estimate and alternative estimators may be considered; e.g., since we start with an F_{2} generation of two inbred lines, the marker variance matrix can also be computed analytically when the marker interval lengths are known. Alternatively the marker variance matrix might also be estimated by bootstrapping or Monte Carlo simulation experiments. These issues will be a topic of further research.
The theoretical advantages of the new approach are confirmed by our simulation experiments where the new approximation clearly outperforms both the classical Lande and Thompson approach and phenotypic selection when the number of records exceeds the number of markers to be included in the index. We also performed simulation studies for smaller sample sizes, more traits, different heritabilities, and nonzero environmental correlation (Lange 2000). In all simulation studies we observed the same pattern. The superiority of the new approach over both the classical Lande and Thompson approach and phenotypic selection was always substantial, provided the number of records exceeds the markers to be included in the index. In theory, the differences between the MAS approaches will be largest when the heritability of the traits is low or dense marker maps are given, since in these cases modeling the marker scores s_{i} is more difficult, e.g., a model selection problem for marker scores. This assumption was supported by the simulation experiments in the absence of estimation error (Figures 1 and 3). However, in the presence of estimation error the advantages of the new approach partly vanish (Figures 2 and 4). In particular, where the number of records is less than the number of markers our approach can be inferior to the standard twostage method in early generations. This indicates the need for more sophisticated estimation methods than the ones used for the new approach here.
In practice, selection solely on individual phenotype, as described here, is seldom used. Rather, phenotypic information on relatives would be incorporated, either via selection indices or, more usually, by the use of BLUP breeding value estimates: This of course reduces the advantage of MAS. We used phenotypic selection here solely to provide a point of reference for the MAS results, since our focus was on the comparison of the MAS approaches.
We also assumed that all markers are typed in all generations. However, it would be possible to reduce the typing cost by selecting a subset of markers in the F_{2} generation and genotyping only these markers in all subsequent generations. Then the advantages of the new methods will be slightly reduced but still be of practical relevance.
Finally, note that our approximation exploits the whole marker map. The variance of the predicted values can therefore be calculated by
In conclusion, markerassisted selection based on approximation (10) for the prediction of the genetic values has a number of advantages over the standard approach by Lande and Thompson.
Acknowledgments
We thank two referees for their constructive comments on an earlier draft of this article. This research was supported in part by the Biotechnology and Biological Sciences Research council (United Kingdom) and in part by grant MH59532 of The National Institutes of Health.
Footnotes

Communicating editor: C. Haley
 Received August 31, 2000.
 Accepted September 5, 2001.
 Copyright © 2001 by the Genetics Society of America