Abstract
The existence of a quantitative trait locus (QTL) is usually tested using the likelihood of the quantitative trait on the basis of phenotypic character data plus the recombination fraction between QTL and flanking markers. When doing this, the likelihood is calculated for all possible locations on the linkage map. When multiple QTL are suspected close by, it is impractical to calculate the likelihood for all possible combinations of numbers and locations of QTL. Here, we propose a genetic algorithm (GA) for the heuristic solution of this problem. GA can globally search the optimum by improving the “genotype” with alterations called “recombination” and “mutation.” The “genotype” of our GA is the number and location of QTL. The “fitness” is a function based on the likelihood plus Akaike’s information criterion (AIC), which helps avoid falsepositive QTL. A simulation study comparing the new method with existing QTL mapping packages shows the advantage of the new GA. The GA reliably distinguishes multiple QTL located in a single marker interval.
MANY traits of plants and animals are quantitative; that is, the phenotype of the trait is continuous and described by measurement values. In many cases, one quantitative trait is controlled by multiple genetic loci with relatively small effects each. Furthermore, as quantitative traits fluctuate under the influence of environmental effects, we cannot infer a genotype from a phenotype unambiguously. Therefore, genetic and environmental effects contributing to the target traits must be estimated using information from phenotypic values and molecular markers to locate quantitative trait loci (QTL) on linkage maps.
The simple interval mapping (SIM) approach of Lander and Botstein (1989) examines the existence of a QTL at each location on the linkage map. The likelihood is calculated from (a) the conditional probability of the QTL genotype given the marker genotype and (b) the conditional distribution of the phenotypic value given the QTL genotype. The existence of the QTL is evaluated by the loglikelihood ratio (called the LOD score), comparing the proposed QTL with the “no QTL” hypothesis. The LOD score is calculated at all the locations on the linkage map, and, if a peak of the LOD score is higher than a given threshold, a QTL is mapped at the point. SIM uses the expectation maximization (EM) algorithm to estimate the maximumlikelihood (ML) point, and the calculation is fairly computationally expensive. Haley and Knott (1992) described a regression method that can decrease the computation. This regression method examines the existence of a QTL using the information of flanking markers and phenotypic value, but estimates parameters using multiple regression rather than explicit ML. The regression method reduces the calculation task, but the approximation is not so good (Kao 2000).
The SIM method is biased, when there are multiple QTL, because it assumes only one QTL at a time. To correct for this bias, Zeng (1993, 1994) and Jansen (1993) independently developed composite interval mapping (CIM). CIM is a combination of SIM and multiple regression. When testing a QTL using the method of interval mapping, CIM uses the genotypes of the markers (except for the flanking markers of the target QTL) as covariates to control genetic effects at other than the target QTL. However, if there are multiple QTL within a marker interval, neither CIM nor SIM can detect them separately.
It is possible to construct a genetic model that considers multiple QTL simultaneously. Because we do not know the number of QTL in advance, it is impractical to calculate the LOD scores for all possible combinations of number and locations of QTL. We require a practical method to simultaneously estimate the optimum number of QTL. Because the LOD score drops at the positions of markers and consequently shows many local optima, most numerical optimization procedures using information based on the local shape of the LOD score, such as the NewtonRaphson procedure and the simplex method, do not perform so well.
Bayesian analysis of QTL using Markov chain Monte Carlo (MCMC) has been reported recently (Uimari and Hoeschele 1997; Stephens and Fisch 1998; Sillanpää and Arjas 1998; Yi and Xu 2000). This method searches for the best combination of the QTL number and locations by using a prior distribution to update the parameters of the model step by step, which enables us to evaluate the reliability of the estimates in terms of a posterior distribution. However, the approach is presently computationally prohibitive, and thus it is often difficult to detect QTL correctly in a realistic time period.
Kao and Zeng (1997) and Kao et al. (1999) described a general model for handling multiple QTL simultaneously. This model includes not only additive effects of QTL but also epistatic effects between QTL. Their proposed multiple interval mapping (MIM) searches for the best model with a stepwise selection using likelihoodratio statistics to add or to delete a QTL. Threshold values for QTL detection were calculated assuming a χ^{2} distribution of the ratio statistics plus a Bonferroni correction for multiple testing of marker intervals (Lander and Botstein 1989; Zenget al. 1999).
As an alternative with high potential, we propose and test a genetic algorithm (GA) for QTL mapping. We use a multipleQTL model and Akaike’s information criterion (AIC, Akaike 1974) as the evaluation function. Because AIC measures predictive power for the distribution of quantitative traits, it is particularly useful for the application to breeding such as markerassisted selection. Candidates for the QTL numbers and locations are encoded as “genotypes” of the GA. “Selection” on the genotype then produces an efficient global search after maintaining diversity in the GA population through “crossover” and “mutation” type operations. A combination of “drastic mutation” and “slight modification” enables efficient global search for QTL number and locations.
A major advantage of using AIC for QTL mapping comes when the prime interest is markerassisted breeding. Here a major problem is not detecting closely linked QTL of opposite effect (a type II error), and so not checking the results of offspring where rearrangement has occurred. In general, it is very difficult to control type II error rates. Most of the proposed QTL mapping methods control type I error rates with no regard to type II error rates. However, by maximizing predictive power, AIC creates a balance of type I and type II errors, and in this sense is desirable in markerassisted breeding QTL prediction.
MODEL AND GENETIC ALGORITHM
Let us consider an F_{2} population from two completely homozygous parents P_{1} and P_{2}. Let the number of the QTL be M. The phenotypic value Y of one F_{2} individual is expressed as
Given a set of the locations of QTL, we can evaluate the plausibility of the set using likelihood. Therefore, QTL mapping is a case of combinatorial optimization, which must search for the combination of locations on a linkage map, which maximizes the likelihood.
The GA is a numerical optimization procedure, based on an analogy to organic evolution. Candidates for the optimum solution are coded as the “genotypes” of virtual creatures (GA individuals), and candidates have a probability of survival described by the “fitness.” First, a population of GA individuals with random genotypes is generated. Then, “mating” among the GA individuals, “mutation,” and “selection” by the fitness are repeated to improve fitness until the fitness in many succeeding generations changes by less than a threshold value. At this time, the genotype of the best GA individual is adopted as the optimum solution. “Mating” and “mutation” increase the diversity of the GA population, while “selection” reduces it. For GA to reliably find a good global optimum, the factors in quotation marks need to be appropriately matched to the problem.
GA genotype: In our GA, the genotype is the number and location of QTL. Figure 1 shows the GA genotype. M denotes the number of QTL. The term (C_{x}, L_{x}) denotes the location of the xth QTL, where C_{x} is the chromosome on which the xth QTL exists, and L_{x} is the location measured in centimorgans on the chromosome C_{x}.
Selection function and alternating generations: The tournament method is adopted as the “selection” of parent GA individuals that breed offspring GA individuals. Figure 2 gives an example where some constant number N_{t} (called the tournament size) of the GA individuals is extracted (tournament 1) from the GA population of the current generation. Next, the GA individual with the highest fitness (here, the lowest AIC) among tournament 1 is adopted as a parent GA individual (parent 1). This parent selection is repeated again without replacement (so we get parent 2 from tournament 2), and the genotype of the offspring GA individual is created from the genotypes of the selected two parents through the “crossover” and mutation. The breeding procedures are repeated without replacement until the number of the GA individuals in the next generation becomes the same as that of the current generation (dotted line in Figure 2). This alternation of GA generations continues until the GA individual with the highest fitness appears in successive generations, indicating convergence (bold solid line in Figure 2).
GA crossover: GA crossover creates an offspring GA individual as shown in Figure 3. There are two parent GA individuals selected through the tournament method. Parent 1 has two QTL located at (C_{a}, L_{a}) and (C_{a}, L_{b}). Parent 2 has four QTL located at (C_{a}, L_{a}), (C_{a}, L_{c}), (C_{b}, L_{d}), and (C_{b}, L_{f}). The number of QTL for the offspring is selected randomly from the range determined by the numbers of QTL of the parents. The upper limit of the range is the larger QTL number of the parents, and the lower limit is the smaller one. In the example in Figure 3, the upper limit is four (QTL number of parent 2), and the lower limit is two (QTL number of parent 1). Then, the offspring’s QTL number is selected randomly from two, three, or four, and three is selected in this example. There are six QTL locations among the genotypes of the parents, but the location (C_{a}, L_{a}) is duplicated, so there are five available locations, namely, (C_{a}, L_{a}), (C_{a}, L_{b}), (C_{a}, L_{c}), (C_{b}, L_{d}), and (C_{b}, L_{f}). Three locations are selected randomly from these five locations. In this example, (C_{a}, L_{b}), (C_{a}, L_{c}), and (C_{b}, L_{f}) are selected.
GA mutation of drastic change: The offspring that the GA individual created through the crossover may be changed further by a random mutation event. GA mutation has three patterns (Figure 4), which mimic real genomic evolution: insertion, deletion, and substitution (relocation). When a mutation occurs, one of these is randomly selected and applied to the offspring GA individual. Insertion adds a random location to the list of QTL locations in the GA genotype, deletion randomly selects a QTL from the GA genotype and deletes it, and substitution randomly selects a QTL from the GA genotype and moves it to a random location on a random chromosome.
GA mutation of slight modification: A serious problem of GA is premature convergence. In the alternation of the GA generation, a GA individual with higher fitness breeds more offspring than the GA individuals with lower fitness. This procedure decreases genotypic diversity of the GA population. If the GA selection drives diversity too low, the GA easily falls into local optima. In our GA system, diversity means the number of the candidates of the QTL location contained in the GA population. Since the GA mutations mentioned above change GA genotypes drastically, most of them may be deleted from the population. To help maintain the diversity of the GA population, we introduce very slight mutation. This additional mutation randomly selects a QTL from a GA individual and moves it a little from its present location. When an offspring GA individual is created through crossover and drastic mutation, if GA individuals of identical genotype already exist in the newly created population, the slight shift mutation occurs with the probability proportional to the number of identical GA individuals.
GAcontrolling parameters: In our GA, five parameters control the performance: size of GA population, tournament size, probabilities of drastic and slight mutations, and shift distance for the slight mutation. GA population size affects the speed of the search and the diversity. Small population size takes less time than a large one, but a large population contains more candidates of QTL locations in the GA population, that is, it allows high diversity, while in a small population, diversity can quickly decrease. Tournament size affects the selection pressure. Large tournament size increases the chance to select a GA individual with high fitness at the expense of diversity. Mutation affects the rate at which a promising new QTL location enters into the GA population. A high mutation rate and large shift distance quickly generate many QTL locations not contained in the earlier GA populations, but can result in a very slow convergence. Despite the many studies that have attempted to balance these conflicting requirements (e.g., Schraudolpf and Belew 1992; Myers and Hancock 1997; Streifelet al. 1999), there is no general criterion for setting up these parameters. Accordingly, GA parameters are typically determined empirically. We repeated many simulations, changing the GA controlling parameters, and then selected parameter values that showed good QTL detection. We settled on the number of GA individuals being 100, the tournament size was two, the rate of the drastic GA mutation was 0.05, and the rate of slight mutation was set to be 0.25 of the number of identical GA individuals to a newly created GA individual. The slight mutation moves the selected QTL 1 cM. To maximize the speed and certainty of convergence to the global optimal point, further study of GAcontrolling parameters is desirable.
Fitness of GA: Fitness is measured by the LOD score, which expresses the goodness of fit of the model to the data. However, the LOD score tends to increase with the number of QTL, simply because of an additional degree of freedom. To avoid the falsepositive detection of QTL, we need a penalty on the number of QTL. Since GA repeats the evaluation of fitness so many times, we need a criterion that is quickly calculated. Further, in the actual application of QTL analysis to the field of breeding such as markerassisted selection, it is often of most concern to predict the distribution of quantitative traits. Accordingly, we adopt AIC as the fitness criterion. AIC is very simple and was developed to select the model with the highest expected predictive power (see Akaike 1974; Sakamotoet al. 1983). The fitting of a model with a parameter vector θ to the data X is expressed as the maximum loglikelihood
Hence,
AIC is more liberal than likelihoodratio hypothesis testing procedures, because it may allow the model to incorporate “nonsignificant” QTL, as long as they contribute to increasing the predictive power. Like the hypothesistesting procedure, the procedure based on AIC has some probability of making type I errors. However, the predictive power and the accuracy of the estimated genetic effects are maximized, and, in this sense, an optimal balance of type I and type II errors is achieved. In practical situations, such as markerassisted selection, it may be advantageous to examine the best AIC model and exclude QTL with negligible genetic effects.
Suppose we have a sample of size N from the F_{2} population. Each of the M QTL has three possible genotypes, Q_{m}Q_{m}, Q_{m}q_{m}, and q_{m}q_{m}, indexed as k_{m} = 1, 2, and 3 (m = 1, 2,... M). The genetic effect g_{m}_{,}_{km} of the mth QTL is a_{m}, d_{m}, and a_{m} for k_{m} = 1, 2, and 3. The loglikelihood function is
When each interval between adjacent markers has at most one QTL, P_{j}_{,}_{k1}_{,...}_{kM} is the product of the conditional probabilities of the QTL genotypes:
Kao and Zeng (1997) and Kao et al. (1999) described an EM algorithm to estimate the genetic effects of multiple QTL. With the explicit formula of conditional probabilities of genotypes of up to two QTL in a marker interval, their procedure can detect multiple QTL successfully. Instead of extending their formula to a general case, we adopt the following twostep moment method for simplicity. First, we obtain the crude estimates of the genetic effects ã_{m}, d˜_{m}, the constant
SIMULATION EXPERIMENTS
We examined the performance of our GA through a set of numerical experiments using simulated F_{2} populations. We used Haldane’s model to generate the marker genotype data assuming no interference among markers. The variance of the residual was set at 1.0, and the constant value was set at 0. That is, phenotypic value data were generated following a normal distribution with variance 1.0 and mean determined by the summation of the genetic effects of QTL. The sample size of the F_{2} population was set at 500, each F_{2} individual had one or three pairs of chromosomes of 100cM length each, and 11 markers were located every 10 cM on each chromosome.
In these experiments, the number of GA individuals was 100, and the tournament size was two. The rate of the ordinary GA mutation was 0.05, and the rate of the additional one was set to be 0.25 of the number of GA individuals with the same GA genotype as the one newly created. These GAcontrolling parameters were exactly those found to work well empirically.
Behavior of the GA: In this experiment, three QTL were located on the same chromosome at positions of 17, 43, and 85 cM from the end of a chromosome. Each QTL had an additive effect of 1.0 and a dominance effect of 0 (broadsense heritability is 0.6). Figure 5 and Table 2 show that the GA converged by the 20th GA generation. The estimates of three QTL had locations of 16, 84, and 43 cM, and additive effects of 1.424, 1.358, and 1.084, all close to the true values. The dominance effects were correctly estimated to be negligible. The residual variance was also reliably estimated (0.977 vs. 1.0). Repeating the same experiment 100 times (Figure 6), we detected three true QTL in 96 out of 100 simulations, and four QTL, including one false positive, in the remaining cases.
Effects of QTL: Next, we altered the pattern of QTL, so seven QTL were located on three chromosomes at 17, 43, and 72 cM along the first chromosome, 24 and 85 cM along the second chromosome, and 62 and 91 cM along the third chromosome. Each QTL had the same genetic effect. Simultaneously, additive effects were varied from 0.2 to 1.0 in steps of 0.2, and the dominance effects were fixed at 0.
The experiments were repeated 100 times. Table 3 shows the mean and the standard deviation of the estimates when a QTL was detected at the relevant region. It also shows the empirical probabilities of the detection. For example, when the additive effect of each QTL was 0.2, the first QTL was detected 71 times out of 100 trials, and the second, 57 times. With increasing genetic effects, it becomes easier to correctly detect QTL. When the additive effects are 1.0, the average probability of detection is 0.964, but when the additive effects are 0.2, the average probability is 0.624. Falsepositive QTL, where QTL have negligible genetic effects or are located outside the true relevant regions, occurred with an average frequency of 0.32 with small effects, dropping to 0.00 with genetic effects of 1.0. This question of whether anything can be done about AIC being too liberal at lower genetic effects requires further study. The tradeoff will likely be decreasing type I errors but more rapidly rising type II errors (i.e., failure to include the true QTL).
Closely linked QTL: We examined the performance of our GA in the situations of closely linked QTL (Table 4). In this situation, two QTL were located on the same chromosome at 43 and 47 cM. As the markers are located every 10 cM, no marker exists between the two QTL. These two QTL had reverse genetic effects on each other. The additive and dominance effects of the QTL at 43 cM were changed from 0.4 and 0.2 to 2.0 and 1.0, and those of the QTL at 47 cM were changed from 0.4 and 0.2 to 2.0 and 1.0.
The simulations were repeated 100 times. Table 4 shows the results. For example, when the additive and dominance effects of the QTL are 1.6 and 0.8 for the QTL at 43 cM and 1.6 and 0.8 for the QTL at 47 cM, two QTL with reverse genetic effects were detected 87 times out of 100 trials in the marker interval 4050 cM, one QTL was detected 10 times, and no QTL was detected 3 times. No cases of three QTL being detected were observed.
Here, the estimates of the genetic effects are smaller than their true values. This bias of the estimated genetic effects comes from the high correlation between the two QTL, even conditional on the marker information. However, unlike multicolinearity in regression analysis, the estimates are conservative rather than liberal, preventing practitioners from being too optimistic.
Comparison with other methods: We compared the precision of our GA method with those of SIM, CIM, and Bayesian estimation. Software packages used in these experiments were QTL Cartographer (Bastenet al. 1998) for simple and composite interval mapping and Multimapper (Sillanpää 1998) for Bayesian estimation. For SIM and CIM, the critical value of the LOD score for claiming QTL detection with a 5% probability of a type I error was based on 1000 replicates of the simulation under the null hypothesis of no QTL. Three separate experiments were repeated 100 times, and the results are summarized in Tables 5 and 6.
Conditions for generating the data were as above. In the first experiment (Table 5), three QTL were located on the same chromosome at 17, 43, and 85 cM from the end of a chromosome. Each QTL had an additive effect of 1.0 and a dominance effect of 0 (broadsense heritability is 0.6). CIM and GA always detected the three QTL correctly, but SIM could not detect the three QTL separately. Bayesian analysis sometimes detected three QTL, but in other cases it detected only two. The probability of success for the Bayesian analysis was 81/100 for the QTL at 17 cM, 73/100 for the QTL at 43 cM, and 95/100 for the QTL at 85 cM.
In the second experiment (Table 6), two QTL were located on the same chromosome at 43 and 47 cM. As the markers are located every 10 cM, no marker exists between the two QTL. The QTL at 43 cM had additive and dominance effects of 2.0 and 1.0, respectively, whereas the QTL at 47 cM had the opposite effects, 2.0 and 1.0 (broadsense heritability is ∼0.82). SIM, CIM, and the Bayesian method all detected only one QTL with very little genetic effect because of the two opposite QTL effects. However, our GA detected the two QTL in 98 cases out of 100 trials.
DISCUSSION
Numerical simulations clearly showed the high level of detection of our GA. Composite interval mapping is well known as an improved interval mapping. However, if there are two QTL in the same marker interval, even composite interval mapping cannot detect QTL separately, as in the case of simple interval mapping. Bayesian analysis can detect QTL separately in such a case, but will not work well if there are many QTL or if the amount of information decreases due to QTL having opposite net effects. Moreover, Bayesian analysis is time intensive, because it uses the MCMC method. We performed our experiments on a DEC Alpha 21164A 500MHz workstation, and for the threeQTL experiment (Table 5), Bayesian analysis took 150200 min, while our GA took only 13 min.
Since the GA repeats the evaluation of the fitness so many times, a criterion that is easily calculated is required. And as mentioned above, it is often of primary importance to predict the distribution of quantitative traits in the field of breeding such as markerassisted selection. AIC was developed to select the model with the highest predictive power (Akaike 1974; Sakamotoet al. 1983) and is very simple. The GA detects QTL with higher probability than the other methods. That is, it shows a significant improvement in type II error rates, while type I errors are not overly apparent. For the purpose of markerassisted breeding, this is a major advantage. However, it does not necessarily mean that our GA has better performance than all the others, if type I error rate is the sole focus of attention (and in fact it should not be, since these methods explicitly aim to control type I errors). AIC is a relatively liberal criterion, while software packages of interval mapping and Bayesian analysis use even more conservative criteria. Accordingly, our GA picked up more falsepositive QTL than the other methods. The question of how to trade off type I (false detections) and type II (failures to detect) is an important topic that is yet to be studied.
In this study, we adopted a moment method to estimate the genetic effects. On the other hand, Kao and Zeng (1997) and Kao et al. (1999) proposed the MIM using a general EM algorithm for multiple QTL. MIM directly implements simultaneous estimation of genetic effects. Since the moment method is less statistically efficient than ML, MIM may give more precise MLEs than our method. On the other hand, a merit of our moment method is simplicity. Our method is easier to implement than MIM. To check the difference between true MLEs and the results of our method, we applied the NewtonRaphson procedure to the results of our GA and searched for genetic parameters that maximize the fitness function. The differences found were negligible, but we should leave this as an open question for more extensive comparisons.
Dense marker linkage maps presently exist only for some major plants and animals. Worse, genes controlling the same trait are often clustered (Meyerset al. 1998; Simonset al. 1998). Even though rice is fairly densely sampled, two important genes in one interval were regarded as a single gene using standard QTL mapping (Wanget al. 1999). This suggests that multiple QTL may well exist within a given marker interval. If there are multiple QTL in one marker interval, the conditional probabilities of QTL genotypes given marker genotypes are not described by Table 1 and Equation 5. For example, let two QTL (QTL_{1} and QTL_{2}) exist between two markers (marker A and marker B) in the following order: marker A, QTL_{1}, QTL_{2}, and marker B. The conditional probability of QTL genotypes given marker genotypes is
Evaluation of the accuracy of the estimation is an important problem. For instance, in the Bayesian model, the accuracy is easy to evaluate, because the result of the estimation is described by a posterior distribution. With our GA, however, only one GA individual with the best fitness is adopted as the optimum solution. A possible solution is to calculate Fisher’s information matrix (FIM; also called the Hessian matrix), which can be used as a rough evaluation of the precision. The variance of the estimates is approximated by the inverse of the FIM. We tried to evaluate the accuracy of our GA with FIM in the experiment shown in Figure 5 and Table 2. Variances of the estimates of QTL locations were ∼0.5 (1 SD = 0.7 cM), and they are well in accordance with the differences between correct locations and estimated locations. However, estimated variances of genetic parameters are too small (∼10^{4}10^{5}), so this measure of accuracy may be too optimistic. Further, our aims are to develop a method to evaluate the accuracy of the estimation using information from lowerranked GA individuals.
The GA method allows the user to construct models for complicated problems and gain the optimum solution within a practical time. One major goal is to extend our GA to many complex cases, such as epistasis and outbred lines, and to develop a widely applicable QTL mapping method.
Acknowledgments
We thank the communicating editor Dr. ZhaoBang Zeng and two anonymous reviewers for their many helpful comments. Dr. Hiroyuki Hirano and Dr. Hiroyoshi Iwata provided useful information in running the simulations. We thank Stephane ArisBrosou, Douglas M. Robinson, and Peter J. Waddell for their assistance in writing this article. This work is supported by grants 09788, 12554037, and BSAR497 from the Japan Society for the Promotion of Science. Software of our GA use, written in the C++ language, is available from Reiichiro Nakamichi (http://peach.ab.a.utokyo.ac.jp/~naka).
Footnotes

Communicating editor: ZB. Zeng
 Received August 17, 2000.
 Accepted February 5, 2001.
 Copyright © 2001 by the Genetics Society of America