Abstract
This article is devoted to the problem of ordering in linkage groups with many dozens or even hundreds of markers. The ordering problem belongs to the field of discrete optimization on a set of all possible orders, amounting to n!/2 for n loci; hence it is considered an NPhard problem. Several authors attempted to employ the methods developed in the wellknown traveling salesman problem (TSP) for multilocus ordering, using the assumption that for a set of linked loci the true order will be the one that minimizes the total length of the linkage group. A novel, fast, and reliable algorithm developed for the TSP and based on evolutionstrategy discrete optimization was applied in this study for multilocus ordering on the basis of pairwise recombination frequencies. The quality of derived maps under various complications (dominant vs. codominant markers, marker misclassification, negative and positive interference, and missing data) was analyzed using simulated data with ∼50400 markers. High performance of the employed algorithm allows systematic treatment of the problem of verification of the obtained multilocus orders on the basis of computingintensive bootstrap and/or jackknife approaches for detecting and removing questionable marker scores, thereby stabilizing the resulting maps. Parallel calculation technology can easily be adopted for further acceleration of the proposed algorithm. Real data analysis (on maize chromosome 1 with 230 markers) is provided to illustrate the proposed methodology.
AN important step in generating multilocus genetic maps using the results of linkage analysis is the determination of the true marker order. One of the possibilities in addressing this problem is to recover the linear marker order from the known pairwise marker distance matrix d_{ij}. A primary difficulty in ordering genetic loci using linkage analysis is the large number of possible orders: for n loci on a chromosome, n!/2 distinct orders should be evaluated. In real problems, n might vary from dozens to 200500 markers and more (e.g., www.maizemap.org/ibm_frameworkmaps.htm; see also Ott 1991). Clearly, even for n ∼ 30, it would not be feasible to evaluate all n!/2 possible orders using twopoint linkage data. This is why multilocus ordering is considered as a nonpolynomial (NP)hard combinatorial problem (Wilson 1988; Olson and Boehnke 1990; Falk 1992; Ellis 1997). A solution to this problem can be obtained on a PentiumIV (1500 Mhz) computer even for a modest case such as n = 10 after 1 hr.
Several methods have been proposed for determination of marker order (Lathropet al. 1985; Lander and Green 1987; Knappet al. 1995; Newellet al. 1995; Liu 1998) and implemented in software packages like LINKAGE (Lathropet al. 1984), MapMaker (Landeret al. 1987), FastMap (Curtis and Gurling 1993), and JoinMap (Stam 1993). Historically, the main approach of ordering markers within linkage groups was based on multipoint maximumlikelihood analysis. Several effective algorithms have been proposed using various optimization tools, including the branch and bound method (Lathropet al. 1985), simulated annealing (Thompson 1984; Weeks and Lange 1987; Stam 1993; Jansenet al. 2001), and seriation (Buetow and Chakravarti 1987).
Olson and Boehnke (1990) compared eight different methods for marker ordering. In addition to multilocus likelihood, they also considered more simple criteria for preliminary multipoint marker ordering in largescale problems based on twopoint linkage data (by minimizing the sum of adjacent recombination rates or adjacent genetic distances). The simple criteria are founded on the biologically reasonable assumption that the true order of a set of linked loci will be the one that minimizes the total map length of the chromosome segment. Simple methods work quickly but their accuracy may depend on the number of markers, distribution of recombination frequencies (presence of large gaps), percentage of missing data, type of the employed optimization criterion, noise caused by misclassification, and genetic interference. That is why there is a tendency to combine twopoint analysis with more general multipoint methods. However, even for simple methods, based on pairwise analysis, there is a pressing need for efficient algorithms enabling highquality “preliminary” multipoint ordering. Keeping in mind the large number of markers employed in mapping projects of different organisms (humans, experimental model organisms, and agricultural plants and animals), such algorithms should cope with many dozens and even hundreds of markers (e.g., 100 ÷ 1000) per chromosome and in a reasonable executing time.
We present in this article a new, highly efficient algorithm of multilocus ordering based on twolocus linkage data that employs the evolutionary optimization strategy (ES). ES is a heuristic algorithm mimicking natural population processes. The numerical procedures in such optimization are based on simulation of mutation and reproduction, followed by selection of the fittest “genotypes,” representing the obtained values of the optimization criterion. Together with genetic algorithm (Holland 1975) and evolutionary programming (Fogel 1992), evolution strategies form the class of evolutionary algorithms (Nissen 1994). The evolutionary strategies were proposed in the 1970s (Rechenberg 1973; Schwefel 1977, 1987) to solve optimization problems with realvalue variables. A recent survey of search strategies for combinatorial problems was provided by Muhlenbein et al. (1998). ES for optimization problems is presented as a random search by asexual reproduction, which uses mutationderived variation and selection. The mutation change of the current vector of parameters can be introduced by adding a vector of normally distributed variables with zero means. The level of changes can be adapted by variances of these disturbances.
In contrast to ES, genetic algorithms, introduced by Holland (1975), simulate sexual reproduction that is characterized by recombination of two parental strings to build the offspring generation. Clearly, the contributions of mutation and recombination as sources of variation in the search strategy are different: mutation is based on chance only, and the success of a single mutation is largely unpredictable. Crossover can be viewed as a historypreserving operation, which at the same time introduces a new structure to be tested in competition. Homberger and Gehring (1999), Mester (1999, 2000), and D. Mester (unpublished results) adopted the ES algorithm to solve the vehicle routing problem with timewindow restrictions, which is similar, to some extent, to multipoint analysis of markers belonging to several chromosomes (linkage groups). In this article, we applied the ES algorithm for multipoint marker ordering using the similarity between this problem and the wellknown traveling salesman problem (TSP; Presset al. 1986; Weeks and Lange 1987; Falk 1992; Schiex and Gaspin 1997).
EVOLUTION STRATEGIES AND THE HEURISTICS IN THE DEVELOPED ALGORITHM
The employed procedure as a simulated analog of evolutionary processes: Usually, the optimization process of an objective function f(x) with n realvalue variables x = (x_{1}, x_{2},..., x_{n}) can be represented as an evolution of the solution vector x ∈ R^{n}. The main elements of ES algorithms and their correspondence with the elements and processes of an “evolving population” are presented in Table 1.
The common ES algorithm steps: Evolution strategies define the size of a population and the rules for the selection process. Various approaches were proposed for choosing the population size in the ES, including the (1 + 1) strategy (Rechenberg 1973) and (μ, λ) strategy (Schwefel 1977). With the (1 + 1) strategy, population size is equal to one individual used to obtain offspring individuals via mutation operation. If a new individual is better than the “parent,” it replaces the parent. The (μ, λ) strategy works with a population of size λ. The selection operator chooses μ best individuals to establish the new generation. Both versions, (1 + 1) and (μ, λ), employ the following steps:

Create λ individuals (x^{k}) of initial population P^{0}.

Compute the fitness f(x^{k}), k = 1,..., λ.

If the optimization process is terminated, then stop.

Select the μ≤λ best individuals.

Create λ/μ offspring x^{k}^{+1} of each of the μ individuals by small variation.

Return to step 2.
Peculiarities of the combinatorial version of ES: Clearly the multilocus ordering problem cannot be directly represented in terms of ES with realvalue formulation. Combinatorial versions of ES differ from the realvalue formulation by specific representation of the solution vector x and mutation mechanisms (Homberger and Gehring 1999). In combinatorial formulation, the solution (an “individual”) can be represented as a vector x = (x_{1}, x_{2},... x_{n}) that consists of n ranked discrete coordinates (chromosomes) or as a directed graph G(A, B) with a set of nodes A = {a_{1}, a_{2},... a_{n}} and set of arcs B = A × A, where node a_{j}, j > 0, represents the chromosome. The fitness function assigns to each of the n(n – 1)/2 arcs (a_{i}, a_{j}) [or pair of coordinates (x_{i}, x_{j})] a nonnegative d_{ij} cost of moving from element i to element j. The problem is symmetric if and only if d_{ij} = d_{ji} for all arcs. For optimization of a combinatorial problem, one needs to define such an order of the vector coordinates (or nodes) that will provide minimum total cost.
The mutation operator (referred to hereafter as mutator) changes the vector x^{k}, thereby producing a new solution vector x^{k}^{+1}. For this goal, one can use the movegeneration and the solutiongeneration mechanisms (Osman 1995; Homberger and Gehring 1999) or the removeinsert mechanism (Mester 1999). Our version of the combinatorial ES algorithm employs multiparametric mutator (MPM), which changes the solution vector via removing and inserting β coordinates of x^{k} (Mester 1999; D. Mester, unpublished results). The common heuristic remove defines a random proportion β= (0.1 + 0.5r)n of rejected coordinates in the solution vector, where n is the number of coordinates in the solution and r is a random value (e.g., evenly) distributed between 0 and 1. The heuristic also defines a set of removing rules R (to take out specific parts of x^{k} or the full vector). At this mutation stage, the solution vector x^{k} is divided into two subvectors: x^{k}_{remainder} and x^{k}_{reject}. Another common heuristic, insert, defines a set of rules I to insert, consequently one by one, all x_{i} ∈ x^{k}_{reject} into x^{k}_{remainder}. This is construction phase of the mutator, which builds some new solution vectors x^{k+1} using the variation of the problemspecified criterion (Mole and Jameson 1976; Or 1976; Osman 1993; Mester 1999).
At the mutation stage, mutator M(R, I, β, x^{k}) produces an offspring x^{k+1} from the parent x^{k}. If the first offspring appears to surpass the parent, the mutator with the same parameters is applied again to the new parent, and so on. If the offspring does not surpass the parent, then to generate the new offspring, the algorithm uses the mutator with other parameters. After mutation, the vector x^{k}^{+1} “is improved” by standard combinatorial procedures of order O(n^{2}): (1) 2Opt (Lin and Kernighan 1973), (2) OrOpt (Or 1976), and (3) 1interchange (Osman 1993).
This twophase approach (mutationimproving) reflects the principles of solution diversification and upgrading (Rochat and Taillard 1995). We combine the last three improving procedures into one composite procedure (Composite). At the initial solution phase, Composite is applied five times. We refer to such an algorithm (multiple application of the Composite procedure starting from random initial points) as the MultiStart procedure. In Table 2 we compare the solutions of standard TSP obtained by four different powerful heuristics: guided local search (GLS), simulated annealing (SA), tabu search (TS; for the comparison of these three algorithms, see Voudoris and Tsang 1999), and the ESMPM algorithm proposed by Mester (1999, 2000, and unpublished results). In addition, we present for comparison also three simple heuristics: 3OPT of Lin and Kernighan (1973), the Composite, and the MultiStart (Table 2). ESMPM is a twophase algorithm that first produces an initial solution using the simple MultiStart procedure and then moves to a more powerful, albeit less fast, ESsearch (ESphase). The presented benchmark clearly demonstrates that the ESMPM algorithm provides quality solutions and is faster than other adaptive algorithms (GLS, SA, and TS).
Multipoint marker ordering as a TSP problem: The proposed algorithm of multipoint ordering employs twopoint linkage data (see also Presset al. 1986; Weeks and Lange 1987; Falk 1992; Schiex and Gaspin 1997). Although this approach is usually considered as “preliminary ordering,” the good quality of the maps produced by our version of the ES algorithm (see below) allows us to consider it not only as a complement to the more sophisticated multilocus maximumlikelihood (ML) ordering, but also, to some extent, as a competitor to ML algorithms (especially for a large number of marker loci and various complications like missing data, misclassification, etc.). We consider n markers enumerated arbitrarily by n coordinates x_{i} ∈ x and, for each n – 1 marker pairs (x_{i}, x_{j}), a “distance” ρ_{ij}. As ρ_{ij}, either pairwise recombination fractions r_{ij} or map distances d_{ij} (e.g., in Haldane or Kosambi metrics) are employed.
Different criteria can be used to discriminate between competitive orders, for example, total distance measured as a sum of distances between consecutive adjacent markers or the total number of recombination events. These criteria are founded on a biologically reasonable assumption that the true order of a set of linked loci will be the one that minimizes the total length of the chromosomal map (Presset al. 1986; Weeks and Lange 1987; Falk 1992; Schiex and Gaspin 1997). In our model, the minimum of sum of distances between adjacent markers was applied as optimization criterion (OC),
The program for simulations was written in Visual Basic 6.0. Monte Carlo testing experiments were conducted on a doubleprocessor Pentium 3 (800 Mhz). To compare different situations, the following coefficient of restoration quality [proximity between the “true” (simulated) and estimated orders] was employed,
SIMULATED DATA SETS
The data for analysis were produced using a pseudorandom generator. The simulation algorithm repeatedly generated a singlechromosome mapping population, F_{2}, for a chosen number of markers with:

Variation of recombination rates between adjacent markers along the chromosome;

defined proportion of dominant vs. codominant markers;

chosen proportion of missing data;

chosen proportion of markers with erroneous classification and level of errors; and

chosen mode of recombination interference for adjacent markers, Haldane, Kosambi, or arbitrary interference. In the last case, we define a few ranges of coincidence values and the probabilities to sample the coincidence values from these ranges (with even distribution of the coincidence values from the chosen range).
The following are the numerical values (ranges) of the main parameters in the majority of experiments:

The number of markers per chromosome: m = 80.

Probability distributions for distances between adjacent markers: P(3 < d (cM) < 5) = 0.8, P(5 < d (cM) < 10) = 0.15, P(10 < d (cM) < 20) = 0.05, with even distribution within each of the three ranges.

Proportions of codominant and dominant markers (in coupling phase, unless noted otherwise): 0.5 and 0.5, respectively.

Three levels for missing data: 0, 10, and 20%.

Two levels for the proportion of loci with classification errors, 0 and 40%, and in the last case, two levels of misclassification, 10 and 20%.

In the case of arbitrary interference, the distributions of coincidence coefficients: P(0 < c < 1) = 0.6 (positive interference), P(1 < c < 2) = 0.2 (slighttomoderate negative interference), and P(2 < c < I_{c}) = 0.2 (moderatetostrong negative interference), where I_{c} = 5, 10, 20, and 40.
Therefore, the efficiency of the preliminary multilocus ordering was considered upon complications caused by negative interference, erroneous marker scoring, and incomplete mapping information due to dominant markers and missing data, known to affect the quality of multipoint ordering. Motivation to consider such complications derives from the simple fact that in real mapping work no one can guarantee that the data are free of such complications. Moreover, in numerous previous attempts at building efficient multilocus ordering tools, some of these problems were usually ignored.
RESULTS
The considered types of disturbances (see the end of simulated data sets) proved to affect the quality of restoration of the true order of markers. These disturbances mainly caused local distortion of the order, e.g., interchanging of two to three neighboring markers (referred to as “local disturbances”). There could be several inverted islands per linkage group. As expected, the number of these islands increases with the percentage of missing data, classification errors, and the level of negative interference. Less frequent were violations caused by excision of a large segment and its transposition to another place with or without inversion within the segment (“global disturbances”). Clearly, such violations result in an appreciable reduction of the coefficient of restoration quality (Equation 2).
Dominance: When all dominant markers were in coupling phase, the proportion of dominant and codominant markers had no effect on the quality of marker ordering. For three proportions of dominant markers (50, 66, and 100%) with Kosambi, Haldane, and a slight negative interference, nearly full recovery of marker order was reached (K_{r} ≈ 0.997 ÷ 0.999). A different result was obtained with dominant markers in repulsion phase. It appears that the higher the proportion of repulsion phase, the lower the quality of multilocus ordering (Mesteret al. 2003). The employing of the ESphase of the ESMPM algorithm (see above, Peculiarities of the combinatorial version of ES) after getting some initial solution through MultiStart positively affected the quality of the final solution. It is noteworthy that the application of ESphase also stabilizes the ordering results (as displayed by the reduction of σ_{K}_{r}, the standard deviation of K_{r} between the Monte Carlo experiments). High precision of ordering in the couplingphase data and low precision in the repulsionphase data justify splitting the data into two sets, each with couplingphase markers only and generating two complementary maps for each linkage group (Knappet al. 1995; Penget al. 2000; Mesteret al. 2003). Clearly, the next step should be integration of the two maps. The last step may encounter difficulties caused by local and global map disturbances affecting codominant markers common for both maps, if the density of such codominant markers is relatively low (e.g., in cases when codominant markers serve as anchors). In fact, the availability of shared codominant markers enables mutual control during multilocus ordering, which, together with computingintensive jackknife and bootstrap techniques (Efron 1979), significantly improves the quality of the resulting map (Mesteret al. 2003).
Negative interference: As expected, negative interference complicates the ordering problem that is manifested in reduction of K_{r} (Table 3). However, the decline in K_{r} with an increase in the maximum value I_{c} of coefficient of coincidence c is unexpectedly slow, pointing to robustness of the employed ordering procedure. A detailed anatomy of misordered situations shows that deviations from the true order in such cases are due mainly to interchanges of adjacent markers. The relatively low effect of negative interference can be explained by a stabilizing role of the neighboring intervals. This can be illustrated by the following example (Figure 2), in which recombination rate between the flanking markers M_{1} and M_{3} is smaller than that for the subinterval M_{2}M_{3} (see Figure 2a).
Without taking into account the information from neighboring intervals, the criterion “minimum of total distance between markers” will give the local order M_{2}M_{1}M_{3} (Figure 2b) that differs from the true one. The stabilizing effect of the neighbor M_{0} allows us to obtain the true order. Indeed, the optimization criterion value for the true order (Figure 2c) is OC = 0.048 + 0.024 + 0.059 = 0.131, whereas the order corresponding to the foregoing inversion between M_{1} and M_{2} (Figure 2d) caused by negative interference results in OC = 0.072 + 0.024 + 0.052 = 0.148. Therefore, despite high negative interference on interval M_{1}M_{2}M_{3} (c = 15.8), which violates the rule that “the entire entity is supposed to be larger than its parts,” the algorithm recovers the true order.
Misclassification: Errors in marker scoring inflate recombination distances and can also violate the principle, the entire entity is supposed to be larger than its parts by imitating “negative interference.” This is why some mapping packages allow for error filtration by selecting out double recombinants. Our simulations showed that in the majority of such local violations the true order could be recovered due to the stabilizing effect of the neighboring markers (Table 4).
In a typical example (Table 5) with a maximum level of noise (20% of marker scoring errors were simulated for 40% of marker loci), there were 14 pairs of adjacent intervals (out of 49 possible pairs) in which either r_{i,i+1} or r_{i+1,i+2} was larger than r_{i,i+2}, but in only 4 of these pairs the true order could not be recovered. We conclude from the obtained results that despite the biases in pairwise estimates of recombination rates and inflation of the map length, the employed criteria of ordering are fairly robust to errors in marker scoring, unless the errors occur on a catastrophic level (say, at half of the loci and with a rate ≫20%).
Missing data: The results presented in Table 6 for several levels of missing marker scores (m = 0, 10, and 20%) show the same tendencies as those found for complicating factors considered above. Thus, participation of the ESphase in the optimization procedure increased the precision of ordering (reducing the deviation of K_{r} from unity) and stabilized the ordering among Monte Carlo runs (as displayed in reduction of σ_{K}_{r}).
Comparison with multilocus algorithms: The foregoing results illustrate the advantages of the ordering procedure on the basis of minimization of the total length of the map (sum of recombination rates or distances between consecutive pairs of markers). Combined with our novel, highly efficient method of discrete optimization, a unique performance and rather high robustness with respect to various disturbances (like classification errors, negative interference, and missing data) are provided. It is noteworthy that ordering 100, 200, 400, and 800 markers takes ∼1.3 sec, 14 sec, 2 min, and 9 min on a Pentium4 2.0GHz computer in the most complicated of the aforementioned situations. Note that even better performance was found in the first trials of our new optimizer based on guided evolution strategies (GES): on the same computer, map ordering for the foregoing variants proved fivefold (!) faster (Mester and Braysy 2003). It would be of interest to compare our algorithm with other procedures, like those of Ott (1991) and Lander and Green (1987). Ott (1991) proposed a criterion on the basis of sliding summation of threelocus LODs along the chromosome. This criterion was compared with the foregoing OC criterion (see Equation 1), using our optimization tools, on the basis of 10 Monte Carlo samples. The simulated F_{2} data were for a 100marker map (total length 500600 cM), population size n = 200 with a very high noise caused by misclassification (40% of markers were simulated with 20% of typing errors!). The pairwise comparison shows (Table 7) that OC does invariably better than S_{LOD} (higher values of the coefficient of restoration K_{r} were obtained for OC).
To compare the efficiency of the OC criterion (Equation 1) with the multilocuslikelihood method, MapMaker 3.0 software was employed in a simulated data set of 200 markers with high negative interference in several regions. First, we revealed on the simulated map all islands where for three consecutive markers i, i + 1, and i + 2, either r_{i,i+1} or r_{i+1,i+2} was larger than r_{i,i+2}. For each such island, three “windows” involving 5, 7, and 9 markers, respectively, were analyzed using MapMaker. Simultaneously, the entire set of 200 markers was ordered with our program. Despite the fact that a local order that one could derive by comparing multilocus likelihoods for all possible candidate orders of such local neighborhoods (of 5, 7, or 9 markers) cannot be considered as a final solution, it makes sense to compare the local properties of the MapMaker solutions and those of the OCbased procedure (with OC defined by Equation 1). This is especially important for situations in which the natural condition r_{i,i+1} and r_{i+1,i+2} < r_{i,i+2} is violated, causing the highest instability of the result under sampling variation, e.g., by using jackknife or bootstrap procedures. It should be noted, however, that application of these last techniques seems impossible with MapMaker for ∼100 and more markers because of CPU limitations. The following are the details of this comparison. The simulated data of 200 markers included (a) for 95% of intervals L = 0.75 cM, for 2.5% L = 30 cM, and for the remainder 2.5% L = 60 cM; (b) 80% of the markers dominant in coupling phase, and 20% codominant; (c) population size N = 400; and (d) interference, with probability P = 0.6, c ∈ (0, 1), with P = 0.2, c ∈ (1, 2), with P = 0.2, c ∈ (2, 20).
This example included 10 3marker islands with violation of the condition {r_{i,i+1} and r_{i+1,i+2} < r_{i,i+2}}. In addition to negative interference or classification errors, such a violation may derive from sampling fluctuations, especially when two adjacent intervals are of very different lengths. At 8 out of 10 such islands, our algorithm recovered the true order (the entire solution for 200 markers took <1 sec). MapMaker recovered the true order in 5 out of 10 islands on the basis of the 5marker window; the remaining 5 islands were treated using the 7marker window and recovered the true order in an additional 3 islands, and the last 2 were treated using the 9marker window with a 50% success. The last two tasks took 6 hr.
Possibilities to validate the solution: Clearly, the foregoing comparisons using simulated data are only to illustrate the quality of the solution provided by the simple OCbased procedure. In dealing with real data, one needs some tools to validate the obtained order, and it is hard to choose the solution from several (sometimes dozens) candidate solutions (like those provided by MapMaker). To cope with this problem, some authors proposed computingintensive procedures based on various combinations of jackknifing and bootstrapping (Efron 1979; Mottet al. 1993; Wanget al. 1994; Liu 1998). With a sufficiently large number of markers, the feasibility of such analysis strongly depends on the performance of the ordering algorithm employed and the quality of solution. We believe that our algorithm perfectly fits both of these demands: its high performance allows us to conduct the ordering procedure many times under different jackknife or bootstrap iterations of the initial sample (Figure 3).
The first step is ordering of markers using the whole set of data. To validate (or correct) the obtained map, the following analysis is conducted on the basis of a large series of jackknife runs (e.g., 100010,000). In each run based on a subsample of individuals (e.g., 90%), we first order the markers and for each marker determine its two (left and right) neighbors. Then, for each marker, the frequency distribution of its closest left and right neighbors is calculated and the unstable neighborhoods are detected using the entire set of generated jackknife runs. Such cases are classified according to the putative causal factor of instability: (a) double recombinants in adjacent intervals (resulting from negative interference or misclassification) and (b) sampling variation of recombination in large intervals. The first problem can be treated by taking out marker scores qualified as “double recombinants” (without affecting the scores of other markers of the same individual). The revised data set is reanalyzed by a repeated ordering procedure using the same jackknife approach. If the instability was caused by the second problem, one could split the map into two linkage groups until additional markers are available for the revealed gap. The following simulated example illustrates the application of the algorithm. The simulated data of 100 codominant markers included: (i) for 80% of intervals, L ∈ (5, 10) cM, and for 20% L ∈ (10, 20) cM; (ii) population size N = 300; and (iii) interference, with P = 0.6, c ∈ (0, 1); with P = 0.2, c ∈ (1, 2); and with P = 0.2, c ∈ (2, 20).
Each jackknife run employed 275 (92%) individuals at both steps: initial ordering and validation were based on a revised data set. One thousand runs were analyzed. A typical fragment of the matrix characterizing the stability of neighborhoods is shown in Table 8a. It can be easily seen from this fragment that two local orders are possible for this part of the map, s_{1} = (67, 68, 69, 70) and s_{2} = (67, 68, 69, 70), with probabilities P(OC(s_{1}) > OC(s_{2})) = 0.737 and P(OC(s_{1}) < OC(s_{2})) = 0.263. The recombination rates for the two orders calculated on the initial data set are shown in Figure 4. Thus, the OC values are OC(s_{1}) = 0.099 + 0.082 + 0.069 = 0.250 and OC(s_{2}) = 0.143 + 0.082 + 0.029 = 0.254. Therefore, on the basis of OC values, one will choose the true order OC(s_{1}), but it is clear that for another sample OC(s_{2}) may be selected as well, due to sampling variation of recombination rates. This is why it is important not only to detect such questionable neighborhoods, but also to evaluate the probabilities of the local competitive orders. The same is true for any other ordering criterion, e.g., maximum likelihood, because P(L(s_{1}) > L(s_{2})) is also a priori unknown. For our numerical example, the probabilities of the compared alternative orders do not differ significantly, so that further steps are needed to obtain a solution with higher confidence.
The simplest way to improve the quality of the solution is to remove the questionable marker (Mottet al. 1993; Liu 1998). In our example, the source of the difficulties in the island 6770 was marker 69 that inflated fivefold the size of the spanning interval 6870. After removing marker 69, the jackknife procedure was applied 1000 times again with results shown in Table 8b. Thus, by deleting the problematic marker, one can obtain an unequivocal local ordering.
A more complex approach is based on temporal exclusion of marker scores considered as double recombinants (without affecting other markers of the same individuals; see Figure 3). This increases the probability of recovering the true order by excluding (albeit artificially) the local violations of the condition r_{i,i+1} and r_{i+1,i+2} < r_{i,i+2}. After such editing of the data, we again applied the jackknife procedure. In the above example, after 1000 runs we obtained the result shown in Table 8c. Thus, as expected, removing double recombinants resulted in an increased stability of the derived ordering: the weakest connection between the neighbors in the locality 67686970 increased from P = 0.737 up to P = 0.974. Note that after removing double recombinants, r_{6869} and r_{6970} decreased from 0.082 and 0.069 to 0.025 and 0.011, respectively, and r_{6870} = 0.029. Therefore, the condition r_{i,i+1} and r_{i+1,i+2} < r_{i,i+2} is not violated anymore. The same procedure could be applied to test the second local order, namely 67696870.
An example of application to real data: We employed the proposed approach to recently published mapping data on the maize Intermated B73 × Mo17 (IBM) population (Leeet al. 2002). For demonstration, the first chromosome (with 231 markers) was chosen from the Map database (www.maizemap.org/ibm_frameworkmaps.htm, framework_302.xls file). In our treatment of this data set, several questions that could be addressed during the map construction and its validation based on jackknife were of interest: (a) to reveal the map segments with stable neighborhoods (P = 1 for each pair of adjacent markers) that fully coincide with the published map (Leeet al. 2002); (b) to reveal the map segments with neighborhood probability higher than some threshold (e.g., P = 0.90 or 0.95) that coincide with the published map; (c) to reveal the map segments with neighborhood probability higher than some threshold (P = 0.90 or 0.95) that do not coincide with the published map; (d) to demonstrate alternative (competitive) orders of the same region with unreliable neighborhoods (i.e., with neighborhood probability lower than the threshold) that could be resolved by excluding 12 markers to fit the conditions b or c; and (e) revealing the segments of the map for which an exclusion of a larger group of markers (e.g., ≥2) is needed to fit the conditions b, c, or d for the remaining subgroups.
For simplification of presentation of the results, the “natural” marker numbers as they are represented in the Excel data file (see also Leeet al. 2002) were used as a reference ordering. To address the foregoing questions (ae), we employed our ordering algorithms for map construction and jackknife resampling procedure to test the reliability of the resulting orders (using 100 jackknife runs with sampled proportion of 90% of genotypes at each run). Following are the obtained results. First, marker 24 showed no close linkage with any of the remaining 230 markers; hence it was excluded from the map. The remaining marker groups were classified with respect to the jackknife test of neighborhood stability:

Regions with stable (P = 1) neighborhoods that fully coincide with the published IBM map were: 114, 3748 (without marker 39 that was linked closer to other markers of chromosome 1; see below), 5260, 8184, 8991, 112115, 120125, 128133, 140153, 205208, and 218221.

Segments with neighborhood probability >P = 0.90 that coincide with the published map were: 1825 (but without marker 24), 2833, 6375, 160165, 168174, and 221231.

Segments with neighborhood probability >P = 0.9 that did not coincide with the published map; the revised orders were: (1) 174176175177; (2) 179181180, 184185186188; (3) 204202201205; and (4) 214216215217.

Islands with simple unresolved alternatives; for resolution (i.e., to reach the foregoing conditions b or c) it is necessary to exclude 12 markers:

15161718 with P(1516) ≈ 0.62 vs. 15171618 with P(1517) ≈ 0.38: after marker 17 is excluded, we obtain P(1516) = P(1618) = 1.

2527263928 with P(2527) ≈ 0.72 vs. 2526273928 with P(2526) ≈ 0.28: after marker 27 is excluded, we obtain P(2526) = p(2639) = p(3928) = 1. Resolving this situation has also improved the stability of the foregoing group 28 ÷ 33 that can be moved now from the set of groups with P = 0.9 to the set of fully stable ones with P = 1.

Stabilization of group 28 ÷ 33, in its turn, caused an improvement for group 33 ÷ 36. Instead of the initial dichotomy, 33343536 with P(3334) ≈ 0.78 vs. 33353436 with p(3335) ≈ 0.22, we obtained P(3334) = 0.96, P(3435) = 1, and P(3536) = 0.96.

4849515052 with P(4849) ≈ 0.84 vs. 4850514952 with P(4850) ≈ 0.16. By excluding marker 51, we can get P(4849) = 0.93, P(4950) = 1, and P(5052) = 0.93.

60626163 with P(6062) ≈ 0.62 vs. 60616263 with P(6061) ≈ 0.38. By excluding marker 61, we obtain P(6062) = P(6263) = 1.

75787776798081 with P(7578) ≈ 0.6 vs. 75798076777881 with P(7579) ≈ 0.4. After excluding markers 76 and 77, we obtain P(7578) = 0.95, P(7879) = 0.99, P(7980) = 1, and P(8081) = 0.95.

848586888789 with P(8485) ≈ 0.5 vs. 848685888789 with P(8486) ≈ 0.5. Excluding markers 86 and 87 results in P(8485) = P(8588) = P(8889) = 1.

115116117118119 with P(115116) ≈ 0.55 vs. 115117118116119 with P(115117) ≈ 0.45. After 116 is excluded, P(115117) = P(117118) = P(118119) = 1.

125126127128 with P(125126) ≈ 0.56 vs. 125127126128 with P(125127) ≈ 0.44; exclusion of marker 126 gives P(125127) = P(127128) = 1.

133134135136 with P(133134) ≈ 0.69 vs. 133135134136 with P(133135) ≈ 0.31; exclusion of marker 134 gives P(133135) = P(135136) = 1.

136138137139 with P(136138) ≈ 0.66 vs. 136137138139 with P(136137) ≈ 0.34; exclusion of marker 138 gives P(136137) = P(137139) = 1.

153155154157156158159160 with P(153155) ≈ 0.55 vs. 153154155157156158159160 with P(153154) ≈ 0.45; exclusion of markers 154 and 155 gives P(153157) = 0.94, P(157156) = 0.97, P(156158) = 0.97, P(158159) = 0.97, and P(159160) = 1.

165167166168 with P(165167) ≈ 0.55 vs. 165166167168 with P(165166) ≈ 0.45; exclusion of marker 166 gives P(165167) = 1, P(167168) = 0.95.

180182183184 with P(180182) ≈ 0.56 vs. 180183182184 with P(180183) ≈ 0.44; exclusion of marker 183 gives P(180182) = P(182184) = 1.

208210211209212 with P(208210) ≈ 0.65 vs. 208209210211212 with P(208209) ≈ 0.35; exclusion of marker 209 gives P (208210) = 1, P(210211) = 0.99, and P(211212) = 1.


Segments of the map for which an exclusion of a larger group of markers (≥2 markers) is needed fit conditions b, c, or d for the remaining subgroups: (i) 91112 and (ii) 187200.

In the first group, P(9293) = 0.7, P(9495) = 0.54, P(9596) = 0.3, P(9697) = 0.72, P(9798) = 0.45, P(9899) = 0.39, P(99100) = 0.83, P(100101) = 0.14, P(101102) = 0.38, P(102103) = 0.99, P(103104) = 0.17, P(104105) = 0.91, P(105106) = 0.5, P(106107) = 0.00, P(107108) = 1, P(108109) = 0.43, P(109110) = 0.95, P(110111) = 0.55, and P(111112) = 0.88. By excluding markers 94, 95, 99, 101, 104, and 109, we obtained P = 1 for pairs 9293, 9697, 9798, 98100, 100102, 102103, 103106, 106105, 105107, 107108, 108110, and 110111, and P = 0.99 for 9396.

In the second group, P(187188) = 0.47, P(188189) = 0.61, P(189190) = 0.9, P(190191) = 0.99, P(191192) = 0.41, P(192193) = 1, P(193194) = 0.66, P(194195) = 0.5, P(195196) = 1, P(196197) = 0.75, P(197198) = 0.5, P(198199) = 0.61, P(199200) = 0.83. After excluding markers 187 and 198 we obtained P(188189) = 1, P(189190) = 0.97, P(190191) = 1, P(191192) = 0.96, P(192193) = 0.98, P(193194) = 0.97, P(194195) = 0.96, and P = 1 for pairs 195196, 196197, 197199, and 199200.

The results of these manipulations are presented in Figure 5. It is noteworthy that the obtained map differs from the published one (see Leeet al. 2002). In the new version of the map, which was recently presented in www.maizemap.org/ibm_frameworkmaps.htm, the authors have deleted 40 markers, whereas in our version only 28 markers are deleted. The foregoing analysis allows us to suppose that our version is of a better quality compared to the revised map presented on the website.
DISCUSSION
This study is devoted to the problem of marker ordering in linkage groups with many dozens or hundreds of markers. We considered situations complicated by missing data, typing errors, high proportion of dominant markers, and high negative interference. The ordering problem belongs to the field of discrete optimization on a set of all possible orders (amounting to n!/2 for n loci). This formulation is quite similar to the wellknown challenging TSP, and several authors attempted to employ the methods developed in the TSP for genetic mapping (Presset al. 1986; Weeks and Lange 1987; Falk 1992; Schiex and Gaspin 1997). New ESoptimization algorithms developed by Mester (1999, 2000, and unpublished results) significantly improved the quality of solution in the TSP field (see Table 2). Our simulation experiments showed that a need in optimization power provided by these ESalgorithms usually begins from ordering problems with >20 markers; with smallersize problems the Composite algorithm seems to be sufficient. Composite is built from simple optimization procedures working faster than ES, but producing worse solutions. On all tested sizes of the ordering problem (50 and more), the ES algorithm provided the best solution after one to six evolutionary cycles. These results allowed us to define the threshold for the solution time (not more than six cycles) for the ES algorithm at different sizes of the problem. The advantage of ES over other selected algorithms of optimization, in particular simulated annealing (SA), as applied to combinatorial problems, can be clearly seen from Table 2. Therefore, it was quite natural to apply this fast and efficient approach for multilocus ordering. Applying the TSPoriented methodology to mapping problems would be especially simple if, instead of multilocus likelihood, a faster criterion based on minimization of the total map length is employed. Combined with high performance of the optimization algorithm, this simplification allows us to treat the problem systematically with verification of the obtained multilocus order on the basis of computingintensive bootstrap and jackknife approaches.
To analyze the properties of derived maps under various complications (dominant vs. codominant markers, marker misclassification, alternating negative and positive interference, and missing data), simulated data with ∼50400 markers were employed, and the quality of the map was evaluated using a “coefficient of restoration” (based on comparison between the simulated and recovered orders). It appeared that the employed optimization criterion enabled us to achieve a very close proximity of the calculated orders to the simulated ones, despite missing data or misclassification. Two types of deviations from the true order were revealed: (i) local “inversion” usually involving adjacent markers and (ii) “excision” of a map fragment and its “insertion” (with or without inversion) to another map region. The first type of error is caused by violation of the condition r_{i,i+1} and r_{i+1,i+2} < r_{i,i+2} due to high negative interference or marker misclassification. The second type of error occurs mainly due to large gaps along the map (caused by low density of markers in some chromosomal regions).
To detect unreliable segments of the map, bootstrap and jackknife techniques could be employed. Unless the optimization procedure is highly efficient, the application of these approaches should be constrained to a relatively small number of markers due to CPU limitations. This is not the case with our ESoptimization algorithm: ordering of 100 markers takes ∼0.21.5 sec on a Pentium 2GHz computer. A further severalfold improvement in performance is expected by using our new optimizer based on guided evolution strategies (Mester and Braysy 2003). The diagnostic approach for detecting unreliable map regions, proposed in this article, differs in some aspects from other procedures [e.g., from the bootstrap procedure described by Liu (1998)]. We employ an invariant description of marker orders on the basis of the notion of marker neighborhoods rather than marker map positions. Actually, such a consideration is closely related to our method of evaluation of restoration quality in simulated experiments, i.e., proximity between the “true” (simulated) and recovered orders, which is independent of the specific coordinate system (e.g., recombination rates or map positions). We believe that marker order is a much more objective indicator for comparison multipoint maps than map positions. Indeed, even with strict constancy of gene order within species, recombination rates (hence map positions) may widely fluctuate from experiment to experiment due to sampling variation, dependence on ecological conditions, sex, genotype, and age (Korolet al. 1994). Consequently, genetic mapping of any target trait, either qualitative or quantitative, through determining the marker brackets, will be less dependent on these fluctuations and more comparable across independent studies than a related effort based on map position in centimorgans. This is especially clear when the results of fine mapping serve as a starting point for mapbased cloning. In such a case, what is really important is the information about close markers rather than precise map position.
There is also a technical advantage of using the marker orders rather than marker map positions (centimorgans) as final mapping results. Our verification procedure based on jackknife and bootstrap techniques reveals neighborhoods of questionable local ordering and enables us to detect the “weak connections” in the marker chain. If such a local “weakness” was caused by low marker density, one can split the map into two linkage groups and/or attempt to add new markers to fill the gap. In the case of an excess of double recombinants, the detected questionable marker scores can be removed from the data (without having to delete the marker entirely) with a subsequent reanalysis of the map, as in similar options available in other mapping tools (e.g., MapMaker). This purifying operation may be sufficient to stabilize the resulting map, and it is reasonable if the questionable score derives from typing error (that can be tested by a repeated typing). However, there is some evidence that an excess of double recombinants may result from negative interference (Penget al. 2000; Boykoet al. 2002; Esch and Weber 2002). Even then, such a treatment is useful as a diagnostic step, and after getting an idea of what factors caused the local problem, one may continue the analysis. For instance, it makes sense to deal with two versions of the defined region: one (purified) for mapping needs only and the other one for further indepth study of the putative negative interference. Unlike many other procedures that remove double recombinants and conclude the analysis by recalculating the orders, in our case this step is complemented by reanalysis of the probabilities P(OC(s_{1}) > OC(s_{2})) and P(OC(s_{1}) < OC(s_{2})), thereby providing a direct tool for statistically justified decisions.
We should recall another reason to deal with two map versions simultaneously, which is related to the linkage phase of dominant markers, i.e., coupling vs. repulsion. As shown above, higher precision of ordering couplingphase dominant markers compared to repulsionphase data justifies splitting the dominant marker data into two sets, each with the coupling phase only, and generating two maps for each linkage group (Knappet al. 1995; Penget al. 2000). Clearly, such a procedure should be followed by integration of the two maps. The availability of shared codominant markers enables mutual control during multilocus ordering (Mesteret al. 2003), facilitating the integration that can be conducted by a proper algorithm (e.g., Lalouel 1977; Stam 1993; Mesteret al. 2003). Parallel calculation technology can easily be adopted for further expedition of the proposed algorithm.
Acknowledgments
The useful suggestions of G. Churchill and an anonymous reviewer are acknowledged with thanks. This study was supported by the Israeli Ministry of Absorption, the U.S. Agency for International Development Cooperative Development Research Program (grant TAMOU97CA17001), and the GermanIsraeli Cooperation Project [DeutschIsraelische Projektkooperation project funded by the Bundesministerium für Bildung und Forschung (BMBF) and supported by BMBF's International Bureau at the Deutsch Zentrum Luftund Raumfahrt].
Footnotes

Communicating editor: G. Churchill
 Received July 23, 2003.
 Accepted August 28, 2003.
 Copyright © 2003 by the Genetics Society of America