- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Mester, D.
- Articles by Korol, A.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Mester, D.
- Articles by Korol, A.
Constructing Large-Scale Genetic Maps Using an Evolutionary Strategy Algorithm
D. Mestera, Y. Ronina, D. Minkova, E. Nevoa, and A. Korolaa Institute of Evolution, University of Haifa, Haifa 31905, Israel
Corresponding author: A. Korol, University of Haifa, Haifa 31905, Israel., korol{at}esti.haifa.ac.il (E-mail)
Communicating editor: G. CHURCHILL
| ABSTRACT |
|---|
This article is devoted to the problem of ordering in linkage groups with many dozens or even hundreds of markers. The ordering problem belongs to the field of discrete optimization on a set of all possible orders, amounting to n!/2 for n loci; hence it is considered an NP-hard problem. Several authors attempted to employ the methods developed in the well-known traveling salesman problem (TSP) for multilocus ordering, using the assumption that for a set of linked loci the true order will be the one that minimizes the total length of the linkage group. A novel, fast, and reliable algorithm developed for the TSP and based on evolution-strategy discrete optimization was applied in this study for multilocus ordering on the basis of pairwise recombination frequencies. The quality of derived maps under various complications (dominant vs. codominant markers, marker misclassification, negative and positive interference, and missing data) was analyzed using simulated data with
50400 markers. High performance of the employed algorithm allows systematic treatment of the problem of verification of the obtained multilocus orders on the basis of computing-intensive bootstrap and/or jackknife approaches for detecting and removing questionable marker scores, thereby stabilizing the resulting maps. Parallel calculation technology can easily be adopted for further acceleration of the proposed algorithm. Real data analysis (on maize chromosome 1 with 230 markers) is provided to illustrate the proposed methodology.
AN important step in generating multilocus genetic maps using the results of linkage analysis is the determination of the true marker order. One of the possibilities in addressing this problem is to recover the linear marker order from the known pairwise marker distance matrix dij. A primary difficulty in ordering genetic loci using linkage analysis is the large number of possible orders: for n loci on a chromosome, n!/2 distinct orders should be evaluated. In real problems, n might vary from dozens to 200500 markers and more (e.g., www.maizemap.org/ibm_frameworkmaps.htm; see also ![]()
30, it would not be feasible to evaluate all n!/2 possible orders using two-point linkage data. This is why multilocus ordering is considered as a nonpolynomial (NP)-hard combinatorial problem (![]()
![]()
![]()
![]()
Several methods have been proposed for determination of marker order (![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
We present in this article a new, highly efficient algorithm of multilocus ordering based on two-locus linkage data that employs the evolutionary optimization strategy (ES). ES is a heuristic algorithm mimicking natural population processes. The numerical procedures in such optimization are based on simulation of mutation and reproduction, followed by selection of the fittest "genotypes," representing the obtained values of the optimization criterion. Together with genetic algorithm (![]()
![]()
![]()
![]()
![]()
![]()
![]()
In contrast to ES, genetic algorithms, introduced by ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
| EVOLUTION STRATEGIES AND THE HEURISTICS IN THE DEVELOPED ALGORITHM |
|---|
The employed procedure as a simulated analog of evolutionary processes:
Usually, the optimization process of an objective function f(x) with n real-value variables x = (x1, x2, ... , xn) can be represented as an evolution of the solution vector x
Rn . The main elements of ES algorithms and their correspondence with the elements and processes of an "evolving population" are presented in Table 1.
|
The common ES algorithm steps:
Evolution strategies define the size of a population and the rules for the selection process. Various approaches were proposed for choosing the population size in the ES, including the (1 + 1) strategy (![]()
) strategy (![]()
) strategy works with a population of size
. The selection operator chooses µ best individuals to establish the new generation. Both versions, (1 + 1) and (µ,
), employ the following steps:
- Create
individuals (xk) of initial population P0. - Compute the fitness f(xk), k = 1, ... ,
. - If the optimization process is terminated, then stop.
- Select the µ
best individuals. - Create
/µ offspring xk+1 of each of the µ individuals by small variation. - Return to step 2.
Peculiarities of the combinatorial version of ES:
Clearly the multilocus ordering problem cannot be directly represented in terms of ES with real-value formulation. Combinatorial versions of ES differ from the real-value formulation by specific representation of the solution vector x and mutation mechanisms (![]()
The mutation operator (referred to hereafter as mutator) changes the vector xk, thereby producing a new solution vector xk+1. For this goal, one can use the move-generation and the solution-generation mechanisms (![]()
![]()
![]()
![]()
xkreject into xkremainder. This is the construction phase of the mutator, which builds some new solution vectors xk+1 using the variation of the problem-specified criterion (![]()
![]()
![]()
![]()
At the mutation stage, mutator M(R, I, ß, xk) produces an offspring xk+1 from the parent xk. If the first offspring appears to surpass the parent, the mutator with the same parameters is applied again to the new parent, and so on. If the offspring does not surpass the parent, then to generate the new offspring, the algorithm uses the mutator with other parameters. After mutation, the vector xk+1 "is improved" by standard combinatorial procedures of order O(n2): (1) 2-Opt (![]()
![]()
![]()
This two-phase approach (mutation-improving) reflects the principles of solution diversification and upgrading (![]()
![]()
![]()
![]()
![]()
|
Multipoint marker ordering as a TSP problem:
The proposed algorithm of multipoint ordering employs two-point linkage data (see also ![]()
![]()
![]()
![]()
x and, for each n - 1 marker pairs (xi, xj), a "distance"
ij. As
ij, either pairwise recombination fractions rij or map distances dij (e.g., in Haldane or Kosambi metrics) are employed.
Different criteria can be used to discriminate between competitive orders, for example, total distance measured as a sum of distances between consecutive adjacent markers or the total number of recombination events. These criteria are founded on a biologically reasonable assumption that the true order of a set of linked loci will be the one that minimizes the total length of the chromosomal map (![]()
![]()
![]()
![]()
![]() |
(1) |
where
ij = 0 or
ij = 1 represents in the criterion only u
n - 1 distances out of all n(n - 1)/2 pairwise distances;
The program for simulations was written in Visual Basic 6.0. Monte Carlo testing experiments were conducted on a double-processor Pentium 3 (800 Mhz). To compare different situations, the following coefficient of restoration quality [proximity between the "true" (simulated) and estimated orders] was employed,
![]() |
(2) |
where xi is the digit code of the ith marker in the currently ordered marker sequence. Fig 1 illustrates a typical dependence of Kr on executing time using different heuristics.
|
| SIMULATED DATA SETS |
|---|
The data for analysis were produced using a pseudorandom generator. The simulation algorithm repeatedly generated a single-chromosome mapping population, F2, for a chosen number of markers with:
- Variation of recombination rates between adjacent markers along the chromosome;
- defined proportion of dominant vs. codominant markers;
- chosen proportion of missing data;
- chosen proportion of markers with erroneous classification and level of errors; and
- chosen mode of recombination interference for adjacent markers, Haldane, Kosambi, or arbitrary interference. In the last case, we define a few ranges of coincidence values and the probabilities to sample the coincidence values from these ranges (with even distribution of the coincidence values from the chosen range).
The following are the numerical values (ranges) of the main parameters in the majority of experiments:
- The number of markers per chromosome: m = 80.
- Probability distributions for distances between adjacent markers:
- P(3 < d (cM) < 5) = 0.8, P(5 < d (cM) < 10) = 0.15, P(10 < d (cM) < 20) = 0.05, with even distribution within each of the three ranges.
- Proportions of codominant and dominant markers (in coupling phase, unless noted otherwise): 0.5 and 0.5, respectively.
- Three levels for missing data: 0, 10, and 20%.
- Two levels for the proportion of loci with classification errors, 0 and 40%, and in the last case, two levels of misclassification, 10 and 20%.
- In the case of arbitrary interference, the distributions of coincidence coefficients: P(0 < c < 1) = 0.6 (positive interference), P(1 < c < 2) = 0.2 (slight-to-moderate negative interference), and P(2 < c < Ic) = 0.2 (moderate-to-strong negative interference), where Ic = 5, 10, 20, and 40.
Therefore, the efficiency of the preliminary multilocus ordering was considered upon complications caused by negative interference, erroneous marker scoring, and incomplete mapping information due to dominant markers and missing data, known to affect the quality of multipoint ordering. Motivation to consider such complications derives from the simple fact that in real mapping work no one can guarantee that the data are free of such complications. Moreover, in numerous previous attempts at building efficient multilocus ordering tools, some of these problems were usually ignored.
| RESULTS |
|---|
The considered types of disturbances (see the end of SIMULATED DATA SETS) proved to affect the quality of restoration of the true order of markers. These disturbances mainly caused local distortion of the order, e.g., interchanging of two to three neighboring markers (referred to as "local disturbances"). There could be several inverted islands per linkage group. As expected, the number of these islands increases with the percentage of missing data, classification errors, and the level of negative interference. Less frequent were violations caused by excision of a large segment and its transposition to another place with or without inversion within the segment ("global disturbances"). Clearly, such violations result in an appreciable reduction of the coefficient of restoration quality (Equation 2).
Dominance:
When all dominant markers were in coupling phase, the proportion of dominant and codominant markers had no effect on the quality of marker ordering. For three proportions of dominant markers (50, 66, and 100%) with Kosambi, Haldane, and a slight negative interference, nearly full recovery of marker order was reached (Kr
0.997 ÷ 0.999). A different result was obtained with dominant markers in repulsion phase. It appears that the higher the proportion of repulsion phase, the lower the quality of multilocus ordering (![]()
Kr, the standard deviation of Kr between the Monte Carlo experiments). High precision of ordering in the coupling-phase data and low precision in the repulsion-phase data justify splitting the data into two sets, each with coupling-phase markers only and generating two complementary maps for each linkage group (![]()
![]()
![]()
![]()
![]()
Negative interference:
As expected, negative interference complicates the ordering problem that is manifested in reduction of Kr (Table 3). However, the decline in Kr with an increase in the maximum value Ic of coefficient of coincidence c is unexpectedly slow, pointing to robustness of the employed ordering procedure. A detailed anatomy of misordered situations shows that deviations from the true order in such cases are due mainly to interchanges of adjacent markers. The relatively low effect of negative interference can be explained by a stabilizing role of the neighboring intervals. This can be illustrated by the following example (Fig 2), in which recombination rate between the flanking markers M1 and M3 is smaller than that for the subinterval M2M3 (see Fig 2A).
|
|
Without taking into account the information from neighboring intervals, the criterion "minimum of total distance between markers" will give the local order M2M1M3 (Fig 2B) that differs from the true one. The stabilizing effect of the neighbor M0 allows us to obtain the true order. Indeed, the optimization criterion value for the true order (Fig 2C) is OC = 0.048 + 0.024 + 0.059 = 0.131, whereas the order corresponding to the foregoing inversion between M1 and M2 (Fig 2D) caused by negative interference results in OC = 0.072 + 0.024 + 0.052 = 0.148. Therefore, despite high negative interference on interval M1M2M3 (c = 15.8), which violates the rule that "the entire entity is supposed to be larger than its parts," the algorithm recovers the true order.
Misclassification:
Errors in marker scoring inflate recombination distances and can also violate the principle, the entire entity is supposed to be larger than its parts by imitating "negative interference." This is why some mapping packages allow for error filtration by selecting out double recombinants. Our simulations showed that in the majority of such local violations the true order could be recovered due to the stabilizing effect of the neighboring markers (Table 4).
|
In a typical example (Table 5) with a maximum level of noise (20% of marker scoring errors were simulated for 40% of marker loci), there were 14 pairs of adjacent intervals (out of 49 possible pairs) in which either ri,i+1 or ri+1,i+2 was larger than ri,i+2, but in only 4 of these pairs the true order could not be recovered. We conclude from the obtained results that despite the biases in pairwise estimates of recombination rates and inflation of the map length, the employed criteria of ordering are fairly robust to errors in marker scoring, unless the errors occur on a catastrophic level (say, at half of the loci and with a rate >>20%).
|
Missing data:
The results presented in Table 6 for several levels of missing marker scores (m = 0, 10, and 20%) show the same tendencies as those found for complicating factors considered above. Thus, participation of the ES-phase in the optimization procedure increased the precision of ordering (reducing the deviation of Kr from unity) and stabilized the ordering among Monte Carlo runs (as displayed in reduction of
Kr).
|
Comparison with multilocus algorithms:
The foregoing results illustrate the advantages of the ordering procedure on the basis of minimization of the total length of the map (sum of recombination rates or distances between consecutive pairs of markers). Combined with our novel, highly efficient method of discrete optimization, a unique performance and rather high robustness with respect to various disturbances (like classification errors, negative interference, and missing data) are provided. It is noteworthy that ordering 100, 200, 400, and 800 markers takes
1.3 sec, 14 sec, 2 min, and 9 min on a Pentium-4 2.0-GHz computer in the most complicated of the aforementioned situations. Note that even better performance was found in the first trials of our new optimizer based on guided evolution strategies (GES): on the same computer, map ordering for the foregoing variants proved fivefold (!) faster (![]()
![]()
![]()
![]()
|
To compare the efficiency of the OC criterion (Equation 1) with the multilocus-likelihood method, MapMaker 3.0 software was employed in a simulated data set of 200 markers with high negative interference in several regions. First, we revealed on the simulated map all islands where for three consecutive markers i, i + 1, and i + 2, either ri,i+1 or ri+1,i+2 was larger than ri,i+2. For each such island, three "windows" involving 5, 7, and 9 markers, respectively, were analyzed using MapMaker. Simultaneously, the entire set of 200 markers was ordered with our program. Despite the fact that a local order that one could derive by comparing multilocus likelihoods for all possible candidate orders of such local neighborhoods (of 5, 7, or 9 markers) cannot be considered as a final solution, it makes sense to compare the local properties of the MapMaker solutions and those of the OC-based procedure (with OC defined by Equation 1). This is especially important for situations in which the natural condition ri,i+1 and ri+1,i+2 < ri,i+2 is violated, causing the highest instability of the result under sampling variation, e.g., by using jackknife or bootstrap procedures. It should be noted, however, that application of these last techniques seems impossible with MapMaker for
100 and more markers because of CPU limitations. The following are the details of this comparison. The simulated data of 200 markers included (a) for 95% of intervals L = 0.75 cM, for 2.5% L = 30 cM, and for the remainder 2.5% L = 60 cM; (b) 80% of the markers dominant in coupling phase, and 20% codominant; (c) population size N = 400; and (d) interference, with probability P = 0.6, c
(0, 1), with P = 0.2, c
(1, 2), with P = 0.2, c
(2, 20).
This example included 10 3-marker islands with violation of the condition {ri,i+1 and ri+1,i+2 < ri,i+2}. In addition to negative interference or classification errors, such a violation may derive from sampling fluctuations, especially when two adjacent intervals are of very different lengths. At 8 out of 10 such islands, our algorithm recovered the true order (the entire solution for 200 markers took <1 sec). MapMaker recovered the true order in 5 out of 10 islands on the basis of the 5-marker window; the remaining 5 islands were treated using the 7-marker window and recovered the true order in an additional 3 islands, and the last 2 were treated using the 9-marker window with a 50% success. The last two tasks took 6 hr.
Possibilities to validate the solution:
Clearly, the foregoing comparisons using simulated data are only to illustrate the quality of the solution provided by the simple OC-based procedure. In dealing with real data, one needs some tools to validate the obtained order, and it is hard to choose the solution from several (sometimes dozens) candidate solutions (like those provided by MapMaker). To cope with this problem, some authors proposed computing-intensive procedures based on various combinations of jackknifing and bootstrapping (![]()
![]()
![]()
![]()
|
The first step is ordering of markers using the whole set of data. To validate (or correct) the obtained map, the following analysis is conducted on the basis of a large series of jackknife runs (e.g., 100010,000). In each run based on a subsample of individuals (e.g., 90%), we first order the markers and for each marker determine its two (left and right) neighbors. Then, for each marker, the frequency distribution of its closest left and right neighbors is calculated and the unstable neighborhoods are detected using the entire set of generated jackknife runs. Such cases are classified according to the putative causal factor of instability: (a) double recombinants in adjacent intervals (resulting from negative interference or misclassification) and (b) sampling variation of recombination in large intervals. The first problem can be treated by taking out marker scores qualified as "double recombinants" (without affecting the scores of other markers of the same individual). The revised data set is reanalyzed by a repeated ordering procedure using the same jackknife approach. If the instability was caused by the second problem, one could split the map into two linkage groups until additional markers are available for the revealed gap. The following simulated example illustrates the application of the algorithm. The simulated data of 100 codominant markers included: (i) for 80% of intervals, L
(5, 10) cM, and for 20% L
(10, 20) cM; (ii) population size N = 300; and (iii) interference, with P = 0.6, c
(0, 1); with P = 0.2, c
(1, 2); and with P = 0.2, c
(2, 20).
Each jackknife run employed 275 (92%) individuals at both steps: initial ordering and validation were based on a revised data set. One thousand runs were analyzed. A typical fragment of the matrix characterizing the stability of neighborhoods is shown in Table 8a. It can be easily seen from this fragment that two local orders are possible for this part of the map, s1 = (67, 68, 69, 70) and s2 = (67, 68, 69, 70), with probabilities P(OC(s1) > OC(s2)) = 0.737 and P(OC(s1) < OC(s2)) = 0.263. The recombination rates for the two orders calculated on the initial data set are shown in Fig 4. Thus, the OC values are OC(s1) = 0.099 + 0.082 + 0.069 = 0.250 and OC(s2) = 0.143 + 0.082 + 0.029 = 0.254. Therefore, on the basis of OC values, one will choose the true order OC(s1), but it is clear that for another sample OC(s2) may be selected as well, due to sampling variation of recombination rates. This is why it is important not only to detect such questionable neighborhoods, but also to evaluate the probabilities of the local competitive orders. The same is true for any other ordering criterion, e.g., maximum likelihood, because P(L(s1) > L(s2)) is also a priori unknown. For our numerical example, the probabilities of the compared alternative orders do not differ significantly, so that further steps are needed to obtain a solution with higher confidence.
|
|
The simplest way to improve the quality of the solution is to remove the questionable marker (![]()
![]()
A more complex approach is based on temporal exclusion of marker scores considered as double recombinants (without affecting other markers of the same individuals; see Fig 3). This increases the probability of recovering the true order by excluding (albeit artificially) the local violations of the condition ri,i+1 and ri+1,i+2 < ri,i+2. After such editing of the data, we again applied the jackknife procedure. In the above example, after 1000 runs we obtained the result shown in Table 8c. Thus, as expected, removing double recombinants resulted in an increased stability of the derived ordering: the weakest connection between the neighbors in the locality 67686970 increased from P = 0.737 up to P = 0.974. Note that after removing double recombinants, r6869 and r6970 decreased from 0.082 and 0.069 to 0.025 and 0.011, respectively, and r6870 = 0.029. Therefore, the condition ri,i+1 and ri+1,i+2 < ri,i+2 is not violated anymore. The same procedure could be applied to test the second local order, namely 67696870.
An example of application to real data:
We employed the proposed approach to recently published mapping data on the maize Intermated B73 x Mo17 (IBM) population (![]()
![]()
2) is needed to fit the conditions b, c, or d for the remaining subgroups.
For simplification of presentation of the results, the "natural" marker numbers as they are represented in the Excel data file (see also ![]()
- Regions with stable (P = 1) neighborhoods that fully coincide with the published IBM map were: 114, 3748 (without marker 39 that was linked closer to other markers of chromosome 1; see below), 5260, 8184, 8991, 112115, 120125, 128133, 140153, 205208, and 218221.
- Segments with neighborhood probability >P = 0.90 that coincide with the published map were: 1825 (but without marker 24), 2833, 6375, 160165, 168174, and 221231.
- Segments with neighborhood probability >P = 0.9 that did not coincide with the published map; the revised orders were: (1) 174176175177; (2) 179181180, 184185186188; (3) 204202201205; and (4) 214216215217.
- Islands with simple unresolved alternatives; for resolution (i.e., to reach the foregoing conditions b or c) it is necessary to exclude 12 markers:
- 15161718 with P(1516)
0.62 vs. 15171618 with P(1517)
0.38: after marker 17 is excluded, we obtain P(1516) = P(1618) = 1. - 2527263928 with P(2527)
0.72 vs. 2526273928 with P(2526)
0.28: after marker 27 is excluded, we obtain P(2526) = p(2639) = p(3928) = 1. Resolving this situation has also improved the stability of the foregoing group 28 ÷ 33 that can be moved now from the set of groups with P = 0.9 to the set of fully stable ones with P = 1. - Stabilization of group 28 ÷ 33, in its turn, caused an improvement for group 33 ÷ 36. Instead of the initial dichotomy, 33343536 with P(3334)
0.78 vs. 33353436 with p(3335)
0.22, we obtained P(3334) = 0.96, P(3435) = 1, and P(3536) = 0.96. - 4849515052 with P(4849)
0.84 vs. 4850514952 with P(4850)
0.16. By excluding marker 51, we can get P(4849) = 0.93, P(4950) = 1, and P(5052) = 0.93. - 60626163 with P(6062)
0.62 vs. 60616263 with P(6061)
0.38. By excluding marker 61, we obtain P(6062) = P(6263) = 1. - 75787776798081 with P(7578)
0.6 vs. 75798076777881 with P(7579)
0.4. After excluding markers 76 and 77, we obtain P(7578) = 0.95, P(7879) = 0.99, P(7980) = 1, and P(8081) = 0.95. - 848586888789 with P(8485)
0.5 vs. 848685888789 with P(8486)
0.5. Excluding markers 86 and 87 results in P(8485) = P(8588) = P(8889) = 1. - 115116117118119 with P(115116)
0.55 vs. 115117118116119 with P(115117)
0.45. After 116 is excluded, P(115117) = P(117118) = P(118119) = 1. - 125126127128 with P(125126)
0.56 vs. 125127126128 with P(125127)
0.44; exclusion of marker 126 gives P(125127) = P(127128) = 1. - 133134135136 with P(133134)
0.69 vs. 133135134136 with P(133135)
0.31; exclusion of marker 134 gives P(133135) = P(135136) = 1. - 136138137139 with P(136138)
0.66 vs. 136137138139 with P(136137)
0.34; exclusion of marker 138 gives P(136137) = P(137139) = 1. - 153155154157156158159160 with P(153155)
0.55 vs. 153154155157156158159160 with P(153154)
0.45; exclusion of markers 154 and 155 gives P(153157) = 0.94, P(157156) = 0.97, P(156158) = 0.97, P(158159) = 0.97, and P(159160) = 1. - 165167166168 with P(165167)
0.55 vs. 165166167168 with P(165166)
0.45; exclusion of marker 166 gives P(165167) = 1, P(167168) = 0.95. - 180182183184 with P(180182)
0.56 vs. 180183182184 with P(180183)
0.44; exclusion of marker 183 gives P(180182) = P(182184) = 1. - 208210211209212 with P(208210)
0.65 vs. 208209210211212 with P(208209)
0.35; exclusion of marker 209 gives P (208210) = 1, P(210211) = 0.99, and P(211212) = 1. - Segments of the map for which an exclusion of a larger group of markers (
2 markers) is needed fit conditions b, c, or d for the remaining subgroups: (i) 91112 and (ii) 187200. - In the first group, P(9293) = 0.7, P(9495) = 0.54, P(9596) = 0.3, P(9697) = 0.72, P(9798) = 0.45, P(9899) = 0.39, P(99100) = 0.83, P(100101) = 0.14, P(101102) = 0.38, P(102103) = 0.99, P(103104) = 0.17, P(104105) = 0.91, P(105106) = 0.5, P(106107) = 0.00, P(107108) = 1, P(108109) = 0.43, P(109110) = 0.95, P(110111) = 0.55, and P(111112) = 0.88. By excluding markers 94, 95, 99, 101, 104, and 109, we obtained P = 1 for pairs 9293, 9697, 9798, 98100, 100102, 102103, 103106, 106105, 105107, 107108, 108110, and 110111, and P = 0.99 for 9396.
- In the second group, P(187188) = 0.47, P(188189) = 0.61, P(189190) = 0.9, P(190191) = 0.99, P(191192) = 0.41, P(192193) = 1, P(193194) = 0.66, P(194195) = 0.5, P(195196) = 1, P(196197) = 0.75, P(197198) = 0.5, P(198199) = 0.61, P(199200) = 0.83. After excluding markers 187 and 198 we obtained P(188189) = 1, P(189190) = 0.97, P(190191) = 1, P(191192) = 0.96, P(192193) = 0.98, P(193194) = 0.97, P(194195) = 0.96, and P = 1 for pairs 195196, 196197, 197199, and 199200.
The results of these manipulations are presented in Fig 5. It is noteworthy that the obtained map differs from the published one (see ![]()
|
| DISCUSSION |
|---|
This study is devoted to the problem of marker ordering in linkage groups with many dozens or hundreds of markers. We considered situations complicated by missing data, typing errors, high proportion of dominant markers, and high negative interference. The ordering problem belongs to the field of discrete optimization on a set of all possible orders (amounting to n!/2 for n loci). This formulation is quite similar to the well-known challenging TSP, and several authors attempted to employ the methods developed in the TSP for genetic mapping (![]()
![]()
![]()
![]()
![]()
![]()
To analyze the properties of derived maps under various complications (dominant vs. codominant markers, marker misclassification, alternating negative and positive interference, and missing data), simulated data with
50400 markers were employed, and the quality of the map was evaluated using a "coefficient of restoration" (based on comparison between the simulated and recovered orders). It appeared that the employed optimization criterion enabled us to achieve a very close proximity of the calculated orders to the simulated ones, despite missing data or misclassification. Two types of deviations from the true order were revealed: (i) local "inversion" usually involving adjacent markers and (ii) "excision" of a map fragment and its "insertion" (with or without inversion) to another map region. The first type of error is caused by violation of the condition ri,i+1 and ri+1,i+2 < ri,i+2 due to high negative interference or marker misclassification. The second type of error occurs mainly due to large gaps along the map (caused by low density of markers in some chromosomal regions).
To detect unreliable segments of the map, bootstrap and jackknife techniques could be employed. Unless the optimization procedure is highly efficient, the application of these approaches should be constrained to a relatively small number of markers due to CPU limitations. This is not the case with our ES-optimization algorithm: ordering of 100 markers takes
0.21.5 sec on a Pentium 2-GHz computer. A further severalfold improvement in performance is expected by using our new optimizer based on guided evolution strategies (![]()
![]()
![]()
There is also a technical advantage of using the marker orders rather than marker map positions (centimorgans) as final mapping results. Our verification procedure based on jackknife and bootstrap techniques reveals neighborhoods of questionable local ordering and enables us to detect the "weak connections" in the marker chain. If such a local "weakness" was caused by low marker density, one can split the map into two linkage groups and/or attempt to add new markers to fill the gap. In the case of an excess of double recombinants, the detected questionable marker scores can be removed from the data (without having to delete the marker entirely) with a subsequent reanalysis of the map, as in similar options available in other mapping tools (e.g., MapMaker). This purifying operation may be sufficient to stabilize the resulting map, and it is reasonable if the questionable score derives from typing error (that can be tested by a repeated typing). However, there is some evidence that an excess of double recombinants may result from negative interference (![]()
![]()
![]()
We should recall another reason to deal with two map versions simultaneously, which is related to the linkage phase of dominant markers, i.e., coupling vs. repulsion. As shown above, higher precision of ordering coupling-phase dominant markers compared to repulsion-phase data justifies splitting the dominant marker data into two sets, each with the coupling phase only, and generating two maps for each linkage group (![]()
![]()
![]()
![]()
![]()
![]()
| ACKNOWLEDGMENTS |
|---|
The useful suggestions of G. Churchill and an anonymous reviewer are acknowledged with thanks. This study was supported by the Israeli Ministry of Absorption, the U.S. Agency for International Development Cooperative Development Research Program (grant TA-MOU-97-CA17-001), and the German-Israeli Cooperation Project [Deutsch-Israelische Projektkooperation project funded by the Bundesministerium für Bildung und Forschung (BMBF) and supported by BMBF's International Bureau at the Deutsch Zentrum Luft-und Raumfahrt].
Manuscript received July 23, 2003; Accepted for publication August 28, 2003.
| LITERATURE CITED |
|---|
BOYKO, E., R. KALENDAR, V. KORZUN, J. FELLERS, and A. KOROL et al., 2002 A high-density cytogenetic map of the Aegilops tauschii genome incorporating retrotransposons and defense-related genes: insights into cereal chromosome structure and function. Plant Mol. Biol. 48:767-790.[Medline]
BUETOW, K. N. and A. CHAKRAVARTI, 1987 Multipoint gene mapping using seriation. Am. J. Hum. Genet. 41:189-201.[Medline]
CURTIS, D. and H. GURLING, 1993 A procedure for combining two-point lod scores into a summary multipoint map. Hum. Hered. 43:173-185.[Medline]
EFRON, B., 1979 Bootstrap method: another look at the jackknife. Ann. Stat. 7:1-26.
ELLIS, T., 1997 Neighbour mapping as a method for ordering genetic markers. Genet. Res. 69:35-43.
ESCH, E. and W. E. WEBER, 2002 Investigation of crossover interference in barley (Hordeum vulgare L.) using the coefficient of coincidence. Theor. Appl. Genet. 104:786-796.[Medline]
FALK, C. T., 1992 Preliminary ordering of multiple linked loci using pairwise linkage data. Genet. Epidemiol. 9:367-375.[Medline]
FOGEL, D., 1992 Evolving artificial intelligence. Ph.D. Thesis, University of California, San Diego.
HOLLAND, J., 1975 Adaptation in Natural and Artificial Systems. MIT Press, Cambridge, MA.
HOMBERGER, J. and H. GEHRING, 1999 Two evolutionary metaheuristics for vehicle routing problem with time windows. INFOR 37:297-318.
JANSEN, J., A. C. DE JONG, and J. W. VAN OOIJEN, 2001 Constructing dense genetic linkage maps. Theor. Appl. Genet. 102:1113-1122.
KNAPP, S. J., J. L. HOLLOWAY, W. C. BRIDGES, and B. H. LIU, 1995 Mapping dominant markers using F2 mating. Theor. Appl. Genet. 91:74-81.
KOROL, A. B., I. A. PREYGEL and S. I. PREYGEL, 1994 Recombination Variability and Evolution. Chapman & Hall, London.
LALOUEL, J. M., 1977 Linkage mapping from pair-wise recombination data. Heredity 38:61-77.[Medline]
LANDER, E. S. and P. GREEN, 1987 Construction of multilocus linkage maps in human. Proc. Natl. Acad. Sci. USA 84:2363-2367.





