## Abstract

Although most high-density linkage maps have been constructed from codominant markers such as single-nucleotide polymorphisms (SNPs) and microsatellites due to their high linkage information, dominant markers can be expected to be even more significant as proteomic technique becomes widely applicable to generate protein polymorphism data from large samples. However, for dominant markers, two possible linkage phases between a pair of markers complicate the estimation of recombination fractions between markers and consequently the construction of linkage maps. The low linkage information of the repulsion phase and high linkage information of coupling phase have led geneticists to construct two separate but related linkage maps. To circumvent this problem, we proposed a new method for estimating the recombination fraction between markers, which greatly improves the accuracy of estimation through distinction between the coupling phase and the repulsion phase of the linked loci. The results obtained from both real and simulated F_{2} dominant marker data indicate that the recombination fractions estimated by the new method contain a large amount of linkage information for constructing a complete linkage map. In addition, the new method is also applicable to data with mixed types of markers (dominant and codominant) with unknown linkage phase.

MOST high-density linkage maps have been constructed from codominant markers such as single-nucleotide polymorphisms (SNPs) and microsatellites because of their high linkage information, but linkage maps of dominant markers will become more and more important because such markers are often related to biological functions and are increasingly available as proteomic techniques are becoming mature. Proteomic markers include position-shift locus (PSL), presence/absence sport (PAS), and protein quantitative locus (PQL) (Thiellement *et al*. 1999; Zivy and de Vienne 2000; Consoli *et al*. 2002), of which PAS and PQL are dominant markers (Thiellement *et al*. 1999; Zivy and de Vienne 2000; Consoli *et al*. 2002). An example of a linkage map constructed from mostly dominant markers is the *Escherichia coli* bacteriophage T7 protein linkage map (Bartel *et al*. 1996). High-density linkage maps in the future will be more likely constructed from both dominant and codominant markers since such maps can provide fine genetic locations of functional markers through high-density codominant markers flanking them. Therefore, accurate estimates of recombination fractions between dominant markers and between dominant and codominant markers are important.

Due to dominance, the genotype of an individual at a dominant marker is often ambiguous, which increases the complexity of analysis. An important issue in the estimation of the recombination fraction is how to efficiently deal with different linkage phases between a pair of dominant loci (Mester *et al*. 2003a). Two different linkage phases for a double heterozygote are well recognized. One is known as the repulsion phase, which corresponds to the situation in which these two dominant alleles reside on different chromosomes; otherwise, it is known as the coupling phase. In a two-point analysis that considers two markers at a time, the repulsion phase provides much less information about linkage than the coupling phase (Allard 1956; Knapp *et al*. 1995; Liu 1998; Mester *et al*. 2003a). This is especially true for double heterozygotes from the F_{2} population (Liu 1998). In reality, about half of the markers are in the coupling phase and the remaining markers are in the other coupling phase. The phase between two couplings is repulsion (Liu 1998; Mester *et al*. 2003a). This leads in practice to the construction of two separate partner linkage maps: one is called the paternal map on which markers are derived from the paternal parent and the other is called the maternal map consisting of the maternal markers (Knapp *et al*. 1995; Peng *et al*. 2000; Mester *et al*. 2003a). To date, there is no effective way to integrate the partner maps into a single complete map. Mester *et al*. (2003) attempted to use pairs of codominant and dominant markers to accomplish this task because such pairs of markers in the repulsion phase have higher linkage information than pairs of dominant markers in the coupling phase. However, this strategy is extremely demanding because it requires that every dominant marker be paired with a codominant marker.

The two-point analysis implemented by the expectation-maximization (EM) algorithm (Dempster *et al*. 1977; Lander and Green 1987; Ott 1991) is a powerful approach for estimating recombination fractions between codominant loci and between dominant loci in the coupling phase, but it has a poor resolution for dominant loci in the repulsion phase (see Liu 1998). This is because the two-point analysis cannot distinguish the coupling phase from the repulsion phase of dominant markers, which have rather different statistical properties. In addition to the need for treating coupling and repulsion phases separately, examining three loci at a time will lead to a better utilization of available linkage information. The problem is that not only the number of combinations of the three loci is large when the total number of loci is large, but also the complexity of the analysis increases due to the need to distinguish several types of double or triple heterozygotes. To circumvent these problems, we propose an alternative approach in this article. The new method considers three loci at a time. It first classifies phenotypes into four pairs of gamete genotypes, followed by estimating their frequencies from the sample that led to the identification of the linkage phase of the loci, then estimates recombination fractions between loci according to their linkage phase, and finally reduces the three-point estimates of the recombination fractions to two-point estimates. A key to this strategy is a fast method for estimating the frequencies of different gamete types because of the need to deal with a large number of loci combinations. We are able to develop very efficient estimators of these frequencies by taking advantage of the simplicity of their expectations. The estimates of recombination fractions obtained by this new method make it possible to integrate two separate partner linkage maps based on the EM estimates of recombination fractions into a single complete linkage map.

## METHODS

#### Estimating the frequencies of three-locus gametes:

Since the novel method to be described for estimating recombination fractions makes use of the frequencies of gametes defined by alleles from three loci, we start by presenting estimators of these frequencies. Two cases need to be considered separately. The first corresponds to the situation in which all three loci are dominant and thus is referred to as “dominant loci.” The second is that only one or two loci out of three are dominant and is referred to as “mixed loci.”

#### Dominant loci:

Consider three dominant loci each having two alleles. Let *A* and *a* be the two alleles for the first locus, *B* and *b* be those for the second, and *C* and *c* be those for the third. Uppercase letters denote dominant alleles and lowercase letters recessive alleles. A meiosis from a triple-heterozygote individual of the F_{1} population can produces eight different types of three-locus gamete: *ABC*, *ABc*, *Abc*, *AbC*, *aBC*, *abC*, *aBc*, and *abc*, where *ABC* and *abc*, *Abc* and *aBC*, *abC* and *ABc*, and *AbC* and *aBc* are, respectively, sister gametes. These sister gametes are expected to have equal frequency under the assumption of no segregation distortion during meiosis. In practice, a chi-square test can be used to remove loci that exhibit significant segregation distortion. These gametes can be grouped into four pairs of nonsister gametes. Define an F_{2} population:It follows that . The individuals of the F_{2} population can be classified into four categories. Category *i* (*i* = 0, . . ., 3) consists of individuals with exactly *i* loci possessing a dominant allele. To estimate gamete frequencies, it is necessary to consider the frequency of each category. Let represent the phenotype in which only locus *c* exhibits a dominant phenotype. Therefore represent the group of individuals from category 1 whose locus *c* has a dominant allele(s). It is obvious that there are three genotypes in category 1 and can be further dissected intoPhenotypes and are also dissected in a similar fashion.

There are also three phenotypes in category 2, each of which can be dissected into five pairs of sister gametes. For instance, the phenotype can be dissected intoNote that the phenotype for category 3 is not very informative since the single phenotype corresponds to too many genotypes. Therefore frequencies for category 3 are not used.

Let , , , , , , and be the expected frequencies of phenotypes , , , , , , and in the F_{2} population, respectively. Then(1)and(2)Letting , Equation 2 may be rewritten as(3)Moment estimates of can be obtained from the above sets of equations by replacing by their moment estimates, which are simply their observed frequencies in the sample. Theoretically Equation 1 is sufficient for deriving solutions for *q*'s. However, Equation 3 can be used to further minimize the stochastic effect in the observed frequencies. Specifically, , , and can be estimated as(4)where (see appendix a). It follows that , , and can alternatively be estimated from the observed frequencies of , , , and . We can combine the two sets of estimates of , , and to obtain a more stable set of estimates as(5)where and are weights of and , respectively, where *k* = 2, 3, 4. is the estimate of . Our simulation study showed that usually gives the best result for the estimation of . When the sample is small, it is possible that or . In such a case, one can set and for , or and for and .

Since , therefore can be expressed as(6a)Similarly we have(6b)(6c) and are estimated by and , so is estimated by(7a)Similarly(7b)(7c)is estimated by(7d)

#### Mixed loci:

Two configurations in the case of the mixed loci need to be considered. The first is two codominant loci and one dominant locus (2C1D), and the second is one codominant locus and two dominant loci (1C2D) (see Figure 1). For a codominant locus, “0” and “1” represent two parental types of homozygotes and “2” represent heterozygote. While for the dominant locus, “*A*” and “*a*” represent a dominant phenotype and a recessive phenotype, respectively. Without loss of generality, we assume in the following discussion the order of loci in the case of 2C1D is DCC. The 12 phenotypes are informative for linkage analysis, which are , , , , , , , , , , , and , while phenotypes *A*20, *A*21, and *A*02 and *A*12, *a*22, and *A*22 are much less informative because they are double (or potentially) and triple (or potentially) heterozygotes. In the population, similar to phenotype in dominant loci, phenotypes , , , and are homozygous and have the expected frequencies , , , and , respectively, and , , , and are similar to in dominant loci and have the expected frequencies , , , and , respectively. The frequencies of , , , and are expected to have , , , and , respectively. Thus, for any nonsister gamete type, there are three ways to estimate these gamete frequencies. For example, can be estimated by the following three equations:(8a)(8b)(8c)A simple single estimate can be obtained by taking the average of the three. The approach is also used for other gametes, resulting in the estimates(9a)(9b)(9c)(9d)where , , , , , , , , , , , and are estimates of , , , , , , , , , , , and , respectively.

Similarly, we can obtain estimates of the frequencies of these four types of nonsister gametes in 1C2D from(10a)(10b)(10c)(10d)where , , , , , and are the estimated frequencies of phenotypes , , , , , , and , respectively.

#### Three-point estimates of recombination fractions between loci:

Recombination fractions between loci can be estimated from *q*'s. Since *q*'s are estimated separately, their sum does not always satisfy the equation . Therefore, before estimating the recombination fraction, we obtain normalized estimates of *q*'s asIt is obvious that three loci are viewed to be independent if the null hypothesis holds at the significance level of 0.05, two loci are believed to be linked with each other, and the rest is independent if two of four types of nonsister gametes have equal estimated frequencies at the 0.05 significance level.

For linked loci, the frequencies of the four pairs of nonsister gametes can be used to distinguish the coupling phase from the repulsion phase between loci and consequently lead to proper estimates of the recombination fraction between loci according to whether they are in the coupling phase or in the repulsion phase. For example, suppose the order of the three loci is *a*–*b*–*c*. Then if is the smallest and is the largest, each pair of the three loci is in the coupling phase, and if is the largest and is the smallest, then loci *a* and *c* are in the coupling phase but loci *a* and *b* and loci *b* and *c* are in the repulsion phase. On the other hand, if is the largest and is the smallest, then loci *a* and *b* are in coupling phase but loci *a* and *c* and loci *b* and *c* are in repulsion phase. Similarly if is the smallest and is the largest, then loci *b* and *c* are in coupling phase but loci *a* and *b* and loci *a* and *c* are in repulsion.

In the coupling phase is the frequency of double crossover in the F_{2} progeny. Thus, the recombination fractions between *a* and *b*, between *b* and *c*, and between *a* and *c* can be estimated by(11)Estimates of the recombination fractions between loci in the other orders in the coupling phase are also obtained in a similar manner.

In the repulsion phase, the order (*a–b–c*) leads to due to double crossover, and thus the recombination fractions between *a* and *b*, between *b* and *c*, and between *a* and *c* are estimated by(12)The recombination fractions between three loci in the other orders in the repulsion phase can be estimated in a similar fashion.

#### Reduction of the three-point estimates of recombination fractions to the two-point estimates:

If *n* loci on a chromosome are genotyped in the mapping study, there are combinations of three loci, each of which results in three estimates of the recombination fraction. Therefore a total of recombination fractions are being estimated. When *n* is large, it will be difficult to compare all these combinations for building a linkage map of *n* loci even on a modern computer. Moreover, the recombination fractions contain coupling and repulsion linkage information. To avoid these complex comparisons, it is necessary to reduce the three-point estimates to two-point estimates. Although loci *i* and *j* would be configured with other loci to form three-point combinations, the linkage phase between loci *i* and *j* has already been fixed regardless of the other locus. Estimates of the recombination fraction between loci *i* and *j* may vary slightly with the other loci due to their respective different double-exchange frequencies and sampling error; hence, it needs to be adjusted with other loci. For convenience, let the estimate of recombination fraction between loci *i* and *j* in a three-point combination (*i*, *j*, *k*) be referred to as a three-point estimate and denoted by , where *k* is called a reference locus and . Thus, for *n* loci on a chromosome or a fragment, recombination fractions between loci *i* and *j* have three-point estimates. The order of loci *i*, *j*, and *k* in has been determined previously; that is, contains the order information of these three loci according to Equations 11 and 12. On the other hand, there are estimates of the recombination fraction between loci *i* and *j*. These estimates fluctuate with sampling errors and different double-exchange values, which depends upon the distances of locus *i* or/and locus *j* from locus *k*. Three cases for the variation of double-exchange values with respect to the estimate of the recombination fraction between loci *i* and *j* are considered: (1) loci *i* and *j* are adjacent loci, and all reference loci are out of interval *i–j*; (2) loci *i* and *j* are two terminal loci on a chromosome or a fragment, and all reference loci are within interval *i–j*; and (3) loci *i* and *j* are nonadjacent loci and the reference loci are either within or out of interval *i–j*. In the first case, the double exchanges dealing with all reference loci are detected and measured but different from one reference locus to another reference locus. For the second case, the double exchanges dealing with reference loci do not contribute to the recombination fraction between loci *i* and *j*. There is only one type in this case: loci *i* and *j* are two terminal loci but the estimates are also different with different reference loci because the double-exchange frequency is different with the reference locus; for example, a reference locus near locus *i* or *j* has less double-exchange frequency than a reference locus a distance from loci *i* and *j*. In other words, the former loses smaller double exchanges than the latter. Therefore, the former has a larger estimate value than the latter. The third case is in between the first and second cases, which is seen in the next section. Thus, the recombination fraction between loci *i* and *j* is estimated by an average estimate over reference loci:(13)It is obvious that contains not only information of the linkage phase but also the average double-exchange frequency over all reference loci and, in addition, balances sampling errors. Therefore, *is* closer to its true value than that obtained by using an EM algorithm.

## AN EXAMPLE

As an example to illustrate the construction of linkage maps by MAPMAKER/EXP (version 3.0b), Lander *et al.* (1987) provided a RFLP data set of 333 F_{2} mice. Since RFLP markers are codominant, A, H, and B are used in the data set for each locus to denote homozygotes of type A, heterozygotes (type H), and homozygotes of type B, respectively. To evaluate our new method, we converted these codominant marker data into dominant marker data by changing A to H and applied our new method to the dominant marker data set of the first six markers in the unknown linkage phase. Table 1 provides the estimates of the four pairs of nonsister gametes in the three-point combinations in the sample of 333 F_{2} individuals. It is clear that the frequencies of the four pairs of nonsister gametes containing both loci 4 and 6 all fit the ratios of 1:1:1:1 very well, which indicates that loci 4 and 6 are independent of each other and unlinked to the other four loci. Thus, these two loci are excluded. By using Equations 11 and 12, we obtained estimates of the recombination fractions in three-point combinations (123), (125), (135), and (235). The procedure is as follows: the first step is to determine the linkage order of three loci in a combination; for example, for combination (123), indicates that is the parental type and is the type due to double exchange. Those remaining are recombinants where and , respectively, represent recessive and dominant alleles in locus *i* (*i* = 1, 2, 3) in a combination. These three loci have the linkage order of 1*–*3*–*2. The second step is to determine the linkage phase: since gamete is recessive at all three loci and has the largest frequency among these four types of nonsister gametes, we can determine that loci 1, 2, and 3 are in the coupling phase. The third step is to estimate recombination fractions in combination (123) by applying Equation 11 for the case of the coupling phase to the data in Table 1; that is,Similarly, we also obtained estimates of the recombination fractions in combinations (125), (135), and (235) (see Table 2).

Finally, the three-point estimates of the recombination fractions were incorporated into two-point estimates by applying Equation 13 to the data in Table 2:On the basis of the two-point estimates of recombination fractions, the best linkage map for these four loci under study was found to be 1*–*3*–*2*–*5, using a novel approach called the unidirectional growth method (Tan and Fu 2006), where loci 1, 2, 3, and 5 correspond to markers T175, T93, C35, and C66, respectively, in the original data set. The same linkage map (see Figure 2A) was obtained when only some of the markers were converted to dominant markers and is also the same linkage map that was obtained by MAPMAKER (at LOD = 3.0) in the original data. However, when all markers are converted to the dominant type, MAPMAKER yielded a linkage map 1*–*3*–*2*–*5*–*6*–*4 (at LOD = 3.0) where locus 6 corresponding to marker T209 was linked to locus 5 (C66) at map distance 30.3 cM and locus 4 corresponding to T24 was linked to locus T209 at map distance 14.9 cM (see Figure 2B). These observations indicate that the new method leads to a better estimate of recombination than the maximum-likelihood method between dominant markers in the case of unknown phase in F_{2} progeny.

## SIMULATION STUDY

Since real data are not the best for fully evaluating a method because of unknown recombination fractions between loci, we used a computer simulation to generate data so that estimates of the recombination fraction can be compared to their true values. In addition to the new method, we also implemented the EM algorithm (see Liu 1998 for a detailed description of the process). To avoid potential unknown bias of a map-making method, we implemented the exhaustive search method to make maps (Liu 1998). Since the exhaustive search is extremely time consuming (Mester *et al*. 2003b), we examined only two short linkage maps, composed of 6 and 11 dominant loci, respectively. Five map distances 10, 15, 20, 25, and 30 cM (1 cM = 1%) were randomly assigned to each adjacent interval. This setting makes it more difficult to estimate recombination fractions than in the case of a single fixed distance for all adjacent loci.

We took two cases of linkage phases into account in the simulation: (1) coupling phase (CP), 1 allelic statuses at all loci are assigned to a parental (P_{1}) chromosome and all 0 allelic statuses to the other parental (P_{2}) chromosome; and (2) unknown phase (UP), 1 or 0 allelic status at each locus is at random allocated to each of two parental chromosomes with equal probability. We used the point process crossover model (Foss *et al*. 1993; McPeek and Speed 1995) to generate recombinants. In each of F_{1} meioses, recombination events occur at random between two adjacent loci. We considered both crossover-independent and complete crossover interference (but in separate simulations). For the complete crossover interference, we assumed that crossover cannot occur within an interval and between two nonsister chromatids when there is already a crossover within its adjacent interval and between the same two nonsister chromatids in the case of which the sum of distances over two adjacent intervals is ≤40 cM.

The expected ratio of alleles 1 and 0 for each locus is 3:1 among F_{2} individuals. The simulations were carried out with sample sizes *N* = 100, 200, and 300 F_{2} individuals, and loci that exhibited significant segregation distortion as revealed by chi-square test were removed. For each parameter set, 500 replicates were generated. Two criteria were used to evaluate these methods. One is the bias of the estimates of recombination fractions between two adjacent loci, which is defined as the average squared distance of the estimate to its true value, and the other is the accuracy of a method in recovering the true linkage map of given loci.

Table 3 shows the biases of estimates in the case of UP obtained by the two methods. In all the cases, the new method has a much smaller bias than the EM algorithm, which is a good indication that the new method is a better approach. However, the ultimate measure of usefulness of a method for estimating recombination fractions is to see if it leads to more accurate linkage map estimation. Table 4 summarizes the results of linkage map estimation by applying the exhaustive search method to the estimated recombination fraction data obtained by using both the EM algorithm and the new methods. It can be seen from Table 4 that both the EM and the new estimators have a very high accuracy in the case of CP even in a relatively small sample of 100 F_{2} individuals. However, the new estimator has a much higher accuracy than the EM estimator in the case of UP, as expected. Furthermore, the new method improves its accuracy rapidly with sample size. It has an accuracy of 50.5% with a sample size of 100 F_{2} individuals and 85.1% with a sample size of 300 F_{2} individuals. The accuracy of both estimators decreases as the number of dominant loci increases. Table 5 shows the results of accuracy under the assumption of crossover interference. As expected, both methods have poorer performance than under the assumption of crossover interference. Although complete crossover interference in general likely occurs only between two very small adjacent intervals. The results in Table 5 suggest that crossover interference has in general a negative impact on the estimate of the recombination fraction.

## DISCUSSION

We showed in this article, using both real and simulated data, that the widely used EM algorithm for estimating the recombination fraction between a pair of loci performs poorly for dominant markers because it fails to distinguish the coupling phase from the repulsion phase. We also found (results not shown) that similar to those shown in Tables 4 and 5 MAPMAKER/EXP performed poorly (<10% accuracy) for dominant markers in the unknown linkage phase, regardless whether a two-point or a three-point approach was used to estimate recombination fractions. The excellent performance of our new method may be due to several factors: (a) improved accuracy of the estimates of the gamete frequencies, (b) three-point analysis in which coupling and repulsion phases of loci are effectively distinguished, and (c) reduction of three-point estimates to two-point estimates resulting in more stable estimates of the recombination fractions.

Although the new method appears to have a shortcoming in that good accuracy of recovering true linkage maps using its estimates requires a reasonably large sample size, it does provide a promising approach that can lead to a better estimation of linkage maps from either dominant loci or mixed loci when the sample size is ∼300 F_{2} individuals. One likely application of the new method is to supplement the EM method. More specifically, one can apply both methods to the same data set and obtain two sets of estimates of recombination fractions. The EM estimates are used to build two partner linkage maps in which all linked loci are in the coupling phase. The new method's estimates can be used to integrate these two partner linkage maps into a single linkage map.

This study also indicates that examination of three loci at a time does provide additional information for estimating both recombination fractions and linkage maps. Since there are on the order of *n*^{3} combinations of three loci, any approach that analyzes three loci at a time will be demanding computationally, particularly when the number of loci is large. It will be practical only when the speed of analyzing each combination of the three loci is sufficiently fast. The new method is practical even for a large number of loci since the amount of computation for each triplet of loci is minimal.

## APPENDIX A

Since , an alternative expression of is(A1)Similarly, we have(A2)(A3)It follows that(A4)and(A5)Equations A4 and A5 lead to the solution for as(A6)

## Acknowledgments

We thank the High Performance Computer Center of Yunnan University for computational support and Sara Barton for editorial assistance. This research was supported by National Institutes of Health grant R01 GM50428 (to Y.-X. F.) and by funds from Yunnan University and a 973 project (2003CB415102).

## Footnotes

Communicating editor: N. Takahata

- Received July 27, 2006.
- Accepted October 9, 2006.

- Copyright © 2007 by the Genetics Society of America