## Abstract

A new statistical approach for construction of the genetic linkage map and estimation of the parental linkage phase based on allele frequency data from pooled gametic (sperm or egg) samples is introduced. This method can be applied for estimation of recombination fractions (over distances <1 cM) and ordering of large numbers (even hundreds) of closely linked markers. This method should be extremely useful in species with a long generation interval and a large genome size such as in dairy cattle or in forest trees; the conifer species have haploid tissues available in megagametophytes. According to Mendelian expectation, two parental alleles should occur in gametes in 1:1 proportions, if segregation distortion does not occur. However, due to mere sampling variation, the observed proportions may deviate from their expected value in practice. These deviations and their dependence along the chromosome can provide information on the parental linkage phase and on the genetic linkage map. Usefulness of the method is illustrated with simulations. The role of segregation distortion as a source of these deviations is also discussed. The software implementing this method is freely available for research purposes from the authors.

ESTIMATION of recombination fraction over distances <1 cM is important because current genetic maps are very inaccurate in such distances (Kong *et al.* 2002; Weber 2002; Arnheim *et al.* 2003) and because of current interest toward characterization and utilization of haplotype-block structure of the human genome (Arnheim *et al.* 2003; International HapMap Consortium 2003). For an excellent general review of characterization of recombination in small distances, see Arnheim *et al.* (2003).

To estimate parental linkage phase and short map distances in a range <1 cM one would ideally require data from several gametes (haploid tissues). In species with a short generation cycle it may be possible to produce “map expansion” (an excess of recombinants) by applying a number of consecutive intercrosses (Darvasi 1998). Several haplotyping methods for pedigree data (Wijsman 1987; Kruglyak *et al.* 1995,1996; Sobel and Lange 1996; Sobel *et al.* 1996; Lin and Speed 1997; Tapadar *et al.* 2000; Qian and Beckmann 2002; Gao *et al.* 2004) and for general population samples (Clark 1990; Excoffier and Slatkin 1995; Hawley and Kidd 1995; Long *et al.* 1995; Stephens *et al.* 2001; Kitamura *et al.* 2002; Stephens and Scheet 2005) have been introduced during the last two decades. Similarly, there is a rich literature of methods for construction of the genetic linkage map (Lander and Green 1987; Stephens and Smith 1993; George *et al.* 1999; Jansen *et al.* 2001; Butcher *et al.* 2002; Rosa *et al.* 2002; Wu *et al.* 2002; Lu *et al.* 2004; George 2005). In species such as forest trees or dairy cattle, where the generation cycle is long and the genome size is large, all map distances and ordering of markers are generally difficult to estimate. However, certain types of haploid tissues such as sperm, eggs, or megagametophytes may be easily available in such species. Use of individual sperm typing (Lazzeroni *et al.* 1994; Navidi and Arnheim 1994, 1999; Cullen *et al.* 2002; Arnheim *et al.* 2003) and utilization of DNA in megagametophytes (Tulsieram *et al.* 1992; Yazdani *et al.* 1995) for the estimation of the genetic map (order and distances) have been proposed in the literature earlier. However, extensive individual typing of haploid tissues is not cost effective. Note that application of radiation-hybrid mapping has also been proposed for a similar purpose (Boehnke *et al.* 1991; Slonim *et al.* 1997).

Use of the pooled DNA data reduces the cost and time spent on typing by allowing for direct determination of allele frequencies at each locus (Shaw *et al.* 1998; Collins *et al.* 2000; Ritland 2002; Sham *et al.* 2002; Butcher *et al.* 2004; Norton *et al.* 2004). Pooled DNA data on diploid tissues can also be used to estimate haplotype frequencies (Ito *et al.* 2003; Yang *et al.* 2003). We assume here that allele frequency measurements from pooled DNA can be obtained with a high accuracy in the lab (Norton *et al.* 2002; Sham *et al.* 2002; Butcher *et al.* 2004; Yang *et al.* 2005). If haploid tissues from a single parent are used in the pool and the possibility of segregation distortion is excluded, the average proportions of the two parental alleles are even at each locus. However, the observed proportions often deviate from expected proportions and are correlated between the loci (due to close linkage of the loci), which can provide information of the linkage phase (haplotypes) of the parent. The information in these fluctuations is stochastic in nature and can vary somewhat from sample to sample. Note that these fluctuations, when measured from large samples, can be used to detect segregation distortion (McPeek 1999; and see discussion). When there is no significant evidence of segregation distortion, the asymptotic properties of these fluctuations (and their dependence) as a source of information in estimating linkage phase of the parent at short map distances are described in this article.

Hidden Markov models (HMMs) provide a flexible modeling framework for several types of problems (Rabiner 1989), including the construction of the linkage map (Lander and Green 1987). The Viterbi algorithm is a maximization algorithm that can be used in HMMs to determine the optimal path (having the highest likelihood value) through the hidden state structure. The same algorithm can also be used to sample directly from the posterior distribution. In the following, we present our haplotyping method that is based on the Viterbi algorithm. Note that also the well-known Genehunter program uses the Viterbi algorithm for haplotyping (Kruglyak *et al.* 1995, 1996).

## MODEL

We first describe the model for individual gametic observations and then introduce the required modifications for that under pooled gametic observations.

#### Individual gametic observations:

Assume that the sample consists of *n* gametes , which have been collected from a diploid individual (the parent). The parent is assumed to be a full heterozygote in a given set of *L* polymorphic marker loci so that at each locus there are two segregating alleles (denoted as 0 and 1) that can be distinguished in the sample. In practice, we omit the loci in our sample, where only a single allele can be identified (*i.e.*, the parent is a homozygote). Each gamete *i* can then be represented as a vector .

Denote by *X _{l}* the grandparental origin of the 0-allele at locus

*l*, with the convention that when

*X*= 0 (

_{l}*X*= 1) the origin of the 0-allele is grandpaternal (grandmaternal), respectively. Note that the assignment of the vector uniquely determines the parental linkage phase.

_{l}*A priori*, are independent random variables, , with respective Bernoulli(π

_{l}) distributions. We assume that both grandparental origins are

*a priori*equally likely at each locus;

*i.e.*, corresponding to linkage equilibrium assumption for each locus

*l*(for potential problems, see Schaid

*et al.*2002; Huang

*et al.*2004.) However, in the case that we have some prior information about parental linkage phase,

*e.g.*, genotype information from the grandparents, or knowledge about linkage disequilibrium among the loci, we could set or, in the latter case, assume

*a priori*a Markov model for (see McPeek and Strahs 1999).

Denote the vector of recombination fractions by , where an element θ_{l} is the recombination fraction between two consecutive loci (*l* − 1) and *l*. The likelihood function is the following:At the first locus above, gametes , are independent Bernoulli(1/2) variables. Then, given *X _{l}*

_{−1},

*X*, and , gametes are conditionally independent with the following distribution: If

_{l}*X*=

_{l}*X*

_{l}_{−1}(

*i.e.*, the grandparental origin of the 0-allele remains the same) thenand if

*X*≠

_{l}*X*

_{l}_{−1}(

*i.e.*, the grandparental origins of the 0-allele differ) thenThis formulation corresponds to the standard transmission model (

*e.g.*, Sillanpää and Arjas 1999; see their Equations 4 and 5). Here, although the phases (

*X*) are

_{l}*a priori*independent, the vectors , form a Markov process, which gives a HMM with hidden layer (

*X*). Although the parameterization is different, this is the same model assumed in Lander and Green (1987) and in Kruglyak

_{l}*et al.*(1995, 1996), where a bit in their inheritance vector includes information from both the grandparental origin (

*X*) and the allele in the gamete (

_{l}*h*). Given the individual gametic observations and the recombination fractions between the loci, it is possible to apply a variant of the Viterbi algorithm to compute the posterior distribution of the parental linkage phase

_{l}^{i}*p*(

*X*|

*h*

^{1}, … ,

*h*, θ), which is proportional to

^{n}*p*(

*h*

^{1}, … ,

*h*|

^{n}*X*, θ) ×

*p*(

*X*). It is also possible to implement the EM algorithm (Dempster

*et al.*1977) to compute the maximum-likelihood estimator of the recombination fractions (θ).

#### Pooled gametic observations:

Next we consider the case, where instead of observing the individual gametes (*h _{l}^{i}*) we observe their (allele) frequencies from a gametic pool or equivalently , the number of copies of the 1-allele at locus

*l*. The likelihood function for pooled observations is the following:At the first locus above, the frequency

*N*

_{1}is independent from phase vector

*X*and it has the Binomial(

*n*, ) distribution. For loci

*l*> 1, given the phase information

*X*

_{l}_{−1},

*X*, and the frequency

_{l}*N*

_{l}_{−1}at the previous locus,

*N*has the following conditional distribution (given by convolution of two binomials): If

_{l}*X*=

_{l}*X*

_{l}_{−1}(

*i.e.*, the grandparental origin of the 0-allele remains the same) thenand if

*X*≠

_{l}*X*

_{l}_{−1}(

*i.e.*, the grandparental origins of the 0-allele differ) then

Alternatively we can use the following notation:where * denotes convolution. These convolution distributions can be computed efficiently, using discrete Fourier transform (see, *e.g.*, Bremaud 2002).

#### Multiple pools:

We can also consider the case where we observe different frequencies from *n*_{pools} distinct pools, each containing *n*^{(k)} gametes, . Accordingly, the pool-specific frequency vectors , are conditionally independent given the phase vector *X*, starting with initial distribution *N*_{1}^{(k)} ∼ Binomial(*n*^{(k)}, ) and then following independently the Markovian dynamics described above. To maintain simplicity of the notation in the rest of this article, the theory is presented mainly for a single pool but it generalizes to multiple pools analogously.

#### Estimation of linkage phase:

It may be somewhat surprising, but becomes clear from the formula above, that knowledge of the frequencies *N _{l}*

_{−1}and

*N*gives some information about the change |

_{l}*X*−

_{l}*X*

_{l}_{−1}| in the grandparental origin. The amount of such information is quantified by the total variation distance between the two distributions above; clearly it depends on

*N*

_{l}_{−1},

*n*, and θ

_{l}and is random in this sense. This information is zero when θ

_{l}= , and it increases as the recombination fraction proceeds to 0. Also, it is zero when

*N*

_{l}_{−1}=

*n*/2 and increases as

*N*

_{l}_{−1}goes to 0 or to

*n*.

Again, the pairs form a hidden Markov chain (Figure 1), and it is possible to apply HMM techniques and the EM algorithm to estimate parental linkage phase (*X*) and recombination fractions (θ), respectively. This practice is similar to that in the full data case. The Viterbi algorithm is described in detail in appendix a. We show that in the case where the marker spacing is dense and several distinct pools are used, the information content of the pooled data about the parental linkage phase *X* is not much lower than the information content carried by the full data.

#### Normal approximation and information about linkage phase:

In the following, we present a normal approximation for transitions between allele frequency measurements at different loci, which may be used to speed up the numerical computations, when the sizes of the gametic pools are large enough so that the binomial distribution can be approximated with the normal distribution. We also compute the *information in the data* and explain how to calculate the *mutual information* between linkage phase and the observed allele frequency data under the approximate model. Letbe the frequency of the allele inherited from the grandfather at locus *l*, where *n* is the size of the gametic pool. The conditional distribution of given is a convolution of two binomials with mean and variance *n*(1 − θ_{l})θ_{l}. When *n* is large enough, we approximate the scaled process by a stationary Gaussian process with transition densityand invariant distribution .

Analogously the conditional distribution of the frequency *N _{l}* given

*N*

_{l}_{−1}and |Δ

*X*| = |

_{l}*X*−

_{l}*X*

_{l}_{−1}| is approximated by a normal distribution with mean

*m*(θ) =

*N*

_{l}_{−1}+ (

*n*− 2

*N*

_{l}_{−1})

*r*(|Δ

*X*|, θ

_{l}_{l}) and variance

*n*(1 − θ

_{l})θ

_{l}, where we denote

*r*(0, θ) = θ and

*r*(1, θ) = 1 − θ.

After rescaling by *n*^{−1/2}, asymptotically we are in a conditionally Gaussian shift experiment (see van der Vaart 1998) with shift and variance θ_{l}(1 − θ_{l}). Note that under the marginal distribution the random variable has mean and variance 1; therefore the shift is bounded in probability as *n* grows and the information about |Δ*X _{l}*| remains bounded, too. On the other hand the experiment becomes more and more informative as the recombination fraction θ

_{l}decreases, since the shift is a standardized random variable while the variance of the experiment goes to 0.

The observed information (information in the data) is the relative entropy (Kullback and Leibler 1951) of the prior distribution with respect to the posterior distribution, which is decomposable as follows:See Figure 2 for illustration. Note that the parameter vector θ is known here.

The mutual information is obtained by integrating out in each summand the pair (*N _{l}*

_{−1},

*N*) with respect to its joint distribution; that is,

_{l}*N*

_{l}_{−1}is and then

*N*given

_{l}*N*

_{l}_{−1}is mixed normal with mixing parameter |Δ

*X*|.

_{l}##### Remark:

Here the size of the pool is related to the precision of the measurement, since it is implicitly assumed that one has perfect measurements with unit resolution. It would be more realistic to include measurement errors in the model (see discussion).

#### Estimation of recombination fractions:

While in the previous sections the recombination fractions were assumed to be known, we suppose them to be unknown here. By using a Monte Carlo EM algorithm (Penttinen 1984; Geyer and Thompson 1992), we compute the maximum-likelihood estimator of the recombination fractions on the basis of the pooled data.

Recall that conditionally on *N _{l}*

_{−1}and |Δ

*X*|,Therefore we can write

_{l}*N*=

_{l}*y*+ (

_{l}*N*−

_{l}*y*), where, for

_{l}*l*> 1,

*y*∼ Binomial(

_{l}*N*

_{l}_{−1}, 1 −

*r*(|Δ

*X*|, θ

_{l}_{l})) and (

*N*−

_{l}*y*) ∼ Binomial(

_{l}*n*−

*N*

_{l}_{−1},

*r*(|Δ

*X*|, θ

_{l}_{l})) are conditionally independent random variables. We set

*y*

_{1}= 0. Note that are sufficient statistics for the recombination fractions .

To easily solve the maximization step in the EM algorithm, we extend the state space by including the variables *y _{l}* to the hidden states

*X*. The hidden Markov model {(

_{l}*X*,

_{l}*y*,

_{l}*N*)} has the following form: Conditionally on (

_{l}*X*

_{l}_{−1},

*y*

_{l}_{−1},

*N*

_{l}_{−1}),andTo sample conditionally on the pooled data (

*N*), we first sample (

_{l}*X*) conditionally on (

_{l}*N*), using the Viterbi algorithm described in appendix a, and then for we sample the value for

_{l}*Y*conditionally on

_{l}*X*

_{l}_{−1},

*X*,

_{l}*N*

_{l}_{−1},

*N*from a distribution

_{l}Starting with some initial guess , at stage *t*, in the E-step we sample *K* independent identically distributed realizations (*X*^{(j)}, *Y*^{(j)}) conditionally on the pooled data (*N*) and compute a Monte Carlo approximation of the integrated log-likelihood:In the M-step we obtain the next estimate by maximizing this expression over θ, taking withWhen we iterate the procedure,

This extends immediately to the situation where we observe distinct pooled gametic samples produced by unrelated individuals belonging to the same species.

##### Remark:

For a given sample size, one can improve the estimation of the recombination fraction between two loci by measuring the frequencies at a finer map resolution between the two loci.

#### Simulation experiment I—estimation of recombination fractions and parental linkage phase under different pooling strategies:

Simulations were used to test performance of the algorithm under different pooling strategies. We simulated a data set containing 1000 haploid gametes across 200 evenly spaced loci where the recombination fraction between consecutive markers was determined to be 0.01. To illustrate the effect of different pooling strategies we computed the maximum-likelihood estimator of recombination fractions in several situations, namely: (1) the full data case, where we had observations from individual haplotypes (gametes), and (2) pooled data cases, where we split the same data (1000 gametes) into one, two, four, or eight different parts (pools) and observed only the allele frequency of each pool. The Monte Carlo–EM algorithm starts from some arbitrary values for the recombination fractions and it is continued until convergence. In each Monte Carlo step, 5000 independent identically distributed realizations were drawn. The ML estimates of the recombination fraction corresponding to these different situations are shown in Figure 3. One can see in this figure that the use of several smaller pools simultaneously improves the accuracy. In particular, the estimated recombination fraction, corresponding to a single pool (curve a), at locus 80 suddenly jumps toward infinity. This does not correspond to a change of chromosome but it is an artifact since at that locus the frequency is exactly 0.5 and there is no information about the parental linkage phase and the recombination fraction cannot be estimated. We see that when multiple pooling is used, we recover some information about the parental phase at that locus so that the “chromosome jump” disappears. By increasing the number of pools (up to eight) the parental phase was recovered without errors.

#### Simulation experiment I—robustness with respect to errors in data:

From simulation experiment I, data of eight pools (with *n* = 125 gametes in each) were further selected to study the influence of errors on the estimation procedure. Two different kinds of errors were added to the observed allele frequencies in the pooled data.

##### Measurement errors:

We first considered a perturbation in the data where the measured allelic frequencies are masked (at each locus and independently across loci) by adding a Gaussian error with mean 0 and standard deviation with σ = 0.01. This was hoped to mimic an error attributable to the measuring instrument. In the numerical example, with this level of noise, the parental linkage phase was recovered without errors. However, the recombination fraction estimator is clearly not robust against this type of error and provided values that were highly overestimated (see Figure 4). This is because after phase assignment the algorithm counts the remaining fluctuations in the data as recombinations.

##### DNA amplification errors:

We also considered a second type of perturbation, which may occur during the DNA amplification process. In our model, the pooled DNA sample is assumed to contain a single copy of each gametic observation (a haploid DNA) collected from the parent. To simulate DNA amplification errors, we add a simple perturbation to the pooled data sample by randomizing the number of copies of each gametic observation. Namely, a collected gamete with index *i* is copied *C ^{i}* times, where we assume that

*C*are independently and identically distributed for some parameter ε ∈ [0, 1],which is the convolution of a Poisson(ε) and a Bernoulli (1 − ε) distribution. This gives

^{i}*E*(

*C*) = 1 and Var(

^{i}*C*) = (2 − ε)ε. For more sophisticated modeling of DNA amplification, see Lalam

^{i}*et al.*(2004). After DNA amplification, we obtain a pool containing gametes, and we observe at each locus the perturbed counts . Obviously these perturbations are correlated across loci, and this is the reason why the recombination fraction estimator is expected to be robust against DNA amplification errors. In a numerical experiment that has been summarized in Figure 4, we tested several experiments by having different levels (ε = 0.1, 0.2, 0.5, 1.0) of DNA-amplification errors in each. In all these experiments, the recombination fraction estimator seemed to perform reasonably well and the linkage phase was recovered without errors.

#### Joint estimation of recombination fractions, parental linkage phase, and the ordering of markers:

Next we consider the case where the ordering of markers is unknown and it is estimated from the data together with the recombination fractions and the parental linkage phase. We observe the allele frequencies from *n*_{pools} distinct gametic pools obtained from the same parental individual.

The state space is given by a permutation π of the marker loci indexes , together with the recombination fractions θ_{l} between the consecutive markers indexed by π(*l* − 1) and π(*l*), for , and the parental phase vector ().

To speed up the numerical computations we assume that the gametic pools have sizes large enough so that we can use the normal approximation.

##### Marginal likelihood and pseudo-likelihood:

Consider a pair of markers, *j* and *l*, together with the corresponding parental linkage phases *X _{j}*,

*X*∈ {0, 1}. The log-likelihood contribution when marker

_{l}*l*follows marker

*j*with given phases is given bywhere θ is the recombination fraction between the markers

*j*and

*l*, is the total number of gametes in the pools, and we setNote that and the matrix are sufficient statistics, which need to be computed only once for every pair of markers. Their dimension does not depend on the number of pools.

Next, we assume a uniform prior on [0, ] on the recombination fractions, and by integrating the recombination fraction parameter θ with respect to the prior we obtain the marginal-likelihood contribution for the joint choice of the ordering and the linkage phase, and we define the cost function byMaximizing the likelihood corresponds to solving a traveling salesman problem, *i.e.*, finding an ordering π of loci and a phase vector that minimizes the total costwhere the cost for the first locus .

For a small number of markers it is possible to find the optimal ordering in the maximum-likelihood sense by using dynamic programming. However, dynamic programming becomes unfeasible as the number of markers grows. Instead we develop a Markov chain Monte Carlo (MCMC) method (Hastings 1970) to sample from the posterior distribution of the configuration .

Another problem is that the integral in the cost function does not have an analytic expression. For a large number of markers, it becomes numerically expensive to evaluate accurately the cost matrix . We consider instead the pseudo-log-likelihood(1)Intuitively, the optimal ordering should be also close to optimal in the least-squares sense, and this gives a guideline to set the proposal distribution for the ordering π in the MCMC algorithm. For updating steps and the proposal distributions, see appendix b.

#### Simulation experiment II—joint estimation of recombination fractions, parental linkage phase, and the ordering of markers:

We computed a numerical experiment with 400 closely linked marker loci. The markers were not equally spaced; instead the distance between consecutive markers was either 0.001 or 0.01 M, systematically in the proportion of 10 to 1. We simulated data from 500 pools, with 200 gametes in each pool, and then applied a random permutation to the marker indexes and to the parental phases before collecting the pooled data.

The MCMC estimation algorithm found quickly (say, in a few thousand iterations) some good permutations, quite close to the true ordering of the markers. We stopped the experiment after 2 weeks and several million MCMC cycles, because the algorithm was clearly stuck in a local maxima. Since the mixing of the MCMC seems to be very slow, instead of using the empirical distribution as an approximation to the posterior, we look only at the maximum *a posteriori* (MAP), estimated by the sampled configuration with the highest posterior density.

The pseudo-log-likelihood values for initial, true, and estimated MAP configurations were −19,747,072, −75,385, and −121,663, respectively. In Figure 5, we can see that the estimated MAP configuration is not far away from the true ordering, since all the markers are placed extremely close to their true positions. In fact, many markers are placed correctly while some relatively short segments are placed in their right positions but in the reverse direction.

The parental phase was recovered without errors, and in Figure 6 we plot the genetic distances estimated without any prior knowledge on the ordering of the markers, compared with the true map and the map obtained by using the full data and knowing the true ordering of the markers. These estimates are comparable with the corresponding estimates obtained in simulation experiment I where the pooled data were used and ordering of the markers was known *a priori*. It seems that when the number of pools is large enough the marker order can be recovered so well that the two pooling estimators of the recombination fraction (with and without prior knowledge of the ordering) behave very similarly.

## DISCUSSION

We have presented a method to estimate parental linkage phase, sex-specific recombination fractions, and ordering of markers using pooled haploid DNA available from sperm, egg, or megagametophyte samples. This method has a lot of promise because the potential accuracy provided by this method has not been available without individual genotyping before. Use of a single pool seems to result in some information gaps along the chromosome where the linkage phase (see Figure 2) and recombination fractions (see Figure 3) cannot be determined. However, these gaps can be effectively avoided by using multiple (two or more) pools simultaneously because random gap positions are likely to differ between pools.

The parental linkage phase and recombination frequency estimation in this method is based on random fluctuations in transmission ratios and their correlation between the loci under Mendelian inheritance. The information content in the pooled data seems to depend on how much the observed frequency deviates from its expectation. To maximize the information content we propose the following strategy, which requires individual genotyping at the first locus. One could classify individuals into two sets of divergent gametic pools according to the typing of the allele at the first locus. These two pools would then be used to measure pooled frequencies from other loci. This would enhance (locally) the information of the pooled frequencies, without the need of any corrections.

We comment here briefly about segregation distortion. Segregation distortion may follow as result of a selection process on the gametes produced in the meiosis. Since we are considering a short chromosomal segment, we may assume that the selection probability depends only on the alleles at a single locus. Having assumed that the recombination process occurs independently from the haplotypes of the parent, it follows that the distribution of the recombination pattern is not affected by the selection process. In that case we can apply our method without changes. In fact, this kind of segregation distortion would produce data that on average are more informative around the selective locus than in the neutral case. Note, however, that in the case where selection probability depends on several loci, the distribution of the recombination pattern on the selected gametes may be perturbed in a way that depends on the alleles, violating our assumptions. Because this method is based on typing gametic samples (of sperm or egg) rather than living progeny, it can be used to estimate offspring ratios and sex-specific recombination frequencies in a meiotic state under transmission equilibrium. These estimates can then be compared to similar estimates obtained from living organisms (*i.e.*, under a postmeiotic state). A statistically significant departure in estimates between these two states can then be taken as evidence in favor of postmeiotic selection as a source of transmission distortion. For other methods to determine the origin of transmission ratio distortion, see de Villena *et al.* (2000).

We hope that the proposal distributions in the MCMC algorithm could be improved in the future to achieve a faster relaxation. One possibility to be explored would be to use particle filter (say, genetic algorithm) techniques, which can be used successfully for the traveling salesman problem (see Del Moral 2004). The linkage phase estimator is robust to errors and the estimator for genetic distance is robust to DNA amplification errors although the sources of errors in the allele frequency measurements from gametic pools are not taken into account in the current model. We are currently investigating the possible further extension of the method, using errors-in-variable modeling (Fuller 1987; Biemer *et al.* 1991; Carroll *et al.* 1995). In any case, the measurement errors can be controlled in the design of the pooling experiment—by taking several pools with a pool size small enough. Of course, we are looking forward to testing our method with real data when available. The software implementing the method is freely available for research purposes from the authors.

## APPENDIX A: VITERBI ALGORITHM

Here we describe the Viterbi algorithm for the hidden Markov model , where we assume that the recombination fractions are known in advance. The problem is to sample the parental linkage phase conditionally on the observed allele frequencies , which are obtained from the pool of gametes. Because we are on a discrete state space we can use a standard setting; see Durbin *et al.* (1998) and Rabiner (1989).

Nothing is computed for locus *l* = 1, because linkage phase *X*_{1} and observed frequency *N*_{1} are independent there. But for all subsequent loci , the (conditional) transition probabilities *p*(*X _{l}* =

*x*|

*X*

_{l}_{−1}=

*y*,

*N*

_{l}_{−1},

*N*) are recursively computed for all possible values of (

_{l}*x*,

*y*). The term

*X*

_{l}_{−1}is then integrated out from this expression so that we obtain

After these *L* forward steps, linkage phase *X _{L}* is sampled from and the backward algorithm continues for as follows:

Linkage phase *X _{l}* is drawn given the data and

*X*

_{l}_{+1}with probability proportional in

*x*to

After *L* backward steps we obtain the sample from the posterior distribution.

By dynamic programming it is also possible to compute *a posteriori* the most probable path and its posterior probability.

## APPENDIX B: UPDATE STEPS AND PROPOSAL DISTRIBUTIONS FOR THE METROPOLIS ALGORITHM

A MCMC cycle consists of several proposed moves:

We perform a block update for the parental phase vector , which samples from the fully conditional posterior distribution given the data and the current values of π and . If there is no prior information about the phase vector this is very simple since under the uniform prior for

*X*, Δ*X*_{π(l)}will be conditionally independent given π. If the prior for*X*is not uniform, we use the Viterbi algorithm (see appendix a).We update independently the recombination fraction parameters θ

_{l}conditionally on the data, on π, and on the phases*X*_{π(l−1)},*X*_{π(l)}. This is a Metropolis move where as a proposal distribution for θ_{l}we take a histogram approximation over a finite grid*G*⊂ [0, ] of the full conditional density .The two following moves update the ordering. For each of these, we resample simultaneously the parental phases and the recombination fraction parameters around the markers involved.

As a proposal distribution, we apply a random permutation to the current ordering. For each locus involved, we resample jointly the phase and the recombination fractions between the new neighboring markers (that is, by using the proposal distribution described in ii given the new ordering and phase).

As a proposal distribution we randomly select a segment and we cut it off from the chromosome, join the extremes, choose a random location in the chromosome, and insert the segment there, eventually also inverting the direction of the segment. We also have the possibility to flip simultaneously all the parental phases of the loci belonging to the segment. Finally new recombination fraction parameters are sampled for the markers around the two cuts points by using the proposal distribution described in ii.

The random permutation in move iii and the random segment in move iv together with the new phases are not selected uniformly at random. Instead in the proposal distribution we take into account the pseudo-log-likelihood (Equation 1).

For each of these moves, for a given starting configuration (π, *X*) we denote by *F*(π, *X*) the set of configurations that are reachable in one step. By using the pseudo-log-likelihood we define the proposal distribution aswhere *t* ≥ 0 is an inverse temperature parameter *t* = 0 corresponding to the uniform distribution over *F*(π, *X*), and *Z _{t}*(π,

*X*) is the normalizing constant. For high values of

*t*, the proposal distribution among the reachable configurations will choose with high probability those with the smallest energy. Since we want to have also a chance to propose any reachable configuration, we let the inverse temperature parameter

*t*vary cyclically in a finite set.

The Hastings ratio for such moves is given by

In the case that the data contain gametic pools sampled from *J* different parental individuals, it is clear that one should extend the MCMC algorithm by keeping one permutation vector and introducing *J* independent parental phase vectors , .

## Acknowledgments

We are grateful to two anonymous referees for their constructive comments on the manuscript. This work was supported by research grant 202324 from the Academy of Finland and by the Centre of Population Genetic Analyses, University of Oulu, Finland.

## Footnotes

Communicating editor: M. Nordborg

- Received April 12, 2005.
- Accepted November 1, 2005.

- Copyright © 2006 by the Genetics Society of America