Originally published as Genetics Published Articles Ahead of Print on November 19, 2005.

Genetics, Vol. 172, 1325-1335, February 2006, Copyright © 2006
doi:10.1534/genetics.105.044271

Constructing the Parental Linkage Phase and the Genetic Map Over Distances <1 cM Using Pooled Haploid DNA

University of Helsinki, Helsinki FIN-00014, Finland

1 Corresponding author: Department of Mathematics and Statistics, University of Helsinki, P.O. Box 68 (Gustaf Hällströmin katu 2b), Room 316, Helsinki FIN-00014, Finland.
E-mail: dag{at}rolf.helsinki.fi

Manuscript received April 12, 2005. Accepted for publication November 1, 2005.

ABSTRACT

A new statistical approach for construction of the genetic linkage map and estimation of the parental linkage phase based on allele frequency data from pooled gametic (sperm or egg) samples is introduced. This method can be applied for estimation of recombination fractions (over distances <1 cM) and ordering of large numbers (even hundreds) of closely linked markers. This method should be extremely useful in species with a long generation interval and a large genome size such as in dairy cattle or in forest trees; the conifer species have haploid tissues available in megagametophytes. According to Mendelian expectation, two parental alleles should occur in gametes in 1:1 proportions, if segregation distortion does not occur. However, due to mere sampling variation, the observed proportions may deviate from their expected value in practice. These deviations and their dependence along the chromosome can provide information on the parental linkage phase and on the genetic linkage map. Usefulness of the method is illustrated with simulations. The role of segregation distortion as a source of these deviations is also discussed. The software implementing this method is freely available for research purposes from the authors.


ESTIMATION of recombination fraction over distances <1 cM is important because current genetic maps are very inaccurate in such distances (KONG et al. 2002; WEBER 2002; ARNHEIM et al. 2003) and because of current interest toward characterization and utilization of haplotype-block structure of the human genome (ARNHEIM et al. 2003; INTERNATIONAL HAPMAP CONSORTIUM 2003). For an excellent general review of characterization of recombination in small distances, see ARNHEIM et al. (2003).

To estimate parental linkage phase and short map distances in a range <1 cM one would ideally require data from several gametes (haploid tissues). In species with a short generation cycle it may be possible to produce "map expansion" (an excess of recombinants) by applying a number of consecutive intercrosses (DARVASI 1998). Several haplotyping methods for pedigree data (WIJSMAN 1987; KRUGLYAK et al. 1995,1996; SOBEL and LANGE 1996; SOBEL et al. 1996; LIN and SPEED 1997; TAPADAR et al. 2000; QIAN and BECKMANN 2002; GAO et al. 2004) and for general population samples (CLARK 1990; EXCOFFIER and SLATKIN 1995; HAWLEY and KIDD 1995; LONG et al. 1995; STEPHENS et al. 2001; KITAMURA et al. 2002; STEPHENS and SCHEET 2005) have been introduced during the last two decades. Similarly, there is a rich literature of methods for construction of the genetic linkage map (LANDER and GREEN 1987; STEPHENS and SMITH 1993; GEORGE et al. 1999; JANSEN et al. 2001; BUTCHER et al. 2002; ROSA et al. 2002; WU et al. 2002; LU et al. 2004; GEORGE 2005). In species such as forest trees or dairy cattle, where the generation cycle is long and the genome size is large, all map distances and ordering of markers are generally difficult to estimate. However, certain types of haploid tissues such as sperm, eggs, or megagametophytes may be easily available in such species. Use of individual sperm typing (LAZZERONI et al. 1994; NAVIDI and ARNHEIM 1994, 1999; CULLEN et al. 2002; ARNHEIM et al. 2003) and utilization of DNA in megagametophytes (TULSIERAM et al. 1992; YAZDANI et al. 1995) for the estimation of the genetic map (order and distances) have been proposed in the literature earlier. However, extensive individual typing of haploid tissues is not cost effective. Note that application of radiation-hybrid mapping has also been proposed for a similar purpose (BOEHNKE et al. 1991; SLONIM et al. 1997).

Use of the pooled DNA data reduces the cost and time spent on typing by allowing for direct determination of allele frequencies at each locus (SHAW et al. 1998; COLLINS et al. 2000; RITLAND 2002; SHAM et al. 2002; BUTCHER et al. 2004; NORTON et al. 2004). Pooled DNA data on diploid tissues can also be used to estimate haplotype frequencies (ITO et al. 2003; YANG et al. 2003). We assume here that allele frequency measurements from pooled DNA can be obtained with a high accuracy in the lab (NORTON et al. 2002; SHAM et al. 2002; BUTCHER et al. 2004; YANG et al. 2005). If haploid tissues from a single parent are used in the pool and the possibility of segregation distortion is excluded, the average proportions of the two parental alleles are even at each locus. However, the observed proportions often deviate from expected proportions and are correlated between the loci (due to close linkage of the loci), which can provide information of the linkage phase (haplotypes) of the parent. The information in these fluctuations is stochastic in nature and can vary somewhat from sample to sample. Note that these fluctuations, when measured from large samples, can be used to detect segregation distortion (MCPEEK 1999; and see DISCUSSION). When there is no significant evidence of segregation distortion, the asymptotic properties of these fluctuations (and their dependence) as a source of information in estimating linkage phase of the parent at short map distances are described in this article.

Hidden Markov models (HMMs) provide a flexible modeling framework for several types of problems (RABINER 1989), including the construction of the linkage map (LANDER and GREEN 1987). The Viterbi algorithm is a maximization algorithm that can be used in HMMs to determine the optimal path (having the highest likelihood value) through the hidden state structure. The same algorithm can also be used to sample directly from the posterior distribution. In the following, we present our haplotyping method that is based on the Viterbi algorithm. Note that also the well-known Genehunter program uses the Viterbi algorithm for haplotyping (KRUGLYAK et al. 1995, 1996).


MODEL
We first describe the model for individual gametic observations and then introduce the required modifications for that under pooled gametic observations.

Individual gametic observations:

Assume that the sample consists of n gametes Formula, which have been collected from a diploid individual (the parent). The parent is assumed to be a full heterozygote in a given set of L polymorphic marker loci so that at each locus there are two segregating alleles (denoted as 0 and 1) that can be distinguished in the sample. In practice, we omit the loci in our sample, where only a single allele can be identified (i.e., the parent is a homozygote). Each gamete i can then be represented as a vector Formula.

Denote by Xl the grandparental origin of the 0-allele at locus l, with the convention that when Xl = 0 (Xl = 1) the origin of the 0-allele is grandpaternal (grandmaternal), respectively. Note that the assignment of the vector Formula uniquely determines the parental linkage phase. A priori, Formula are independent random variables, Formula, with respective Bernoulli({pi}l) distributions. We assume that both grandparental origins are a priori equally likely at each locus; i.e., Formula corresponding to linkage equilibrium assumption for each locus l (for potential problems, see SCHAID et al. 2002; HUANG et al. 2004.) However, in the case that we have some prior information about parental linkage phase, e.g., genotype information from the grandparents, or knowledge about linkage disequilibrium among the loci, we could set Formula or, in the latter case, assume a priori a Markov model for Formula (see MCPEEK and STRAHS 1999).

Denote the vector of recombination fractions by Formula, where an element {theta}l is the recombination fraction between two consecutive loci (l – 1) and l. The likelihood function is the following:

Formula
At the first locus above, gametes Formula, Formula are independent Bernoulli(1/2) variables. Then, given Xl–1, Xl, and Formula, gametes Formula are conditionally independent with the following distribution: If Xl = Xl–1 (i.e., the grandparental origin of the 0-allele remains the same) then

Formula
and if Xl != Xl–1 (i.e., the grandparental origins of the 0-allele differ) then

Formula
This formulation corresponds to the standard transmission model (e.g., SILLANPÄÄ and ARJAS 1999; see their Equations 4 and 5). Here, although the phases (Xl) are a priori independent, the vectors Formula, form a Markov process, which gives a HMM with hidden layer (Xl). Although the parameterization is different, this is the same model assumed in LANDER and GREEN (1987) and in KRUGLYAK et al. (1995, 1996), where a bit in their inheritance vector includes information from both the grandparental origin (Xl) and the allele in the gamete (hli). Given the individual gametic observations and the recombination fractions between the loci, it is possible to apply a variant of the Viterbi algorithm to compute the posterior distribution of the parental linkage phase p(X|h1, ... , hn, {theta}), which is proportional to p(h1, ... , hn|X, {theta}) x p(X). It is also possible to implement the EM algorithm (DEMPSTER et al. 1977) to compute the maximum-likelihood estimator of the recombination fractions ({theta}).

Pooled gametic observations:

Next we consider the case, where instead of observing the individual gametes (hli) we observe their (allele) frequencies from a gametic pool or equivalently Formula, the number of copies of the 1-allele at locus l. The likelihood function for pooled observations is the following:

Formula
At the first locus above, the frequency N1 is independent from phase vector X and it has the Binomial(n, Formula) distribution. For loci l > 1, given the phase information Xl–1, Xl, and the frequency Nl–1 at the previous locus, Nl has the following conditional distribution (given by convolution of two binomials): If Xl = Xl–1 (i.e., the grandparental origin of the 0-allele remains the same) then

Formula
and if Xl != Xl–1 (i.e., the grandparental origins of the 0-allele differ) then

Formula

Alternatively we can use the following notation:

Formula
where * denotes convolution. These convolution distributions can be computed efficiently, using discrete Fourier transform (see, e.g., BREMAUD 2002).

Multiple pools:

We can also consider the case where we observe different frequencies from npools distinct pools, each containing n(k) gametes, Formula. Accordingly, the pool-specific frequency vectors Formula, are conditionally independent given the phase vector X, starting with initial distribution N1(k) ~ Binomial(n(k), Formula) and then following independently the Markovian dynamics described above. To maintain simplicity of the notation in the rest of this article, the theory is presented mainly for a single pool but it generalizes to multiple pools analogously.

Estimation of linkage phase:

It may be somewhat surprising, but becomes clear from the formula above, that knowledge of the frequencies Nl–1 and Nl gives some information about the change |XlXl–1| in the grandparental origin. The amount of such information is quantified by the total variation distance between the two distributions above; clearly it depends on Nl–1, n, and {theta}l and is random in this sense. This information is zero when {theta}l = Formula, and it increases as the recombination fraction proceeds to 0. Also, it is zero when Nl–1 = n/2 and increases as Nl–1 goes to 0 or to n.

Again, the pairs Formula form a hidden Markov chain (Figure 1), and it is possible to apply HMM techniques and the EM algorithm to estimate parental linkage phase (X) and recombination fractions ({theta}), respectively. This practice is similar to that in the full data case. The Viterbi algorithm is described in detail in APPENDIX A. We show that in the case where the marker spacing is dense and several distinct pools are used, the information content of the pooled data about the parental linkage phase X is not much lower than the information content carried by the full data.


Figure 1
View larger version (6K):
In this window
In a new window
Download PPT slide
 
FIGURE 1.—

Graphical representation of the hidden state structure. At locus L, hidden parental linkage phase is indicated as XL and the number of offspring having the 1-allele as NL.

 

Normal approximation and information about linkage phase:

In the following, we present a normal approximation for transitions between allele frequency measurements at different loci, which may be used to speed up the numerical computations, when the sizes of the gametic pools are large enough so that the binomial distribution can be approximated with the normal distribution. We also compute the information in the data and explain how to calculate the mutual information between linkage phase Formula and the observed allele frequency data Formula under the approximate model. Let

Formula
be the frequency of the allele inherited from the grandfather at locus l, where n is the size of the gametic pool. The conditional distribution of Formula given Formula is a convolution of two binomials with mean Formula and variance n(1 – {theta}l){theta}l. When n is large enough, we approximate the scaled process Formula by a stationary Gaussian process Formula with transition density

Formula
and invariant distribution Formula.

Analogously the conditional distribution of the frequency Nl given Nl–1 and |{Delta}Xl| = |XlXl–1| is approximated by a normal distribution with mean m({theta}) = Nl–1 + (n 2Nl–1)r(|{Delta}Xl|, {theta}l) and variance n(1 – {theta}l){theta}l, where we denote r(0, {theta}) = {theta} and r(1, {theta}) = 1 – {theta}.

After rescaling by n–1/2, asymptotically we are in a conditionally Gaussian shift experiment (see VAN DER VAART 1998) with shift Formula and variance {theta}l(1 – {theta}l). Note that under the marginal distribution the random variable Formula has mean Formula and variance 1; therefore the shift is bounded in probability as n grows and the information about |{Delta}Xl| remains bounded, too. On the other hand the experiment becomes more and more informative as the recombination fraction {theta}l decreases, since the shift is a standardized random variable while the variance of the experiment goes to 0.

The observed information (information in the data) is the relative entropy (KULLBACK and LEIBLER 1951) of the prior distribution with respect to the posterior distribution, which is decomposable as follows:

Formula
See Figure 2 for illustration. Note that the parameter vector {theta} is known here.


Figure 2
View larger version (18K):
In this window
In a new window
Download PPT slide
 
FIGURE 2.—

Observed information about the parental linkage phase in pooled haplotype data. Pooled data consist of a simulated haploid offspring group with 100 haplotypes (gametes). Allele frequency of the allele with lower frequency (top) and the corresponding observed information content (bottom) are shown for 200 heterozygote loci. The limit of maximal information in log(2) is indicated by a dashed line in the bottom half. Genetic map distances (in Morgans) are also shown on the x-axis.

 
The mutual information is obtained by integrating out in each summand the pair (Nl–1, Nl) with respect to its joint distribution; that is, Nl–1 is Formula and then Nl given Nl–1 is mixed normal with mixing parameter |{Delta}Xl|.

Remark:

Here the size of the pool is related to the precision of the measurement, since it is implicitly assumed that one has perfect measurements with unit resolution. It would be more realistic to include measurement errors in the model (see DISCUSSION).

Estimation of recombination fractions:

While in the previous sections the recombination fractions were assumed to be known, we suppose them to be unknown here. By using a Monte Carlo EM algorithm (PENTTINEN 1984; GEYER and THOMPSON 1992), we compute the maximum-likelihood estimator of the recombination fractions Formula on the basis of the pooled data.

Recall that conditionally on Nl–1 and |{Delta}Xl|,

Formula
Therefore we can write Nl = yl + (Nlyl), where, for l > 1, yl ~ Binomial(Nl–1, 1 – r(|{Delta}Xl|, {theta}l)) and (Nlyl) ~ Binomial(n Nl–1, r(|{Delta}Xl|, {theta}l)) are conditionally independent random variables. We set y1 = 0. Note that Formula are sufficient statistics for the recombination fractions Formula.

To easily solve the maximization step in the EM algorithm, we extend the state space by including the variables yl to the hidden states Xl. The hidden Markov model {(Xl, yl, Nl)} has the following form: Conditionally on (Xl–1, yl–1, Nl–1),

Formula
and

Formula
To sample Formula conditionally on the pooled data (Nl), we first sample (Xl) conditionally on (Nl), using the Viterbi algorithm described in APPENDIX A, and then for Formula we sample the value for Yl conditionally on Xl–1, Xl, Nl–1, Nl from a distribution

Formula

Starting with some initial guess Formula, at stage t, in the E-step we sample K independent identically distributed realizations (X(j), Y(j)) conditionally on the pooled data (N) and compute a Monte Carlo approximation of the integrated log-likelihood:

Formula
In the M-step we obtain the next estimate Formula by maximizing this expression over {theta}, taking Formula with

Formula
When we iterate the procedure,

Formula

This extends immediately to the situation where we observe distinct pooled gametic samples produced by unrelated individuals belonging to the same species.

Remark:

For a given sample size, one can improve the estimation of the recombination fraction between two loci by measuring the frequencies at a finer map resolution between the two loci.

Simulation experiment I—estimation of recombination fractions and parental linkage phase under different pooling strategies:

Simulations were used to test performance of the algorithm under different pooling strategies. We simulated a data set containing 1000 haploid gametes across 200 evenly spaced loci where the recombination fraction between consecutive markers was determined to be 0.01. To illustrate the effect of different pooling strategies we computed the maximum-likelihood estimator of recombination fractions in several situations, namely: (1) the full data case, where we had observations from individual haplotypes (gametes), and (2) pooled data cases, where we split the same data (1000 gametes) into one, two, four, or eight different parts (pools) and observed only the allele frequency of each pool. The Monte Carlo–EM algorithm starts from some arbitrary values for the recombination fractions and it is continued until convergence. In each Monte Carlo step, 5000 independent identically distributed realizations were drawn. The ML estimates of the recombination fraction corresponding to these different situations are shown in Figure 3. One can see in this figure that the use of several smaller pools simultaneously improves the accuracy. In particular, the estimated recombination fraction, corresponding to a single pool (curve a), at locus 80 suddenly jumps toward infinity. This does not correspond to a change of chromosome but it is an artifact since at that locus the frequency is exactly 0.5 and there is no information about the parental linkage phase and the recombination fraction cannot be estimated. We see that when multiple pooling is used, we recover some information about the parental phase at that locus so that the "chromosome jump" disappears. By increasing the number of pools (up to eight) the parental phase was recovered without errors.


Figure 3
View larger version (17K):
In this window
In a new window
Download PPT slide
 
FIGURE 3.—

Simulation experiment I. The curves of estimated genetic distances over marker loci (measured cumulatively from the left) in the case of different pooling strategies are shown: (a) one pool of 1000 gametes, (b) two pools of 500 gametes, (c) four pools of 250 gametes, (d) eight pools of 125 gametes, (e) full data of 1000 individual gametes, and (f) true genetic distance. Recombination fractions were converted to genetic distance using Haldane's mapping function.

 

Simulation experiment I—robustness with respect to errors in data:

From simulation experiment I, data of eight pools (with n = 125 gametes in each) were further selected to study the influence of errors on the estimation procedure. Two different kinds of errors were added to the observed allele frequencies in the pooled data.

Measurement errors:

We first considered a perturbation in the data where the measured allelic frequencies are masked (at each locus and independently across loci) by adding a Gaussian error with mean 0 and standard deviation Formula with {sigma} = 0.01. This was hoped to mimic an error attributable to the measuring instrument. In the numerical example, with this level of noise, the parental linkage phase was recovered without errors. However, the recombination fraction estimator is clearly not robust against this type of error and provided values that were highly overestimated (see Figure 4). This is because after phase assignment the algorithm counts the remaining fluctuations in the data as recombinations.


Figure 4
View larger version (27K):
In this window
In a new window
Download PPT slide
 
FIGURE 4.—

Simulation experiment I—erroneous data. The curves of estimated genetic distances over marker loci (measured cumulatively from the left) in the case of different degrees of error in the data are shown. The curves d, e, and f coincide with those in Figure 3: (f) true genetic distances, (e) estimator based on 1000 individual gametes (correct data), and (d) estimator based on eight pools of 125 gametes in each (correct data). In curves g–k, data sets with eight pools of 125 gametes are also used but with errors in the data: (g) Gaussian random error with mean 0 and standard deviation Formula where {sigma} = 0.01, (h) DNA amplification error with {varepsilon} = 0.1, (i) DNA amplification error with {varepsilon} = 0.2, (j) DNA amplification error with {varepsilon} = 0.5, and (k) DNA amplification error with {varepsilon} = 1.0. Recombination fractions were converted to genetic distance using Haldane's mapping function.

 

DNA amplification errors:

We also considered a second type of perturbation, which may occur during the DNA amplification process. In our model, the pooled DNA sample is assumed to contain a single copy of each gametic observation (a haploid DNA) collected from the parent. To simulate DNA amplification errors, we add a simple perturbation to the pooled data sample by randomizing the number of copies of each gametic observation. Namely, a collected gamete with index i is copied Ci times, where we assume that Ci are independently and identically distributed for some parameter {varepsilon} isin [0, 1],

Formula
which is the convolution of a Poisson({varepsilon}) and a Bernoulli (1 – {varepsilon}) distribution. This gives E(Ci) = 1 and Var(Ci) = (2 – {varepsilon}){varepsilon}. For more sophisticated modeling of DNA amplification, see LALAM et al. (2004). After DNA amplification, we obtain a pool containing Formula gametes, and we observe at each locus the perturbed counts Formula. Obviously these perturbations are correlated across loci, and this is the reason why the recombination fraction estimator is expected to be robust against DNA amplification errors. In a numerical experiment that has been summarized in Figure 4, we tested several experiments by having different levels ({varepsilon} = 0.1, 0.2, 0.5, 1.0) of DNA-amplification errors in each. In all these experiments, the recombination fraction estimator seemed to perform reasonably well and the linkage phase was recovered without errors.

Joint estimation of recombination fractions, parental linkage phase, and the ordering of markers:

Next we consider the case where the ordering of markers is unknown and it is estimated from the data together with the recombination fractions and the parental linkage phase. We observe the allele frequencies from npools distinct gametic pools obtained from the same parental individual.

The state space is given by a permutation {pi} of the marker loci indexes Formula, together with the recombination fractions {theta}l between the consecutive markers indexed by {pi}(l – 1) and {pi}(l), for Formula, and the parental phase vector (Formula).

To speed up the numerical computations we assume that the gametic pools have sizes large enough so that we can use the normal approximation.

Marginal likelihood and pseudo-likelihood:

Consider a pair of markers, j and l, together with the corresponding parental linkage phases Xj, Xl isin {0, 1}. The log-likelihood contribution when marker l follows marker j with given phases is given by

Formula
where {theta} is the recombination fraction between the markers j and l, Formula is the total number of gametes in the pools, and we set

Formula
Note that Formula and the matrix Formula are sufficient statistics, which need to be computed only once for every pair of markers. Their dimension does not depend on the number of pools.

Next, we assume a uniform prior on [0, Formula] on the recombination fractions, and by integrating the recombination fraction parameter {theta} with respect to the prior we obtain the marginal-likelihood contribution for the joint choice of the ordering and the linkage phase, and we define the cost function by

Formula
Maximizing the likelihood corresponds to solving a traveling salesman problem, i.e., finding an ordering {pi} of loci Formula and a phase vector Formula that minimizes the total cost

Formula
where the cost for the first locus Formula.

For a small number of markers it is possible to find the optimal ordering in the maximum-likelihood sense by using dynamic programming. However, dynamic programming becomes unfeasible as the number of markers grows. Instead we develop a Markov chain Monte Carlo (MCMC) method (HASTINGS 1970) to sample from the posterior distribution of the configuration Formula.

Another problem is that the integral in the cost function does not have an analytic expression. For a large number of markers, it becomes numerically expensive to evaluate accurately the cost matrix Formula. We consider instead the pseudo-log-likelihood

Formula 1(1)
Intuitively, the optimal ordering should be also close to optimal in the least-squares sense, and this gives a guideline to set the proposal distribution for the ordering {pi} in the MCMC algorithm. For updating steps and the proposal distributions, see APPENDIX B.

Simulation experiment II—joint estimation of recombination fractions, parental linkage phase, and the ordering of markers:

We computed a numerical experiment with 400 closely linked marker loci. The markers were not equally spaced; instead the distance between consecutive markers was either 0.001 or 0.01 M, systematically in the proportion of 10 to 1. We simulated data from 500 pools, with 200 gametes in each pool, and then applied a random permutation to the marker indexes and to the parental phases before collecting the pooled data.

The MCMC estimation algorithm found quickly (say, in a few thousand iterations) some good permutations, quite close to the true ordering of the markers. We stopped the experiment after 2 weeks and several million MCMC cycles, because the algorithm was clearly stuck in a local maxima. Since the mixing of the MCMC seems to be very slow, instead of using the empirical distribution as an approximation to the posterior, we look only at the maximum a posteriori (MAP), estimated by the sampled configuration with the highest posterior density.

The pseudo-log-likelihood values for initial, true, and estimated MAP configurations were –19,747,072, –75,385, and –121,663, respectively. In Figure 5, we can see that the estimated MAP configuration is not far away from the true ordering, since all the markers are placed extremely close to their true positions. In fact, many markers are placed correctly while some relatively short segments are placed in their right positions but in the reverse direction.


Figure 5
View larger version (12K):
In this window
In a new window
Download PPT slide
 
FIGURE 5.—

Simulation experiment II. Estimated order of 400 marker loci is shown, using pooled data from 500 pools, each containing 200 gametes.

 
The parental phase was recovered without errors, and in Figure 6 we plot the genetic distances estimated without any prior knowledge on the ordering of the markers, compared with the true map and the map obtained by using the full data and knowing the true ordering of the markers. These estimates are comparable with the corresponding estimates obtained in simulation experiment I where the pooled data were used and ordering of the markers was known a priori. It seems that when the number of pools is large enough the marker order can be recovered so well that the two pooling estimators of the recombination fraction (with and without prior knowledge of the ordering) behave very similarly.


Figure 6
View larger version (12K):
In this window
In a new window
Download PPT slide
 
FIGURE 6.—

Simulation experiment II. Genetic distance curves over marker loci (measured cumulatively from the left) are shown: (a) true genetic distances, (b) estimated genetic distance using the full data with known marker ordering, and (c) the genetic distance is estimated simultaneously with the ordering of the markers by using pooled data.

 


DISCUSSION
We have presented a method to estimate parental linkage phase, sex-specific recombination fractions, and ordering of markers using pooled haploid DNA available from sperm, egg, or megagametophyte samples. This method has a lot of promise because the potential accuracy provided by this method has not been available without individual genotyping before. Use of a single pool seems to result in some information gaps along the chromosome where the linkage phase (see Figure 2) and recombination fractions (see Figure 3) cannot be determined. However, these gaps can be effectively avoided by using multiple (two or more) pools simultaneously because random gap positions are likely to differ between pools.

The parental linkage phase and recombination frequency estimation in this method is based on random fluctuations in transmission ratios and their correlation between the loci under Mendelian inheritance. The information content in the pooled data seems to depend on how much the observed frequency deviates from its expectation. To maximize the information content we propose the following strategy, which requires individual genotyping at the first locus. One could classify individuals into two sets of divergent gametic pools according to the typing of the allele at the first locus. These two pools would then be used to measure pooled frequencies from other loci. This would enhance (locally) the information of the pooled frequencies, without the need of any corrections.

We comment here briefly about segregation distortion. Segregation distortion may follow as result of a selection process on the gametes produced in the meiosis. Since we are considering a short chromosomal segment, we may assume that the selection probability depends only on the alleles at a single locus. Having assumed that the recombination process occurs independently from the haplotypes of the parent, it follows that the distribution of the recombination pattern is not affected by the selection process. In that case we can apply our method without changes. In fact, this kind of segregation distortion would produce data that on average are more informative around the selective locus than in the neutral case. Note, however, that in the case where selection probability depends on several loci, the distribution of the recombination pattern on the selected gametes may be perturbed in a way that depends on the alleles, violating our assumptions. Because this method is based on typing gametic samples (of sperm or egg) rather than living progeny, it can be used to estimate offspring ratios and sex-specific recombination frequencies in a meiotic state under transmission equilibrium. These estimates can then be compared to similar estimates obtained from living organisms (i.e., under a postmeiotic state). A statistically significant departure in estimates between these two states can then be taken as evidence in favor of postmeiotic selection as a source of transmission distortion. For other methods to determine the origin of transmission ratio distortion, see DE VILLENA et al. (2000).

We hope that the proposal distributions in the MCMC algorithm could be improved in the future to achieve a faster relaxation. One possibility to be explored would be to use particle filter (say, genetic algorithm) techniques, which can be used successfully for the traveling salesman problem (see DEL MORAL 2004). The linkage phase estimator is robust to errors and the estimator for genetic distance is robust to DNA amplification errors although the sources of errors in the allele frequency measurements from gametic pools are not taken into account in the current model. We are currently investigating the possible further extension of the method, using errors-in-variable modeling (FULLER 1987; BIEMER et al. 1991; CARROLL et al. 1995). In any case, the measurement errors can be controlled in the design of the pooling experiment—by taking several pools with a pool size small enough. Of course, we are looking forward to testing our method with real data when available. The software implementing the method is freely available for research purposes from the authors.


APPENDIX A: VITERBI ALGORITHM
Here we describe the Viterbi algorithm for the hidden Markov model Formula 1, where we assume that the recombination fractions Formula 1 are known in advance. The problem is to sample the parental linkage phase Formula 1 conditionally on the observed allele frequencies Formula 1, which are obtained from the pool of gametes. Because we are on a discrete state space we can use a standard setting; see DURBIN et al. (1998) and RABINER (1989).

Nothing is computed for locus l = 1, because linkage phase X1 and observed frequency N1 are independent there. But for all subsequent loci Formula 1, the (conditional) transition probabilities p(Xl = x|Xl–1 = y, Nl–1, Nl) are recursively computed for all possible values of (x, y). The term Xl–1 is then integrated out from this expression so that we obtain

Formula 1

After these L forward steps, linkage phase XL is sampled from Formula 1 and the backward algorithm continues for Formula 1 as follows:

Linkage phase Xl is drawn given the data and Xl+1 with probability proportional in x to

Formula 1

After L backward steps we obtain the sample Formula 1 from the posterior distribution.

By dynamic programming it is also possible to compute a posteriori the most probable path and its posterior probability.


APPENDIX B: UPDATE STEPS AND PROPOSAL DISTRIBUTIONS FOR THE METROPOLIS ALGORITHM
A MCMC cycle consists of several proposed moves:
  1. We perform a block update for the parental phase vector Formula 1, which samples from the fully conditional posterior distribution given the data and the current values of {pi} and Formula 1. If there is no prior information about the phase vector this is very simple since under the uniform prior for X, {Delta}X{pi}(l) will be conditionally independent given {pi}. If the prior for X is not uniform, we use the Viterbi algorithm (see APPENDIX A).
  2. We update independently the recombination fraction parameters {theta}l conditionally on the data, on {pi}, and on the phases X{pi}(l–1), X{pi}(l). This is a Metropolis move where as a proposal distribution for {theta}l we take a histogram approximation over a finite grid G sub [0, Formula 1] of the full conditional density Formula 1.
  3. The two following moves update the ordering. For each of these, we resample simultaneously the parental phases and the recombination fraction parameters around the markers involved.
  4. As a proposal distribution, we apply a random permutation to the current ordering. For each locus involved, we resample jointly the phase and the recombination fractions between the new neighboring markers (that is, by using the proposal distribution described in ii given the new ordering and phase).
  5. As a proposal distribution we randomly select a segment and we cut it off from the chromosome, join the extremes, choose a random location in the chromosome, and insert the segment there, eventually also inverting the direction of the segment. We also have the possibility to flip simultaneously all the parental phases of the loci belonging to the segment. Finally new recombination fraction parameters are sampled for the markers around the two cuts points by using the proposal distribution described in ii.

The random permutation in move iii and the random segment in move iv together with the new phases are not selected uniformly at random. Instead in the proposal distribution we take into account the pseudo-log-likelihood (Equation 1).

For each of these moves, for a given starting configuration ({pi}, X) we denote by F({pi}, X) the set of configurations that are reachable in one step. By using the pseudo-log-likelihood we define the proposal distribution as

Formula 1
where t ≥ 0 is an inverse temperature parameter t = 0 corresponding to the uniform distribution over F({pi}, X), and Zt({pi}, X) is the normalizing constant. For high values of t, the proposal distribution among the reachable configurations will choose with high probability those with the smallest energy. Since we want to have also a chance to propose any reachable configuration, we let the inverse temperature parameter t vary cyclically in a finite set.

The Hastings ratio for such moves is given by

Formula 1

In the case that the data contain gametic pools sampled from J different parental individuals, it is clear that one should extend the MCMC algorithm by keeping one permutation vector Formula 1 and introducing J independent parental phase vectors Formula 1, Formula 1.


ACKNOWLEDGEMENTS
We are grateful to two anonymous referees for their constructive comments on the manuscript. This work was supported by research grant 202324 from the Academy of Finland and by the Centre of Population Genetic Analyses, University of Oulu, Finland.


LITERATURE CITED

ARNHEIM, N., P. CALABRESE and M. NORDBORG, 2003 Hot and cold spots of recombination in the human genome: the reason we should find them and how this can be achieved. Am. J. Hum. Genet. 73: 5–16.[CrossRef][Medline]

BIEMER, P. P., R. M. GROVES, L. E. LYBERG, N. A. MATHIOWETZ and S. SUDMAN, 1991 Measurement Errors in Surveys. John Wiley & Sons, New York.

BOEHNKE, M., K. LANGE and D. COX, 1991 Statistical methods for multipoint radiation hybrid mapping. Am. J. Hum. Genet. 49: 1174–1188.[Medline]

BREMAUD, P., 2002 Mathematical Principles of Signal Processing: Fourier and Wavelet Analysis. Springer-Verlag, New York.

BUTCHER, P. A., E. R. WILLIAMS, D. WHITAKER, S. LING, T. P. SPEED et al., 2002 Improving linkage analysis in outcrossed forest trees—an example from Acacia mangium. Theor. Appl. Genet. 104: 1185–1191.[CrossRef][Medline]

BUTCHER, L. M., E. MEABURN, L. LIU, C. FERNANDEZ, L. HILL et al., 2004 Genotyping pooled DNA on microarrays: a systematic genome screen of thousands of SNPs in large samples to detect QTLs for complex traits. Behav. Genet. 34: 549–555.[CrossRef][Medline]

CARROLL, R. J., D. RUPPERT and L. A. STEFANSKI, 1995 Non-Linear Measurement Error Models. Chapman & Hall, London.

CLARK, A. G., 1990 Inference of haplotypes from PCR-amplified samples of diploid populations. Mol. Biol. Evol. 7: 111–122.[Abstract]

COLLINS, H. E., H. LI, S. E. INDA, J. ANDERSON, K. LAIHO et al., 2000 A simple and accurate method for determination of microsatellite total allele content differences between DNA pools. Hum. Genet. 106: 218–226.[CrossRef][Medline]

CULLEN, M., S. P. PERFETTO, W. KLITZ, G. NELSON and M. GARRINGTON, 2002 High-resolution patterns of meiotic recombination across the human major histocompatibility complex. Am. J. Hum. Genet. 71: 759–776.[CrossRef][Medline]

DARVASI, A., 1998 Experimental strategies for the genetic dissection of complex traits in animal models. Nat. Genet. 18: 19–24.[CrossRef][Medline]

DEL MORAL, P., 2004 Feynman-Kac Formulae: Genealogical and Interacting Particle Systems With Applications. Springer, Berlin/Heidelberg, Germany/New York.

DEMPSTER, A. P., N. M. LAIRD and D. B. RUBIN, 1977 Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. B 39: 1–38.

DE VILLENA, F. P-M., E. DE LA CASA-ESPERON, T. L. BRISCOE and C. SAPIENZA, 2000 A general test to determine the origin of maternal transmission ratio distortion: meiotic drive at the mouse Om locus. Genetics 154: 333–342.[Abstract/Free Full Text]

DURBIN, R., S. EDDY, A. KROGH and G. MITCHISON, 1998 Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, Cambridge, UK.

EXCOFFIER, L., and M. SLATKIN, 1995 Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12: 921–927.[Abstract]

FULLER, W. A., 1987 Measurement Error Models. John Wiley & Sons, New York.

GAO, G., I. HOESCHELE, P. SORENSEN and F. DU, 2004 Conditional probability methods for haplotyping in pedigrees. Genetics 167: 2055–2065.[Abstract/Free Full Text]

GEORGE, A. W., 2005 A novel MCMC approach for constructing accurate meiotic maps. Genetics 171: 791–801.[Abstract/Free Full Text]

GEORGE, A. W., K. L. MENGERSEN and G. P. DAVIS, 1999 A Bayesian approach to ordering gene markers. Biometrics 55: 419–429.[CrossRef][Medline]

GEYER, C. J., and E. A. THOMPSON, 1992 Constrained Monte-Carlo maximum-likelihood for dependent data. J. R. Stat. Soc. B 54: 657–699.

HASTINGS, W. K., 1970 Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57: 97–109.[Abstract/Free Full Text]

HAWLEY, M. E., and K. K. KIDD, 1995 HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. J. Hered. 86: 409–411.[Free Full Text]

HUANG, Q., S. SHETE and C. I. AMOS, 2004 Ignoring linkage disequilibrium among tightly linked markers induces false-positive evidence of linkage for affected sib pair analysis. Am. J. Hum. Genet. 75: 1106–1112.[CrossRef][Medline]

INTERNATIONAL HAPMAP CONSORTIUM, 2003 The international HapMap project. Nature 426: 789–796.[CrossRef][Medline]

ITO, T., S. CHIKU, E. INOUE, M. TOMITA, T. MORISAKI et al., 2003 Estimation of haplotype frequencies, linkage-disequilibrium measures, and combination of haplotype copies in each pool by use of pooled DNA data. Am. J. Hum. Genet. 72: 384–398.[CrossRef][Medline]

JANSEN, J., A. G. DE JONG and J. W. VAN OOIJEN, 2001 Constructing dense genetic linkage maps. Theor. Appl. Genet. 102: 1113–1122.[CrossRef]

KITAMURA, Y., M. MORIGUCHI, H. KANEKO, H. MORISAKI, T. MORISAKI et al., 2002 Determination of probability distribution of diplotype configuration (diplotype distribution) for each subject from genotypic data using the EM algorithm. Ann. Hum. Genet. 66: 183–193.[CrossRef][Medline]

KONG, A., D. F. GUDBJARTSSON, J. SAINZ, G. M. JONSDOTTIR, S. A. GUDJONSSON et al., 2002 A high-resolution recombination map of the human genome. Nat. Genet. 31: 241–247.[CrossRef][Medline]

KRUGLYAK, L., M. J. DALY and E. S. LANDER, 1995 Rapid multipoint linkage analysis of recessive traits in nuclear families, including homozygosity mapping. Am. J. Hum. Genet. 56: 519–527.[Medline]

KRUGLYAK, L., M. J. DALY, M. P. REEVE-DALY and E. S. LANDER, 1996 Parametric and nonparametric linkage analysis: a unified multipoint approach. Am. J. Hum. Genet. 58: 1347–1363.[Medline]

KULLBACK, S., and R. A. LEIBLER, 1951 On information and sufficiency. Ann. Math. Stat. 22: 79–86.

LALAM, N., C. JACOB and P. JAGERS, 2004 Modelling the PCR amplification process by a size-dependent branching process and estimation of the efficiency. Adv. Appl. Probab. 36: 602–615.[CrossRef]

LANDER, E. S., and P. GREEN, 1987 Construction of multilocus genetic maps in humans. Proc. Natl. Acad. Sci. USA 84: 22363–22367.

LAZZERONI, L. C., N. ARNHEIM, K. SCHMITT and K. LANGE, 1994 Multipoint mapping calculations for sperm-typing data. Am. J. Hum. Genet. 55: 431–436.[Medline]

LIN, S., and T. P. SPEED, 1997 An algorithm for haplotype analysis. J. Comput. Biol. 4: 535–546.[Medline]

LONG, J. C., R. C. WILLIAMS and M. URBANEK, 1995 An E-M algorithm and testing strategy for multiple-locus haplotypes. Am. J. Hum. Genet. 56: 799–810.[Medline]

LU, Q., Y. CUI and R. WU, 2004 A multilocus likelihood approach to joint modeling of linkage, parental diplotype and gene order in a full-sib family. BMC Genet. 5: 20.[CrossRef][Medline]

MCPEEK, M. S., 1999 SPERMSEG: analysis of segregation distortion in single-sperm data. Am. J. Hum. Genet. 65: 1195–1197.[CrossRef][Medline]

MCPEEK, M. S., and A. STRAHS, 1999 Assessment of linkage disequilibrium by the decay of haplotype sharing, with application to fine scale genetic mapping. Am. J. Hum. Genet. 65: 858–875.[CrossRef][Medline]

NAVIDI, W., and N. ARNHEIM, 1994 Analysis of genetic data from the polymerase chain reaction. Stat. Sci. 9: 320–333.

NAVIDI, W., and N. ARNHEIM, 1999 Combining data from polymerase chain reaction DNA typing experiments: application to sperm typing data. J. Am. Stat. Assoc. 94: 726–733.[CrossRef]

NORTON, N., N. M. WILLIAMS, H. J. WILLIAMS, G. SPURLOCK, G. KIROV et al., 2002 Universal, robust, highly quantitative SNP allele frequency measurement in DNA pools. Hum. Genet. 110: 471–478.[CrossRef][Medline]

NORTON, N., N. M. WILLIAMS, M. C. O'DONOVAN and M. J. OWEN, 2004 DNA pooling as a tool for large-scale association studies in complex traits. Ann. Med. 36: 146–152.[CrossRef][Medline]

PENTTINEN, A., 1984 Modelling interactions in spatial point patterns: parameter estimation by the maximum likelihood method. Ph.D. Thesis, University of Jyväskylä, Jyväskylä, Finland.

QIAN, D., and L. BECKMANN, 2002 Minimum-recombinant haplotyping in pedigrees. Am. J. Hum. Genet. 70: 1434–1445.[CrossRef][Medline]

RABINER, L., 1989 A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77: 257–286.[CrossRef]

RITLAND, K., 2002 Estimation of gene frequency and heterozygosity from pooled samples. Mol. Ecol. Notes 2: 370–372.[CrossRef]

ROSA, G. J. M., B. S. YANDELL and D. GIANOLA, 2002 A Bayesian approach for constructing genetic maps when markers are miscoded. Genet. Sel. Evol. 34: 353–369.[CrossRef][Medline]

SCHAID, D. J., S. K. MCDONNELL, L. WANG, J. M. CUNNINGHAM and S. N. THIBODEAU, 2002 Caution on pedigree haplotype inference with software that assumes linkage equilibrium. Am. J. Hum. Genet. 71: 992–995.[CrossRef][Medline]

SHAM, P., J. S. BADER, I. CRAIG, M. O'DONOVAN and M. OWEN, 2002 DNA pooling: a tool for large-scale association studies. Nat. Rev. Genet. 3: 862–871.[CrossRef][Medline]

SHAW, S. H., M. M. CARRASQUILLO, C. KASHUK, E. G. PUFFENBERGER and A. CHAKRAVARTI, 1998 Allele frequency distribution in pooled DNA samples: applications to mapping complex disease genes. Genome Res. 8: 111–123.[Abstract/Free Full Text]

SLONIM, D., L. KRUGLYAK, L. STEIN and E. LANDER, 1997 Building human genome maps with radiation hybrids. J. Comput. Biol. 4: 487–504.[Medline]

SOBEL, E., and K. LANGE, 1996 Descent graphs in pedigree analysis: applications to haplotyping, location scores, and marker-sharing statistics. Am. J. Hum. Genet. 58: 1323–1337.[Medline]

SOBEL, E., K. LANGE, J. R. O'CONNELL and D. E. WEEKS, 1996 Haplotyping algorithms, pp. 89–110 in Genetic Mapping and DNA Sequencing (IMA Vol. 81 in Mathematics and Its Applications), edited by T. P. SPEED and M. S. WATERMAN. Springer-Verlag, New York.

STEPHENS, D. A., and A. F. M. SMITH, 1993 Bayesian inference in multipoint gene mapping. Ann. Hum. Genet. 57: 65–82.[Medline]

STEPHENS, M., and P. SCHEET, 2005 Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am. J. Hum. Genet. 76: 449–462.[CrossRef][Medline]

STEPHENS, M., N. J. SMITH and P. DONNELLY, 2001 A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68: 978–989.[CrossRef][Medline]

SILLANPÄÄ, M. J., and E. ARJAS, 1999 Bayesian mapping of multiple quantitative trait loci from incomplete outbred offspring data. Genetics 151: 1605–1619.[Abstract/Free Full Text]

TAPADAR, P., S. GHOSH and P. P. MAJUMDER, 2000 Haplotyping in pedigrees via a genetic algorithm. Hum. Hered. 50: 43–56.[Medline]

TULSIERAM, L. K., J. C. GLAUBITZ, G. KISS and J. E. CARLSON, 1992 Single tree genetic linkage mapping in conifers using haploid DNA from megagametophytes. Bio/Technology 10: 686–690.[CrossRef][Medline]

VAN DER VAART, A. W., 1998 Asymptotic Statistics. Cambridge University Press, Cambridge, UK.

WEBER, J. L., 2002 The Iceland map. Nat. Genet. 31: 225–226.[Medline]

WIJSMAN, E. M., 1987 A deductive method of haplotype analysis in pedigrees. Am. J. Hum. Genet. 41: 356–373.[Medline]

WU, R., C-X. MA, S. S. WU and Z-B. ZENG, 2002 Linkage mapping of sex-specific differences. Genet. Res. 79: 85–96.[CrossRef][Medline]

YANG, H.-C., C.-C. PAN, R. C. Y. LU and C. S. J. FANN, 2005 New adjustment factors and sample size calculation in DNA-pooling experiment with preferential amplification. Genetics 169: 399–410.[Abstract/Free Full Text]

YANG, Y., H. ZHANG, J. HOH, F. MATSUDA, P. XU et al., 2003 Efficiency of single-nucleotide polymorphism haplotype estimation from pooled DNA. Proc. Natl. Acad. Sci. USA 100: 7225–7230.[Abstract/Free Full Text]

YAZDANI, R. F., C. YEH and J. RIMSHA, 1995 Genomic mapping of Pinus sylvestris (L.) using random amplified polymorphic DNA markers. For. Genet. 4: 209–215.

Communicating editor: M. NORDBORG