## Abstract

Parent-offspring trios are widely collected for disease gene-mapping studies and are being extensively genotyped as part of the International HapMap Project. With dense maps of markers on trios, the effects of LD and linkage can be separated, allowing estimation of recombination rates in a model-free setting. Here we define a model-free multipoint method on the basis of dense sequence polymorphism data from parent-offspring trios to estimate intermarker recombination rates. We use simulations to show that this method has up to 92% power to detect recombination hotspots of intensity 25 times background over a region of size 10 kb typed at density 1 marker per 2.5 kb and almost 100% power to detect large hotspots of intensity >125 times background over regions of size 10 kb typed with just 1 marker per 5 kb (α = 0.05). We found strong agreement at megabase scales between estimates from our method applied to HapMap trio data and estimates from the genetic map. At finer scales, using Centre d'Etude du Polymorphisme Humain (CEPH) pedigree data across a 10-Mb region of chromosome 20, a comparison of population recombination rate estimates obtained from our method with estimates obtained using a coalescent-based approximate-likelihood method implemented in PHASE 2.0 shows detection of the same coldspots and most hotspots: The Spearman rank correlation between the estimates from our method and those from PHASE is 0.58 (*p* < 2.2^{−16}).

LINKAGE disequilibrium (LD) refers to the presence of nonrandom association among two or more alleles at distinct loci. The degree of LD between alleles is determined by an evolutionary process combining the effects of natural selection, mutation, genetic drift, and population admixture. Appreciable LD is likely to be present only between two loci that are tightly linked. For this reason, attempts have been made to use the magnitude of LD between two loci to obtain estimates of recombination fractions between loci, *e.g.*, Serre *et al.* (1990). Transmission of haplotypes from parents to offspring is dependent on both LD and linkage (Spielman *et al.* 1993). Specifically, the extent of LD between two distinct loci decays exponentially at a rate proportional to the recombination fraction *c* between the loci: If the disequilibrium between the two loci is *D*_{0} at time 0, then after *k* generations the disequilibrium will be(1)Hence, if either the time span or the recombination rate is known, then, under simplifying assumptions that ignore the ongoing stochastic effect of the evolutionary process, the other parameter can be estimated from the coefficients of LD (Lander and Botstein 1986). In practice, such estimates are problematic because the sampling variance of LD tends to be very large due to the evolutionary sampling (Hill and Weir 1994), resulting in large standard errors in the estimate of *c*. Motivated by the extensive genotyping on familial trios (mother, father, and offspring) in the International HapMap Project (International HapMap Consortium 2003), as well as by the availability of large trio samples in several disease cohorts, we construct a simple, robust, approach to disentangling the effects of LD and linkage in the transmission of data from parent to offspring. Our method relates the genetic variation in a population sample over one generation to the underlying recombination rate and uses a multipoint approach to reduce the standard error normally associated with LD-based methods. We make only the usual assumptions of genetic equilibrium required for establishment of Hardy-Weinberg equilibrium. Specifically, we make the following explicit assumptions: infinite population size, discrete generations, random mating, no selection, no migration, no mutation, and no difference across sexes in parental genotype frequencies. In this sense, our method can be described as model-free. The transmission data between multiple pairs of overlapping densely spaced markers is combined to reduce variance and produce accurate estimates of recombination rates between adjacent markers. Our method generates short-range smoothness in recombination rates over the region of the chromosome spanned by overlapping markers and allows for rate estimation on various scales limited only by the density of the markers used.

More sophisticated methods applying models to account for stochastic effects have been developed and used successfully for the fine mapping of disease genes using allelic associations in isolated populations, for example, those in Hastbacka *et al.* (1992), Kaplan and Weir (1995), Kaplan *et al.* (1995), and Devlin *et al.* (1996). Also, many coalescent model-based methods exist for the estimation of recombination rates from patterns of sequence variation in population genetic data. Methods include moment-based estimators (Wall 2000) and full-likelihood (Griffiths and Marjoram 1996; Kuhner *et al.* 2000; Nielson 2000; Fearnhead and Donnelly 2001) and approximate-likelihood approaches (Hudson 2001; Fearnhead and Donnelly 2002; Mcvean *et al.* 2002; Li and Stephens 2003). Methods are further described in Stumph and Mcvean (2003).

We define a hotspot of intensity *s* to be a region of length 10 kb where crossovers occur a frequency of more than *s* times that in the surrounding region. We use simulations to show that, using 200 trio families, our method can reliably detect recombination hotspots of intensity 10 across a region typed with 20 markers at a density of 1 marker per 2.5 kb (*P* < 0.025). Hotspots of an intensity >25 can be detected by typing as few as 5 markers at a density of 1 marker per 2.5 kb (*P* < 0.001). We show that the number of markers required to achieve a given accuracy decreases with marker density. Results suggest that this method is able to detect fine-scale variation in recombination rate according to the density of typed markers: That is, the more dense the marker spacing is, the more fine scale the variation that can be detected. To validate our approach, we estimated recombination rates using HapMap Centre d'Etude du Polymorphisme Humain (CEPH) trio data from various chromosomes and compared averages over a megabase scale to estimates from the genetic map (Kong *et al.* 2002). At a finer scale, we compared recombination rate estimates in European (CEPH) populations between adjacent SNPs on a 10-Mb region of chromosome 20 made using our model-free method to independent estimates made from a coalescent model-based approximate-likelihood approach implemented in the program PHASE 2.0 (Stephens *et al.* 2001; Stephens and Donnelly 2003). These independent estimates are described in detail in Evans and Cardon (2005).

## METHODS

We assume that we have phase-known genotype data from *N* familial trios at *L* densely spaced, ordered, biallelic markers. See discussion for comments on how to handle phase-unknown genotype data. Markers are labeled 1, … , *L*. We summarize the data at all pairs of markers up to a given distance apart (*e.g.*, all pairs of markers up to *w* markers apart, *w* < *L*) as counts of observed haplotypes in both parental and offspring generations. The method can be summarized in three stages as follows:

For all pairs of markers, we first estimate the probability of observing each haplotype in the parental population. This provides an estimate of LD in the parental generation.

For each pair of markers, we write down the likelihood of transmitting the observed offspring haplotype counts conditional on the LD in the parental generation, that is, assuming the LD in the parental generation is a known parameter. The likelihood of transmitted haplotype counts at each pair of markers is a function of the recombination fraction between the markers. The recombination fraction is defined as the probability that a transmitted haplotype constitutes a new combination of alleles different from that of either parental haplotype. Maximum-likelihood estimates (MLEs) of the recombination fraction are then obtained, independently, at all pairs of markers.

Finally, for any given pair of adjacent markers, all estimates from marker pairs that overlap the adjacent pair, up to a given distance apart, are scaled according to the interval spanned by the pair of markers relative to the interval spanned by the adjacent markers under consideration. Each scaled estimate represents a new estimate of the recombination fraction between the adjacent pair of markers. The original estimate between each pair of markers and all new estimates are then averaged (weighted by variance) to form a final, new estimate of the recombination fraction between each pair of adjacent markers. These steps are now described in greater detail.

#### Step 1—determine LD in the parental generation:

Consider any two biallelic markers with alleles arbitrarily labeled *A* and *a* at the first locus and *B* and *b* at the second locus. Let *c* be the recombination fraction between the markers. We assume that haplotypes are available or have been estimated from a sample of *N* parent-offspring trios. The four possible haplotypes are *ab*, *aB*, *Ab*, and *AB* and the observed number of these haplotypes in the parental generation is denoted by **m** = (*m _{ab}*,

*m*,

_{aB}*m*,

_{Ab}*m*), with

_{AB}*m*+

_{ab}*m*+

_{aB}*m*+

_{Ab}*m*= 4

_{AB}*N*. The likelihood of a specific sample configuration is(2)where

*p*,

_{ab}*p*,

_{aB}*p*, and

_{Ab}*p*are the frequencies of haplotypes

_{AB}*ab*,

*aB*,

*Ab*, and

*AB*, respectively, in the parental generation. The MLE of

*p*is and is denoted by and likewise for

_{ab}*p*,

_{aB}*p*, and

_{Ab}*p*. Let estimates of major allele frequencies at each locus be given by and . Using the standard notation for an LD coefficient, we letdenote the LD in the parental generation. Letdenote the MLE of LD in the parental generation.

_{AB}#### Step 2—infer recombination fraction *c* between markers:

Let *p*′_{ab} be the frequency of haplotype *ab* in the offspring generation. Then, by conditioning on all possible parental haplotype combinations from which an *ab* haplotype can be transmitted, it is easy to see that(3)Similarly, it is easy to show that the frequencies of haplotypes *aB*, *Ab*, and *AB* in the offspring generation are, respectively,and

Let the observed number of haplotypes in the offspring generation be denoted bywhere *n _{ab}* +

*n*+

_{aB}*n*+

_{Ab}*n*= 2

_{AB}*N*. Using MLEs of parental haplotype probabilities, the likelihood at

*c*for a specific sample configuration of offspring haplotypes is therefore given by(4)MLEs of

*c*are obtained by selecting the value of

*c*that maximizes Equation 4. The information

*I*(

*c*) for

*c*is defined as minus the second derivative of Equation 4:(5)

*I*(

*c*) evaluated at is the observed information and is inversely proportional to the asymptotic variance of . Let denote the variance of estimate . Observe that when is small, as would be the case in recombination hotspots, the likelihood has less dependence on the exact value of

*c*and MLEs will have large variance. Conversely, when is large, estimates of

*c*will be more precise with small variance.

#### Step 3—use multiple closely linked flanking markers to refine estimate of *c*:

Let *d _{i}*

_{,j}be the distance, in base pairs, between any two markers

*i*and

*j*. Let

*c*

_{i}_{,j}represent the recombination fraction between markers

*i*and

*j*and let be the MLE of

*c*

_{i}_{,j}obtained using the method described above. Individual estimates of can be highly variable. To refine our estimate between adjacent markers

*i*and

*i*+ 1, we combine estimates obtained at multiple pairs of markers

*j*and

*k*up to

*w*markers away from markers

*i*and

*i*+ 1,

*i.e.*, pairs of markers in

*S*:= {(

*j*,

*k*) : max(0,

*i*+ 1 −

*w*) ≤

*j*<

*k*≤ min(

*i*+

*w*,

*L*)}. These represent all the markers that overlap markers

*i*and

*i*+ 1 and are within

*w*markers of them and are referred to as the

*flanking markers*. We assume that the rate at which recombination occurs remains constant over the region of the chromosome spanned by the flanking markers. Since recombination fractions are approximately additive over short distances and our markers are densely spaced, we assume that the intermarker recombination fractions estimated between pairs of flanking markers are scaled estimates of the intermarker recombination fraction between markers

*i*and

*i*+ 1. Specifically, letbe an estimate of

*c*

_{i}_{,i+1}on the basis of data from the {

*j*,

*k*} interval. Then the variance of is

We form a new estimate of *c _{i}*

_{,i+1}, by taking the weighted average of all estimates , (

*j*,

*k*) ∈

*S*. Specifically, let(6)whereis the weight given to the estimated intermarker recombination fraction between markers

*i*and

*i*+ 1 made on the basis of information from markers

*j*and

*k*. Figure 1 illustrates these steps in more detail.

If estimates are independent and *N* large then is approximately distributed as a normal random variable with mean *c _{i}*

_{,i+1}and variance (Edwards 1992), where(7)

However, estimates , (*j*, *k*) ∈ *S*, are, in fact, dependent. For the analysis presented here, we violate the assumption of independence and use the formula given in Equation 7 to get rough approximations of normal confidence intervals. Refer to the discussion for alternative possibilities for computing the variance.

#### Simulations:

To test the accuracy of our method for estimating recombination fractions between loci we used the program fin (software available on request from G. A. T. Mcvean, mcvean{at}stats.ox.ac.uk) to generate SNP genotype data simulated from a standard coalescent neutral model of genetic variation. The basic simulation involved 400 sequences of 400 SNPs with an overall population recombination rate, ρ = 4*N*_{e}*r* = 400, where *N*_{e} is the effective population size and *r* is the per generation recombination rate, that is, the probability of a crossover per generation across the 1-Mb region. For an effective population size of *N*_{e} = 10,000, then *r* = 0.01 over the 1-Mb region, equivalent to a genetic distance of 1 cM/Mb, which is approximately the rate seen in human data derived from observed crossovers in pedigrees (Kong *et al*. 2002). Average SNP spacing is 1 SNP per *d* = 2.5 kb. The population recombination rate was set at a constant background value over the simulated region but increased to size *s* = 10 times the background rate at three evenly spaced intervals of length 10 kb. Hence, for this simulation, there were 385 markers with intermarker recombination rate value 0.748; the remaining 15 markers were simulated to be in three hotspots of 5 markers with intermarker recombination rate value 7.48. Thus, in total, the overall population recombination rate is ρ = 15 × 7.48 + 385 × 0.748 = 400, as required, and crossovers between markers in the hotspot were simulated to occur 10 times more frequently than those in the surrounding sequence. Samples were then randomly mated using the program tdtsim (software available on request from A. Morris, amorris{at}well.ox.ac.uk) to produce 200 offspring samples with zero recombination. We show that recombination in current generation meioses does not have a significant impact on estimated recombination rates using our method. Simulations were repeated as described for hotspot intensities *s* = 1, 25, 125, and 625. Obviously, simulations with *s* = 1 have uniform recombination rate across the region, *i.e*., no hotspots, and are used for comparison.

Intermarker recombination fraction estimates were estimated for each simulation and converted to genetic distances in centimorgans per megabase as follows. *c* is multiplied by 100, to convert to centimorgans, and then divided by the intermarker interval length in megabases, to convert to centimorgans per megabase. To ensure a mean value of 1 cM/Mb across the region, estimates are scaled by dividing by the estimated total genetic distance across the 1-Mb region. Since recombination fractions are <0.01, conversion to genetic distances requires no mapping function.

## RESULTS

Figure 2 shows intermarker recombination fraction estimates for randomly selected markers according to the number of pairs of overlapping markers used to refine each estimate. Observe that estimates have large confidence intervals when using information from just the adjacent pair (equivalent to 1 overlap or 1 flanking marker) but the variance reduces, and estimates settle to specific values, as the number of flanking markers used to refine the estimate increases. Clearly, information from a single pair of markers does not provide enough information to accurately estimate intermarker recombination fractions in this way. The strength of our method comes from combining estimates at pairs of overlapping markers. The number of flanking markers to use will depend on the marker density, the scale of recombination to be detected, and the number of available trios. Appropriate selection of *w* is discussed in the discussion.

Figure 3 shows intermarker recombination rates, in centimorgans per megabase, from simulations described for Figure 2. Each row corresponds to a different true hotspot intensity, *s* = 10, 25, 125, and 625, and illustrates results at 10, 20, 40, and 60 flanking markers. Estimates plotted in centimorgans per megabase appear smoothed because: (1) The intermarker recombination fraction estimates depend on the intermarker distances and (2) the method assumes constancy of recombination rate over the distance covered by the flanking markers that are used to refine each intermarker estimate. Plots indicate that hotspots start to appear at as few as 10 flanking markers; large hotspots of 625 times background are well defined at this point (Figure 3D). Clearly, the larger the hotspot intensity is, the easier it is to detect and using more markers increases the ability to detect hotspots. The last column of Figure 3 indicates that when 60 flanking markers are used to refine the intermarker recombination rate estimates, the method can successfully detect hotspots of each intensity. Note that the presence of peaks at locations other than model hotspots may still be valid since our data are randomly generated.

Figure 4 shows results similar to those given in Figure 3 but where the current generation of offspring was created, allowing an expected value of four recombinations in their parental meioses per 1 Mb of simulated sequence. The solid crosses indicate the site of a recombination that occurred in the current generation. Clearly, recombination in the current generation does not have a significant impact on our estimates of population recombination rates. Our method appears to capture historical recombination rate information.

#### Consistency:

To assess more formally whether our method was able to consistently detect hotspots and to analyze the impact of marker density, simulations were repeated 1000 times for each hotspot intensity, *s* = 1, 10, 25, 125, 625 times background and for new marker densities 1 marker per 2 and 5 kb. We refer to a set of 1000 simulations as a simulation group. For each simulation we used our method to estimate intermarker recombination fractions for various numbers of flanking markers. For each set of estimated intermarker recombination fractions, we estimated hotspot intensity. Estimated hotspot intensity is calculated as the estimated mean genetic distance (centimorgans per megabase) at the simulated hotspot sites divided by that at the simulated nonhotspot sites.

Figure 5 shows the estimated mean hotspot intensity by flanking markers for each simulation group at marker density 1 marker per 2.5 kb. Dotted lines indicate 95% bootstrap percentile confidence intervals (Efron and Tibshirani 1993). For any value of flanking markers used, Figure 5 shows, as indicated in Figure 2, that estimates of hotspot intensity are relatively correct although they are not linearly related; *i.e.*, the line for 625 times background is still higher than the line for 125 times background but not 5 times greater. Clearly the method can detect hotspots of different intensity but problems of scale are related to the way that the method utilizes nearby markers to refine estimates. The effect of refining estimates over multiple pairs of markers also smoothes them.

To formally test whether a hotspot has been detected in simulations from groups where *s* > 1, we consider a null hypothesis that the simulation true hotspot intensity is 1. For a type I error rate of 0.05, we use the 95th percentile of estimated hotspot intensity value from simulation groups where *s* = 1 as critical values and accept the alternative hypothesis that a simulation true hotspot intensity is >1 if the estimated hotspot intensity from that simulation is greater than the corresponding critical value. We define the power of our test in a given simulation group to be the proportion of simulations from that group with estimated hotspot intensities greater than the corresponding critical value.

Figure 6 shows the power of our test to detect a hotspot for each simulated true hotspot intensity, *s* = 10, 25, 125, and 625 times background, and marker density 1 marker per 2, 2.5, and 5 kb, according to the number of flanking markers. Power appears to increase with number of flanking markers used, up to a point: Graphs for hotspot intensities 10 and 25 times background indicate that using too many flanking markers can result in a decline of power. In regions of low hotspot intensity, using too many markers will over-smooth estimates to the extent that hotspots cannot be distinguished. Figure 6 also suggests that the more dense the marker spacing is, the greater the power to detect a hotspot for any given number of flanking markers. Clearly, the more dense the markers are, the more information in any given interval and so the more accurate the estimates over that interval.

Results suggest that this method can be used to detect variation in recombination rates. The resolution at which estimates can be accurately made will depend on the marker density in conjunction with the number of trios studied (results varying the number of trios are not shown here). Greater density of markers means accuracy at a greater resolution. Increasing the flanking markers will tend to increase the accuracy at the expense of resolution.

#### Real data at high resolution:

Existing genetic maps estimate recombination rates over megabase scales. We can obtain estimates at megabase scales using our model-free method by using a sliding-windows approach to average estimates over a given interval. We used HapMap (March 2005 release, www.hapmap.org) SNP genotypes from 30 CEPH trios of European ancestry (Dausset *et al.* 1990) on chromosomes 3 and 22 to compare estimates made in this way to those in the genetic map for a European population obtained by a pedigree-based method (Kong *et al.* 2002). Figure 7 shows results of these comparisons and indicates strong agreement between the methods. Further comparisons on all other chromosomes indicate similar results.

#### Read data at low resolution:

We compared European population recombination rates, across a 10-Mb section of chromosome 20, estimated from our method to those from independent estimates made using the approximate-likelihood-based method implemented in PHASE 2.0. Estimates were based on SNP genotype data from 12 CEPH pedigrees. The raw genotype data and samples have been described by Ke *et al.* (2004). The black line in Figure 8 shows estimates from our method, in centimorgans per megabase, between 5355 adjacent SNP markers genotyped from 21 trios from 12 CEPH pedigrees. The red line shows PHASE recombination rate estimates, in centimorgans per megabase, between 4513 of the 5355 adjacent SNP markers, genotyped from 46 founders from the same 12 CEPH pedigrees (Evans and Cardon 2005). The original estimates from PHASE are population recombination rates, ρ, and are dependent on both effective population size and per generation recombination rate. To scale PHASE estimates to centimorgans per megabase, we assumed a constant effective population size of *N*_{e} = 10,000. The Spearman rank correlation of recombination rate estimates made using each method between the 4513 common SNPs is 0.58 (*p* < 2.2 × 10^{−16}), indicating strong positive correlation between estimates from the two methods.

## DISCUSSION

We have developed a simple model-free method that attempts to disentangle the confounded effects of LD and linkage using nuclear family data. We have shown that the method is able to accurately detect variation across a region to a scale that depends on marker density. Our method can be used to take advantage of widespread trio data, requires no population genetic model, and is very easy to program and fast to compute: We programmed our method in R (www.r-project.org) and computation of estimates for 5355 SNPs genotyped in 21 trio families using 60 flanking markers takes <30 min on a cluster of nine 3.2-GHz processors with 1 GB random access memory.

To understand why the method works even when there are no observed recombinants, we consider the relationship between the possible distribution of offspring haplotypes and the size of LD between any two loci in the parental generation. If there are no recombinants, then the offspring haplotypes are simply a sample of the parental haplotypes. When LD is high in the parental generation, random sampling of haplotypes without recombination will tend to produce a sample with high LD too, in which case Equation 1 suggests that our method will estimate a low value of recombination fraction *c*. As the LD decreases in the parental generation, random sampling of haplotypes without recombination will tend to produce samples with an increasing range of possible LD values, in which case Equation 1 suggests that estimates of *c* will be, in general*,* higher. The strength of our method comes from combining estimates from neighboring markers.

We note that our method ignores dependence between markers when combining estimates. Further work is required to understand possible biases that may result or to devise alternative ways of combining estimates. We have shown that the method is able to capture variation in recombination rate but not absolute value. Our estimates are dependent upon a variety of interdependent parameters, namely, the original hotspot intensity, the number of flanking markers used, the marker density, and the number of parent-offspring trios available.

The Spearman rank correlation coefficient between our recombination rate estimates and those made from PHASE across a 10-Mb section of chromosome 22 was 0.58. We do not expect perfect correlation with PHASE for at least three reasons:

Our method ignores the effects of evolutionary sampling, which we regard as a distinguishing feature since it removes the need for accompanying assumptions about population demography.

PHASE estimates population recombination rates, which include the effective population size; our method estimates recombination fractions. To compare the two, we have assumed a fixed effective population size. Variability in the effective population size would account for some of the differences between recombination rate estimates made from each method.

Our method uses multiple overlapping markers to refine estimates. We therefore expect that our approach will produce estimates at a lower resolution than estimates from PHASE for the same number of markers. Our method requires more markers than PHASE to achieve results at an equivalent resolution to that of PHASE.

The aim of our method is not to outperform PHASE. PHASE is model based and can allow for patterns of population-level association, or LD, that can arise by chance as well as simply as a result of recombination rate variation. The trade-off in allowing for this complexity is extended computation times. The time taken to estimates rates using PHASE is dependent on the amount of LD: Chromosomal segments with low levels of LD took a minimum of 3–4 days to estimate recombination rates between just 70 markers, and some chromosomal segments with regions of high LD took over a month to estimate rates for that many markers.

To address the question of how to select the number of flanking markers *w*, we consider the case where all markers are an equal distance apart and let the variance for each original recombination fraction estimated from step 2 of the method be *V*. Then, the variance of the refined estimate can be approximated:Hence, the variance declines very rapidly as *w* first increases from 1. Selecting *w* = 2 reduces the variance of our estimate by ∼89%; *w* = 20 reduces it by ∼99.998%. Power curves in Figure 6 suggest that the power of our method to detect hotspots of intensity <100 using markers of any density begins to decline at ∼*w* = 20 markers. Obviously, for hotspots of small intensity, smoothing over so many markers causes depletion of the estimate to the extent that it cannot be detected from nearby estimates. For hotspots of intensity >100, power is maintained for larger values of *w*. Hence, we suggest *w* = 20 as a minimum for this method. Increasing *w* > 20 will depend on the aim of the project. Clearly, as *w* increases, estimates are smoothed and detection of smaller hotspots becomes problematic. However, as *w* increases, there will be greater accuracy of recombination rate variation, albeit at a lower resolution. Recall that the method assumes that the recombination rate is constant over the interval spanned by the flanking markers. Hence, given a specific marker density, the chosen value of *w* will define the scale at which recombination rate variation can be detected. If density is 1 marker per 2.5 kb and *w* = 20, for example, then rates can be accurate to a scale of 50 kb. If finer scales of recombination rate variation are required, then a greater density of markers will be required; *e.g*., markers at density 1 per 1 kb and *w* = 20 can give an accuracy to a scale of 20 kb.

We have assumed that haplotypes can be accurately measured in parental and offspring generations. However, for data collected on nuclear families, haplotypes in parents may not be uniquely resolved. Typing of additional siblings may help to resolve parental haplotypes but since our method uses the parental haplotypes only to gain a measure of LD in the parental generation, an alternative approach would be to use composite LD measures to estimate LD using parental genotypes instead (Schaid 2004). A necessary condition for ambiguity in offspring haplotype estimation between two loci is that one parent and the offspring are doubly heterozygous and the other parent is heterozygous at at least one of the loci (Dudbridge *et al.* 2000). The probability that an offspring is heterozygous at a locus given his parental genotypes is . If the probability of being heterozygous at a locus is *H*, then the probability of ambiguity in offspring haplotype isif haplotype loci are dependent andif haplotype loci are independent. Since the maximum value of *H* is , probabilities of ambiguity in offspring haplotype could range from to . On average the heterozygosity at any given locus in HapMap is ∼ (www.hapmap.org). Hence ambiguity in offspring haplotype estimation in HapMap reduces in range from to . The impact is further reduced by the averaging of estimates over multiple markers. Methods to incorporate haplotype uncertainty into the likelihood of transmitted haplotypes are possibilities for further work.

To treat recombination fractions between pairs of markers flanking an adjacent pair as a scaled version of that between the adjacent pair we have assumed that recombination fractions are additive. This is approximately true over short distances (<2 cM). To allow for the effects of interference, the method could be modified to convert each estimate of a recombination fraction made in step 2 to a map distance via an appropriate mapping function, *e.g*., Kosambi. Map distances between markers flanking an adjacent pair are then scaled versions of the map distance between the adjacent pair and the method could proceed as described in step 3, but by replacing recombination fractions with map distances.

To determine normal confidence intervals presented in this article we have assumed independence between individual estimates made for each pair of markers. Further work might include the more accurate estimation of distribution of our estimator, using bootstrap or jackknife techniques (Efron and Tibshirani 1993).

## Acknowledgments

We thank P. Visscher and two anonymous reviewers for comments on the manuscript. We are also grateful to A. Morris for helpful critical discussions and to D. Evans for supplying PHASE estimates of population recombination rates. This work was supported by the Wellcome Trust and a by HapMap grant from the National Institutes of Health.

## Footnotes

Communicating editor: P. J. Oefner

- Received June 29, 2005.
- Accepted August 2, 2005.

- Copyright © 2005 by the Genetics Society of America