Originally published as Genetics Published Articles Ahead of Print on August 22, 2005.

Genetics, Vol. 171, 2085-2095, December 2005, Copyright © 2005
doi:10.1534/genetics.105.047431

Disentangling Linkage Disequilibrium and Linkage From Dense Single-Nucleotide Polymorphism Trio Data

Wellcome Trust Centre for Human Genetics, Oxford University, Oxford OX3 7BN, United Kingdom

1 Corresponding author: Wellcome Trust Centre for Human Genetics, Oxford University, Roosevelt Dr., Oxford OX3 7BN, United Kingdom.
E-mail: gclarke{at}well.ox.ac.uk

Manuscript received June 29, 2005. Accepted for publication August 2, 2005.

ABSTRACT

Parent-offspring trios are widely collected for disease gene-mapping studies and are being extensively genotyped as part of the International HapMap Project. With dense maps of markers on trios, the effects of LD and linkage can be separated, allowing estimation of recombination rates in a model-free setting. Here we define a model-free multipoint method on the basis of dense sequence polymorphism data from parent-offspring trios to estimate intermarker recombination rates. We use simulations to show that this method has up to 92% power to detect recombination hotspots of intensity 25 times background over a region of size 10 kb typed at density 1 marker per 2.5 kb and almost 100% power to detect large hotspots of intensity >125 times background over regions of size 10 kb typed with just 1 marker per 5 kb ({alpha} = 0.05). We found strong agreement at megabase scales between estimates from our method applied to HapMap trio data and estimates from the genetic map. At finer scales, using Centre d'Etude du Polymorphisme Humain (CEPH) pedigree data across a 10-Mb region of chromosome 20, a comparison of population recombination rate estimates obtained from our method with estimates obtained using a coalescent-based approximate-likelihood method implemented in PHASE 2.0 shows detection of the same coldspots and most hotspots: The Spearman rank correlation between the estimates from our method and those from PHASE is 0.58 (p < 2.2–16).


LINKAGE disequilibrium (LD) refers to the presence of nonrandom association among two or more alleles at distinct loci. The degree of LD between alleles is determined by an evolutionary process combining the effects of natural selection, mutation, genetic drift, and population admixture. Appreciable LD is likely to be present only between two loci that are tightly linked. For this reason, attempts have been made to use the magnitude of LD between two loci to obtain estimates of recombination fractions between loci, e.g., SERRE et al. (1990). Transmission of haplotypes from parents to offspring is dependent on both LD and linkage (SPIELMAN et al. 1993). Specifically, the extent of LD between two distinct loci decays exponentially at a rate proportional to the recombination fraction c between the loci: If the disequilibrium between the two loci is D0 at time 0, then after k generations the disequilibrium will be

Formula 1(1)
Hence, if either the time span or the recombination rate is known, then, under simplifying assumptions that ignore the ongoing stochastic effect of the evolutionary process, the other parameter can be estimated from the coefficients of LD (LANDER and BOTSTEIN 1986). In practice, such estimates are problematic because the sampling variance of LD tends to be very large due to the evolutionary sampling (HILL and WEIR 1994), resulting in large standard errors in the estimate of c. Motivated by the extensive genotyping on familial trios (mother, father, and offspring) in the International HapMap Project (INTERNATIONAL HAPMAP CONSORTIUM 2003), as well as by the availability of large trio samples in several disease cohorts, we construct a simple, robust, approach to disentangling the effects of LD and linkage in the transmission of data from parent to offspring. Our method relates the genetic variation in a population sample over one generation to the underlying recombination rate and uses a multipoint approach to reduce the standard error normally associated with LD-based methods. We make only the usual assumptions of genetic equilibrium required for establishment of Hardy-Weinberg equilibrium. Specifically, we make the following explicit assumptions: infinite population size, discrete generations, random mating, no selection, no migration, no mutation, and no difference across sexes in parental genotype frequencies. In this sense, our method can be described as model-free. The transmission data between multiple pairs of overlapping densely spaced markers is combined to reduce variance and produce accurate estimates of recombination rates between adjacent markers. Our method generates short-range smoothness in recombination rates over the region of the chromosome spanned by overlapping markers and allows for rate estimation on various scales limited only by the density of the markers used.

More sophisticated methods applying models to account for stochastic effects have been developed and used successfully for the fine mapping of disease genes using allelic associations in isolated populations, for example, those in HASTBACKA et al. (1992), KAPLAN and WEIR (1995), KAPLAN et al. (1995), and DEVLIN et al. (1996). Also, many coalescent model-based methods exist for the estimation of recombination rates from patterns of sequence variation in population genetic data. Methods include moment-based estimators (WALL 2000) and full-likelihood (GRIFFITHS and MARJORAM 1996; KUHNER et al. 2000; NIELSON 2000; FEARNHEAD and DONNELLY 2001) and approximate-likelihood approaches (HUDSON 2001; FEARNHEAD and DONNELLY 2002; MCVEAN et al. 2002; LI and STEPHENS 2003). Methods are further described in STUMPH and MCVEAN (2003).

We define a hotspot of intensity s to be a region of length 10 kb where crossovers occur a frequency of more than s times that in the surrounding region. We use simulations to show that, using 200 trio families, our method can reliably detect recombination hotspots of intensity 10 across a region typed with 20 markers at a density of 1 marker per 2.5 kb (P < 0.025). Hotspots of an intensity >25 can be detected by typing as few as 5 markers at a density of 1 marker per 2.5 kb (P < 0.001). We show that the number of markers required to achieve a given accuracy decreases with marker density. Results suggest that this method is able to detect fine-scale variation in recombination rate according to the density of typed markers: That is, the more dense the marker spacing is, the more fine scale the variation that can be detected. To validate our approach, we estimated recombination rates using HapMap Centre d'Etude du Polymorphisme Humain (CEPH) trio data from various chromosomes and compared averages over a megabase scale to estimates from the genetic map (KONG et al. 2002). At a finer scale, we compared recombination rate estimates in European (CEPH) populations between adjacent SNPs on a 10-Mb region of chromosome 20 made using our model-free method to independent estimates made from a coalescent model-based approximate-likelihood approach implemented in the program PHASE 2.0 (STEPHENS et al. 2001; STEPHENS and DONNELLY 2003). These independent estimates are described in detail in EVANS and CARDON (2005).


METHODS
We assume that we have phase-known genotype data from N familial trios at L densely spaced, ordered, biallelic markers. See DISCUSSION for comments on how to handle phase-unknown genotype data. Markers are labeled 1, ... , L. We summarize the data at all pairs of markers up to a given distance apart (e.g., all pairs of markers up to w markers apart, w < L) as counts of observed haplotypes in both parental and offspring generations. The method can be summarized in three stages as follows:
  1. For all pairs of markers, we first estimate the probability of observing each haplotype in the parental population. This provides an estimate of LD in the parental generation.
  2. For each pair of markers, we write down the likelihood of transmitting the observed offspring haplotype counts conditional on the LD in the parental generation, that is, assuming the LD in the parental generation is a known parameter. The likelihood of transmitted haplotype counts at each pair of markers is a function of the recombination fraction between the markers. The recombination fraction is defined as the probability that a transmitted haplotype constitutes a new combination of alleles different from that of either parental haplotype. Maximum-likelihood estimates (MLEs) of the recombination fraction are then obtained, independently, at all pairs of markers.
  3. Finally, for any given pair of adjacent markers, all estimates from marker pairs that overlap the adjacent pair, up to a given distance apart, are scaled according to the interval spanned by the pair of markers relative to the interval spanned by the adjacent markers under consideration. Each scaled estimate represents a new estimate of the recombination fraction between the adjacent pair of markers. The original estimate between each pair of markers and all new estimates are then averaged (weighted by variance) to form a final, new estimate of the recombination fraction between each pair of adjacent markers. These steps are now described in greater detail.

Step 1—determine LD in the parental generation:

Consider any two biallelic markers with alleles arbitrarily labeled A and a at the first locus and B and b at the second locus. Let c be the recombination fraction between the markers. We assume that haplotypes are available or have been estimated from a sample of N parent-offspring trios. The four possible haplotypes are ab, aB, Ab, and AB and the observed number of these haplotypes in the parental generation is denoted by m = (mab, maB, mAb, mAB), with mab + maB + mAb + mAB = 4N. The likelihood of a specific sample configuration is

Formula 2(2)
where pab, paB, pAb, and pAB are the frequencies of haplotypes ab, aB, Ab, and AB, respectively, in the parental generation. The MLE of pab is Formula 2 and is denoted by Formula 2 and likewise for paB, pAb, and pAB. Let estimates of major allele frequencies at each locus be given by Formula 2 and Formula 2. Using the standard notation for an LD coefficient, we let

Formula 2
denote the LD in the parental generation. Let

Formula 2
denote the MLE of LD in the parental generation.

Step 2—infer recombination fraction c between markers:

Let p'ab be the frequency of haplotype ab in the offspring generation. Then, by conditioning on all possible parental haplotype combinations from which an ab haplotype can be transmitted, it is easy to see that

Formula 3(3)
Similarly, it is easy to show that the frequencies of haplotypes aB, Ab, and AB in the offspring generation are, respectively,

Formula 3

Formula 3
and

Formula 3

Let the observed number of haplotypes in the offspring generation be denoted by

Formula 3
where nab + naB + nAb + nAB = 2N. Using MLEs of parental haplotype probabilities, the likelihood at c for a specific sample configuration of offspring haplotypes is therefore given by

Formula 4(4)
MLEs Formula 4 of c are obtained by selecting the value of c that maximizes Equation 4. The information I(c) for c is defined as minus the second derivative of Equation 4:

Formula 5(5)
I(c) evaluated at Formula 5 is the observed information and is inversely proportional to the asymptotic variance of Formula 5. Let Formula 5 denote the variance of estimate Formula 5. Observe that when Formula 5 is small, as would be the case in recombination hotspots, the likelihood has less dependence on the exact value of c and MLEs will have large variance. Conversely, when Formula 5 is large, estimates of c will be more precise with small variance.

Step 3—use multiple closely linked flanking markers to refine estimate of c:

Let di,j be the distance, in base pairs, between any two markers i and j. Let ci,j represent the recombination fraction between markers i and j and let Formula 5 be the MLE of ci,j obtained using the method described above. Individual estimates of Formula 5 can be highly variable. To refine our estimate Formula 5 between adjacent markers i and i + 1, we combine estimates Formula 5 obtained at multiple pairs of markers j and k up to w markers away from markers i and i + 1, i.e., pairs of markers in S := {(j, k) : max(0, i + 1 – w) ≤ j < k ≤ min(i + w, L)}. These represent all the markers that overlap markers i and i + 1 and are within w markers of them and are referred to as the flanking markers. We assume that the rate at which recombination occurs remains constant over the region of the chromosome spanned by the flanking markers. Since recombination fractions are approximately additive over short distances and our markers are densely spaced, we assume that the intermarker recombination fractions estimated between pairs of flanking markers are scaled estimates of the intermarker recombination fraction between markers i and i + 1. Specifically, let

Formula 5
be an estimate of ci,i+1 on the basis of data from the {j, k} interval. Then the variance of Formula 5 is

Formula 5

We form a new estimate Formula 5 of ci,i+1, by taking the weighted average of all estimates Formula 5, (j, k) isin S. Specifically, let

Formula 6(6)
where

Formula 6
is the weight given to the estimated intermarker recombination fraction between markers i and i + 1 made on the basis of information from markers j and k. Figure 1 illustrates these steps in more detail.


Figure 1
View larger version (10K):
In this window
In a new window
Download PPT slide
 
FIGURE 1.—

Consider four markers with intermarker distances noted. MLEs Formula 6, Formula 6, Formula 6, Formula 6, Formula 6, and Formula 6 of the intermarker recombination fractions are made for each pair of markers. New estimates for the interval between markers 2 and 3 are made from markers 1 and 3, markers 2 and 4, and markers 1 and 4 as Formula 6, Formula 6, and Formula 6, respectively. A weighted average of these new estimates provides a final refined estimate, Formula 6, of the intermarker recombination fraction between markers 2 and 3.

 
If estimates are independent and N large then Formula 6 is approximately distributed as a normal random variable with mean ci,i+1 and variance Formula 6 (EDWARDS 1992), where

Formula 7(7)

However, estimates Formula 7, (j, k) isin S, are, in fact, dependent. For the analysis presented here, we violate the assumption of independence and use the formula given in Equation 7 to get rough approximations of normal confidence intervals. Refer to the DISCUSSION for alternative possibilities for computing the variance.

Simulations:

To test the accuracy of our method for estimating recombination fractions between loci we used the program fin (software available on request from G. A. T. Mcvean, mcvean{at}stats.ox.ac.uk) to generate SNP genotype data simulated from a standard coalescent neutral model of genetic variation. The basic simulation involved 400 sequences of 400 SNPs with an overall population recombination rate, {rho} = 4Ner = 400, where Ne is the effective population size and r is the per generation recombination rate, that is, the probability of a crossover per generation across the 1-Mb region. For an effective population size of Ne = 10,000, then r = 0.01 over the 1-Mb region, equivalent to a genetic distance of 1 cM/Mb, which is approximately the rate seen in human data derived from observed crossovers in pedigrees (KONG et al. 2002). Average SNP spacing is 1 SNP per d = 2.5 kb. The population recombination rate was set at a constant background value over the simulated region but increased to size s = 10 times the background rate at three evenly spaced intervals of length 10 kb. Hence, for this simulation, there were 385 markers with intermarker recombination rate value 0.748; the remaining 15 markers were simulated to be in three hotspots of 5 markers with intermarker recombination rate value 7.48. Thus, in total, the overall population recombination rate is {rho} = 15 x 7.48 + 385 x 0.748 = 400, as required, and crossovers between markers in the hotspot were simulated to occur 10 times more frequently than those in the surrounding sequence. Samples were then randomly mated using the program tdtsim (software available on request from A. Morris, amorris{at}well.ox.ac.uk) to produce 200 offspring samples with zero recombination. We show that recombination in current generation meioses does not have a significant impact on estimated recombination rates using our method. Simulations were repeated as described for hotspot intensities s = 1, 25, 125, and 625. Obviously, simulations with s = 1 have uniform recombination rate across the region, i.e., no hotspots, and are used for comparison.

Intermarker recombination fraction estimates were estimated for each simulation and converted to genetic distances in centimorgans per megabase as follows. c is multiplied by 100, to convert to centimorgans, and then divided by the intermarker interval length in megabases, to convert to centimorgans per megabase. To ensure a mean value of 1 cM/Mb across the region, estimates are scaled by dividing by the estimated total genetic distance across the 1-Mb region. Since recombination fractions are <0.01, conversion to genetic distances requires no mapping function.


RESULTS
Figure 2 shows intermarker recombination fraction estimates for randomly selected markers according to the number of pairs of overlapping markers used to refine each estimate. Observe that estimates have large confidence intervals when using information from just the adjacent pair (equivalent to 1 overlap or 1 flanking marker) but the variance reduces, and estimates settle to specific values, as the number of flanking markers used to refine the estimate increases. Clearly, information from a single pair of markers does not provide enough information to accurately estimate intermarker recombination fractions in this way. The strength of our method comes from combining estimates at pairs of overlapping markers. The number of flanking markers to use will depend on the marker density, the scale of recombination to be detected, and the number of available trios. Appropriate selection of w is discussed in the DISCUSSION.


Figure 2
View larger version (49K):
In this window
In a new window
Download PPT slide
 
FIGURE 2.—

Estimates of intermarker recombination fractions for randomly selected markers according to the number of overlapping markers used to refine the estimate. Axes at the top of each plot show how many flanking markers have been used to refine the estimate. Axes at the bottom of each plot show the equivalent number of overlapping pairs of markers that have been used to refine the estimate. For example, allowing up to 5 flanking markers creates 5 + 4 + 3 + 2 + 1 = 15 overlapping marker pairs used to refine the intermarker estimate. Each row shows estimates for randomly selected adjacent markers from simulations with true hotspot intensities (A) 10, (B) 25, (C) 125, and (D) 625. The value HS in under each plot indicates how many markers away from a hotspot the represented marker pair are located: Plots i and ii show results for intermarker estimates located within a hotspot (HS = 0); plots iii–v show results for intermarker estimates located outside a hotspot. The dotted lines indicate 95% normal confidence intervals. Results are for simulations in 200 trios of 400 SNP markers at a mean spacing of 2.5 kb over a 1-Mb region with three evenly spaced hotspots.

 
Figure 3 shows intermarker recombination rates, in centimorgans per megabase, from simulations described for Figure 2. Each row corresponds to a different true hotspot intensity, s = 10, 25, 125, and 625, and illustrates results at 10, 20, 40, and 60 flanking markers. Estimates plotted in centimorgans per megabase appear smoothed because: (1) The intermarker recombination fraction estimates depend on the intermarker distances and (2) the method assumes constancy of recombination rate over the distance covered by the flanking markers that are used to refine each intermarker estimate. Plots indicate that hotspots start to appear at as few as 10 flanking markers; large hotspots of 625 times background are well defined at this point (Figure 3D). Clearly, the larger the hotspot intensity is, the easier it is to detect and using more markers increases the ability to detect hotspots. The last column of Figure 3 indicates that when 60 flanking markers are used to refine the intermarker recombination rate estimates, the method can successfully detect hotspots of each intensity. Note that the presence of peaks at locations other than model hotspots may still be valid since our data are randomly generated.


Figure 3
View larger version (34K):
In this window
In a new window
Download PPT slide
 
FIGURE 3.—

Estimates of intermarker recombination rates, in centimorgans per megabase, plotted at marker midpoint location (in megabases) for simulations with true hotspot intensities (A) 10, (B) 25, (C) 125, and (D) 625 times background. Each row shows results for the given true hotspot intensity at w = 10, 20, 40, and 60 flanking markers: w flanking markers means that intermarker recombination fraction estimates from all marker pairs within w markers of a given pair of markers were used to refine the estimate at that given marker pair. Results are for simulations in 200 trios of 400 SNP makers at a mean spacing of 2.5 kb over a 1-Mb region with three evenly spaced hotspots. The solid vertical dashed lines indicate the true locations of the hotspots. Refer to text for discussion of how intermarker recombination fraction estimates Formula 7 are converted to genetic distances in centimorgans per megabase.

 
Figure 4 shows results similar to those given in Figure 3 but where the current generation of offspring was created, allowing an expected value of four recombinations in their parental meioses per 1 Mb of simulated sequence. The solid crosses indicate the site of a recombination that occurred in the current generation. Clearly, recombination in the current generation does not have a significant impact on our estimates of population recombination rates. Our method appears to capture historical recombination rate information.


Figure 4
View larger version (22K):
In this window
In a new window
Download PPT slide
 
FIGURE 4.—

Estimates of intermarker recombination rates, in centimorgans per megabase, plotted at marker midpoint location (in megabases) for simulations with true hotspot intensities (A) 10, (B) 25, (C) 125, and (D) 625 times background and where offspring have been generated with a mean population recombination rate of four recombinations per megabase per generation over the simulated sequence. Results are for w = 60 flanking markers: w flanking markers means that intermarker recombination fraction estimates from all marker pairs within w markers of a given pair of markers were used to refine the estimate at that given marker pair. Results are for simulations in 200 trios of 400 SNP markers at a mean spacing of 2.5 kb over a 1-Mb region with three evenly spaced hotspots. The solid vertical dashed lines indicate the true locations of these hotspots. The solid crosses indicate the true locations of recombinations occurring in offspring meioses. Refer to the text for a discussion of how intermarker recombination fraction estimates Formula 7 are converted to genetic distances in centimorgans per megabase.

 

Consistency:

To assess more formally whether our method was able to consistently detect hotspots and to analyze the impact of marker density, simulations were repeated 1000 times for each hotspot intensity, s = 1, 10, 25, 125, 625 times background and for new marker densities 1 marker per 2 and 5 kb. We refer to a set of 1000 simulations as a simulation group. For each simulation we used our method to estimate intermarker recombination fractions for various numbers of flanking markers. For each set of estimated intermarker recombination fractions, we estimated hotspot intensity. Estimated hotspot intensity is calculated as the estimated mean genetic distance (centimorgans per megabase) at the simulated hotspot sites divided by that at the simulated nonhotspot sites.

Figure 5 shows the estimated mean hotspot intensity by flanking markers for each simulation group at marker density 1 marker per 2.5 kb. Dotted lines indicate 95% bootstrap percentile confidence intervals (EFRON and TIBSHIRANI 1993). For any value of flanking markers used, Figure 5 shows, as indicated in Figure 2, that estimates of hotspot intensity are relatively correct although they are not linearly related; i.e., the line for 625 times background is still higher than the line for 125 times background but not 5 times greater. Clearly the method can detect hotspots of different intensity but problems of scale are related to the way that the method utilizes nearby markers to refine estimates. The effect of refining estimates over multiple pairs of markers also smoothes them.


Figure 5
View larger version (20K):
In this window
In a new window
Download PPT slide
 
FIGURE 5.—

Estimated mean hotspot intensity coefficient over 1000 simulations with true hotspot intensities (A) 10, (B) 25, (C) 125, and (D) 625 times background according to the number of flanking markers used to refine estimates. For each simulation, intermarker recombination rates are estimated for flanking markers, w = 1, ... , 10, 20, 30, 40, and 60. w flanking markers means that intermarker recombination fraction estimates from all marker pairs within w markers of a given pair of markers were used to refine the estimate at that given marker pair. For each value of w, the hotspot intensity coefficient is then calculated as the estimated mean recombination rate (centimorgans per megabase) at the simulated hotspot sites divided by that at the simulated nonhotspot sites. The estimated mean hotspot intensity coefficient for a simulation group is then given by averaging values over all simulations. The dotted lines are 95% bootstrap confidence intervals. Results are for simulations in 200 trios of 400 SNP markers at a mean spacing of 2.5 kb over a 1-Mb region with three evenly spaced hotspots.

 
To formally test whether a hotspot has been detected in simulations from groups where s > 1, we consider a null hypothesis that the simulation true hotspot intensity is 1. For a type I error rate of 0.05, we use the 95th percentile of estimated hotspot intensity value from simulation groups where s = 1 as critical values and accept the alternative hypothesis that a simulation true hotspot intensity is >1 if the estimated hotspot intensity from that simulation is greater than the corresponding critical value. We define the power of our test in a given simulation group to be the proportion of simulations from that group with estimated hotspot intensities greater than the corresponding critical value.

Figure 6 shows the power of our test to detect a hotspot for each simulated true hotspot intensity, s = 10, 25, 125, and 625 times background, and marker density 1 marker per 2, 2.5, and 5 kb, according to the number of flanking markers. Power appears to increase with number of flanking markers used, up to a point: Graphs for hotspot intensities 10 and 25 times background indicate that using too many flanking markers can result in a decline of power. In regions of low hotspot intensity, using too many markers will over-smooth estimates to the extent that hotspots cannot be distinguished. Figure 6 also suggests that the more dense the marker spacing is, the greater the power to detect a hotspot for any given number of flanking markers. Clearly, the more dense the markers are, the more information in any given interval and so the more accurate the estimates over that interval.


Figure 6
View larger version (20K):
In this window
In a new window
Download PPT slide
 
FIGURE 6.—

Power to detect a hotspot of true intensity (A) 10, (B) 25, (C) 125, and (D) 625 times background according to the number w of flanking markers used to refine estimates: w flanking markers means that intermarker recombination fraction estimates from all marker pairs within w markers of a given pair of markers were used to refine the estimate at that given marker pair. Results are for simulations in 200 trios of 400 SNP markers at mean spacings of 2, 2.5, and 5 kb: Each line corresponds to a different marker density, as indicated in the inset. Type I error rate is 0.05.

 
Results suggest that this method can be used to detect variation in recombination rates. The resolution at which estimates can be accurately made will depend on the marker density in conjunction with the number of trios studied (results varying the number of trios are not shown here). Greater density of markers means accuracy at a greater resolution. Increasing the flanking markers will tend to increase the accuracy at the expense of resolution.

Real data at high resolution:

Existing genetic maps estimate recombination rates over megabase scales. We can obtain estimates at megabase scales using our model-free method by using a sliding-windows approach to average estimates over a given interval. We used HapMap (March 2005 release, www.hapmap.org) SNP genotypes from 30 CEPH trios of European ancestry (DAUSSET et al. 1990) on chromosomes 3 and 22 to compare estimates made in this way to those in the genetic map for a European population obtained by a pedigree-based method (KONG et al. 2002). Figure 7 shows results of these comparisons and indicates strong agreement between the methods. Further comparisons on all other chromosomes indicate similar results.


Figure 7
View larger version (17K):
In this window
In a new window
Download PPT slide
 
FIGURE 7.—

Estimated mean recombination rates at a 2-Mb scale on chromosome 3 (top) and on the long arm of chromosome 22 (bottom). The black lines show estimates from our method based on 70,470 (chromosome 3) and 19,017 (chromosome 22) SNP markers from HapMap CEPH trio data (release March 2005) averaged over 2-Mb windows at 1-Mb intervals. The red lines show sex-averaged recombination rates estimated from pedigree data (KONG et al. 2002).

 

Read data at low resolution:

We compared European population recombination rates, across a 10-Mb section of chromosome 20, estimated from our method to those from independent estimates made using the approximate-likelihood-based method implemented in PHASE 2.0. Estimates were based on SNP genotype data from 12 CEPH pedigrees. The raw genotype data and samples have been described by KE et al. (2004). The black line in Figure 8 shows estimates from our method, in centimorgans per megabase, between 5355 adjacent SNP markers genotyped from 21 trios from 12 CEPH pedigrees. The red line shows PHASE recombination rate estimates, in centimorgans per megabase, between 4513 of the 5355 adjacent SNP markers, genotyped from 46 founders from the same 12 CEPH pedigrees (EVANS and CARDON 2005). The original estimates from PHASE are population recombination rates, {rho}, and are dependent on both effective population size and per generation recombination rate. To scale PHASE estimates to centimorgans per megabase, we assumed a constant effective population size of Ne = 10,000. The Spearman rank correlation of recombination rate estimates made using each method between the 4513 common SNPs is 0.58 (p < 2.2 x 10–16), indicating strong positive correlation between estimates from the two methods.


Figure 8
View larger version (37K):
In this window
In a new window
Download PPT slide
 
FIGURE 8.—

Estimates of intermarker recombination rates (centimorgans per megabase) plotted at marker midpoint location (in megabases) across a 10-Mb region of chromosome 20: 20q12–20q13.13. The red line shows results from PHASE based on 4513 SNP markers genotyped on 46 CEPH founders. The black line shows results from our method based on 5355 SNP markers genotyped on 21 CEPH founder-offspring trios. We used 60 flanking markers to refine our intermarker estimates: w flanking markers, for example, means that intermarker recombination fraction estimates from all marker pairs within w markers of a given pair of markers were used to refine the estimate at that given marker pair. Estimates made using our method are scaled to ensure that the average of centimorgans per megabase across the region matches that from pedigree data (KONG et al. 2002).

 


DISCUSSION
We have developed a simple model-free method that attempts to disentangle the confounded effects of LD and linkage using nuclear family data. We have shown that the method is able to accurately detect variation across a region to a scale that depends on marker density. Our method can be used to take advantage of widespread trio data, requires no population genetic model, and is very easy to program and fast to compute: We programmed our method in R (www.r-project.org) and computation of estimates for 5355 SNPs genotyped in 21 trio families using 60 flanking markers takes <30 min on a cluster of nine 3.2-GHz processors with 1 GB random access memory.

To understand why the method works even when there are no observed recombinants, we consider the relationship between the possible distribution of offspring haplotypes and the size of LD between any two loci in the parental generation. If there are no recombinants, then the offspring haplotypes are simply a sample of the parental haplotypes. When LD is high in the parental generation, random sampling of haplotypes without recombination will tend to produce a sample with high LD too, in which case Equation 1 suggests that our method will estimate a low value of recombination fraction c. As the LD decreases in the parental generation, random sampling of haplotypes without recombination will tend to produce samples with an increasing range of possible LD values, in which case Equation 1 suggests that estimates of c will be, in general, higher. The strength of our method comes from combining estimates from neighboring markers.

We note that our method ignores dependence between markers when combining estimates. Further work is required to understand possible biases that may result or to devise alternative ways of combining estimates. We have shown that the method is able to capture variation in recombination rate but not absolute value. Our estimates are dependent upon a variety of interdependent parameters, namely, the original hotspot intensity, the number of flanking markers used, the marker density, and the number of parent-offspring trios available.

The Spearman rank correlation coefficient between our recombination rate estimates and those made from PHASE across a 10-Mb section of chromosome 22 was 0.58. We do not expect perfect correlation with PHASE for at least three reasons:

Our method ignores the effects of evolutionary sampling, which we regard as a distinguishing feature since it removes the need for accompanying assumptions about population demography.
PHASE estimates population recombination rates, which include the effective population size; our method estimates recombination fractions. To compare the two, we have assumed a fixed effective population size. Variability in the effective population size would account for some of the differences between recombination rate estimates made from each method.
Our method uses multiple overlapping markers to refine estimates. We therefore expect that our approach will produce estimates at a lower resolution than estimates from PHASE for the same number of markers. Our method requires more markers than PHASE to achieve results at an equivalent resolution to that of PHASE.

The aim of our method is not to outperform PHASE. PHASE is model based and can allow for patterns of population-level association, or LD, that can arise by chance as well as simply as a result of recombination rate variation. The trade-off in allowing for this complexity is extended computation times. The time taken to estimates rates using PHASE is dependent on the amount of LD: Chromosomal segments with low levels of LD took a minimum of 3–4 days to estimate recombination rates between just 70 markers, and some chromosomal segments with regions of high LD took over a month to estimate rates for that many markers.

To address the question of how to select the number of flanking markers w, we consider the case where all markers are an equal distance apart and let the variance for each original recombination fraction estimated from step 2 of the method be V. Then, the variance of the refined estimate can be approximated:

Formula 7
Hence, the variance declines very rapidly as w first increases from 1. Selecting w = 2 reduces the variance of our estimate by ~89%; w = 20 reduces it by ~99.998%. Power curves in Figure 6 suggest that the power of our method to detect hotspots of intensity <100 using markers of any density begins to decline at ~w = 20 markers. Obviously, for hotspots of small intensity, smoothing over so many markers causes depletion of the estimate to the extent that it cannot be detected from nearby estimates. For hotspots of intensity >100, power is maintained for larger values of w. Hence, we suggest w = 20 as a minimum for this method. Increasing w > 20 will depend on the aim of the project. Clearly, as w increases, estimates are smoothed and detection of smaller hotspots becomes problematic. However, as w increases, there will be greater accuracy of recombination rate variation, albeit at a lower resolution. Recall that the method assumes that the recombination rate is constant over the interval spanned by the flanking markers. Hence, given a specific marker density, the chosen value of w will define the scale at which recombination rate variation can be detected. If density is 1 marker per 2.5 kb and w = 20, for example, then rates can be accurate to a scale of 50 kb. If finer scales of recombination rate variation are required, then a greater density of markers will be required; e.g., markers at density 1 per 1 kb and w = 20 can give an accuracy to a scale of 20 kb.

We have assumed that haplotypes can be accurately measured in parental and offspring generations. However, for data collected on nuclear families, haplotypes in parents may not be uniquely resolved. Typing of additional siblings may help to resolve parental haplotypes but since our method uses the parental haplotypes only to gain a measure of LD in the parental generation, an alternative approach would be to use composite LD measures to estimate LD using parental genotypes instead (SCHAID 2004). A necessary condition for ambiguity in offspring haplotype estimation between two loci is that one parent and the offspring are doubly heterozygous and the other parent is heterozygous at at least one of the loci (DUDBRIDGE et al. 2000). The probability that an offspring is heterozygous at a locus given his parental genotypes is Formula 7. If the probability of being heterozygous at a locus is H, then the probability of ambiguity in offspring haplotype is

Formula 7
if haplotype loci are dependent and

Formula 7
if haplotype loci are independent. Since the maximum value of H is Formula 7, probabilities of ambiguity in offspring haplotype could range from Formula 7 to Formula 7. On average the heterozygosity at any given locus in HapMap is ~Formula 7 (www.hapmap.org). Hence ambiguity in offspring haplotype estimation in HapMap reduces in range from Formula 7 to Formula 7. The impact is further reduced by the averaging of estimates over multiple markers. Methods to incorporate haplotype uncertainty into the likelihood of transmitted haplotypes are possibilities for further work.

To treat recombination fractions between pairs of markers flanking an adjacent pair as a scaled version of that between the adjacent pair we have assumed that recombination fractions are additive. This is approximately true over short distances (<2 cM). To allow for the effects of interference, the method could be modified to convert each estimate of a recombination fraction made in step 2 to a map distance via an appropriate mapping function, e.g., Kosambi. Map distances between markers flanking an adjacent pair are then scaled versions of the map distance between the adjacent pair and the method could proceed as described in step 3, but by replacing recombination fractions with map distances.

To determine normal confidence intervals presented in this article we have assumed independence between individual estimates made for each pair of markers. Further work might include the more accurate estimation of distribution of our estimator, using bootstrap or jackknife techniques (EFRON and TIBSHIRANI 1993).


ACKNOWLEDGEMENTS
We thank P. Visscher and two anonymous reviewers for comments on the manuscript. We are also grateful to A. Morris for helpful critical discussions and to D. Evans for supplying PHASE estimates of population recombination rates. This work was supported by the Wellcome Trust and a by HapMap grant from the National Institutes of Health.


LITERATURE CITED

DAUSSET, J., H. CANN, D. COHEN, M. LAHTROP, J. M. LALOUEL et al., 1990 Centre d'Etude du Polymorphisme Humain (CEPH): collaborative genetic mapping of the human genome. Genomics 6: 575–577.[CrossRef][Medline]

DEVLIN, B., N. RISCH and S. ROEDER, 1996 Disequilibrium mapping: composite likelihood for pairwise disequilibrium. Genomics 36: 1–16.[CrossRef][Medline]

DUDBRIDGE F., B. P. C. KOELEMAN, J. A. TODD and D. G. CLAYTON, 2000 Unbiased application of the transmission/disequilibrium test to multilocus haplotypes. Am. J. Hum. Genet. 66: 2009–2012.[CrossRef][Medline]

EDWARDS, A., 1992 Likelihood. John Hopkins University Press, Baltimore.

EFRON, B., and R. TIBSHIRANI, 1993 An Introduction to the Bootstrap. Chapman & Hall, London/New York.

EVANS, D. M., and L. R. CARDON, 2005 A comparison of linkage equilibrium patterns and estimated population recombination rates across multiple populations. Am. J. Hum. Genet. 76: 681–687.[CrossRef][Medline]

FEARNHEAD, P., and P. DONNELLY, 2001 Estimating recombination rates from population genetic data. Genetics 159: 1299–1318.[Abstract/Free Full Text]

FEARNHEAD, P., and P. DONNELLY, 2002 Approximate likelihood methods for estimating local recombination rates. J. R. Stat. Soc. Ser. B 64: 657–680.[CrossRef]

GRIFFITHS, R. C., and P. MARJORAM, 1996 Ancestral inference from samples of DNA sequences with recombination. J. Comput. Biol. 3: 479–502.[Medline]

HASTBACKA, J., A. DE LA CHAPELLE, I. KAITILA, P. SISTONEN, A. WEAVER et al., 1992 Linkage disequilibrium mapping in isolated founder populations: diastrophic dysplasia in Finland. Nat. Genet. 2: 204–211.[CrossRef][Medline]

HILL, W. G., and B. S. WEIR, 1994 Maximum likelihood estimation of gene location by linkage disequilibrium. Am. J. Hum. Genet. 54: 705–714.[Medline]

HUDSON R. R., 2001 Two locus sampling distributions and their application. Genetics 159: 1805–1817.[Abstract/Free Full Text]

INTERNATIONAL HAPMAP CONSORTIUM, 2003 The international HapMap project. Nature 426: 789–796.[CrossRef][Medline]

KAPLAN, N. J., and B. S. WEIR, 1995 Are moment bounds on the recombination fraction between a marker and a disease locus too good to be true? Allelic association mapping revisited for simple genetic diseases in the Finnish population. Am. J. Hum. Genet. 57: 1486–1498.[Medline]

KAPLAN, N. N., W. G. HILL and B. S. WEIR, 1995 Likelihood methods for locating disease genes in nonequilibrium populations. Genetics 56: 18–32.

KE, X., S. HUNT, W. TAPPER, R. LAWRENCE, G. STAVRIDES et al., 2004 The impact of SNP density on fine-scale patterns of linkage disequilibrium. Hum. Mol. Genet. 13: 577–588.[Abstract/Free Full Text]

KONG, A., D. F. GUDBJARTSSON, J. SAINZ, G. M. JONSDOTTIR, S. A. GUDJONSSON et al., 2002 A high-resolution recombination map of the human genome. Nat. Genet. 31: 241–247.[CrossRef][Medline]

KUHNER, M. K., J. YAMATO and J. FELSENSTEIN, 2000 Maximum likelihood of recombination rates from population data. Genetics 156: 1393–1401.[Abstract/Free Full Text]

LANDER, E. S., and D. BOTSTEIN, 1986 Mapping complex genetic traits in humans: new methods using a complete RFLP linkage map. Cold Spring Harbor Symp. Quant. Biol. 51: 49–62.

LI, N., and M. STEPHENS, 2003 Modeling linkage disequilibrium and identifying recombination hotspots using single nucleotide polymorphism data. Genetics 165: 2213–2233.[Abstract/Free Full Text]

MCVEAN, G. A. T., P. ADWALLA and P. FEARNHEAD, 2002 A coalescent-based method for detecting and estimating recombination from gene sequences. Genetics 160: 1231–1241.[Abstract/Free Full Text]

NIELSON, R., 2000 Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics 154: 931–942.[Abstract/Free Full Text]

SCHAID, D. J., 2004 Linkage disequilibrium testing when linkage phase is unknown. Genetics 166: 505–512.[Abstract/Free Full Text]

SERRE, J. L., B. SIMON-BOUY, E. MORNET, B. JAUME-ROIG, A. BALASSOPOULOU et al., 1990 Studies of RFLP closely linked to the cystic fibrosis locus throughout Europe lead to new considerations in populations genetics. Hum. Genet. 84: 449–454.[Medline]

SPIELMAN, R. S., R. E. MCGINNIS and W. J. EWENS, 1993 Transmission test for linkage disequilibrium: the insulin gene region and insulin-dependent diabetes mellitus (IDDM). Am. J. Hum. Genet. 52: 506–516.[Medline]

STEPHENS, M., and P. DONNELLY, 2003 A comparison of Bayesian methods for haplotype reconstruction from population genotype data. Am. J. Hum. Genet. 73: 1162–1169.[CrossRef][Medline]

STEPHENS, M., N. SMITH and P. DONNELLY, 2001 A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68: 978–989.[CrossRef][Medline]

STUMPH, M. P. H., and G. A. T. MCVEAN, 2003 Estimating recombination rates from population-genetic data. Nat. Genet. Rev. 4: 959–968.

WALL, J., 2000 A comparison of estimators of the population recombination rate. Mol. Biol. Evol. 17: 839–850.[Abstract/Free Full Text]

Communicating editor: P. J. OEFNER




This article has been cited by other articles:


Home page
Genome ResHome page
A. Tenesa, P. Navarro, B. J. Hayes, D. L. Duffy, G. M. Clarke, M. E. Goddard, and P. M. Visscher
Recent human effective population size estimated from linkage disequilibrium
Genome Res., April 1, 2007; 17(4): 520 - 526.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
P. M. Visscher and W. G. Hill
Estimation of Recombination Rate and Detection of Recombination Hotspots From Dense Single-Nucleotide Polymorphism Trio Data
Genetics, August 1, 2006; 173(4): 2415 - 2417.
[Full Text] [PDF]