| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Corresponding author: Molly Przeworski, Brown University, 80 Waterman St., Box G-W, Providence, RI 02912., molly_przeworski{at}brown.edu (E-mail)
Communicating editor: J. J. HEIN
| ABSTRACT |
|---|
An ability to predict levels of linkage disequilibrium (LD) between linked markers would facilitate the design of association studies and help to distinguish between evolutionary models. Unfortunately, levels of LD depend crucially on the rate of recombination, a parameter that is difficult to measure. In humans, rates of genetic exchange between markers megabases apart can be estimated from a comparison of genetic and physical maps; these large-scale estimates can then be interpolated to predict LD at smaller ("local") scales. However, if there is extensive small-scale heterogeneity, as has been recently proposed, local rates of recombination could differ substantially from those averaged over much larger distances. We test this hypothesis by estimating local recombination rates indirectly from patterns of LD in 84 genomic regions surveyed by the SeattleSNPs project in a sample of individuals of European descent and of African-Americans. We find that LD-based estimates are significantly positively correlated with map-based estimates. This implies that large-scale, average rates are informative about local rates of recombination. Conversely, although LD-based estimates are based on a number of simplifying assumptions, it appears that they capture considerable information about the underlying recombination rate or at least about the ordering of regions by recombination rate. Using LD-based estimators, we also find evidence for homologous gene conversion in patterns of polymorphism. However, as we demonstrate by simulation, inferences about gene conversion are unreliable, even with extensive data from homogeneous regions of the genome, and are confounded by genotyping error.
THE extent to which alleles assort randomly on chromosomes depends on the recombination rate, as well as on the demographic history of the species and on the selective pressures exercised on the genomic region. All else being equal, alleles at a given physical distance apart will tend to be more strongly associatedin greater linkage disequilibriumwhen the recombination rate is lower. Thus, levels of linkage disequilibrium (LD) can be predicted from estimates of the recombination rate, if an adequate model for the history of the species exists (![]()
![]()
In humans, there is increasing interest in an accurate characterization of LD, prompted in part by the question of what marker density will be needed to attain reasonable power in genome-wide association studies (![]()
![]()
![]()
![]()
![]()
To date, there is direct evidence for small-scale (<100 kb) variation in recombination rates for <10 regions of the genome (e.g., ![]()
![]()
![]()
![]()
![]()
![]()
![]()
Recent studies of human patterns of LD have also highlighted the importance of a second feature of recombination: homologous gene conversion (![]()
![]()
![]()
![]()
![]()
We tested how well cmap predicts local levels of LD in extensive polymorphism data collected by the SeattleSNP project (http://pga.mbt.washington.edu/). To do so, we characterized levels of LD by an estimate of the population rate of crossing over,
= 4Nec (Ne is the diploid effective population size and c the rate of crossing over per base pair per generation). This approach summarizes the LD in a region by a single number, so that levels of LD in different regions can be compared (![]()
/4
e provides an estimate of the crossing-over rate that can be compared to cmap (![]()
| METHODS |
|---|
SeattleSNP data:
We used publicly available polymorphism data for autosomal genes implicated in inflammatory diseases (see ![]()
4%; M. RIEDER, personal communication). Since diploids were sequenced, the data consist of genotypes where the phase of multiple heterozygotes is unknown.
We analyzed the two groups separately because, in previous studies, allele frequencies and levels of LD differed between Sub-Saharan African and European populations (e.g., ![]()
48 in African-Americans and 33 in European-Americans. In our usage, segregating sites include all biallelic markers, whether nucleotide substitutions or indels, but exclude a small fraction of sites with more than two alleles (18 across all regions). The regions always include a gene, but coding sequence represents a small proportion (
10% on average) of the total. The segments surveyed for variation in a region are not always contiguous:
470 kb were sequenced (median of 11.2 kb) out of a total length of 4180 kb (median of 11.8 kb).
Estimating the effective population size from diversity and divergence:
Under the standard neutral model of a random-mating population of constant size, the diploid effective population size of humans, Ne, is the same for all regions of the genome (see Table 1 for a brief definition of symbols). It can be estimated as
W/(4µdiv), where µdiv is an estimate of the mutation rate per base pair per generation, µ, and
W is Watterson's estimate of the population mutation rate
= 4Neµ, based on the number of segregating sites in the sample (![]()
![]()
W = 1 x 103/bp for African-Americans and
W = 7 x 104 for the CEPH. The estimate of Ne, Ndiv obtained in this way is
15,000 for African-Americans and
10,000 for the CEPH. Similar estimates are obtained by considering the mean or median Ne estimate across loci (results not shown).
|
The model of recombination:
We considered a simple model of recombination to make inferences about rates of gene conversion (![]()
![]()
![]()
Estimating the crossover rate:
Rough estimates of the recombination rate can be obtained by a comparison of genetic and physical maps. We used sex-averaged recombination rate estimates based on the genetic map of ![]()
![]()
![]()
![]()
Estimating the population rate of crossing over when f = 0:
We estimated the population rate of crossing over,
= 4Nec, from polymorphism data using the method of ![]()
![]()
and
. Further, as
approaches 0, and conditioning on two sites being polymorphic, the probability of a particular allelic configuration at a pair of sites is only a function of
d, where d is the physical distance between the two sites (in base pairs).
Let n be the vector whose four entries are the counts of each haplotype in the sample. For a single pair, the maximum-likelihood estimate of
can be obtained by maximizing Pr(n;
, d). To estimate
for polymorphism data from a given region of the genome, ![]()
iPr(ni;
, di) over all pairs i of sites in the region. This procedure ignores the dependence between pairs and hence the likelihood obtained is not a true likelihood, but instead is referred to as "composite likelihood." Note that because the allelic configurations depend on
, not c, we cannot estimate c and Ne separately.
We used diploid data, where the phase of double heterozygotes is unknown and in which data are missing; in addition, the ancestral state is unknown. Assuming random mating, the approach of ![]()
(![]()
![]()
iPr(nd,i;
, di), one can estimate
from data where the phase of double heterozygotes is unknown. The estimate of
obtained in this way is denoted
LD. The program to estimate
from diploid data was kindly provided by R. Hudson. Simulations suggest that estimates of
based on n diploids are as accurate as, or slightly more accurate than, those based on n haploids but less accurate than estimates based on 2n haploids (S. E. PTAK, M. PRZEWORSKI and R. HUDSON, unpublished results).
Likewise, the probabilities of allelic configurations where the ancestral state is unknown can be obtained from the probabilities when it is known, by summing over the appropriate sample configurations. Simulations suggest that there is not much loss of information when ancestral types are unknown (![]()
Estimating the population crossing-over rate when f > 0:
We also considered a model where there is both gene conversion and crossing over. The probability of a configuration at a pair of sites, nd, then depends on f,
, and d. Specifically, it depends on the rate at which a recombinant is produced between two sites, which is r = c[d + 2fL(1 exp(d/L))] under this model of recombination (![]()
can be estimated by maximizing the product
iPr(nd; f,
, di). The program to do so was kindly provided by R. Hudson.
We estimated f and
using all the regions at once. Specifically, we estimated the likelihood of
and f values on a two-way grid, where
varies from 0 to 0.01 (in increments of 0.0001) and f from 0 to 10 (in increments of 0.25). We tried L = 60, 250, 500, and 1000, values consistent with the little that is known about gene conversion tract lengths in yeast, fruit flies, and humans (![]()
are the same across all loci and maximized CLik(f,
) =
jCLikj(f,
), where CLikj is the composite likelihood for genic region j. We denote the composite maximum-likelihood estimate of
and f obtained in this way by
+ and
+, respectively. ![]()
For the SeattleSNPs data, cmap suggests that recombination rates vary by an order of magnitude across regions. We therefore developed a second approach, referred to as the profile method, by analogy to profile likelihood. This method assumes that f is fixed across loci, but allows
to vary. Let
be the profile composite likelihood of f for locus j, where
f is the maximum composite-likelihood estimate of
for a given value of f. To obtain a maximum composite-likelihood estimate of f,
*, we maximized the product
jCLikPj(f). For each locus, the estimate of
is given by
f*.
Performance of estimators of f:
To gain a rough sense of how well the two estimators of f are expected to perform on these data, we simulated 200 sets of loci that matched the actual data for African-Americans, with f = 0, ... , 4. For each locus, we generated data for 24 diploid individuals by coalescent simulations (![]()
=
map and
=
W. We then estimated
and f from each of the 200 sets of data using the joint and the profile method and tabulated how often we obtained particular values of
(on an integer grid from 0 to 8). In these simulations and all others described below, we matched the length of the sequence as well as the gaps for each region, but not the missing data, and created the observed number of genotypes by randomly pairing haplotypes (using modifications of software available from http://home.uchicago.edu/~rhudson1/source/mksamples.html).
Rejecting no variation in
across loci:
We used coalescent simulations to test the null hypothesis of a fixed
across loci. Specifically, we considered the ratio of the maximum composite likelihood obtained under the null hypothesis and under the alternative model where f is fixed but
can vary [i.e., CLik(
+,
+)/CLik(
*,
f*)]. (Note that when
is fixed across loci, the profile approach reduces to the joint one.) A small ratio suggests that the data are more probable under the alternative hypothesis that
is not fixed across loci. To assess significance, we compared the observed ratio to a distribution generated from 100 simulations with
=
W,
=
+, and
for each locus.
Rejecting no gene conversion (i.e., f = 0):
We tested the null hypothesis that f = 0 in the CEPH and in African-American data. To do so, we compared the observed value of
to a distribution of
values obtained from 100 simulations with
=
W,
=
0, and f = 0 (i.e., the estimate obtained from the profile method constraining f to be 0).
Effect of genotyping error on inferences about f:
To examine the effect of genotyping error on estimates of f, we generated 100 sets of 84 simulated loci that matched the actual data for African-Americans, conditional on
=
map. We set f = 0 but introduced "genotyping errors," then tabulated the proportion of sets where we obtained
> 0 by either the joint or the profile estimation procedure. It is unclear how best to model genotyping error, since the process depends on the SNP detection technology and its use in specific cases. In addition, while rates of false positives can be estimated, rates of false negatives are much harder to assess. In our simulations, we chose an extremely simple model of genotyping error, meant to be illustrative rather than descriptive. To mimic a genotyping error rate of 
, we switched each allele with probability
/2 at every segregating site. As a result of these errors, some polymorphisms appear to be monomorphic, while others change frequency. Note that no additional, fake polymorphisms are created by this procedure, so that an implicit assumption is that the rate of false positive for low-frequency variants is 0; this may not be grossly unrealistic since sites with few heterozygous genotypes are often confirmed manually. This model also makes the sensible assumption that a heterozygote and a homozygote genotype are much more likely to be confused than are the two homozygote genotypes. We ran simulations for
= 0.005 and 0.01.
To test whether we can reject the hypothesis of no gene conversion in the presence of genotyping error, we compared the observed
to the distribution obtained from 100 simulations where
=
0, f = 0, and
= 0.005.
Relationship between
LD and cmap:
To examine the relationship between
LD and cmap, we estimated
LD for each locus (for f = 0, ... , 10 in increments of 0.25). We then computed the rank correlation of these estimates and cmap estimates (for a given f), using Kendall's coefficient. To test whether this relationship would still be significant (at the 5% level) after correction for differences in Ne across regions, we considered the partial rank correlation of
LD and cmap after correction for
W values. We estimated the significance of the observed partial correlation by a permutation test (with 100 permutations).
To gain a sense of our power to detect a relationship between
LD and cmap using the SeattleSNPs data, we used coalescent simulations of the standard neutral model to generate 100 sets of 84 regions that mimicked the actual data. In each region,
=
W and f = 0. To take into account the uncertainty associated with estimates of cmap,
for each region was chosen from a gamma distribution with mean set to
map = 4Ndivcmap; the choice of gamma parameters was guided by the rough confidence intervals reported for the KONG et al. (2001) genetic map of humans (![]()
LD from each set of simulated data, using the same approach as for the actual data. Using the 100 sets of simulated data, we then asked: (1) In what proportion of data sets is the correlation of
LD and cmap (measured using Kendall's rank coefficient) as strong as or stronger than what we observed in the actual data?, (2) how often is it significant at the 5% level?, and (3) how often is the median
LD as large as or larger than that observed? We also asked these questions using 100 sets of data generated with f = 1.
| RESULTS |
|---|
Levels of linkage disequilibrium in the African-American and CEPH samples:
We analyzed 84 regions of the genome in a sample of 24 African-Americans and 77 regions in a sample of 23 CEPH individuals (see METHODS). For each region, we used
W (![]()
= 4Neµ, and
LD (![]()
= 4Nec, assuming no gene conversion (for the estimates of
and
for each region, see Table 1 of the supplementary materials available at http://email.eva.mpg.de/~przewors).
W values in the two samples are highly correlated (
= 0.527, p = 106, n = 77), as are
LD values (
= 0.541, p = 106, n = 77).
These estimators of
and
assume the standard neutral model and it is unclear how to interpret the estimates under alternative models where Ne is not well defined. To compare samples, we therefore considered the ratio
LD/
W, an estimate of the number of crossover events per mutation for each locus, c/µ (![]()
![]()
More or less LD than expected?
These LD-based estimates of c can be compared to large-scale estimates. The median cmap's are 1.08 cM/Mb and 1.15 cM/Mb per base pair for the set of 84 loci in African-Americans and 77 loci in the CEPH sample, respectively. For the African-American sample, the median estimate of
LD is 0.0010/bp. (Throughout, we consider the median
LD and not the mean because with small probability the
estimator returns a very large value, so the mean is not well defined). Assuming a diploid effective population size of Ndiv = 15,000 for African-Americans (see METHODS), this yields a median crossover rate of
LD/4Ndiv = 1.71 cM/Mb. This value is
58% larger than the median cmap estimate. In contrast, the median value of
LD for the CEPH sample is 0.0003/bp. Assuming Ndiv = 10,000 for Europeans (see METHODS), this yields a median crossover rate of 0.75 cM/Mb, which is
35% smaller than the median cmap.
To assess whether the discrepancies between cmap and LD-based estimates of the recombination rate are larger than expected by chance, we generated 100 sets of simulated data under the standard neutral model that mimicked the actual data. For each region, we set
=
W and chose
from a gamma distribution with mean 4Ndiv cmap (see METHODS). For the African-American sample, the median
LD was equal to or larger than that observed in 0 of 100 cases, suggesting that there is significantly more recombination than expected from cmap. In other words, levels of linkage disequilibrium are unexpectedly low for the standard neutral model, when f = 0. The finding for the CEPH sample, however, is the opposite: Only in 1 of 100 cases was the median
LD equal to or smaller than that observed. Thus, in the CEPH sample, levels of linkage disequilibrium are unexpectedly high for the standard neutral model, when f = 0.
These conclusions are predicated on an estimate of Ne, which in turn depends on the generation time and the time to a common ancestor, two parameters about which little is known (![]()
![]()
Relationship of cmap and
LD:
To assess the extent to which average crossover rates between markers far apart predict local levels of LD, we considered the rank correlation of cmap and
LD in both population samples. (We did not use parametric analyses, as many assumptions are grossly violated in this context.) When all loci with at least 10 segregating sites are considered, the two are significantly positively correlated at the 5% level in the CEPH sample (
= 0.220, p = 0.007, n = 77) but not in the African-American sample (
= 0.116, p = 0.129, n = 84).
Since estimates of
are highly inaccurate when there are few variable sites (![]()
LD estimates are within twofold of the true value of
90% of the time under the standard neutral model [HUDSON's (2001) Figure 8, assuming
=
]. With this restricted data set, cmap and
LD are significantly positively correlated in both population samples (see Fig 1;
= 0.258, p = 0.025, n = 40 for the CEPH sample and
= 0.256, p = 0.004, n = 62 for the African-Americans).
|
As reported previously, estimates of
(= 4Neµ) are also positively correlated with cmap (![]()
(= 4Nec) could increase with cmap because of a relationship between Ne and cmap, as expected under models of variation-reducing selection (cf. ![]()
and cmap appears to be due to an association of mutation and recombination rates, not to a relationship between Ne and cmap (![]()
LD is still significantly correlated with cmap after correction for
W values (
= 0.255, p = 0.03, n = 40 for the CEPH sample and
= 0.238, p = 0.01, n = 62 for the African-Americans; see METHODS). Thus, it appears that large-scale estimates of the crossing-over rate, cmap, are informative about local levels of linkage disequilibrium, as measured by
LD.
The strength of the underlying correlation between cmap and
LD is unclear, however, as error in cmap and
LD estimates will introduce noise and thus decrease the observed association. A related question is how often the observed correlation is expected under the standard neutral model, if f = 0 and in the absence of recombination rate heterogeneity. We examined this by tabulating how often a correlation is observed in simulated data that mimics the actual data (see METHODS). When all loci with at least 30 segregating sites were considered, there was a significant correlation between cmap and
LD in 100 of 100 simulated data sets, for both population samples. Further, the rank correlation coefficient was always larger than that observed. When the cutoff was 10 segregating sites, the correlation was again always significant and only once was a rank correlation coefficient smaller than that observed (for the CEPH sample). [As expected, however, the correlation coefficients tended to be larger when we restricted our attention to only data sets with >30 segregating sites compared to when we considered all data sets with >10 (results not shown)]. In summary, if the assumptions of the model were met, a much stronger relationship would be expected between cmap and
LD in spite of errors in both sets of estimates. This finding suggests that there are salient departures from model assumptions that reduce the correlation between large-scale and local recombination rates.
Performance of joint and profile estimators of f:
As discussed above, one feature of recombination missing from the model of recombination is homologous gene conversion. To assess the evidence for gene conversion in the SeattleSNPs data, we used an extension of the method of ![]()
![]()
![]()
and f (for a given mean conversion tract length, L; see METHODS).
|
Unfortunately, when both
and f are estimated on a single locus, even large (e.g., with >200 segregating sites), the power to reject f = 0 is low and estimates of the parameters are highly inaccurate (results not shown). We therefore assumed a fixed f value across loci and combined information from all loci to estimate this parameter. As described in METHODS, we did so under two sets of assumptions about
. In the so-called "joint" approach, we assumed that
is the same for all loci; in the second "profile" method, we allowed
to vary.
We tested the performance of both estimators of f by generating simulated data sets that mimicked the African-American data (see METHODS). For each region, we set
=
W and
=
map (for all values of f); thus, in our simulations, the population crossover rate varied across loci, but f was fixed. As can be seen in Fig 3, the joint estimator tends to overestimate the true value of f under these conditions. In contrast, the profile approach tends to underestimate it, but to a lesser extent. Similarly, the joint estimate of
is an underestimate, and the median profile estimate of
a slight overestimate, of the true median
(see Table 2 in supplementary materials at http://email.eva.mpg.de/~przewors). Overall, the profile estimator of f performs better than the joint method.
|
Which method is preferable in general depends on the extent to which
varies across regions. If it does not vary much, then the joint estimator should be more accurate, as it combines information from all regions to estimate
. In practice, researchers will rarely have precise estimates of the local
. At best, they will have a set of regions that are thought to experience similar crossing-over rates based on large-scale c estimates. To quantify the performance of the two estimators for these types of data, we generated 100 sets of 50 loci, where each locus was similar in information content to the data sets collected by ![]()
values for each region independently from the same gamma distribution. As expected, in this situation, the joint method leads to more accurate estimates for f > 0 than does the profile approach. However, the point estimates of f are often >0 in the absence of gene conversion (
26% of the time; see Table 3 in supplementary materials at http://email.eva.mpg.de/~przewors).
Estimates of gene conversion rates:
For the SeattleSNPs data, the cmap estimates point to substantial variation in
across regions. We therefore tested the null hypothesis that
and f are fixed across loci against an alternative where f is fixed but
can vary (see METHODS). We found that the polymorphism data are significantly more likely under the model where
is allowed to vary (the observed ratio of composite likelihoods was smaller than those for all 100 simulations). Consequently, we present estimates of
and f obtained using the profile approach. To facilitate comparison with previous studies, notably ![]()
For the African-American sample, the profile method yielded
and median
LD = 0.0008 while for the CEPH sample we obtained
and median
LD = 0.0003. Furthermore, simulations (see METHODS) suggest that the null hypothesis of no gene conversion can be rejected for both population samples (the observed
is larger than that obtained in all 100 simulations for the African-American sample and in 96 out of 100 simulations for the European-American sample.
Effects of mutation rate variation on estimates of
and f:
Inferences about
and f are potentially confounded by factors that affect allelic associations similarly. The estimator of
and f (![]()
![]()
and f for the African-American sample (with L = 500) after excluding all CpG sites from our polymorphism data (a CpG site was conservatively defined as any dinucleotide where at least one chromosome in the sample could have a CpG). Estimates were
and median
LD = 0.0009; thus, they were essentially unchanged by the exclusion of CpG sites.
Effect of crossover rate variation on inferences about f:
An additional concern might be that the presence of short segments with elevated crossing-over rates ("recombination hotspots") can mimic the effects of gene conversion by decreasing LD between closely linked pairs. However, for the method of ![]()
Effect of genotyping error on inferences about f:
A more serious problem is that genotyping errors have more effect on LD at short than at long distances and thereby mimic the effect of gene conversion (see Fig 2). To examine the effect of genotyping error on inferences about f, we ran simulations with no gene conversion but with genotyping error and estimated f (see METHODS). As can be seen in Fig 4, a 1% genotyping error rate led to a point estimate of f > 0 in all 100 simulated data sets, using either the profile or the joint approaches. Even with a seemingly small genotyping error rate (0.5%),
> 0 in 78 and 99% of simulated runs, respectively. For the SeattleSNPs data, an error rate was assessed by genotyping a subset of the sites using two other genotyping technologies;
0.5% of genotypes checked in this way differed from the original call (![]()
![]()
![]()
![]()
|
| DISCUSSION |
|---|
Too little LD in African-Americans and too much in the CEPH?
In the SeattleSNPs data, levels of LD in the CEPH are higher than those in the African-Americans. This finding is consistent with reports of higher levels of LD in European populations relative to Sub-Saharan African ones (e.g., ![]()
![]()
![]()
= 4Neµ is
1.6-fold higher in African-Americans, while the median estimate of
= 4Nec is >3-fold higher. Thus, levels of LD differ more between populations than do diversity levels, as previously reported for a comparison of Hausa and Italians (![]()
Our simulations further suggest that the levels of LD observed in African-Americans are lower than what would be expected under the standard neutral model (assuming f = 0). As discussed above, a trivial explanation is that we underestimated Ne. However, lower than expected levels of LD have also been reported in a study of 10 intergenic regions in a Hausa sample from Africa (![]()
![]()
LD). As an illustration, when f = 1, levels of LD in African-Americans are closer to what is expected from a comparison of genetic and physical maps: The median crossing-over rate estimated from polymorphism data using the profile method,
LD/4Ndiv, is 1.33 cM/Mb, while the median cmap is 1.08 cM/Mb. Moreover, the median
LD is no longer significantly larger than expected from simulations where f = 1 and
=
map (results not shown).
Of course, a departure from model assumptions other than gene conversion could also have decreased levels of LD below the expectations of the standard neutral model, for example, population growth (cf. ![]()
![]()
![]()
(![]()
![]()
In contrast to the African-American sample, the CEPH population shows unexpectedly high levels of linkage disequilibrium relative to the expectations of the standard neutral model when f = 0. When f > 0, the apparent excess of LD in the CEPH sample becomes more extreme (results not shown). This observation could be partially explained if Ne is overestimated. However, other features of polymorphism data in individuals of European descent suggest a demographic explanation may be more likely. In particular, the observation of reduced levels of diversity relative to Sub-Saharan African samples and an excess of intermediate frequency alleles in European samples have led to the suggestion of a recent reduction in population size in Europeans (![]()
![]()
![]()
![]()
![]()
Inferences about gene conversion:
Theoretical investigations have highlighted the potential importance of gene conversion in shaping local levels of linkage disequilibrium (![]()
1 in the African-Americans and 0.25 in the CEPH. Simulations suggest that both estimates are significantly >0, assuming no or very little genotyping error. These represent the second estimates of f from polymorphism data. In the first, the estimate was based on a smaller data set sequenced in a Sub-Saharan African sample and obtained under the assumption that crossing-over rates are fixed across loci (![]()
overestimates the true f when rates vary substantially across loci (Fig 3).
These inferences about f rely on a number of assumptions. Most obviously, they are based on a simplified model of recombination, for which there is support in humans from single-sperm typing (e.g., ![]()
![]()
![]()
Contrasting local and large-scale estimates of the recombination rate:
Local recombination rates estimated from polymorphism data (
LD) increase with large-scale estimates of the crossing-over rate (cmap) in both African-Americans and the CEPH. This association suggests that LD-based estimates of the recombination rate such as
LD capture considerable information about underlying recombination rates or at least about the ordering of different regions by levels of LDin spite of their reliance on a number of unrealistic assumptions, such as a constant population size and random mating. Consistent with our conclusion, recent studies found close agreement between LD-based estimates of recombination rates (using a different approach) and single-sperm typing estimates at the TAP2 and ß-globin loci in humans (![]()
![]()
While large-scale estimates of the crossing-over rate are predictive of local levels of LD, simulations suggest that the association of cmap and
LD is weaker than would be expected if there were no heterogeneity in recombination rate or variation in f across loci (see RESULTS). One explanation is that f varies substantially across loci, reducing the strength of the relationship. To test this possibility, we ran 100 simulations where f was not fixed across loci, but was instead an integer "chosen uniformly" between 0 and 9. The association did tend to be weaker in this situation compared to one where f was fixed and nonzero across regions (results not shown). However, the observed correlation coefficient between cmap and
LD (estimated for any fixed f
9) was smaller than that seen in all 100 simulations (results not shown). Thus, variation in f alone is unlikely to explain the weakness of the correlation. A plausible alternative is that occasional recombination hotspots are reflected in the SeattleSNPs data. Candidates for hotspot regions are the loci with unusually high
LD estimates given cmap (see Fig 1).
Outlook:
We find that average recombination rates over large distances are informative about local rates of genetic exchange and, hence, about local patterns of linkage disequilibrium. This finding is somewhat surprising: In light of increasing evidence for extensive local variation in recombination rates, why do regions of 4180 kb so often reflect the rates obtained from averaging over megabases? One possibility is that large-scale rates reflect the background rate of recombination (i.e., the rate outside of hotspots). To address this and related questions, further work is needed to quantify how much recombination occurs in hotspots and the variation in intensity among hotspots, as well as to assess the extent to which global features of chromosome structure influence the location of recombination events (![]()
![]()
| ACKNOWLEDGMENTS |
|---|
We thank D. Cutler, R. Hudson, M. Stephens, and J. Wall for useful discussions and P. Andolfatto, Y. Gilad, J. Wall, and three anonymous reviewers for comments on the manuscript. We are also grateful to I. Hellmann and N. Matasci for bioinformatics help; to M. Rieder for providing additional information about the SeattleSNP data; and especially to D. Nickerson, M. Rieder, and the SeattleSNPs project for making their data publicly available. This work was supported by Deutsche Forschungsgemeinschaft grant BIZ6-1/1.
Manuscript received June 3, 2003; Accepted for publication January 19, 2004.
| LITERATURE CITED |
|---|
ANDOLFATTO, P., 2001 Adaptive hitchhiking effects on genome variability. Curr. Opin. Genet. Dev. 11:635-641.[CrossRef][Medline]
ANDOLFATTO, P. and M. NORDBORG, 1998 The effect of gene conversion on intralocus associations. Genetics 148:1397-1399.
ANDOLFATTO, P. and M. PRZEWORSKI, 2000 A genome-wide departure from the standard neutral model in natural populations of Drosophila. Genetics 156:257-268.
CARLSON, C. S., M. A. EBERLE, M. J. RIEDER, J. D. SMITH, and L. KRUGLYAK et al., 2003 Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in humans. Nat. Genet. 33:518-521.[CrossRef][Medline]
CHAKRABORTY, R. and K. M. WEISS, 1988 Admixture as a tool for finding linked genes and detecting that difference from allelic association between loci. Proc. Natl. Acad. Sci. USA 85:9119-9123.
COOPER, D. N. and M. KRAWCZAK, 1989 Cytosine methylation and the fate of CpG dinucleotides in vertebrate genomes. Hum. Genet. 83:181-188.[CrossRef][Medline]
DALY, M. J., J. D. RIOUX, S. F. SCHAFFNER, T. J. HUDSON, and E. S. LANDER, 2001 High-resolution haplotype structure in the human genome. Nat. Genet. 29:229-232.[CrossRef][Medline]
DE MASSY, B., 2003 Distribution of meiotic recombination sites. Trends Genet. 19:514-522.[CrossRef][Medline]
FEARNHEAD, P. and P. DONNELLY, 2001 Estimating recombination rates from population genetic data. Genetics 159:1299-1318.
FRISSE, L., R. R. HUDSON, A. BARTOSZEWICZ, J. D. WALL, and J. DONFACK et al., 2001 Gene conversion and different population histories may explain the contrast between polymorphism and linkage disequilibrium levels. Am. J. Hum. Genet. 69:831-843.[CrossRef][Medline]
GABRIEL, S. B., S. F. SCHAFFNER, H. NGUYEN, J. M. MOORE, and J. ROY et al., 2002 The structure of haplotype blocks in the human genome. Science 296:2225-2229.
GRIFFITHS, A. J. F., J. H. MILLER, D. T. SUZUKI, R. C. LEWONTIN and W. M. GELBART, 1996 An Introduction to Genetic Analysis. W. H. Freeman, New York.
HELGASON, A., B. HRAFNKELSSON, J. R. GULCHER, R. WARD, and K. STEFANSSON, 2003 A populationwide coalescent analysis of Icelandic matrilineal and patrilineal genealogies: evidence for a faster evolutionary rate of mtDNA lineages than Y chromosomes. Am. J. Hum. Genet. 72:1370-1388.[CrossRef][Medline]
HELLMANN, I., I. EBERSBERGER, S. PTAK, S. PAABO, and M. PRZEWORSKI, 2003 A neutral explanation for the correlation of diversity with recombination in humans. Am. J. Hum. Genet. 72:1527-1535.[CrossRef][Medline]
HUDSON, R. R., 1987 Estimating the recombination parameter of a finite population model without selection. Genet. Res. 50:245-250.[Medline]
HUDSON, R. R., 1990 Oxford Surveys in Evolutionary Biology, pp. 144. Oxford University Press, Oxford.
HUDSON, R. R., 2001 Two-locus sampling distributions and their application. Genetics 159:1805-1817.
JEFFREYS, A. J. and R. NEUMANN, 2002 Reciprocal crossover asymmetry and meiotic drive in a human recombination hot spot. Nat. Genet. 31:267-271.[CrossRef][Medline]
JEFFREYS, A. J., A. RITCHIE, and R. NEUMANN, 2000 High resolution analysis of haplotype diversity and meiotic crossover in the human TAP2 recombination hotspot. Hum. Mol. Genet. 9:725-733.
JEFFREYS, A. J., L. KAUPPI, and R. NEUMANN, 2001 Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nat. Genet. 29:217-222.[CrossRef][Medline]
KONG, A., D. F. GUDBJARTSSON, J. SAINZ, G. M. JONSDOTTIR, and S. A. GUDJONSSON et al., 2002 A high-resolution recombination map of the human genome. Nat. Genet. 31:241-247.[CrossRef][Medline]
KRUGLYAK, L., 1999 Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat. Genet. 22:139-144.[CrossRef][Medline]
LI, N. and M. STEPHENS, 2003 Modelling linkage disequilibrium and identifying recombination hotspots using SNP data. Genetics 165:2213-2233.
MAY, C. A., A. C. SHONE, L. KALAYDJIEVA, A. SAJANTILA, and A. J. JEFFREYS, 2002 Crossover clustering and rapid decay of linkage disequilibrium in the Xp/Yp pseudoautosomal gene SHOX. Nat. Genet. 31:272-275.[CrossRef][Medline]
MCVEAN, G. A., 2002 A genealogical interpretation of linkage disequilibrium. Genetics 162:987-991.
NACHMAN, M. W., 2001 Single nucleotide polymorphisms and recombination rate in humans. Trends Genet. 17:481-485.[CrossRef][Medline]
OHTA, T. and M. KIMURA, 1971 Linkage disequilibrium between two segregating nucleotide sites under the steady flux of mutations in a finite population. Genetics 68:571-580.
PETES, T. D., 2001 Meiotic recombination hot spots and cold spots. Nat. Rev. Genet. 2:360-369.[CrossRef][Medline]
PHILLIPS, M. S., R. LAWRENCE, R. SACHIDANANDAM, A. P. MORRIS, and D. J. BALDING et al., 2003 Chromosome-wide distribution of haplotype blocks and the role of recombination hot spots. Nat. Genet. 33:382-387.[CrossRef][Medline]
PITTMANN, D. L. and J. C. SCHIMENTI, 1998 Recombination in the mammalian germ line. Curr. Top. Dev. Biol. 7:1-35.