| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Genetics, Vol. 174, 1517-1528, November 2006, Copyright © 2006
doi:10.1534/genetics.106.060723
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

* Molecular and Computational Biology and
Biostatistics Division, Department of Preventive Medicine, Keck School of Medicine, University of Southern California, Los Angeles, California 90089
1 Corresponding author: Biotechnology Building, Room 169, Cornell University, Ithaca, NY 14853.
E-mail: pkbadri{at}yahoo.com
| ABSTRACT |
|---|
|
|
|---|
Recombination rates can be estimated using a variety of techniques, such as sperm typing (e.g., JEFFREYS et al. 2001), pedigree studies (e.g., KONG et al. 2002), and population genetic methods. Although sperm typing can provide estimates at the finest possible resolution, it is currently not practical for whole-genome studies. Existing pedigree studies, on the other hand, can offer whole-genome coverage but cannot provide the required resolution (i.e., at the kilobase scale). Therefore, population genetic approaches have proved to be valuable. These are faster and easier to implement than sperm-typing techniques and can offer much higher resolution than is possible from current pedigree studies.
Population genetic methods use polymorphism data from DNA sequences sampled from a population. They infer the population-scaled recombination rate
(= 4Nr) (where N denotes the effective population size and r denotes recombination fraction) on the basis of simplified models of population evolution.
estimation is a well-studied problem in the field of population genetics and many different estimators are currently available. The simplest methods are based on ad hoc moment estimators. These are quick and easy to compute but are inaccurate since they do not use the available information efficiently (e.g., HUDSON 1987; HEY and WAKELEY 1997; WAKELEY 1997). In contrast, full-likelihood methods are elegant (GRIFFITHS and MARJORAM 1996; KUHNER et al. 2000; NIELSEN 2000; FEARNHEAD and DONNELLY 2001) and make full use of the available haplotype information, but prove to be computationally infeasible for larger data sets (e.g., >15-kb regions in humans). To overcome both these limitations, several practically useful compromise approaches have been proposed. These approaches try to avoid the computational expense of calculating exact likelihoods for the observed data while maintaining some likelihood-based framework (e.g., WALL 2000; HUDSON 2001; FEARNHEAD and DONNELLY 2002; LI and STEPHENS 2003; WALL 2004).
The method used in WALL (2000) involves describing data sets with one or more summary statistics and then performing maximum-likelihood inference using the reduced data. The success of this approach depends on finding summaries that efficiently collect information from the data. The combination of the number of distinct haplotypes and the minimum number of inferred recombination events (HUDSON and KAPLAN 1985) has been found to work reasonably well (WALL 2000; HUDSON 2001).
The methods described by HUDSON (2001), FEARNHEAD and DONNELLY (2002), and WALL (2004) utilize composite likelihoods. In this approach, we break a data set into smaller subsets, calculate full likelihoods for these subsets, and then multiply these likelihoods together to get "composite" likelihoods. For example, Hudson's method calculates the likelihoods of the haplotype configurations for all possible SNP pairs and multiplies these likelihoods together. Theoretical results show that a simple modification to this method, where SNP pairs are given weights that decay with the distance between them, can give a consistent estimator of the recombination rate (FEARNHEAD 2003). The composite-likelihood curve can provide a good point estimate of
. Because there is dependency between the subsets, standard asymptotic maximum-likelihood assumptions do not apply and therefore the uncertainty in estimates has to be calculated from simulations. The method of WALL (2004) is similar to the method of HUDSON (2001) but considers all triplets of sites instead of pairs. The FEARNHEAD and DONNELLY (2002) method is slightly different from the other two methods and calculates full likelihoods for small nonoverlapping windows along the sequence.
An alternate approach, proposed by LI and STEPHENS (2003), consider the likelihood of
for a given data set as a product of the conditional distributions of observing a haplotype, given a subset of the other haplotypes. If H1, H2, ..., Hn denotes a sample of n haplotypes, then
![]() |
by maximizing their product. Since this method is sensitive to the order in which the haplotypes are considered, the authors estimated their likelihoods by averaging over several possible orders. There are no theoretical results available for this method. Many of the estimation methods mentioned previously assume that recombination happens only in the form of crossing-over events. However, this model is not biologically realistic. Current meiotic recombination models allow for two different kinds of events (e.g., SZOSTAK et al. 1983). We call these two forms of recombination "crossing over" and "gene conversion," respectively. Crossing over refers to the reciprocal exchange of large chromosomal fragments whereas gene conversion refers to short exchanges between chromosomes that are not accompanied by crossing over. Theoretical results that incorporate both these mechanisms have been developed before (e.g., ANDOLFATTO and NORDBORG 1998; WIUF and HEIN 2000). Using these models, it is possible to generalize the composite-likelihood approach of HUDSON (2001) for estimating both crossing-over and gene-conversion rates (e.g., FRISSE et al. 2001; PTAK et al. 2004). To do so, it is only necessary to specify the effective recombination rate between a pair of sites (from both crossing over and conversion) as a function of distance (e.g., ANDOLFATTO and NORDBORG 1998 or LANGLEY et al. 2000). The method of WALL (2004) can also be used for jointly estimating both crossing-over and gene-conversion rates and has been shown to give more accurate estimates than the method of HUDSON (2001).
In this article, we introduce a novel method for jointly estimating both crossing-over and gene-conversion rates from single-nucleotide polymorphisms (SNPs) using summary statistics. We first tested the performance of this method on simulated data sets and compared it with that of the composite-likelihood approach (HUDSON 2001). For this comparison, we simulated both phased and unphased data with uniform and nonuniform recombination rates along the sequence. We then applied our method to a human data set recently genotyped by Perlegen Sciences (HINDS et al. 2005).
| MATERIALS AND METHODS |
|---|
|
|
|---|
In the summary statistics method, we first define patterns for SNPs on the basis of the absolute value of pairwise D' (D' denotes the normalized measure of linkage disequilibrium, LD). For example, for a pair of SNPs A and B, D'(AB) < 1.0, D'(AB) < 0.5, D'(AB) < 0.1, etc., denote patterns. Similarly, for three SNPs A, B, and C, D'(AB) < 1.0 and D'(BC) < 1.0, D'(AB) < 0.5 and D'(BC) < 0.5, etc., denote patterns. Informally, we try to summarize the distribution of LD levels for all triplets or pairs of SNPs within a data set by calculating the fraction that show any particular pattern. Since our summary statistics are based on all triplets or pairs, our method uses approximately full sequence information. Note that the expectation of pairwise D' and its distribution depends on the underlying recombination rate. So, the probability of observing a given pattern increases monotonically with the recombination rate.
Coestimating crossing-over and gene-conversion rates:
Although both mechanisms of recombination lead to the decay of LD, the effects of crossing over and gene conversion are qualitatively different. While the rate of decay of LD by crossing over increases as the distance between the markers increases, with gene conversion it is independent of distance for markers that are sufficiently far apart (WIEHE et al. 2000). Therefore, the effects of gene conversion are significant only for short-range markers whereas the effects of crossing over dominate for long-range markers (ANDOLFATTO and NORDBORG 1998; WIEHE et al. 2000). To jointly estimate both these parameters, we collect summary statistics from both long-range and short-range data. This allows us to distinguish models with gene conversion from those with crossing over alone.
For all the data sets considered here, we estimated rates using the following patterns:
Let P5(I) and P5(II) denote the fraction of all triplets with the outer SNPs within 5 kb of each other that show patterns I and II, respectively. Let P10(II) denote the fraction of all triplets with outer SNP pairs within 10 kb of each other that show pattern II. These denote our short-range summary statistics. Patterns I and II are indicative of gene-conversion events in short-range data and can potentially arise from a single gene-conversion event including the middle SNP in a triplet (WIEHE et al. 2000; PADHUKASAHASRAM et al. 2004). Let P50(III) denote the fraction of all triplets with outer SNPs within 50 kb of each other that show pattern III and P50(IV) denote the fraction of all SNP pairs within 50 kb of each other with D' < 1.0. These denote our long-range summary statistics.
Choice of patterns:
The choice of summary statistics that capture key features of full sequence information is important for our method to work efficiently. To find such informative summaries, we first tested the performance of many different patterns (listed in APPENDIX A) for simulated data sets. We found that patterns that are too rare are not suitable estimators for low recombination values because they are almost never observed. Similarly, patterns that are too common are not suitable estimators for high recombination values because summaries based on them become almost insensitive to recombination in that range. Using multiple patterns in both long-range and short-range data worked better than any individual summaries. In general, it appears that a few (two or three) different patterns are sufficient to describe the distribution of recombination levels in a data set accurately and can roughly approximate full sequence information for a wide range of recombination rates (e.g., as in Table 1). We selected a combination of patterns that performed well for the values considered in Table 1. Adding more summary statistics to this combination did not bring any significant improvements in performance. Therefore, we decided to use this set of patterns for comparing our method with the composite-likelihood approach.
|
) and gene conversion (
) from a test data set, we calculate both short-range and long-range summaries and use all of them in a simple rejection-sampling scheme. In this scheme, we first simulate a large number of data sets for a finite grid of parameter values and compute summary statistics for each. Then, we accept a simulated data set if each one of its summaries lies within 30% of the corresponding values observed in the test data set (we chose a high acceptance rate so that we accept a reasonably large number of the simulated data sets given the summary combination chosen and the total number of data sets simulated; see APPENDIX B for performance for a few other choices) and reject it otherwise. Likelihood for a parameter value is approximated as the fraction of data sets (simulated at that value) that are accepted (for more details about rejection methods see WEISS and VON HAESELER 1998 and MARJORAM et al. 2003).
Extension to genotype data:
To extend our summaries to genotype data, we simply omit double heterozygotes (phase unknown) when determining D' between any pair of SNPs.
Simulations:
DNA sequences were simulated under the coalescent, assuming no population structure, a large constant population size (N), no selection, and the infinite-sites model for mutations The population mutation rate
(=4Nu) was assumed to be uniform along the sequence. Here, u denotes the per-generation, per-sequence probability of a mutation event.
For modeling gene conversion, we used the coalescent with both crossing over and gene conversion, as described by WIUF and HEIN (2000). Gene-conversion tract lengths are assumed to be geometrically distributed with a mean length L. The population crossing-over rate
(=4Nr) and population gene-conversion rate
(=4Nc) are assumed to be uniform along the sequence. Here, r denotes the per-generation, per-sequence probability of a crossing-over event, and c denotes the per-generation, per-sequence probability of a gene-conversion event. Note that this model is equivalent to the assumption that events occur at a total rate of
+
and that each recombination event results in crossing over with probability
/(
+
) and in gene conversion otherwise. The ratio of gene conversion to crossing over is denoted by f (=
/
).
In addition to the standard model of gene conversion, we also simulated data under some alternate models where either crossing over alone or both conversion and crossing over were nonuniform along the sequence. For modeling nonuniform recombination, we assumed that recombination rates are elevated for some 1-kb regions (called hotspots) that occur at certain fixed locations along the sequence. A significant fraction of events occur within these hotspots, whereas the rest of the events occur in the intervening regions. Recombination within hotspots as well as within non-hotspot regions was assumed to be uniform. All hotspots have identical (higher) levels of recombination. Similarly, all non-hotspot regions also have identical (lower) levels of recombination.
To compare estimation methods, we tested them on 50-kb DNA sequences simulated with
set to 0.8/kb (estimates for human data from INNAN et al. 2003) and sample size of n = 18 for haplotype data and 2n = 36 for genotype data. It is usually difficult to estimate both gene-conversion rates and tract lengths from SNPs (PADHUKASAHASRAM et al. 2004). Therefore, conversion rates were always estimated with the tract length (L) fixed at 500 bp and this also facilitates comparisons with previous studies (such as FRISSE et al. 2001; PADHUKASAHASRAM et al. 2004; PTAK et al. 2004). For smaller tract lengths, the estimated conversion rates are expected to be much higher. To summarize the performance of methods, we used the following criteria:
Unphased data sets were generated by first simulating haplotypes and then grouping random pairs of chromosomes together into individuals. We then assumed that within any individual, phase is unknown for double heterozygotes. For estimating rates from the human genotype data set, we simulated 50 kb DNA sequences with
set to 0.8/kb and sample size 2n =142.
The Perlegen data set:
Perlegen genotyped
1.6 million SNPs across the human genome that are likely to be common in individuals of diverse ancestry (HINDS et al. 2005). These SNPs were identified by performing array-based resequencing of 24 diverse human DNA samples. Seventy-one unrelated individuals from three populations were genotyped: 24 European Americans, 23 African Americans, and 24 Han Chinese from the Los Angeles area. These 71 individuals were not related to the individuals previously used for SNP discovery.
Ascertainment and missing data:
We omit haplotypes with missing data while calculating our summaries for real data. The proportion of missing data in the Perlegen genotypes is extremely small (<2%) and thus our estimates are not significantly affected by ignoring these. In addition, when estimating rates for human data, SNPs with low minor allele frequency (<9%) were removed from both real and simulated data sets. To simulate the effects of ascertainment, we retained only those SNPs that were polymorphic in a randomly chosen sample of 24 chromosomes of the simulated data (the same 24 chromosomes for all SNPs).
Maxhap and Maxdip:
Maxhap and Maxdip are programs for estimating recombination rates from haplotype and genotype data, respectively, on the basis of the composite-likelihood approach of HUDSON (2001). We used them for estimating recombination rates from simulated data sets and compared their performance with our summary statistics method.
| RESULTS |
|---|
|
|
|---|
set to 0.8/kb, n = 18, and L fixed at 500 bp.
For each simulated data set, we estimated gene-conversion and crossing-over rates using our rejection method (described in MATERIALS AND METHODS) as well as using Maxhap. Estimates for both the methods were obtained by calculating likelihoods for a finite grid of
-and
-values (identical grids were used for both methods). The grids of
and
used for the first three rows and next three rows in Table 1 were (0.1, 0.50, 1.00, 2.50, 5.00, 7.50, 10.0, 15.0, 20.0, 30.0, 40.0, 60.0, 80.0, 100.0, 120.0, 140.0) and (0.0001, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200, 220, 240), respectively. Likelihoods for the summary statistics method were approximated using 40,000 simulated data sets (for each
,
combination) for the first grid and using 10,000 data sets for the second. We did not smooth likelihood surfaces for obtaining the maximum-likelihood estimates.
Likelihoods for the two-locus configurations in Maxhap were based on 2 x 106 replicates. Estimating rates for 500 data sets using our method took
868 sec once the data for approximating the likelihoods have been simulated (which took
111 hr for the first and 245 hr for the second grid) on a 2-Ghz Xeon processor machine. Estimating rates using Maxhap took
2.25 and 5.2 sec/50-kb data set for the first grid (with 256 values) and the second grid (with 529 values).
Table 1 summarizes accuracy (g), nature of bias (B), and error (V) for both these methods. We found that for estimating gene-conversion rates, the summary statistics method has higher accuracy than Maxhap, which tends to underestimate the conversion rate more often. For estimating crossing-over rates, both the methods have similar accuracy and nature of bias. The root mean square relative error was also roughly similar for both the methods.
Confidence intervals and their coverage properties:
We constructed
90% confidence intervals for the gene-conversion and crossing-over rate estimates obtained using our method and examined their coverage properties. For doing this, we first simulated 5000 data sets each for five different parameter combinations and estimated gene-conversion and crossing-over rates on the basis of the first grid used for Table 1. Because this grid is coarse and we did not use smoothing, it is difficult to obtain precise confidence intervals for our method. We increased the interval width around the estimated conversion or crossing-over rate until it included at least 90% of the total sum of the likelihoods. Then, we chose a subset (
1000) of these simulated data sets for which the confidence intervals contained 90–91% of the total and calculated the fraction of such data sets for which these intervals included the true value (Table 2). This crude coverage study suggests that the actual coverage probabilities obtained from the rejection method used here may differ slightly from the nominal probabilities. Note that we assume a uniform prior distribution for the parameters along the grid values used and confidence intervals were calculated jointly for both the recombination parameters.
|
(0, 10, 20, 30, 40, ... , 90) and
(0, 10, 20, 30, 40, 60, ... , 200) values and likelihoods in our method were approximated using 10,000 simulated data sets (for each combination of
and
in this grid). Likelihoods for the two-locus configurations for Maxdip were based on 2 x 106 replicates. Table 3 summarizes the accuracy (g), nature of bias (B), and error (V) for unphased data. We find that for estimating conversion rates our method has similar accuracy and higher error compared to Maxdip, which tends to underestimate more often, whereas for estimating crossing over it has lower accuracy, higher error, and similar nature of bias.
|
Comparison between haplotype and genotype data:
We also simulated 500 phased data sets (Table 4) for a sample size of 18 chromosomes for comparison with Table 3 and estimated rates using our method and Maxhap. From this comparison, we find that the accuracy of estimates using Maxdip and 18 genotypes was higher than the accuracy obtained with Maxhap and 18 haplotypes. For the summary statistics method, the accuracy with 18 genotypes was slightly lower than the accuracy obtained from 18 haplotypes. The nature of bias was roughly similar for both unphased and phased data sets for both the methods (Tables 3 and 4).
|
|
|
Models with nonuniform crossing over and nonuniform conversion:
Sperm-typing experiments of JEFFREYS and MAY (2004) have revealed the presence of highly localized gene-conversion activity in some crossing-over hotspots in humans. Thus, both conversion and crossing over may be elevated for some regions in the human genome. We also tested the performance of our method and Maxhap for phased data simulated under models where both crossing over and gene conversion are nonuniform along the sequence. This model assumes that 50% of both conversion and crossing-over events happen in 1-kb hotspots that occur once every 25 kb along the sequence. We simulated 500 data sets for this nonstandard model of recombination for the same parameters as in Tables 3 and 4 and estimated rates similarly. Table 7 shows results for these data sets.
|
Recombination in human data:
To illustrate our method in real data, we applied it to the Perlegen genotype data set and estimated gene conversion and crossing over along human chromosome 1 (Figure 1). Likelihoods were calculated by simulating 10,000 data sets each for a grid of
(0, 5, 10, 20, 40, 60, 80, 100, 120, 140, 160, 180, 200, 220, 240, 260, 280, 300, 320) and
(0, 5, 10, 20, 40, 60, 80, 100, 120, 140, 160, 180, 200, 220, 240, 260, 280, 300, 320) values. Gene-conversion and crossing-over rate estimates averaged over all 50-kb windows (with
30 SNPs) in chromosome 1 are 0.00066 and 0.00038/bp, respectively (f = 1.736), assuming that L = 500 bp. These are similar to estimates for chromosome 21 haplotypes in PADHUKASAHASRAM et al. (2004). Note that conversion estimates are highly sensitive to the assumed conversion tract length and estimates are expected to be much higher for smaller tract lengths (e.g., compare estimates for different tract lengths in PADHUKASAHASRAM et al. 2004).
|
|
|
70 SNPs) from chromosome 1–22 and estimated recombination rates for a tract length of L = 500 bp. Then, we calculated the correlation coefficient between the estimated rates of gene conversion and crossing over. We found that these estimates were not strongly correlated (Pearson's coefficient R = 0.102, P = 0.12). Because the level of uncertainty associated with our estimates is high, it is not clear how high a correlation should be expected even if these two parameters happen to be perfectly correlated across the genome. To get an idea of this, we first simulated 100 data sets of 232 independent 50-kb windows each, with crossing-over rates set to corresponding estimates in real data and f set to the ratio of the average conversion rate to average crossing-over rate estimated from the 232 windows in humans. Recombination rates were assumed to be uniform within windows in these simulations. For each simulated data set, we estimated rates as we did for the human data set and computed the correlation coefficient between conversion and crossing over. The lowest value of R observed in these simulations was 0.50.
Fine-scale recombination rate variation within windows can greatly increase the levels of uncertainty associated with our estimated rates. To see if R is expected to be much lower for some plausible models with nonuniform recombination, we simulated another set of 100 data sets where the overall recombination rates were set to the same values as before. In these simulations, we allowed both crossing-over and gene-conversion rates to vary within windows, assuming that a significant fraction (x) of events happen in hotspots that occur at certain fixed locations along the sequence. x was given values of 0.25 or 0.5 or 0.75 with equal frequency among the 232 windows. Note that in this model conversion and crossing over covary in an identical pattern, so that f remains uniform along the sequence. We then looked at the distribution of the correlation coefficient between the estimated conversion and crossing-over rates for these data sets. The lowest value of R observed in these simulations was 0.245 and values <0.3 were observed in only 3 of the 100 simulated data sets. These results seem to suggest that our data set deviates significantly from models where crossing-over and gene-conversion rates are pefectly correlated with one another and therefore that either the parameter f or the conversion tract length (L) may vary along the human genome.
Relationship between GC content and recombination rates:
We also calculated GC percentage for 50-kb windows with high SNP density (
70 SNPs) and looked at the correlation with the estimated crossing-over and gene-conversion rates. At this scale, crossing-over rates are positively correlated with the GC content (Pearson's coefficient R = 0.3138, P = 9.224 x 10–7) whereas gene-conversion rates for L = 500 bp are less strongly associated (Pearson's coefficient R = 0.1269, P = 0.05195). However, note that gene-conversion estimates may be highly unreliable because they are sensitive to assumptions about tract lengths.
| DISCUSSION |
|---|
|
|
|---|
In contrast to other approximate-likelihood methods that also utilize full sequence information (such as HUDSON 2001; FEARNHEAD and DONNELLY 2002; LI and STEPHENS 2003), the uncertainty in estimates in the summary statistics method can be evaluated directly. Another important advantage of our approach could be its flexibility. It is relatively easy to extend our method to any complex demographic scenario provided that data can be simulated under that scenario within the coalescent framework. Demography can affect the performance of some of the other currently available methods (e.g., see SMITH and FEARNHEAD 2005). Our method can be made more robust to such effects if we first estimate demographic parameters from the data and then infer recombination rates under a suitable model (or alternately estimate both recombination rates and demography jointly).
We have used a simple rejection-sampling scheme for estimating the recombination parameters in this study. The main limitation of rejection-sampling methods is that only a small number of summary statistics can usually be handled. Otherwise, acceptance rates become prohibitively low or tolerance levels must be increased, which can distort the approximation of likelihoods. The efficiency of rejection methods such as ours can be improved by using techniques like smooth weighting and regression adjustment (e.g., by using local linear regression) described in BEAUMONT et al. (2002). The key benefit of these techniques is that they use approximations that are insensitive to tolerance and this can permit us to increase the number of summary statistics used and also widen the tolerance levels.
The population mutation parameter was assumed to be uniform along the sequence for our simulations. A better way to use our method might be by simulating data sets conditional on the observed number of segregating sites (S) in the same positions as in real data. This approach was first proposed in HUDSON (1993) and can be useful for surveys of regions with intervening gaps. Although using a fixed S scheme will result in a null model that is slightly different from the standard coalescent model, simulation studies (WALL 2000) suggest that the performances of estimation methods do not change much.
When estimating rates from unphased data sets (Table 4), we expected that the relative performance of our method would drop because we simply ignored double heterozygotes in such data sets. Maxdip, on the other hand, considers all genotypes exactly for estimating rates. In agreement with this expectation, we found that the accuracy of gene-conversion estimates using our method was similar to that of Maxdip, whereas estimates of crossing over were less accurate. However, note that reconstructing the phase first by using some phase-estimating program (such as PHASE) and then using our method or Maxhap on the resulting data set may yield more accurate estimates than Maxdip for unphased data sets (see SMITH and FEARNHEAD 2005).
S. PTAK, M. PRZEWORSKI and R. R. HUDSON (unpublished results cited in PTAK et al. 2004) have reported that using k genotypes should work better than k haplotypes for estimating recombination rates using Hudson's composite-likelihood method. Our simulation results support this conclusion. We found that the accuracy of estimates for Maxdip using 18 genotypes was higher than the accuracy obtained from Maxhap using 18 haplotypes. In contrast, since the summary statistics method is inexact for genotype data, we found that the accuracy with 18 genotypes was slightly lower than the accuracy obtained from 18 haplotypes.
Given that recombination rates vary substantially along the genome on a fine scale, we also tested the performance of methods for data simulated with recombination hotspots. For models with nonuniform crossing-over and uniform gene-conversion rates, the performances of Maxhap and Maxdip seem to be slightly sensitive to variation in the crossing-over rates. In particular, the accuracy of estimating conversion rates and the tendency to underestimate gene conversion became a little worse compared to data simulated with uniform recombination. In contrast, the performance of our method appears to be relatively unaffected in these regards. We note that in the summary statistics approach, we estimate conversion rates on the basis of the difference between long-range and short-range summary statistics. The robustness of our method to nonuniform crossing over suggests that this difference (between long-range and short-range data) depends mainly on the gene-conversion rate and may be insensitive to moderate deviations from the uniform crossing-over model. For models with nonuniform gene conversion and nonuniform crossing over, the accuracy of estimating gene-conversion rates decreased substantially for both methods and there is considerable bias toward underestimating the gene-conversion rate. This may be because gene-conversion hotspots may sometimes not contain any SNP and in these cases a majority of conversion events do not leave a trace in the sample. On the other hand, both methods generally appear to be more robust to nonuniform crossing over and the accuracy of estimating crossing-over rates did not change much for the nonstandard models considered here.
Although both gene conversion and crossing over are thought to arise from common intermediates (i.e., Holliday junctions), the relationship between these two processes has not been clear so far. Some recent results have challenged the original Holliday model that was proposed for the mechanisms underlying conversion and crossing over (ALLERS and LITCHEN 2001). While meiotic crossing over is believed to be essential for the precise disjunction of homologous chromosomes (because it maintains physical connections between homologous DNA) and creates genetic diversity, the significance of meiotic gene conversion is not well understood. Because gene-conversion estimates are highly sensitive to the assumed tract length and human data on the distribution of tract lengths are limited, it is difficult to draw any strong conclusions about the relationship between these two different recombination mechanisms from our data set. If conversion and crossing-over rates are indeed not strongly correlated across the human genome, this could be because the biological pathways leading to these mechanisms might be different (e.g., see results in yeast in ALLERS and LITCHEN 2001).
| APPENDIX A |
|---|
|
|
|---|
For short range, we tested the following patterns for both the 5-kb and the 10-kb range for outer SNPs in triplets:
| APPENDIX B: PERFORMANCE FOR DIFFERENT ACCEPTANCE RATES |
|---|
|
|
|---|
|
|
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
| LITERATURE CITED |
|---|
|
|
|---|
ALLERS, T., and M. LITCHEN, 2001 Differential timing and control of noncrossover and crossover recombination during meiosis. Cell 106: 47–57.[CrossRef][Medline]
ANDOLFATTO, P., and M. NORDBORG, 1998 The effect of gene-conversion on intralocus associations. Genetics 148: 1397–1399.
ANDOLFATTO, P., and J. D. WALL, 2003 Linkage disequilibrium patterns across a recombination gradient in African Drosophila melanogaster. Genetics 165: 1289–1305.
BEAUMONT, M. A., W. ZHANG and D. J. BALDING, 2002 Approximate Bayesian computation in population genetics. Genetics 162: 2025–2035.
CRAWFORD, D. C., T. BHANGALE, N. LI, G. HELLENTHAL, M. J. RIEDER et al., 2004 Evidence for substantial fine-scale variation in recombination rates across the human genome. Nat. Genet. 36: 700–706.[CrossRef][Medline]
DUNHAM, I., N. SHIMIZU, B. A. ROE, S. CHISSOE, A. R. HUNT et al., 1999 The DNA sequence of human chromosome 22. Nature 402: 489–495.[CrossRef][Medline]
FEARNHEAD, P., 2003 Consistency of estimators of the population-scaled recombination rate. Theor. Popul. Biol. 64: 67–79.[CrossRef][Medline]
FEARNHEAD, P., and P. DONNELLY, 2001 Estimating recombination rates from population genetic data. Genetics 159: 1299–1318.
FEARNHEAD, P., and P. DONNELLY, 2002 Approximate likelihood methods for estimating local recombination rates (with discussion). J. R. Soc. Sci. Ser. B 64: 657–680.[CrossRef]
FEARNHEAD, P., and N. G. SMITH, 2005 A novel method with improved power to detect recombination hotspots from polymorphism data reveals multiple hotspots in human genes. Am. J. Hum. Genet. 77: 781–794.[CrossRef][Medline]
FRISSE, L., R. R. HUDSON, A. BARTOSZEWICZ, J. D. WALL, J. DONFACK et al., 2001 Gene-conversion and different population histories may explain the contrast between polymorphism and linkage disequilibrium levels. Am. J. Hum. Genet. 69: 831–843.[CrossRef][Medline]
FULLERTON, S. M., R. M. HARDING, A. J. BOYCE and J. B. CLEGG, 1994 Molecular and population genetic analysis of allelic sequence diversity at the human-globin locus. Proc. Natl. Acad. Sci. USA 91: 1805–1809.
GRIFFITHS, R. C., and P. MARJORAM, 1996 Ancestral inference from samples of DNA sequences with recombination. J. Comput. Biol. 3: 479–502.[Medline]
HEY, J., and J. WAKELEY, 1997 A coalescent estimator of the population recombination rate. Genetics 145: 833–846.[Abstract]
HINDS, D. A., L. L. STUVE, G. B. NILSEN, E. HALPERIN, E. ESKIN et al., 2005 Whole genome patterns of common DNA variation in three human populations. Science 307: 1072–1079.
HUDSON, R. R., 1987 Estimating the recombination parameter of a finite population model without selection. Genet. Res. 50: 245–250.[Medline]
HUDSON, R. R., 1993 The how and why of generating gene genealogies, pp. 23–36 in Mechanisms of Molecular Evolution, edited by N. TAKAHATA and A. G. CLARK. Sinauer Associates, Sunderland, MA.
HUDSON, R. R., 2001 Two-locus sampling distributions and their application. Genetics 159: 1805–1817.
HUDSON, R. R., and N. L. KAPLAN, 1985 Statistical properties of the number of recombination events in the history of a sample of DNA sequences. Genetics 111: 147–164.
INNAN, H., B. PADHUKASAHASRAM and M. NORDBORG, 2003 The pattern of polymorphism on human chromosome 21. Genome Res. 13: 1158–1168.
JEFFREYS, A. J., and C. A. MAY, 2004 Intense and highly localized gene-conversion activity in human meiotic crossover hotspots. Nat. Genet. 36: 151–156.[CrossRef][Medline]
JEFFREYS, A. J., L. KAUPPI and R. NEUMANN, 2001 Intensely punctuate meiotic recombination in the class II region of the major histocompatibilty complex. Nat. Genet. 29: 217–222.[CrossRef][Medline]
JEFFREYS, A. J., R. NEUMANN, M. PANAYI, S. MYERS and P. DONNELLY, 2005 Human recombination hotspots hidden in regions of strong marker associations. Nat. Genet. 37: 601–606.[CrossRef][Medline]
KONG, A., D. F. GUDBJARTSSON, J. SAINZ, G. M. JONSDOTTIR, S. A. GUDJONSSON et al., 2002 A high-resolution recombination map of the human genome. Nat. Genet. 31: 241–247.[CrossRef][Medline]
KUHNER, M. K., J. YAMATO and J. FELSENSTEIN, 2000 Maximum likelihood estimation of recombination rates from population data. Genetics 156: 1393–1401.
LANGLEY, C. H., B. P. LAZZARO, W. PHILLIPS, E. HEIKKINEN and J. M. BRAVERMAN, 2000 Linkage disequilibrium and the site frequency spectra in the su(s) and su(wa) regions of the Drosophila melanogaster X chromosome. Genetics 156: 1837–1852.
LI, N., and M. STEPHENS, 2003 Modeling linkage disequilibrium, and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165: 2213–2233.
MARJORAM, P., J. MOLITOR, V. PLAGNOL and S. TAVARE, 2003 Markov chain Monte Carlo without likelihoods. Proc. Natl. Acad. Sci. USA 100: 15324–15328.
MCVEAN, G. A. T., S. R. MYERS, S. HUNT, P. DELOUKAS, D. R. BENTLEY et al., 2004 The fine-scale structure of recombination rate variation in the human genome. Science 304: 581–584.
MYERS, S., L. BOTTOLO, C. FREEMAN, G. A. T. MCVEAN and P. DONNELLY, 2005 A fine-scale map of recombination rates and hotspots across the human genome. Science 310: 321–324.
NIELSEN, R., 2000 Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics 154: 931–942.
PADHUKASAHASRAM, B., P. MARJORAM and M. NORDBORG, 2004 Estimating the rate of gene-conversion on human chromosome 21. Am. J. Hum. Genet. 75: 386–397.[CrossRef][Medline]
PTAK, S. E., K. VOELPEL and M. PRZEWORSKI, 2004 Insights into recombination from patterns of linkage disequilibrium in humans. Genetics 167: 387–397.
SMITH, N. G. C., and P. FEARNHEAD, 2005 A comparison of three estimators of the population-scaled recombination rate: accuracy and robustness. Genetics 171: 2051–2062.
SZOSTAK, J. W., T. L. ORR-WEAVER, R. J. ROTHSTEIN and F. W. STAHL, 1983 The double-strand-break repair model for recombination. Cell 33: 25–35.[CrossRef][Medline]
WAKELEY, J., 1997 Using the variance of pairwise differences to estimate the recombination rate. Genet. Res. 69: 45–48.[CrossRef][Medline]
WALL, J. D., 2000 A comparison of estimators of the population recombination rate. Mol. Biol. Evol. 17: 156–163.
WALL, J. D., 2004 Estimating recombination rates using three site likelihoods. Genetics 167: 1461–1473.
WEISS, G., and A. VON HAESELER, 1998 Inference of population history using a likelihood approach. Genetics 149: 1539–1546.
WIEHE, T., J. MOUNTAIN, P. PARHAM and M. SLATKIN, 2000 Distinguishing recombination and intragenic gene-conversion by linkage disequilibrium patterns. Genet. Res. 75: 61–73.[CrossRef][Medline]
WIUF, C., and J. HEIN, 2000 The coalescent with gene-conversion. Genetics 155: 451–462.
Related articles in Genetics:
This article has been cited by other articles:
![]() |
J.-F. Lefebvre and D. Labuda Fraction of Informative Recombinations: A Heuristic Approach to Analyze Recombination Rates Genetics, April 1, 2008; 178(4): 2069 - 2079. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Padhukasahasram, P. Marjoram, J. D. Wall, C. D. Bustamante, and M. Nordborg Exploring Population Genetic Models With Recombination Using Efficient Forward-Time Simulations Genetics, April 1, 2008; 178(4): 2417 - 2427. [Abstract] [Full Text] [PDF] |
||||