| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Genetics, Vol. 177, 881-894, October 2007, Copyright © 2007
doi:10.1534/genetics.107.078907
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

* Department of Statistics, University of Oxford, Oxford OX1 3TG, United Kingdom and
The Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02139
1 Corresponding author: Department of Statistics, University of Oxford, 1 S. Parks Rd., Oxford OX1 3TG, United Kingdom.
E-mail: mcvean{at}stats.ox.ac.uk
| ABSTRACT |
|---|
|
|
|---|
400 times more often than crossover events. We also extend the method to estimating variable crossover and gene conversion rates and estimate the rate of gene conversion to be
1.5 times higher than the crossover rate in a region of human chromosome 1 with known recombination hotspots.
In humans, gene conversion is thought to occur
4–15 times as frequently as crossover (JEFFREYS and MAY 2004), but is more difficult to detect due to the short lengths of DNA transferred. Estimates of the tract length vary between studies and between organisms/regions but tend to lie between 50 and 2000 bp (e.g., BORTS and HABER 1989; HILLIKER et al. 1994; JEFFREYS and MAY 2004). For a full description of the gene conversion process see STAHL (1994) and references therein.
Patterns of linkage disequilibrium in humans can be satisfactorily explained only by models including gene conversion (FRISSE et al. 2001). Simulations show that in genomic regions that have been subject to gene conversion, estimates of the crossover rate are inflated when gene conversion is ignored (SMITH and FEARNHEAD 2005). PRZEWORSKI and WALL (2001) showed, using human population genetic data, that gene conversion is likely to be an important factor in explaining a marked difference between estimates of the population recombination rate obtained through comparing genetic and physical maps and those found through analysis of nucleotide sequence polymorphism data. These factors have made gene conversion the subject of much investigation in recent years.
Although highly localized, the effects of gene conversion may also have a significant impact on association studies, which seek a genotyped marker that is in strong LD with an untyped allele responsible for the phenotype of interest. If gene conversion is ignored, the extent of LD over short distances is likely to be overestimated, while LD at longer distances will be underestimated due to the inflated rate of crossover needed to explain the short-range LD (FRISSE et al. 2001). These two influences on LD may affect the choice of the number of markers to genotype for a study (SCHORK 2002).
Gene conversion also affects our ability to detect the effects of natural selection on a population (ANDOLFATTO and NORDBORG 1998). Tests for deviation from the null model typically rely on an estimate of the recombination rate in a region, and ignoring the effects of gene conversion will reduce the power of tests for selection and can also increase the false positive rate of such tests.
Finally, learning about gene conversion could help us to gain biological and mechanistic insights into recombination.
It is therefore desirable to be able to estimate the frequency at which gene conversion events occur, at a fine scale, over genomic regions many megabases in length and to detect variation in gene conversion rates within such a genomic region.
Rates of both crossover and gene conversion can be estimated directly using sperm-typing experiments such as those of JEFFREYS and MAY (2004) that give highly accurate fine-scale rates, but cannot be performed on a genomic scale or on the X chromosome or in females.
Pedigree studies (e.g., KONG et al. 2002) can give further information such as sex-specific differences in crossover rates, but because of the infrequency of events, cannot give accurate fine-scale maps.
For genomewide fine-scale characterization of recombination rates a practical solution is statistical modeling of genetic data, based on simplified assumptions about the historical processes that resulted in the population genetic data seen today. Methods of inference can be performed in many different ways:
In this article we describe a statistical model of population genetic data that includes both crossover and gene conversion, where a gene conversion tract can include any number of markers. The model can be used to estimate the rates of crossover and/or gene conversion in a given region using maximum-likelihood techniques or could be implemented in a Bayesian framework. The model does not require that either rate be constant across the region of interest and could, for example, be used to obtain an estimate of the gene conversion rate in a region known to include a crossover hotspot. As well as performing tests on simulated data, we examine single-nucleotide polymorphism (SNP) data from a genomic region thought to be free from crossover hotspots and then consider a region of the human genome that contains several crossover hotspots (JEFFREYS et al. 2005).
Our results on simulated data show that gene conversion rates can be estimated fairly accurately from population genetic data, and the inclusion of gene conversion in our model results in improved estimates of the crossover rate, particularly when gene conversion is present at high levels. In a region near the telomere of the X chromosome of Drosophila melanogaster we find that gene conversion events occur >400 times as frequently as crossovers, while in a region of human chromosome 1, there is only 1.5 times as much gene conversion activity as crossover.
| MODEL |
|---|
|
|
|---|
and crossover
= 4Nec, where Ne is the effective population size and c is the per-generation probability of crossover between adjacent base pairs. We use the terms haplotype and chromosome interchangeably to refer to a chromosomal segment and assume the method will be applied to resequenced or densely genotyped SNP data, although it could also be applied to microsatellite data with a suitably adjusted emission probability. Our approach has the following properties, some of which are novel:
We chose the LS model because it does not rely on summary statistics, but attempts to use all the available information, albeit under an approximation to the likelihood, making it an ideal candidate for extension to the gene conversion model. We expect the trace of gene conversion to be difficult to detect and therefore wish to use the maximum information that can be extracted from the data.
We first introduce briefly the LS model for crossover alone and then describe the addition of gene conversion to this model. We validate our method using tests on data simulated with a range of parameter values and evaluate its robustness to deviations from our assumptions about population demographics. Finally we generalize the model to allow for variation in the rate of gene conversion.
Modeling crossover using a likelihood-based approach:
The objective of maximum-likelihood methods is to maximize the function L(
) = Pr(H |
), i.e., the likelihood of a set of model parameters
given the sampled data (haplotypes) H = h1, h2,..., hn.
If we knew the underlying genealogy of the sampled individuals, this could be calculated directly. However, this information, in a population genetic sample of unrelated individuals, is not available. In the presence of recombination, the individuals sampled may be related by a different (correlated) phylogenetic tree at each polymorphic site along the sequence [which, together, form the ancestral recombination graph (ARG) of GRIFFITHS and MARJORAM 1997], and phylogenetic methods are unreliable under these circumstances (SCHIERUP and HEIN 2000). It is therefore useful to develop an approximation to L that is not conditional on the ARG G relating the sampled individuals, using
![]() | (1) |
LI and STEPHENS (2003) noted that
![]() | (2) |
= 4Nec is the population crossover rate. By approximating each of the terms on the right-hand side in turn, they arrived at an approximation to the likelihood known as a product of approximate conditionals (PAC) model.
Their approximation
to the conditional probability Pr(hk+1 | h1,..., hk,
) is a modification of the imperfect mosaic model of FEARNHEAD and DONNELLY (2001). Haplotype k + 1 is considered to be made up of segments copied from any or all of the preceding k haplotypes, and at marker l the haplotype being copied from is known as the "nearest neighbor." The copying process can also be imperfect, giving rise to a difference between the new haplotype and its nearest neighbor; this is considered to be a mutation. When the nearest neighbor changes between marker i and marker i + 1 this is considered to be a crossover. The sequence of nearest neighbors taken when traversing the (k + 1)th haplotype from one end to the other can be modeled as a Markov chain where the nearest neighbor at a given marker is dependent only on that at the previous marker and on the crossover probability. The likelihood given a particular value of the parameter
(which may vary across the region) is then calculated by summing over all possible mosaic structures and a maximum-likelihood estimate
can therefore be found.
It is worth noting that this approximation to the likelihood is dependent on the order in which the haplotypes are observed. This unwelcome influence can be greatly reduced by averaging the likelihood over a number of different random orders. We find 20 orders sufficient to ensure that our estimates were consistent between different runs of the program and >20 to be cumbersome in terms of computational time. All results shown in this article are based on 20 orders chosen uniformly at random except where stated otherwise.
The LS model was previously extended by HELLENTHAL (2006) to include gene conversion, assuming each gene conversion tract includes only one SNP. In essence, the emission probability for the Markov chain is modified to mimic a gene conversion. This has the benefit of keeping the computational cost the same (O(N2)), but suffers from a difficulty in distinguishing gene conversion and genotyping error. Our adaptation of the LS model is much more computationally intensive but can be applied to more densely typed SNP data.
Modeling gene conversion:
We now consider our sample to have been affected both by crossover and by gene conversion events throughout its history. This scenario is well modeled by the coalescent model with gene conversion developed by WIUF and HEIN (2000). The imperfect mosaic model can be easily adapted to allow for gene conversion, by allowing a second process to alter our nearest neighbor from x to x', with the proviso that we must eventually return to copying from x. See Figure 1 for an illustration of this. The distribution of lengths of gene conversion tracts can be approximated by a geometric distribution (HILLIKER et al. 1994), giving a constant probability of ending the tract at any particular position and returning to copy from x, irrespective of the length of the tract so far. This lack of memory property allows us to use a Markov chain implementation of the model as above.
|
The first of these assumptions allows us to separate the gene conversion and crossover processes in our model, which simplifies the calculation of the transition probabilities in our Markov chain. It is also biologically reasonable in that we would not expect that the fact that a crossover had once occurred in a particular region to influence the probability that a gene conversion occurs in any given meiosis in that region, except in that a higher rate of crossover might point to a potentially higher rate of gene conversion. Our assumption does not disallow dependence between rates of gene conversion and crossover, only between events.
In our second assumption, we specify the conditions on entering and exiting a gene conversion event. We may begin a new gene conversion tract only when we are not already in a tract, and we may end a tract only by returning to copy from the haplotype we were copying from before the tract began. Allowing tracts to be nested and/or to overlap would violate the Markovian property of our model or necessitate the addition of one or more further dimensions to the model. There is no clear biological interpretation of this assumption. It is certainly reasonable to state that any gene conversion event taking place during a particular meiosis cannot overlap with or be nested within another tract occurring in the same meiosis. The trace left in population genetic data by many independent gene conversion events over a long period of time, perhaps occurring in hotspots and likely to overlap with previous events, is less obvious. When tract lengths are short compared to SNP spacing, we do not expect this assumption to have any effect (two or more SNPs must be in a gene conversion tract for overlapping or nesting to be detectable). When tract lengths are long, we might expect to miss some overlapping gene conversion tracts or see them as crossovers, thus giving a slight underestimate of the gene conversion rate and overestimate of the crossover rate.
Our final assumption is also one of convenience. We have no information about any variation in the rates of gene conversion and crossover in the gap between any pair of adjacent typed SNPs, and we therefore assume the rate is constant. In this article we are mainly considering regions where the rates of crossover and gene conversion are considered to be uniform, but the method also allows for rate variation. When rate variation exists, in this model the rate is permitted to change instantaneously only at a typed marker, and the rate in an interval will correspond to the average rate over the gap between the SNPs.
Details of our implementation of this model are in the APPENDIX. In the next section we describe the results of applying this model to simulated data with the aim of jointly estimating
= 4Neg and
= 4Nec, where g denotes the gene conversion rate per meiosis per unit distance.
| RESULTS |
|---|
|
|
|---|
= 0.5, 1, or 2.5; crossover rate
= 0, 0.5, 1 or 2.5; and gene conversion rate
= 0, 1, or 10 (per kilobase). We focus on the data sets with
= 1 as this corresponds to the human population-scaled mutation rate of
0.7–1.0/kb (PTAK et al. 2004). The mean gene conversion tract length,
, was fixed at 500 bp (cf. FRISSE et al. 2001) for the simulations and during estimation of parameters. The number of SNPs in each data set varied with the mutation rate and, when
= 1, averaged 89 SNPs per simulated data set.
Our estimates
and
are shown in Figures 2a and 3 and summarized in Table 1. In each case we fixed the gene conversion tract length parameter at the value used to simulate data.
|
|
|
:
for data sets simulated with no gene conversion, equal rates of gene conversion and crossover (f = 1), and more gene conversion than crossover (f = 10) are shown in Figure 2a. The presence of gene conversion does not seem to have a detrimental effect on our ability to estimate
, and estimates of
are within a factor of 2 of the truth >90% of the time for each value of f. All the simulations shown used
= 1/kb, and simulations with
= 0.5 and
= 2.5/kb had similar results with slightly reduced accuracy. For comparison, estimates for
obtained using the LS model (without bias correction) on the same data sets are shown in Figure 2b. In the case where f = 10, these estimates are highly inflated, not surprisingly since this method is not intended for data sets where gene conversion is present. However, the fact that for all 1000 data sets with f = 10 this method gives an estimate for
that is more than twice the true value serves as a reminder of the effect that undetected gene conversion can have on estimates of the crossover rate.
Estimation of
:
The distributions of our estimates
for the same data sets as those above are shown in Figure 3. In the case where f = 10, our estimated
was within a factor of 2 of the value used to simulate data, for 999 of the 1000 simulated data sets. Results are summarized in Table 1.
Estimation of f:
We used our estimates of
and
for each data set to obtain an estimate
. These estimates were also generally close to the truth but suffered slightly from being a ratio of two other estimates with uncertainty in both. For data sets simulated with
= 1 our median estimates of f were 0.55, 1.45, and 9.54 for data sets simulated with true f 0, 1, and 10, respectively.
Robustness to deviation from assumed demography:
Here we use additional simulated data to evaluate the robustness of our model to deviations from the assumed neutral model. We consider our three major demographic assumptions: constant population size, panmixia (random mating), and neutral evolution. In each case, we simulated 100 data sets with 50 haplotypes, 20 kb in length, with mutation and crossover rates of 1/kb, and with various gene conversion rates.
Variation in population size:
To test our model in the presence of population size variation, we simulated 100 data sets with a bottleneck 0.15 Ne generations ago that reduced diversity to 85% of that expected without the bottleneck and a further 100 data sets with scaled exponential population growth parameter 1 (cf. MCVEAN et al. 2004). Results under these demographic variations (summarized in Table 1) do not deviate far from those obtained on data simulated under the standard model, although we see a slight increase in our underestimation of the gene conversion rate when f is high.
Population structure:
To test our model in a nonrandom mating scenario, we simulated 100 data sets, where 25 of the 50 chromosomes were sampled from each of two subpopulations corresponding to a level of population differentiation of FST
0.2 (cf. PRITCHARD and PRZEWORSKI 2001). Properties of the resulting estimates for
are shown in Table 1. Although in each simulation, the variance (not shown) of
was higher than that for a single-population data set, this did not have a big effect on the median estimate or the proportion of results within a factor of 2 of the truth. The estimates for
were similarly affected.
Selection:
As a final test we simulated 100 data sets where a positive selective sweep had just finished. These data were generated using the program SelSim (SPENCER and COOP 2004). The strength of selection was chosen to be
= 2Nes = 50, where s is the selective coefficient between homozygotes (cf. SMITH and FEARNHEAD 2005), and was applied to a single site in the center of the 20-kb region. Again we saw no major difference in our results, implying that this method is robust to low to moderate levels of selection (see Table 1).
Genotype data:
To use our method on genotype data it is necessary to first phase the data. Currently available programs to phase genotype data do not take gene conversion into account, so we investigated the effect of performing this preprocessing of the data. For 100 of the above data sets simulated with f = 10, we randomly paired the 50 haplotypes into 25 individuals and then used the program PHASE (STEPHENS et al. 2001; STEPHENS and SCHEET 2005) to rephase the data. We then obtained estimates for the gene conversion and crossover rates on these data sets, which are summarized in Table 1. We found that our method overestimates the crossover rate under these circumstances, as well as underestimating the gene conversion rate, which leads to an underestimate of f.
Comparison with other methods:
We now compare our results with two other methods: Hudson's pairwise composite-likelihood method (HUDSON 2001) and a method based on the summary statistics method of PADHUKASAHASRAM et al. (2006). For the former, we used the program maxHap, freely available from the author's website. For the latter, we adapted a program (also available on the author's website). Our implementation differs from that described in PADHUKASAHASRAM et al. (2006) only in that we do not fix the positions of the segregating sites in our simulations. Results are shown in Table 2. For these calculations, maxHap took
1.7 sec per data set, summStat between 5 and 24 hr (depending on f), and GenCo just under 1 hr on a standard desktop computer.
|
4 and 2.5 kb long, respectively, and are separated by a region of 400 kb in which no SNPs were typed. Like chromosome 4 in Drosophila (HOCHMAN 1976), the region near the X chromosome telomere is subject to a severely reduced level of crossover per physical length (AGUADE and LANGLEY 1994) compared to the genomewide average rate of 1.5 cM/Mb (NACHMAN 2002), perhaps due to regulation of double-strand-break repair mechanisms (MCKIM et al. 2002).
In addition to obtaining maximum-likelihood estimates for
and
for this data set we also constructed a likelihood surface over a grid of values of
and
, shown in Figure 4. We fixed the mean gene conversion tract length at 352 bp (HILLIKER et al. 1994) and obtained
and
/kb (f = 432). Such a strong signal is unlikely to be explained by repeat mutation or genotyping error. These estimates support the conclusion of LANGLEY et al. (2000) that while crossover is suppressed in the region, gene conversion is not. This could indicate that gene conversion and crossover are completely separate processes or, if both are initiated by the same process, that in this region of D. melanogaster there is a strong tendency for recombination events to be resolved as gene conversion rather than crossover. Whether this is the cause of or a consequence of the suppression of crossover is as yet unknown.
|
Our model allows for a different rate of gene conversion between each pair of adjacent SNPs, so it was possible to implement an expectation-maximization algorithm to determine
for each interval i. It should be noted that to reflect biological reality and maintain symmetry we would prefer to model the rate at which gene conversion initiation sites are encountered (i.e., somewhere around the middle of the tract), but due to the way the model is implemented we are in fact modeling the rate at which the left-hand sides of gene conversion tracts are encountered. It would be preferable to model a gene conversion tract extending in both directions from an initiation point (HELLENTHAL and STEPHENS 2007), but this would greatly increase the complexity and hence the computation time of our method.
Instead, we map our estimates to the gene conversion rate by assuming the initiation is in the exact center of the gene conversion tract, according to the equation
![]() | (3) |
is our maximum-likelihood estimate (MLE) of the gene conversion rate at distance x from the beginning of the observed region, and a constant rate
0 is assumed for all x < 0. When the distances between markers are long compared to the length of a gene conversion tract, or when rates change only gradually between intervals, the difference between modeling the center and modeling the end of a gene conversion tract will be negligible. However, if very narrow hotspots of gene conversion are found, it may be necessary to convert the rate of encountering the left-hand side into the rate of gene conversion tract initiation to provide a useful gene conversion rate estimation.
To examine the power and reliability of our method when recombination rates vary, we used the program msHOT (HELLENTHAL and STEPHENS 2007) to simulate 100 data sets containing a hotspot for both gene conversion and crossover. Each data set consisted of 50 haplotypes, 20 kb in length, with
= 1/kb, mean tract length 500 bp, and
= 0.5 and
= 0.05/kb (f = 10), except in a "hotspot" 2 kb wide in the center of the region, where
= 50 and
= 5/kb (f = 10).
Assuming that f was constant across the region, we obtained maximum-likelihood estimates for
and f for each simulated data set. The estimates for
and their median (sampled every 100 bp) are shown in Figure 5.
|
show high levels of variance, but on average, the position, width, and heat of the estimated hotspot are close to the values used to simulate data, and there is little bias in our estimates of
. However, estimates of
are downwardly biased, resulting in an overestimate of f (median 25.1). More work is needed to develop a method that can produce a less biased estimate of f in this variable-rate scenario, even under the restriction that f is constant. To obtain a more reliable gene conversion rate estimate on a single data set, a hotspot model would be needed, such as the reversible-jump MCMC crossover model of AUTON and MCVEAN (2007). However, accurate estimation of the strength of a crossover hotspot is problematic in general, because the regions on opposite sides of any hotspot above a certain size are in near-complete linkage equilibrium with each other (AUTON and MCVEAN 2007). Despite some evidence that f may vary between regions in humans (HELLENTHAL and STEPHENS 2006; PADHUKASAHASRAM et al. 2006), we do not consider this scenario, mainly because population genetic data are unlikely to contain sufficient information to obtain an accurate fine-scale map of gene conversion rates independently of crossover rates, but also because the existence of gene conversion hotspots within crossover hotspots implies some correlation between the two rates, and finally because the processing time needed to maximize such a likelihood would be immense.
Human chromosome 1:
Many crossover hotspots have been identified in the human genome, but of particular interest is the MS32 hotspot on chromosome 1, the existence of which is supported by strong experimental evidence, but has not left a significant imprint on LD (JEFFREYS et al. 2005). We applied our method to a 206-kb region, including this hotspot and several others. The SNP data (JEFFREYS et al. 2005) consist of 214 SNPs on 80 genotypes; we used PHASE v2.1.1 (STEPHENS et al. 2001; STEPHENS and SCHEET 2005) to infer the haplotypes and missing data and averaged our results over 20 independent random subsamples of 50 haplotypes, each taken in 10 random orders. For comparison, the crossover rate for this region, estimated using LDhat (MCVEAN et al. 2002), is shown below. LDhat was run for 108 iterations with block penalty 5, results were sampled every 104 iterations, and the first 100 samples discarded. To reflect the idea that crossover and gene conversion hotspots tend to coincide (JEFFREYS and MAY 2004), we allowed gene conversion rates to vary independently in each interval between SNPs while keeping f constant everywhere in the region. Our median estimated gene conversion rate is shown in Figure 6. Our median estimate of f for this region was 1.5. This estimate is strongly influenced by the gene conversion tract length parameter. In this study we assumed the mean tract length was 100 bp, but note that a longer mean tract length would lead to a lower
and a correspondingly lower
.
|
9. This lies centrally within the range of 4–15 suggested by JEFFREYS and MAY (2004) although as above, is dependent on the gene conversion tract length. The difference between this estimate and the one for the MS32 region could reflect variation in f between different regions of the human genome, also reported by HELLENTHAL and STEPHENS (2006). | DISCUSSION |
|---|
|
|
|---|
Our model allows multiple SNPs to be included in a gene conversion tract. SNP density varies widely between data sets, but also within data sets: for example, in the MS32 data set analyzed above there are 214 SNPs in a 206-kb region, giving an average interval of 967 bp between adjacent markers. However, 45 intervals (21%) are <100 bp long and 133 (62%) are <500 bp long. Our simulations show that when the mean tract length is 100 bp, 9% of gene conversion tracts that initiate within this region will encompass the positions of two or more markers. With 500-bp tracts, this rises to 38%.
When applied to data with little or no history of gene conversion, our model tends to overestimate the gene conversion rate. This is mainly due to the true value lying on or near the boundary of the parameter space. Simulation results (Figure 2, first column) demonstrate that including gene conversion in the model results in improved estimation of the recombination rate, suggesting that modeling errors are preferentially interpreted as gene conversion events. This implies that the use of our model could result in improvements to the estimation of the underlying crossover rate, even in the case where gene conversion is not occurring.
Estimates of uncertainty cannot be obtained directly from this method, due to the approximate likelihood used. To obtain confidence intervals it is necessary to perform a simulation study tailored to the specifics of a given data set (such as the number of haplotypes and rate of mutation).
We analyzed a region of the X chromosome in D. melanogaster and found that, under the assumption that the rates of crossover and gene conversion are constant across the region, gene conversion events occur >400 times as often as crossovers. Application of this model to additional regions of the Drosophila genome could enhance our understanding of the unusual patterns of recombination in this species.
We also analyzed a region of human chromosome 1, around the MS32 gene, a region containing several known crossover hotspots. Our analysis, allowing the rates of crossover and gene conversion to vary across the region while keeping their ratio f constant, shows that the MS32 hotspot, previously difficult to detect using population genetic methods, appears highly active under a gene conversion model. This could indicate that this hotspot is more active in gene conversion than in crossover, but it seems to contradict conclusions that the hotspot has only recently emerged (JEFFREYS et al. 2005).
The maximum-likelihood estimate
for this region should be treated with caution for two reasons: simulations show that this is a biased estimator (see above), but also
is highly dependent on the accuracy of our tract length estimate (here 100 bp). As with the method of PTAK et al. (2004), experiments with our method have shown that there is a strong correlation between the estimated values of
and
(data not shown).
For this reason, we do not attempt to estimate the gene conversion tract length from the data. By misspecifying the mean tract length parameter for data sets simulated with known tract length t0, we find that using a mean tract length estimate t* that is double that simulated (t* = 2t0) results in a slightly lowered gene conversion rate estimate, while using an estimate
causes us to overestimate the gene conversion rate by approximately a factor of 2 (data not shown). These results will vary depending on the SNP spacing and the actual gene conversion rate. Fortunately, estimates of the gene conversion tract length are available for several organisms (e.g., PALMER et al. 2003; JEFFREYS and MAY 2004; NISHANT et al. 2004) although little is yet known about whether there is heterogeneity in this length between different genomic regions.
Our implementation of this model is of order
N3 and is linear in both the number of markers and the number of orders used. It takes
30 min of processing time on a standard desktop machine to jointly calculate the constant MLEs for
and
when N = 50, L = 100,
is fixed, and 10 random orderings are used. However, when
and
are allowed to vary, with their ratio f constant, using the same data set it would take
9 hr to obtain
and
(and therefore
). This time could be reduced by improvements to the implementation and by running on a faster computer. For very large data sets the method is impracticable with present computers, but results can still be obtained by taking several subsamples of the data and taking the median result. Subsamples should be as large as possible as smaller subsamples tend to produce underestimates of the gene conversion rate and have higher variance (see supplemental Table 1 at http://www.genetics.org/supplemental/), but when averaging over several subsamples, one order is sufficient. For reasonable-sized data sets we believe the accuracy obtained by this model's usage of all the information contained in the data makes its slow speed a worthwhile penalty.
In this article we do not consider the effects of genotyping error on our results. Genotyping error can have a similar effect on the patterns of genetic diversity to that of gene conversion (PTAK et al. 2004). In densely typed SNP data, allowing for multi-SNP gene conversion tracts should reduce the impact of genotyping error.
We anticipate that this model could be adapted to detect the signature of nonallelic gene conversion in population genetic data, which would be particularly useful when considering the evolution of multigene families such as the histones (see NEI and ROONEY 2005).
As well as finding fine-scale variations in
, it would be straightforward to adapt this model to compare estimates of f for different regions or genes within a given organism, allowing production of a broad-scale map of f. This could be used to discover whether large-scale genomic features such as proximity to centromeres affect this ratio.
Whether f is constant at a fine scale remains an open question.
| APPENDIX |
|---|
|
|
|---|
,
) is approximated using its own HMM with k(k + 1) states and its own emission and transition probabilities that depend on k.
Our HMM has k(k + 1) distinct states (X = x, G = g), where 1 < x
k denotes our nearest neighbor, unless g
0, in which case we are in a gene conversion, and the current nearest neighbor is g. Starting at the leftmost marker, we calculate the likelihood of the data for each possible state (Xj, Gj) at each marker j, on the basis of the k(k + 1) distinct state probabilities at the previous marker, and the transition (t) and emission (e) probabilites (see below).
Transition probabilities:
We model the process of crossover and that of gene conversion as separate processes, happening independently of each other. The variable Xj can be modified only by crossover, while Gj is affected only by gene conversion:
![]() | (A1) |
Initial state probabilities:
Since we have no data outside our region,
for all x. The probability that we start our Markov chain inside a gene conversion tract depends on how the rate of starting a gene conversion tract compares to the rate of ending one:
![]() | (A2) |
X transition probabilities:
The first term on the right-hand side of Equation A1 is given by LI and STEPHENS (2003) as
![]() | (A3) |
x', we must have had at least one crossover between the two sites, and the probability that the new nearest neighbor was x' is 1/k. When x = x', we may have had no crossover event, or we may have crossed over one or more times but in the last event chose the same nearest neighbor.
G transition probabilities:
To calculate Pr(Gj+1 | Gj) we must consider not only the probability of beginning a gene conversion event within the current interval but also that of ending one. The rate of terminating a gene conversion tract is fixed at
, regardless of the length the tract has so far covered. This geometric model allows us to consider the ending of a gene conversion event as a process in its own right, which goes on all the time, independently of our current state, and resets the state of the system to the base (non-gene-conversion) state. This is reasonable because although in biological terms there is no event corresponding to the end of a gene conversion that can occur outside of a gene conversion tract, any such event occurring when we are already in the base state has no effect and thus has no effect on our model.
For each type of transition between gene conversion states, we describe the sequence of events that could cause that transition to occur and give the probability of undergoing this transition. In each case, any events occurring before the last reset event within the interval will have no effect on the state at the right-hand side of the interval. We therefore integrate back from the right-hand side of the interval, over possible positions of the last reset event, so that we do not have to explicitly consider how many gene conversion events may have taken place prior to the final reset event.
There are five distinct types of transition between gene conversion states:
dj)] and also no gene conversion event [probability
] or there was a reset event and no gene conversion event has taken place since then. This last term can be written as the integral (over all possible places at which the last reset event might have happened) of the probability that no further reset event occurred multiplied by the probability that no gene conversion occurred:
![]() | (A4) |
![]() | (A5) |
![]() | (A6) |
![]() | (A7) |
g':
![]() | (A8) |
These gene conversion state transition probabilities, together with the crossover state transition probabilities above, make up the state transition probabilities for our Markov chain. We write tG(g' | g, j) = Pr(Gj+1 = g' | Gj = g) and tX(x' | x, j) = Pr(Xj+1 = g' | Xj = x).
Emission probabilities:
When it is not known, we (as do LI and STEPHENS 2003) use Watterson's estimator (WATTERSON 1975) to approximate the per-site rate of mutation,
/L:
![]() | (A9) |
![]() | (A10) |
Likelihood:
Let px,g(j) be the relative probability of being in the state (x, g) at the marker j, given the data up to that marker. Then
![]() | (A11) |
![]() | (A12) |
![]() | (A13) |
10 times and take the average likelihood.
Optimization:
The basic algorithm described above is of approximate order N4, but we were able to reduce this to N3 by calculating sums of several subgroups of the px,g(j) and using these sums to facilitate the calculation of px',g'(j + 1). First note that tX(x | x', j) can take only two possible values, depending on whether x = x'. Similarly, tG can take five values as described above. So, for example, tX(x | x', j) is the same for all values of x' except x' = x, and the sum
gives the total probability of all the states where Gj = g'. To calculate px,g'(j + 1), subtracting px,g'(j) from this sum leaves a group of states at marker j for which the transition probability is the same.
These sums, calculated once for each marker, can be used many times in different combinations.
Estimating parameters:
To find maximum-likelihood estimates we use a direct search algorithm (HOOKE and JEEVES 1961) to find the
and
that maximize the average likelihood. When variable rate estimates are required, we use expectation maximization to find these rates independently in each interval.
The C++ code and windows/linux executables for our implementation of this model are available from http://www.stats.ox.ac.uk/
gay/.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
| LITERATURE CITED |
|---|
|
|
|---|
AGUADE, M., and C. H. LANGLEY, 1994 Non-Neutral Evolution: Theories and Molecular Data, pp. 67–76. Chapman & Hall, London/New York.
ANDOLFATTO, P., and M. NORDBORG, 1998 The effect of gene conversion on intralocus associations. Genetics 148: 1397–1399.
AUTON, A., and G. MCVEAN, 2007 Recombination rate estimation in the presence of hotspots. Genome Res. 17: 1219–1227.
BORTS, R. H., and J. E. HABER, 1989 Length and distribution of meiotic gene conversion tracts and crossovers in Saccharomyces cerevisiae. Genetics 123: 69–80.
FEARNHEAD, P., and P. DONNELLY, 2001 Estimating recombination rates from population genetic data. Genetics 159: 1299–1318.
FEARNHEAD, P., N. G. C. SMITH, M. BARRIGAS, A. FOX and N. FRENCH, 2005 Analysis of recombination in campylobacter jejuni from MLST populaton data. J. Mol. Evol 61: 333–340.[CrossRef][Medline]
FRIDELL, R. A., and L. L. SEARLES, 1994 Evidence for a role of the Drosophila melanogaster suppressor of sable gene in the pre-mRNA splicing pathway. Mol. Cell. Biol. 14: 859–867.
FRISSE, L., R. R. HUDSON, A. BARTOSZEWICZ, J. D. WALL, J. DONFACK et al., 2001 Gene conversion and different population histories may explain the contrast between polymorphism and linkage disequilibrium. Am. J. Hum. Genet. 69: 831–843.[CrossRef][Medline]
GRIFFITHS, R., and P. MARJORAM, 1997 An Ancestral Recombination Graph (IMA Vol. 87: Progress in Population Genetics and Human Evolution), pp. 257–270. Springer-Verlag, Berlin/Heidelberg, Germany/New York.
HELLENTHAL, G., 2006 Exploring rates and patterns of variability in gene conversion and crossover in the human genome. Ph.D. Thesis, University of Washington, Seattle.
HELLENTHAL, G., and M. STEPHENS, 2006 Insights into recombination from population genetic variation. Curr. Opin. Genet. Dev. 16: 565–572.[CrossRef][Medline]
HELLENTHAL, G., and M. STEPHENS, 2007 msHOT: modifying Hudson's ms simulator to incorporate crossover and gene conversion hotspots. Bioinformatics 23: 520–521.
HILLIKER, A. J., G. HARAUZ, A. G. REAUME, M. GRAY, S. H. CLARK et al., 1994 Meiotic gene conversion tract length distribution within the rosy locus of Drosophila melanogaster. Genetics 137: 1019–1026.[Abstract]
HOCHMAN, B., 1976 The fourth chromosome of Drosophila melanogaster, pp. 903–928 in The Genetics and Biology of Drosophila, Vol. 1b, edited by M. ASHBURNER and E. NOVITSKI. Academic Press, New York.
HOOKE, R., and T. A. JEEVES, 1961 Direct search solution of numerical and statistical problems. J. Appl. Comp. Meth. 8: 212–229.
HUDSON, R., 2002 Generating samples under a Wright-Fisher neutral model. Bioinformatics 18: 337–338.
HUDSON, R. R., 1983 Properties of a neutral allele model with intragenic recombination. Theor. Popul. Biol. 2: 183–201.
HUDSON, R. R., 2001 Two-locus sampling distributions and their application. Genetics 159: 1805–1817.
JEFFREYS, A. J., and C. A. MAY, 2004 Intense and highly localized gene conversion activity in human meiotic crossover hot spots. Nat. Genet. 36: 151–156.[CrossRef][Medline]
JEFFREYS, A. J., R. NEUMANN, M. PANAYI, S. MYERS and P. DONNELLY, 2005 Human recombination hot spots hidden in regions of strong marker association. Nat. Genet. 37: 601–606.[CrossRef][Medline]
KINGMAN, J., 1982 On the genealogy of large populations. J. Appl. Probab. 19A: 27–43.[CrossRef]
KONG, A., D. F. GUDBJARTSSON, J. SAINZ, G. M. JONSDOTTIR, S. A. GUDJONSSON et al., 2002 A high-resolution recombination map of the human genome. Nat. Genet. 31: 241–247.[CrossRef][Medline]
LANGLEY, C. H., B. P. LAZZARO, W. PHILLIPS, E. HEIKKINEN and J. M. BRAVERMAN, 2000 Linkage disequilibria and the site frequency spectra in the su(s) and su(wa) regions of the Drosophila melanogaster X chromosome. Genetics 156: 1837–1852.
LI, N., and M. STEPHENS, 2003 Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. Genetics 165: 2213–2233.
MCKIM, K. S., J. K. JANG and E. A. MANHEIM, 2002 Meiotic recombination and chromosome segregation in Drosophila females. Annu. Rev. Genet. 36: 205–232.[CrossRef][Medline]
MCVEAN, G., P. AWADALLA and P. FEARNHEAD, 2002 A coalescent-based method for detecting and estimating recombination from gene sequences. Genetics 160: 1231–1241.
MCVEAN, G. A. T., S. R. MYERS, S. HUNT, P. DELOUKAS, D. R. BENTLEY et al., 2004 The fine-scale structure of recombination rate variation in the human genome. Science 304: 581–584.
NACHMAN, M. W., 2002 Variation in recombination rate across the genome: evidence and implications. Curr. Opin. Genet. Dev. 12: 657–663.[CrossRef][Medline]
NEI, M., and A. ROONEY, 2005 Concerted and birth-and-death evolution of multigene families. Annu. Rev. Genet. 39: 121–152.[CrossRef][Medline]
NISHANT, K. T., H. RAVISHANKAR and M. R. S. RAO, 2004 Characterization of a mouse recombination hot spot locus encoding a novel non-protein-coding RNA. Mol. Cell. Biol. 24: 5620–5634.
PADHUKASAHASRAM, B., J. D. WALL, P. MARJORAM and M. NORDBORG, 2006 Estimating recombination rates from single-nucleotide polymorphisms using summary statistics. Genetics 174: 1517–1528.
PALMER, S., E. SCHILDKRAUT, R. LAZARIN, J. NGUYEN and J. A. NICKOLOFF, 2003 Gene conversion tract lengths in Saccharomyces cerevisiae can be extremely short and highly directional. Nucleic Acids Res. 31: 1164–1173.