The Coalescent With Gene Conversion
 ^{*}Department of Statistics, University of Oxford, Oxford OX1 3TG, England
 ^{†}Institute of Biological Sciences, University of Aarhus, 8000 Aarhus C, Denmark
 Corresponding author: Carsten Wiuf, Department of Statistics, University of Oxford, 1 S. Parks Rd., Oxford, OX1 3TG, England. Email: wiuf{at}stats.ox.ac.uk
Abstract
In this article we develop a coalescent model with intralocus gene conversion. The distribution of the tract length is geometric in concordance with results published in the literature. We derive a simulation scheme and deduce a number of analytical results for this coalescent with gene conversion. We compare patterns of variability in samples simulated according to the coalescent with recombination with similar patterns simulated according to the coalescent with gene conversion alone. Further, an expression for the expected number of topology shifts in a sample of presentday sequences caused by gene conversion events is derived.
UNDERSTANDING patterns of variation in DNA sequences requires insight into at least three processes. The first is the genealogical process describing the ancestral history of a sample of alleles from a single locus. Kingman's coalescent process (Kingman 1982) is one such process. The second is the substitution process that describes how alleles mutate over time. For neutral loci the genealogical process and the substitution process can be separated such that the substitution process is superimposed on the genealogical process (Tavaré 1984), whereas for loci subject to selection the two processes cannot be separated. Finally, the last process describes the linkage relations between adjacent loci. It governs how and when adjacent loci are linked or split on different ancestral chromosomes and requires essentially an understanding of mechanisms that operate at the sequence level.
Central to many models describing the mechanisms at the sequence level is the Holliday junction (Holliday 1964). When the Holliday junction is resolved the result can be of two types: (1) a gene conversion with accompanying exchange of flanking regions (gene conversion with recombination); or (2) a gene conversion alone (Carpenter 1984; Stahl 1994). In this case a short tract of nucleotides is exchanged between two strands. We refer to the latter case as a gene conversion and the former as a recombination.
It is the aim of this article to develop a coalescent model with intralocus gene conversion alone and investigate the effects of gene conversion on various statistics of interest in the analysis of DNA sequences. We compare patterns of variability in samples simulated according to the coalescent with recombination with similar patterns simulated according to the coalescent with gene conversion alone. In Hilliker et al. (1994) the distribution of the gene conversion tract in Drosophila melanogaster is estimated. They find that a geometric distribution is a good approximation of the empirical distribution and accordingly the model is built on this observation.
In a subsequent article (Wiuf 2000) it is shown that to some extent the results given in this article apply generally without assuming any specific form of the distribution of the transferred chunk. The benefit of the geometric distribution is that more analytical results can be proven and that a specific choice of distribution (here, geometric) allows for investigation by simulation.
Two nucleotides separated by a large distance produce recombinants due to recombination events only. The (short) finite length of a gene conversion tract makes the contribution from gene conversion events insignificant. Thus, the patterns observed in a sample of sequences are the result of recombination events in the history of the sample and of the substitution process imposed. However, over small distances recombinants can be produced by recombination events as well as by gene conversion events. The observed patterns in samples of sequence data are thus the results of three different processes: the gene conversion process and the recombination process, and as well as the substitution process.
A recombination process was incorporated into Kingman's coalescent model (Kingman 1982; describing the genealogical process of a sample of sequences taken from a population) by Hudson (1983) and has subsequently been investigated by a number of authors (Hudson and Kaplan 1985; Griffiths and Marjoram 1997; and Wiuf and Hein 1997 among others).
We expect that the patterns of intralocus variability in a model with gene conversion are different from those in a model with recombinants produced by recombination (as defined above) only. The effect of recombination in the coalescent model is to break up the material ancestral to a sequence in two parts and distribute the parts on two different ancestors, one carrying the ancestral material to the left of the recombination break point, S, the other carrying the material to the right of S (Figure 1). In contrast, gene conversion, as defined here, breaks the material ancestral to a sequence in two points, S and T, and distributes the material to the left of S and that to the right of T on one ancestor, and the part in between S and T is on another ancestor (Figure 1).
The effect of a recombination event can easily be obtained in a model of intralocus gene conversion. If one end point falls outside the observed sequence the effect will be similar to that of a recombination event, though the probabilities of the two events might be different from each other. These probabilities depend on the rates of gene conversion and of recombination within the observed sequence. Further, they depend on the number of nucleotides observed; the higher the number the less is the chance of a gene conversion with only one end point within the observed sequence.
Similarly, the effect of a gene conversion event can be obtained by two recombination events and one coalescent event (Figure 2). Again, the probabilities of obtaining the events might be very different. Especially, the latter series of events (given that the first recombination event occurs) will depend strongly on the current sample size; the higher the sample size, the lower the chance that the two recombined sequences will coalesce before coalescing with any other sequence in the sample.
Finally, the length distribution of the transferred chunk in the gene conversion events determines the distributions of the end points. Thus, the end points will only in rare cases be uniformly distributed along the observed sequence. Again, this is in striking discrepancy with the coalescent model with recombination where the breakpoint is uniformly distributed along the entire sequence (standard assumption; see, e.g., Griffiths and Marjoram 1997).
Figure 3 shows an example of a genealogy of a sample of size 2 with intralocus gene conversion.
This article is organized into sections. The first two sections describe the coalescent model with gene conversion. In the third section a simulation scheme is developed similar to that of Griffiths and Marjoram (1997) to simulate from the coalescent with recombination; in the fourth section a number of analytical and simulated results are collected. Results obtained in the coalescent with recombination apply on several occasions to the gene conversion model. Finally, the last section is a discussion of extensions to more realistic models of gene conversion and recombination. When proofs are omitted they can be derived from the general results in Wiuf (2000).
THE MODEL
Consider a diploid population model with effective constant (diploid) size 2N. A new generation is obtained from the present generation by sampling 2N sequences with replacement, forming random pairs of sequences, and letting a short tract of nucleotides be transferred between the sequences. The mode of transfer is described below.
Sequences are of length L + 1 nucleotides so that there are L gaps between nucleotides. Consider one such sequence. Assume that in any generation the probability of a gene conversion initiating between any two positions in the sequence is g, independently of whether gene conversions initiate elsewhere along or outside the observed sequence. Where along the sequence a gene conversion initiates is uniformly distributed among all sites. If gL is small the chance of more than one gene conversion event in one generation is negligible.
The transferred chunk of nucleotides originates from a randomly chosen sequence in the population. In Wiuf (2000), the length, Z, of the converted chunk can have an arbitrary distribution. Here we consider the specific case where Z follows a geometric distribution with parameter q, i.e.,
Let Q = qL. If q is small and L is large the distribution of z = Z/L can be approximated with an exponential distribution with parameter Q, i.e., z ~ Exp(Q) (where ~ means “is distributed like”). Further, let G = 4gNL be the expected number of gene conversion events per sequence per 4N generations and assume g is small and N large. The definition of G is analogous to the parametrization adopted for the coalescence with recombination (Hudson 1983).
The genealogical process of a sample of n presentday sequences is studied: time starts at the present and increases, going backward in time. Under the above assumptions we find that the waiting time, W_{C} (in units of 2N generations), until a sequence has been created by a gene conversion event that initiates within the sequence is approximately exponentially distributed with parameter G/2, i.e., W_{C} ~ Exp(G/2), if N is large and gL is small. The rate of gene conversions initiating outside the sequence but ending within the observed sequence must also be taken into account. This rate depends on the distribution of Z and is discussed in the subsequent section. The rate of coalescence is n(n − 1)/2 if there are n sequences in a sample (Kingman 1982) assuming a constant population size.
We note that Q as well as G scale linearly in L, the observed number of gaps between nucleotides. That is, both Q and G are doubled if the observed number of gaps, L, is doubled. The expected length of the transferred chunk in a gene conversion is 1/q (measured in number of nucleotides). Therefore, the parameter Q is interpreted as the sequence length measured in units of expected length of the transferred chunk. Similarly, the parameter G is sequence length measured in expected number of gene conversion events per sequence per 4N generations. We note that Q/G = q/(4gN) is independent of L.
EFFECTS OF GENE CONVERSION
In this section we discuss the effects of a single gene conversion event on a sequence and find the waiting time until the sequence has been created by a gene conversion event.
Denote by ζ, σ, and τ the variables Z/L, S/L, and T/L, respectively, where Z, S, and T are as defined in the previous section. The variables σ and τ take values in {1/L, 2/L, …, 1 − 1/L} ⊆ (0, 1) and as L becomes large the distributions of σ and τ converge to continuous distributions on (0, 1). The representation of a sequence as the continuous interval (0, 1) is commonly used (see, e.g., Griffiths and Marjoram 1997).
Let C denote the event that a gene conversion happens in a given sequence in a given generation. Further, let C_{1} denote the event that the gene conversion falls partly outside the L + 1 observed nucleotides, that is, S + Z > L. Similarly, let C_{2} denote that both end points are within the sequence length (lower index i indicates that i of the two end points of the converted chunk are within the L nucleotides). We find
The end points of the tract are affected by the type, C_{1} or C_{2}, of the event. The density, f_{σ,τ}(s, tC_{2}), of σ and τ conditional on C_{2} is given by
The density of (σ, τ) conditional on C_{2} relates to that of ~ conditional on C_{2} because ζ = τ − σ. We find
Finally, the density of σ conditional on C_{1} (that is, σ + ζ falls outside the observed nucleotides, σ + ζ > 1) is
In Figure 4 we illustrate the above different distributions with q obtained from D. melanogaster data.
Gene conversions initiating outside the L + 1 nucleotides will have a chance of terminating within the L + 1 observed nucleotides. Assume that the entire chromosome potentially consists of an infinite array of sequences of length L plus the observed one of length L + 1 and that the gene conversion model described above is valid for the entire chromosome.
For most organisms, there is an upper bound to the length of a gene conversion tract. This means that only a minor (finite) extension of the observed sequence should be taken into consideration and not the entire chromosome. In the model discussed here, tract lengths can have an arbitrary size, but it can be shown that the probability of a tract initiating at least nL nucleotides away from the observed sequence and ending within the sequence, given a gene conversion event happens that ends in the observed sequence, is of order exp(−nQ). Basically only tracts initiating close to the observed sequence affect the sequence history in concordance with what is expected.
For each gene conversion tract initiating within the observed sequence and ending outside the observed sequence there is a similar tract initiating outside (to the left of the observed sequence), ending within the observed sequence. Because the chance of gene conversion events is the same along the entire chromosome, the two events have the same probability of happening.
Let C_{o}(o for outside) denote the event that a gene conversion initiates outside the observed sequence, and terminates within. The above informal argument shows that
Further, conditional on C_{o}, the termination point t = T/L has density
The events and C_{o} and C_{1} are independent. Thus, the waiting time W_{C}_{1∪Co} = min(W_{C}_{1}, W_{C}_{o}) until an event of either type C_{1} or type C_{o} is approximately exponentially distributed with parameter G(1 − K(Q)), and where the initiation/termination point, δ, happens, is distributed with density
Summing up, we find that the waiting time, W, until a sequence is created by a gene conversion event is exponentially distributed
Given a gene conversion occurs, the probability that both end points of the inserted chunk are within the observed sequence is
The distribution in (10) is interesting for several reasons. First, the distribution of the tract length, ~, depending on Q, affects the number of events and the times between events. The parameter G(2 − K(Q))/2 varies from G/2 when Q = ∞ to G when Q = 0. Second, for all Q > 0 the number of events is higher than the number of events in a recombination model with a similar parameter (i.e., r = g).
The two extreme values, Q = 0 and Q = ∞, are of interest. If Q = 0 we find that W ~ Exp(G), and a gene conversion will leave just one breakpoint within the sequence. Where the break occurs will be uniformly distributed along the sequence. Thus, the coalescent with gene conversion resembles the coalescent with recombination, but the rate of gene conversions is twice the rate of recombinations (assuming the probability of a recombination between any two nucleotides is g).
If Q = ∞ we find that W ~ Exp(G/2) and that both end points of a gene conversion will be within the sequence. As Q goes toward infinity, σ will be uniformly distributed along the sequence, and τ ≈ σ, that is, ζ ≈ 0. In the limit Q = ∞, a gene conversion will just leave a spot on the sequence, leaving the sequence the way it is. Thus, the coalescent with gene conversion will be indistinguishable from the pure coalescent process. Each time a conversion happens, it cannot be seen, and only coalescent events affect the history of a sample.
MODEL SIMULATIONS
Assume n sequences are sampled from a presentday population. The genealogy of the sample subject to gene conversion as described in the previous section can be simulated according to the scheme in Figure 5.
The algorithm is formulated as a birth and death process with constant death rate, G(2 − K((Q))/2, and terminates when there is only one ancestral sequence to the sample. Griffiths and Marjoram (1997) developed a similar birth and death process to handle to coalescent with recombination. Both algorithms return a genealogy in which the “true” genealogy of the sample is embedded; we keep track only of the number of sequences and the end points of gene conversions, not which parts of the sequences that are ancestral to the sample. Gene conversions can happen outside the ancestral material to a sequence, thereby creating an “empty” sequence, i.e., a sequence with no material ancestral to the sample (cf. Figure 2).
Consider a fixed point, χ, along the sequences. The tree, T(χ), that describes the sequences at χ can be found by starting at the present sequences and going back in time. When a gene conversion node is encountered, the branch that describes the segment containing χ is followed. Assume that the gene conversion leaves only one end point, s, in the sequence. If s < ω, then the branch describing the fate of the left part of the sequences is to be followed and vice versa and similarly, if the gene conversion leaves two end points. The tree T(χ) is distributed like the coalescent process since one point cannot be subject to gene conversion.
Mutations can be superimposed on the genealogy if all alleles are selectively neutral. Assuming mutations arrive according to a Poisson process, the number of mutations on the total genealogy is Poisson with parameter θB/2, where B is the total branch length of the entire genealogy, θ = 4Nu is the scaled mutation rate, and u is the probability of a mutation in a sequence per generation.
The coalescent with gene conversion can be combined with the coalescent with recombination. The genealogy of a sample of sequences is constructed similarly to the scheme above, waiting for three different events to occur: coalescence, recombinations, and gene conversions.
RESULTS
As noted previously, Q_{0} = Q/G = q/(4gN) is independent of L. The parameters (Q_{0}, G) are more natural parameters than (Q, G); G represents the length of the sequence in units of gene conversion events per sequence per 4N generations and Q_{0} is a parameter dependent on the effective size of the population and parameters intrinsic to the biological system. In contrast (Q, G) are both parameters representing sequence length, but in different units.
In the following, results are given in terms of (Q_{0}, G) to facilitate comparisons between samples of different lengths but with the same Q_{0}. If we consider Q_{0} fixed and G a variable, results (and plots) can be converted between models with different g's but equal Q_{0} through a linear scaling of sequence length.
If sequence length is measured in units of expected
number of events per sequence per 4N generations, we find that the scaled length ζ′ = ζG of a tract is
In the following, we consider a sample of size n taken from the population at the present time. First, we focus on correlations between trees. Consider two positions, χ_{1} and χ_{2}, in a sample of n sequences at distance G. We are interested in the trees, T_{n}(χ_{1}) and T_{n}(χ_{2}), that describe the sequences at χ_{1} and χ_{2}. Only events of type C_{1} and C_{o} in between positions χ_{1} and χ_{2} affect the relation between the two trees. Events of type C_{2} in between the positions cannot be traced. The rate, r_{G}/2, by which events of type C_{1} and C_{o} happen is, per sequence,
For small values of G we find r_{G} ≈ 2G. This is of particular interest because the rate at which recombinants are produced by recombination is only G (assuming r = g), and this is half the rate at which recombinants are produced by gene conversion events. Thus, gene conversion events might contribute significantly to linkage disequilibrium over small distances (see also Wiuf 1999). For example, according to Andolfatto and Nordborg (1997), g = 3r in D. melanogaster and the rate of gene conversions is thus sixfold that of recombination over small distances.
Applying results in Griffiths (1991), we find for n = 2,
The probability P(T_{2}(χ_{1}) = T_{2}(χ_{2})) for n = 2 is plotted in Figure 6. The tree height, T_{n} = max{T_{n}(χ)χ}, until all positions in the sample have found a most recent common ancestor is plotted in Figure 7 for n = 2. For Q_{0} = 0.05 and G = 5 we find Q = 0.25, which corresponds to ~100 nucleotides in D. melanogaster (see Figures 4 and 6). In this case the chance that two positions separated at distance 5 share a most recent common ancestor (MRCA) is ~0.10 if sample size is 2.
Also of interest are the numbers of different types of gene conversion events. We discuss two such numbers. The first is the number of gene conversion events affecting the history of the sample. Denote this number by G_{n}. The second is the number of gene conversion end points that make the topology change. Call this number S_{n}. This number relates to the possibilities of detecting gene conversion events in the sample history. Under an infinitesites mutation model (Watterson 1975) with a mutation rate very high compared to that of gene conversions, G, all shifts in topologies can be detected from sequence polymorphisms (Hudson and Kaplan 1985). Also, these are the only events that can be detected (under the infinitesites model).
It is not possible to find the expectation of G_{n} analytically. The reason for this is the following. The expectation of G_{n} is not linear in G because a gene conversion with both end points in ancestral material is only counted once. Knowing that a gene conversion has one end point within ancestral material in a small sequence interval does not reveal if the other end point also is within ancestral material. Thus, information about gene conversion end points in a small sequence interval does not provide information whether the points count in G_{n} (Figure 8). In the coalescent with recombination, the expected number of recombination events can easily be found because the number is linear in R (Hudson and Kaplan 1985). The expected value of G_{n} is simulated for different values of G and Q_{0} and plotted in Figure 9. Asymptotically, the expectation of G_{n} is linear.
The expected number of events causing a topology shift is given by
Consider position 0 in the sequences and let B_{n} be the total length of the coalescent tree in position 0. Wiuf and Hein (1999) consider the coalescent with recombination and discuss the sequence length, Λ_{n}, conditional on B_{n}, from position 0 until the first recombination point affecting the history of the sample. They find that
Here, we give the analogous result for the coalescent model with gene conversion. The situation differs in the sense that we wait for either a gene conversion initiating to the right of position 0 [analogous to (18)] or a gene conversion initiating to the left of position 0 but ending to the right of 0. The proof is given in the appendix. The distribution of the length, Λ_{n}, until the first point affecting the history of the sample, conditional on B_{n}, is
DISCUSSION
We developed a coalescent model with gene conversion assuming a diploid population of constant size, N. It takes two parameters as input (along with the sample size). The first is the product G = 4NLg, where g is the probability that a gene conversion tract initiates in a fixed position, and the second parameter is Q = qL, where q is the probability that the next nucleotide is within the tract given that the former is within the tract. An easy simulation scheme was developed. This scheme could be modified, in accordance with Griffiths and Tavaré (1994), to allow for fluctuation in the population size with time.
We derived a number of results related to the correlation of trees in different positions and to the probability that two given positions share a common ancestor. These tend to be highly different from similar quantities obtained in the coalescent with recombination. The covariance between trees relating two distinct positions does not tend to zero with increasing distance between the positions, but is bounded by a positive constant. Thus, it is expected that the trees in the two distinct positions are more likely to share the same topology only in a model of gene conversion than in a model of recombination with similar rate. Figure 10 confirms this. As Q_{0} = Q/G increases, the probability, π, of common topology in two distinct positions decreases and for fixed value of Q_{0}, π converges (as function of distance) to a level distinct from 0 and 1. Low values of Q_{0} have a similar effect to that of recombination.
We should note in passing that given a gene conversion end point is in position χ, the probability, p, of detecting that the gene conversion has occurred from incompatibilities in the sequence alignment is very low and slowly increases with sample size. Hudson and Kaplan (1985) and C. Wiuf, T. Christensen and J. Hein (unpublished results) found
It is of high interest to be able to distinguish mechanisms operating on the molecular level in patterns of variability obtained in a sample of sequences. An indicator of gene conversion could be the following. Define an informative site to be a site where at least two different nucleotides are present in at least two sequences each. Further, say a pair of sites is compatible if there exists a topology explaining the sites with the minimum possible number of substitution events. Now, assume three informative sites are given. If the first and last sites are compatible, and the two remaining pairs involving the middle site are both incompatible, we would take this as evidence of a gene conversion event. In a pure recombination model, given the middle site is incompatible with both other sites, we would expect that the first and the last sites are also incompatible. We have simulated how often under ideal circumstances this is likely to happen. It is assumed that the topology can be accurately constructed for each column in the sequence alignment. This is, for example, the case in an infinitesites model with very high mutation rate. Table 1 shows simulation results for various values of Q_{0} and G.
Denote the positions 1, 2, and 3. Positions 1 and 3 are fixed whereas 3 is uniformly distributed between 1 and 3. The findings in the table can be explained as follows. Assume Q_{0} = 0. If 1 and 2 do not share topology and similarly for 2 and 3, this requires at least two recombination events. As the distance between 1 and 3 increases, the chance of more recombination events before the MRCAs to positions 1 and 3 increases; thus the chance that 1 and 3 share the same topology decreases.
If Q_{0} > 0 the pattern changes. When the distance is small compared to 1/Q_{0} (expected tract length), most events will be of type C_{1} and C_{o}. A similar pattern to that for Q_{0} = 0 is thus expected. As the distance increases, the rate by which 1 and 3 are separated becomes almost constant (r_{G} = 2{1 − exp(−Q_{0}G)}/Q_{0}) and if (1, 2) and (2, 3) do not share topologies it is most likely caused by a gene conversion event with both end points within 1 and 3. Thus we will accept more cases where 1 and 3 share topology.
Now fix the distance between the positions. As Q_{0} increases, the number of C_{2} events increases relative to C_{1} and C_{o} events (Equations 11 and 12) and again if (1, 2) and (2, 3) do not share topologies it is most likely caused by a gene conversion event that removes a region around position 2. Again, we expect more cases where 1 and 3 share topology.
The pattern observed in Table 1 is obtained under ideal circumstances. In practice, the topology relating a set of sequences in a given position can be reconstructed from the nucleotide pattern in that position in rare cases only. Under an infinitesites model with low mutation rate, the chance that 1 and 3 can be explained by the same topology is higher than the value given in Table 1. Similarly, recurrent substitutions have the same effect, but in both cases the overall pattern in Table 1 is retained. These results suggest that in a pure recombination model we will see many more pairs of incompatible sites than in a model with pure gene conversion (assuming the rate of recombination is similar to the rate of gene conversion). On the other hand we expect many more topology shifts under a gene conversion model, so that we must return, moving along the sequences, to previously visited topologies more often than under a recombination model. Also, the results in Table 1 imply that recombination can be distinguished from gene conversion by the spatial arrangement of incompatible pairs, simply by estimating the probability in Table 1 from data. If recombination is not present we will expect this probability to be significantly different from 0.
It would be advantageous to build a model of gene conversions and recombination more directly on a molecular model of these phenomena. This will be pursued in greater depth elsewhere, but the basic issues involved are shortly sketched here.
Central to most models of gene conversion/recombination is the Holliday junction (Holliday 1964). Although the original Holliday model has since been superceded by more complicated models, such as the single strand invasion model (Meselson and Radding 1975) and the doublechainbreak repair model (Resnick 1976), the Holliday model would be still be the natural starting point for modeling gene conversion from a population genetical viewpoint. Issues in the Holliday model would also be central in the more complicated models. In the Holliday model the following three problems would be of central importance.
The movement of the Holliday junction and its consequences. If the movement is controlled by a random walk that is stopped at a fixed time, the length of the gene conversion would be close to a symmetric binominal distribution moved so its mean is zero. Since the movement of the Holliday junction along nonidentical sequences will create heteroduplexes that are energetically unfavorable to homoduplexes, it could be expected that there would be a sequencedependent centralizing drift in this random walk. Since the energetic costs of mismatches are approximately known, this distribution could be calculated. The length of gene conversions would then be sequence dependent. This would be a complication when incorporating it into the coalescent that takes a view from the present toward the past.
How is the heteroduplex corrected? Two extremes can be imagined. The heteroduplex is resolved when the DNA is duplicated or one single strand is made the master copy and the other single strand is corrected so it matches to this master strand. From the point of view of redistributing ancestral material, this is simple as the region within the gene conversion will choose one ancestor. The other possibility, that mismatches are corrected independently one by one, has more drastic consequences for the distribution of ancestral material as each correction will be independent choices of parents. Within the region of the gene conversion, different positions can have different ancestors. This would also have drastic consequences for the phylogenetic relationships among the sequences as one moves along the sequences, as even arbitrarily close positions could be related by different phylogenies.
The Holliday structure can be resolved in two ways, one leading to a gene conversion, but no recombination, and the other leading to a gene conversion and a recombination. What are the probabilities of gene conversion and recombination? Few, if any, biological systems have gene conversion but no recombination. In modeling the consequences, it would be natural to make this probability a single parameter in the model.
Later models, especially Meselson and Radding (1975) and the double exchange model (Resnick 1976), are more complicated and will slightly change the picture. From a population genetical viewpoint, all that matters in these different models is how the ancestral material is redistributed when viewed from the present toward the past. The MeselsonRadding model would have a very short region with a onedirectional gene conversion added relative to the pure Holliday model, due to the strand invasion. Since this region is expected to be short relative to the length of the migration of the Holliday structure, the MeselsonRadding model is expected to be very similar to the pure Holliday model. The doublechainbreak repair model would have two strand invasions and also two Holliday junctions that could undergo a random walk relative to each other. The doublechainbreak repair model probably has greater consequences for the distribution of ancestral material than the MeselsonRadding model, but that remains to be explored.
Since it is not well known how different recombination mechanisms are phylogenetically distributed, it would be interesting to know if it was ever possible to distinguish the underlying mechanism from pure population data. It would here be natural to formulate what was expected in terms of different sets of incompatibilities. For instance, if three informative sites were given, and the first and last sites were compatible, but the two remaining pairs involving the middle site were incompatible, this would be taken as an indicator of a gene conversion. If more informative sites were found close to the middle position, these would be expected to have different kinds of compatibilities dependent on which kind of repair mechanism was operating on the heteroduplex.
Whether or not different recombination/gene conversion models are distinguishable by population sequence data depends on the frequency of configurations of segregating nucleotides in the population that would result in different products when recombined under different models. A functional geneticist would design the necessary configurations of nucleotides and set up the adequate crosses. In population genetics this would have to occur by chance. It is unlikely that population sequence data can compete with designed experiments in this respect, but this does not imply that different mechanisms of recombination will not have consequences for the expected patterns of variation in population sequence data; such data are becoming available from many organisms, where molecular genetical experiments have not yet been performed.
Acknowledgments
J. P. Hjorth is thanked for helpful discussions on what gene conversion is, M. Nordborg for commenting on an early version of the manuscript, and M. Schierup for various suggestions that improved the manuscript considerably. We thank A. Mikkelsen for implementing the model and performing the simulations used in the article. C.W. was supported by grant Biotechnology and Biological Sciences Research Council 43/MMI09788 and by the Carlsberg Foundation, Denmark. Part of this work was carried out while the authors visited the Isaac Newton Institute, University of Cambridge.
APPENDIX: PROOF OF FORMULAS (19) AND (20)
Wiuf and Hein (1999) consider the coalescent with recombination and discuss the sequence length, Λ_{n}, conditional on the total branch length, B_{n}, from position 0 until the first recombination point affecting the history of the sample. They find that
Remember [see (10), (11), and (12)]
The rate, G(1 − K(Q_{0}G))/2, of type C_{o} events converges to 1/(2Q_{0}) for G → ∞. Conditional on B_{n} = b, the number, G_{o}, of C_{o} events in the sequences ancestral to position 0 is thus Poisson distributed with parameter b/(2Q_{0}),
Finally, combining (A2), (A3), and (A5) we deduce that Λ_{n} = min (Λ_{o}, Λ_{1,2}) has distribution
Footnotes

Communicating editor: A. G. Clark
 Received August 4, 1999.
 Accepted January 18, 2000.
 Copyright © 2000 by the Genetics Society of America