Abstract
In this article we develop a coalescent model with intralocus gene conversion. The distribution of the tract length is geometric in concordance with results published in the literature. We derive a simulation scheme and deduce a number of analytical results for this coalescent with gene conversion. We compare patterns of variability in samples simulated according to the coalescent with recombination with similar patterns simulated according to the coalescent with gene conversion alone. Further, an expression for the expected number of topology shifts in a sample of present-day sequences caused by gene conversion events is derived.
UNDERSTANDING patterns of variation in DNA sequences requires insight into at least three processes. The first is the genealogical process describing the ancestral history of a sample of alleles from a single locus. Kingman's coalescent process (Kingman 1982) is one such process. The second is the substitution process that describes how alleles mutate over time. For neutral loci the genealogical process and the substitution process can be separated such that the substitution process is superimposed on the genealogical process (Tavaré 1984), whereas for loci subject to selection the two processes cannot be separated. Finally, the last process describes the linkage relations between adjacent loci. It governs how and when adjacent loci are linked or split on different ancestral chromosomes and requires essentially an understanding of mechanisms that operate at the sequence level.
Central to many models describing the mechanisms at the sequence level is the Holliday junction (Holliday 1964). When the Holliday junction is resolved the result can be of two types: (1) a gene conversion with accompanying exchange of flanking regions (gene conversion with recombination); or (2) a gene conversion alone (Carpenter 1984; Stahl 1994). In this case a short tract of nucleotides is exchanged between two strands. We refer to the latter case as a gene conversion and the former as a recombination.
It is the aim of this article to develop a coalescent model with intralocus gene conversion alone and investigate the effects of gene conversion on various statistics of interest in the analysis of DNA sequences. We compare patterns of variability in samples simulated according to the coalescent with recombination with similar patterns simulated according to the coalescent with gene conversion alone. In Hilliker et al. (1994) the distribution of the gene conversion tract in Drosophila melanogaster is estimated. They find that a geometric distribution is a good approximation of the empirical distribution and accordingly the model is built on this observation.
In a subsequent article (Wiuf 2000) it is shown that to some extent the results given in this article apply generally without assuming any specific form of the distribution of the transferred chunk. The benefit of the geometric distribution is that more analytical results can be proven and that a specific choice of distribution (here, geometric) allows for investigation by simulation.
Two nucleotides separated by a large distance produce recombinants due to recombination events only. The (short) finite length of a gene conversion tract makes the contribution from gene conversion events insignificant. Thus, the patterns observed in a sample of sequences are the result of recombination events in the history of the sample and of the substitution process imposed. However, over small distances recombinants can be produced by recombination events as well as by gene conversion events. The observed patterns in samples of sequence data are thus the results of three different processes: the gene conversion process and the recombination process, and as well as the substitution process.
A recombination process was incorporated into Kingman's coalescent model (Kingman 1982; describing the genealogical process of a sample of sequences taken from a population) by Hudson (1983) and has subsequently been investigated by a number of authors (Hudson and Kaplan 1985; Griffiths and Marjoram 1997; and Wiuf and Hein 1997 among others).
We expect that the patterns of intralocus variability in a model with gene conversion are different from those in a model with recombinants produced by recombination (as defined above) only. The effect of recombination in the coalescent model is to break up the material ancestral to a sequence in two parts and distribute the parts on two different ancestors, one carrying the ancestral material to the left of the recombination break point, S, the other carrying the material to the right of S (Figure 1). In contrast, gene conversion, as defined here, breaks the material ancestral to a sequence in two points, S and T, and distributes the material to the left of S and that to the right of T on one ancestor, and the part in between S and T is on another ancestor (Figure 1).
(A) Recombination and (B) gene conversion. If the resolution of the Holliday junction results in an exchange of flanking regions we have (what here is called) a recombination event. In the top, two of the four strands involved in the Holliday junction are shown. In the bottom, time starts at present (the second generation) and goes backward. Starting with either of the two strands/sequences in the second generation, the effect of the recombination event is to create two ancestors to the sequence; the positions to the left of position S share one ancestor and the positions to the right of S share another ancestor. If the Holliday junction is resolved without an exchange of flanking regions we have a gene conversion event (without recombination). In the top, two of the four strands involved in the Holliday junction are shown. In the bottom, time starts at present (the second generation) and goes backward. Starting with either of the two strands/sequences in the second generation, the effect of the gene conversion event is to create two ancestors to the sequence; the positions to the right of T and to the left of S share one ancestor and the position between S and T shares another ancestor.
The effect of a recombination event can easily be obtained in a model of intralocus gene conversion. If one end point falls outside the observed sequence the effect will be similar to that of a recombination event, though the probabilities of the two events might be different from each other. These probabilities depend on the rates of gene conversion and of recombination within the observed sequence. Further, they depend on the number of nucleotides observed; the higher the number the less is the chance of a gene conversion with only one end point within the observed sequence.
Similarly, the effect of a gene conversion event can be obtained by two recombination events and one coalescent event (Figure 2). Again, the probabilities of obtaining the events might be very different. Especially, the latter series of events (given that the first recombination event occurs) will depend strongly on the current sample size; the higher the sample size, the lower the chance that the two recombined sequences will coalesce before coalescing with any other sequence in the sample.
Finally, the length distribution of the transferred chunk in the gene conversion events determines the distributions of the end points. Thus, the end points will only in rare cases be uniformly distributed along the observed sequence. Again, this is in striking discrepancy with the coalescent model with recombination where the breakpoint is uniformly distributed along the entire sequence (standard assumption; see, e.g., Griffiths and Marjoram 1997).
Figure 3 shows an example of a genealogy of a sample of size 2 with intralocus gene conversion.
This article is organized into sections. The first two sections describe the coalescent model with gene conversion. In the third section a simulation scheme is developed similar to that of Griffiths and Marjoram (1997) to simulate from the coalescent with recombination; in the fourth section a number of analytical and simulated results are collected. Results obtained in the coalescent with recombination apply on several occasions to the gene conversion model. Finally, the last section is a discussion of extensions to more realistic models of gene conversion and recombination. When proofs are omitted they can be derived from the general results in Wiuf (2000).
Gene conversion vs. recombination. The topological effect of one gene conversion event (left) can be obtained in a model with recombination only by two recombination events accompanied by a coalescence event. Assuming the rate of gene conversion events is the same as the rate of recombination events, we find that the chance of two recombination events followed by a coalescence event is far smaller than that of a gene conversion event. The former also strongly depends on the number ancestral sequences present. Here, for example, given that a coalescence event occurs it will result only in the desired distribution of ancestral material in one out of three possible mergings.
Example of a sample history with intralocus gene conversion. We trace the history of a sample of size 2 back in time until all positions on the two sequences have found a most recent common ancestor (MRCA). We adopt the following graphical notation: thin lines represent material nonancestral to the present-day sample; thick lines, material ancestral to the sample; and crosses, material that has found a MRCA. The first three events, 1, 2, and 3, affecting the history of the sample are all gene conversion events spreading the material ancestral to the sample on five sequences. From the point of the topological structure of the tree, we cannot distinguish events 2 and 3 from recombination events. In the first event both end points of the transferred chunk fall within the observed sequence, whereas we cannot say whether the other end points in events 2 and 3 fall within the observed sequence length or fall outside. All that can be said is that the other end points do not fall within material ancestral to the present-day sample. The fourth event is a coalsecence event whereby positions (1/2, 3/4) find a MRCA. At event 5, positions (3/4, 1) find a MRCA, and at event 6, positions (0, 1/2) find a MRCA. Consider the ancestral sequence A. A gene conversion event with end points in the material not ancestral to the sample [that is outside (1/2, 3/4)] cannot be traced; it does not affect the history. Similarly, a gene conversion event in the ancestral sequence B with one or two end points in (1/2, 3/4) and none in ancestral material does not affect the history. Only the region spanned by ancestral material need be taken into account; e.g., the span of ancestral material at sequence B is 1 − 0 = 1, and the span at sequence A is 0.75 − 0.5 = 0.25. Note that the region might include nonancestral material as in sequence B.
THE MODEL
Consider a diploid population model with effective constant (diploid) size 2N. A new generation is obtained from the present generation by sampling 2N sequences with replacement, forming random pairs of sequences, and letting a short tract of nucleotides be transferred between the sequences. The mode of transfer is described below.
Sequences are of length L + 1 nucleotides so that there are L gaps between nucleotides. Consider one such sequence. Assume that in any generation the probability of a gene conversion initiating between any two positions in the sequence is g, independently of whether gene conversions initiate elsewhere along or outside the observed sequence. Where along the sequence a gene conversion initiates is uniformly distributed among all sites. If gL is small the chance of more than one gene conversion event in one generation is negligible.
The transferred chunk of nucleotides originates from a randomly chosen sequence in the population. In Wiuf (2000), the length, Z, of the converted chunk can have an arbitrary distribution. Here we consider the specific case where Z follows a geometric distribution with parameter q, i.e.,
Let Q = qL. If q is small and L is large the distribution of z = Z/L can be approximated with an exponential distribution with parameter Q, i.e., z ~ Exp(Q) (where ~ means “is distributed like”). Further, let G = 4gNL be the expected number of gene conversion events per sequence per 4N generations and assume g is small and N large. The definition of G is analogous to the parametrization adopted for the coalescence with recombination (Hudson 1983).
The genealogical process of a sample of n presentday sequences is studied: time starts at the present and increases, going backward in time. Under the above assumptions we find that the waiting time, WC (in units of 2N generations), until a sequence has been created by a gene conversion event that initiates within the sequence is approximately exponentially distributed with parameter G/2, i.e., WC ~ Exp(G/2), if N is large and gL is small. The rate of gene conversions initiating outside the sequence but ending within the observed sequence must also be taken into account. This rate depends on the distribution of Z and is discussed in the subsequent section. The rate of coalescence is n(n − 1)/2 if there are n sequences in a sample (Kingman 1982) assuming a constant population size.
We note that Q as well as G scale linearly in L, the observed number of gaps between nucleotides. That is, both Q and G are doubled if the observed number of gaps, L, is doubled. The expected length of the transferred chunk in a gene conversion is 1/q (measured in number of nucleotides). Therefore, the parameter Q is interpreted as the sequence length measured in units of expected length of the transferred chunk. Similarly, the parameter G is sequence length measured in expected number of gene conversion events per sequence per 4N generations. We note that Q/G = q/(4gN) is independent of L.
EFFECTS OF GENE CONVERSION
In this section we discuss the effects of a single gene conversion event on a sequence and find the waiting time until the sequence has been created by a gene conversion event.
Denote by ζ, σ, and τ the variables Z/L, S/L, and T/L, respectively, where Z, S, and T are as defined in the previous section. The variables σ and τ take values in {1/L, 2/L, …, 1 − 1/L} ⊆ (0, 1) and as L becomes large the distributions of σ and τ converge to continuous distributions on (0, 1). The representation of a sequence as the continuous interval (0, 1) is commonly used (see, e.g., Griffiths and Marjoram 1997).
Let C denote the event that a gene conversion happens in a given sequence in a given generation. Further, let C1 denote the event that the gene conversion falls partly outside the L + 1 observed nucleotides, that is, S + Z > L. Similarly, let C2 denote that both end points are within the sequence length (lower index i indicates that i of the two end points of the converted chunk are within the L nucleotides). We find
The end points of the tract are affected by the type, C1 or C2, of the event. The density, fσ,τ(s, t|C2), of σ and τ conditional on C2 is given by
The density of (σ, τ) conditional on C2 relates to that of ~ conditional on C2 because ζ = τ − σ. We find
Finally, the density of σ conditional on C1 (that is, σ + ζ falls outside the observed nucleotides, σ + ζ > 1) is
In Figure 4 we illustrate the above different distributions with q obtained from D. melanogaster data.
Gene conversions initiating outside the L + 1 nucleotides will have a chance of terminating within the L + 1 observed nucleotides. Assume that the entire chromosome potentially consists of an infinite array of sequences of length L plus the observed one of length L + 1 and that the gene conversion model described above is valid for the entire chromosome.
For most organisms, there is an upper bound to the length of a gene conversion tract. This means that only a minor (finite) extension of the observed sequence should be taken into consideration and not the entire chromosome. In the model discussed here, tract lengths can have an arbitrary size, but it can be shown that the probability of a tract initiating at least nL nucleotides away from the observed sequence and ending within the sequence, given a gene conversion event happens that ends in the observed sequence, is of order exp(−nQ). Basically only tracts initiating close to the observed sequence affect the sequence history in concordance with what is expected.
Densities of the initiation point and the length of the gene conversion. Shown are the densities of σ conditional on C1 and C2, respectively, as well as the density of ~ conditional on C2. The density of σ given C2 is found from (4); fσ(s|C2) = {1 − exp(−Q(1 − s))}K−1(Q), 0 < s < 1. The parameter Q is in all graphs 5. Hilliker et al. (1994) estimates q in D. melanogaster to be ~1/350 and this corresponds in this case to an observed sequence length five times the expected length of a gene conversion tract or 1750 nucleotides, L = Q/q.
For each gene conversion tract initiating within the observed sequence and ending outside the observed sequence there is a similar tract initiating outside (to the left of the observed sequence), ending within the observed sequence. Because the chance of gene conversion events is the same along the entire chromosome, the two events have the same probability of happening.
Let Co(o for outside) denote the event that a gene conversion initiates outside the observed sequence, and terminates within. The above informal argument shows that
Further, conditional on Co, the termination point t = T/L has density
The events and Co and C1 are independent. Thus, the waiting time WC1∪Co = min(WC1, WCo) until an event of either type C1 or type Co is approximately exponentially distributed with parameter G(1 − K(Q)), and where the initiation/termination point, δ, happens, is distributed with density
Summing up, we find that the waiting time, W, until a sequence is created by a gene conversion event is exponentially distributed
Given a gene conversion occurs, the probability that both end points of the inserted chunk are within the observed sequence is
The distribution in (10) is interesting for several reasons. First, the distribution of the tract length, ~, depending on Q, affects the number of events and the times between events. The parameter G(2 − K(Q))/2 varies from G/2 when Q = ∞ to G when Q = 0. Second, for all Q > 0 the number of events is higher than the number of events in a recombination model with a similar parameter (i.e., r = g).
The two extreme values, Q = 0 and Q = ∞, are of interest. If Q = 0 we find that W ~ Exp(G), and a gene conversion will leave just one breakpoint within the sequence. Where the break occurs will be uniformly distributed along the sequence. Thus, the coalescent with gene conversion resembles the coalescent with recombination, but the rate of gene conversions is twice the rate of recombinations (assuming the probability of a recombination between any two nucleotides is g).
The simulation scheme. Define
= G(1 + (1 − exp(−Q))/Q) = G(2 − K(Q)), and let k denote the number of sequences ancestral to the sample of size n at a given time in the past. Further, let Ui, i ≥ 1, and Vj, j ≥ 1, be series of uniform variables on (0, 1). The exponential variable Wi is the time between event i − 1 and i in the samples history. Events are either coalescence or gene conversions, and if a gene conversion occurs we distinguish between two types according to the number of end points within the observed sequence. With probability p1 there is one end point only, otherwise two. The index i counts the number of events and j the number of gene conversions. That is, i − j is the number of coalescence events. The algorithm stops the first time there is one ancestor to the sample.
If Q = ∞ we find that W ~ Exp(G/2) and that both end points of a gene conversion will be within the sequence. As Q goes toward infinity, σ will be uniformly distributed along the sequence, and τ ≈ σ, that is, ζ ≈ 0. In the limit Q = ∞, a gene conversion will just leave a spot on the sequence, leaving the sequence the way it is. Thus, the coalescent with gene conversion will be indistinguishable from the pure coalescent process. Each time a conversion happens, it cannot be seen, and only coalescent events affect the history of a sample.
MODEL SIMULATIONS
Assume n sequences are sampled from a present-day population. The genealogy of the sample subject to gene conversion as described in the previous section can be simulated according to the scheme in Figure 5.
The algorithm is formulated as a birth and death process with constant death rate, G(2 − K((Q))/2, and terminates when there is only one ancestral sequence to the sample. Griffiths and Marjoram (1997) developed a similar birth and death process to handle to coalescent with recombination. Both algorithms return a genealogy in which the “true” genealogy of the sample is embedded; we keep track only of the number of sequences and the end points of gene conversions, not which parts of the sequences that are ancestral to the sample. Gene conversions can happen outside the ancestral material to a sequence, thereby creating an “empty” sequence, i.e., a sequence with no material ancestral to the sample (cf. Figure 2).
Consider a fixed point, χ, along the sequences. The tree, T(χ), that describes the sequences at χ can be found by starting at the present sequences and going back in time. When a gene conversion node is encountered, the branch that describes the segment containing χ is followed. Assume that the gene conversion leaves only one end point, s, in the sequence. If s < ω, then the branch describing the fate of the left part of the sequences is to be followed and vice versa and similarly, if the gene conversion leaves two end points. The tree T(χ) is distributed like the coalescent process since one point cannot be subject to gene conversion.
Mutations can be superimposed on the genealogy if all alleles are selectively neutral. Assuming mutations arrive according to a Poisson process, the number of mutations on the total genealogy is Poisson with parameter θB/2, where B is the total branch length of the entire genealogy, θ = 4Nu is the scaled mutation rate, and u is the probability of a mutation in a sequence per generation.
The coalescent with gene conversion can be combined with the coalescent with recombination. The genealogy of a sample of sequences is constructed similarly to the scheme above, waiting for three different events to occur: coalescence, recombinations, and gene conversions.
RESULTS
As noted previously, Q0 = Q/G = q/(4gN) is independent of L. The parameters (Q0, G) are more natural parameters than (Q, G); G represents the length of the sequence in units of gene conversion events per sequence per 4N generations and Q0 is a parameter dependent on the effective size of the population and parameters intrinsic to the biological system. In contrast (Q, G) are both parameters representing sequence length, but in different units.
In the following, results are given in terms of (Q0, G) to facilitate comparisons between samples of different lengths but with the same Q0. If we consider Q0 fixed and G a variable, results (and plots) can be converted between models with different g's but equal Q0 through a linear scaling of sequence length.
If sequence length is measured in units of expected
number of events per sequence per 4N generations, we find that the scaled length ζ′ = ζG of a tract is
Probability that two positions share an ancestor, n = 2. Shown is equation 15 for different values of Q0. Sequence length is in expected number of gene conversions per sequence per 4N generations. Using estimates in Hilliker et al. (1994), we find Q0 = 0.05 in D. melanogaster (g = 3 × 10−8, q = 3 × 10−3, and 2N = 106). The curve for Q0 = 0.05 flattens out approximately at 0.027 for G > 40 in which case Q > 2, or twice the expected length of a gene conversion tract. The case Q0 = 0 corresponds to a model with recombination only at rate 2G.
In the following, we consider a sample of size n taken from the population at the present time. First, we focus on correlations between trees. Consider two positions, χ1 and χ2, in a sample of n sequences at distance G. We are interested in the trees, Tn(χ1) and Tn(χ2), that describe the sequences at χ1 and χ2. Only events of type C1 and Co in between positions χ1 and χ2 affect the relation between the two trees. Events of type C2 in between the positions cannot be traced. The rate, rG/2, by which events of type C1 and Co happen is, per sequence,
For small values of G we find rG ≈ 2G. This is of particular interest because the rate at which recombinants are produced by recombination is only G (assuming r = g), and this is half the rate at which recombinants are produced by gene conversion events. Thus, gene conversion events might contribute significantly to linkage disequilibrium over small distances (see also Wiuf 1999). For example, according to Andolfatto and Nordborg (1997), g = 3r in D. melanogaster and the rate of gene conversions is thus sixfold that of recombination over small distances.
Simulated values of the time until all positions have found a MRCA in a sample of size n = 2. The expected value 2(1 − 1/n) of the time to a MRCA in a single position is subtracted. The curves increase with increasing values of Q0. This is in agreement with the fact that the expected number of gene conversions also increases with increasing Q0 (see Figure 9). The D. melanogaster curve, Q0 = 0.05, follows Q0 = 0 very closely (not shown). A total of 10,000 simulations were performed for each value of Q0.
Applying results in Griffiths (1991), we find for n = 2,
The probability P(T2(χ1) = T2(χ2)) for n = 2 is plotted in Figure 6. The tree height, Tn = max{Tn(χ)|χ}, until all positions in the sample have found a most recent common ancestor is plotted in Figure 7 for n = 2. For Q0 = 0.05 and G = 5 we find Q = 0.25, which corresponds to ~100 nucleotides in D. melanogaster (see Figures 4 and 6). In this case the chance that two positions separated at distance 5 share a most recent common ancestor (MRCA) is ~0.10 if sample size is 2.
Number of gene conversion events. The number of gene conversion events affecting the history cannot be counted by chopping up the sequence into small sections, di, i = 1, 2, …, counting the number, ci, in each section and then summing over i, ∑ici. Knowing that one end point of a gene conversion is in section di does not provide information whether or not the other end point is within ancestral material. In the example, c1 = 2, c2 = 1, and c1 + c2 = 3, but only two events affect the history. Each breakpoint occurs twice in the drawing.
Also of interest are the numbers of different types of gene conversion events. We discuss two such numbers. The first is the number of gene conversion events affecting the history of the sample. Denote this number by Gn. The second is the number of gene conversion end points that make the topology change. Call this number Sn. This number relates to the possibilities of detecting gene conversion events in the sample history. Under an infinite-sites mutation model (Watterson 1975) with a mutation rate very high compared to that of gene conversions, G, all shifts in topologies can be detected from sequence polymorphisms (Hudson and Kaplan 1985). Also, these are the only events that can be detected (under the infinite-sites model).
It is not possible to find the expectation of Gn analytically. The reason for this is the following. The expectation of Gn is not linear in G because a gene conversion with both end points in ancestral material is only counted once. Knowing that a gene conversion has one end point within ancestral material in a small sequence interval does not reveal if the other end point also is within ancestral material. Thus, information about gene conversion end points in a small sequence interval does not provide information whether the points count in Gn (Figure 8). In the coalescent with recombination, the expected number of recombination events can easily be found because the number is linear in R (Hudson and Kaplan 1985). The expected value of Gn is simulated for different values of G and Q0 and plotted in Figure 9. Asymptotically, the expectation of Gn is linear.
The expected number of events causing a topology shift is given by
Expected value of Gn. Sequence length is in expected number of gene conversions per sequence per 4N generations. As sequence length increases the expected number of gene conversions becomes effectively linear. The sample size is 10 and 10,000 simulations were performed for each value of Q0.
Consider position 0 in the sequences and let Bn be the total length of the coalescent tree in position 0. Wiuf and Hein (1999) consider the coalescent with recombination and discuss the sequence length, Λn, conditional on Bn, from position 0 until the first recombination point affecting the history of the sample. They find that
Here, we give the analogous result for the coalescent model with gene conversion. The situation differs in the sense that we wait for either a gene conversion initiating to the right of position 0 [analogous to (18)] or a gene conversion initiating to the left of position 0 but ending to the right of 0. The proof is given in the appendix. The distribution of the length, Λn, until the first point affecting the history of the sample, conditional on Bn, is
DISCUSSION
We developed a coalescent model with gene conversion assuming a diploid population of constant size, N. It takes two parameters as input (along with the sample size). The first is the product G = 4NLg, where g is the probability that a gene conversion tract initiates in a fixed position, and the second parameter is Q = qL, where q is the probability that the next nucleotide is within the tract given that the former is within the tract. An easy simulation scheme was developed. This scheme could be modified, in accordance with Griffiths and Tavaré (1994), to allow for fluctuation in the population size with time.
We derived a number of results related to the correlation of trees in different positions and to the probability that two given positions share a common ancestor. These tend to be highly different from similar quantities obtained in the coalescent with recombination. The covariance between trees relating two distinct positions does not tend to zero with increasing distance between the positions, but is bounded by a positive constant. Thus, it is expected that the trees in the two distinct positions are more likely to share the same topology only in a model of gene conversion than in a model of recombination with similar rate. Figure 10 confirms this. As Q0 = Q/G increases, the probability, π, of common topology in two distinct positions decreases and for fixed value of Q0, π converges (as function of distance) to a level distinct from 0 and 1. Low values of Q0 have a similar effect to that of recombination.
We should note in passing that given a gene conversion end point is in position χ, the probability, p, of detecting that the gene conversion has occurred from incompatibilities in the sequence alignment is very low and slowly increases with sample size. Hudson and Kaplan (1985) and C. Wiuf, T. Christensen and J. Hein (unpublished results) found
It is of high interest to be able to distinguish mechanisms operating on the molecular level in patterns of variability obtained in a sample of sequences. An indicator of gene conversion could be the following. Define an informative site to be a site where at least two different nucleotides are present in at least two sequences each. Further, say a pair of sites is compatible if there exists a topology explaining the sites with the minimum possible number of substitution events. Now, assume three informative sites are given. If the first and last sites are compatible, and the two remaining pairs involving the middle site are both incompatible, we would take this as evidence of a gene conversion event. In a pure recombination model, given the middle site is incompatible with both other sites, we would expect that the first and the last sites are also incompatible. We have simulated how often under ideal circumstances this is likely to happen. It is assumed that the topology can be accurately constructed for each column in the sequence alignment. This is, for example, the case in an infinite-sites model with very high mutation rate. Table 1 shows simulation results for various values of Q0 and G.
Simulated probabilities that two positions do not share the same topology for various values of Q0. Sequence length is in expected number of gene conversions per sequence per 4N generations. Consider again D. melanogaster. The curve Q0 = 0.95 follows Q0 = 0 very closely and the chance that two points at distance 1 have different topologies is >70%. This corresponds to a distance of 5% the length of the expected tract length or 0.05 × 350 ≈ 20 nucleotides. Sample size is 10 and 20,000 simulations were performed for each value of Q0.
Denote the positions 1, 2, and 3. Positions 1 and 3 are fixed whereas 3 is uniformly distributed between 1 and 3. The findings in the table can be explained as follows. Assume Q0 = 0. If 1 and 2 do not share topology and similarly for 2 and 3, this requires at least two recombination events. As the distance between 1 and 3 increases, the chance of more recombination events before the MRCAs to positions 1 and 3 increases; thus the chance that 1 and 3 share the same topology decreases.
If Q0 > 0 the pattern changes. When the distance is small compared to 1/Q0 (expected tract length), most events will be of type C1 and Co. A similar pattern to that for Q0 = 0 is thus expected. As the distance increases, the rate by which 1 and 3 are separated becomes almost constant (rG = 2{1 − exp(−Q0G)}/Q0) and if (1, 2) and (2, 3) do not share topologies it is most likely caused by a gene conversion event with both end points within 1 and 3. Thus we will accept more cases where 1 and 3 share topology.
Distance between 1 and 3
Now fix the distance between the positions. As Q0 increases, the number of C2 events increases relative to C1 and Co events (Equations 11 and 12) and again if (1, 2) and (2, 3) do not share topologies it is most likely caused by a gene conversion event that removes a region around position 2. Again, we expect more cases where 1 and 3 share topology.
The pattern observed in Table 1 is obtained under ideal circumstances. In practice, the topology relating a set of sequences in a given position can be reconstructed from the nucleotide pattern in that position in rare cases only. Under an infinite-sites model with low mutation rate, the chance that 1 and 3 can be explained by the same topology is higher than the value given in Table 1. Similarly, recurrent substitutions have the same effect, but in both cases the overall pattern in Table 1 is retained. These results suggest that in a pure recombination model we will see many more pairs of incompatible sites than in a model with pure gene conversion (assuming the rate of recombination is similar to the rate of gene conversion). On the other hand we expect many more topology shifts under a gene conversion model, so that we must return, moving along the sequences, to previously visited topologies more often than under a recombination model. Also, the results in Table 1 imply that recombination can be distinguished from gene conversion by the spatial arrangement of incompatible pairs, simply by estimating the probability in Table 1 from data. If recombination is not present we will expect this probability to be significantly different from 0.
It would be advantageous to build a model of gene conversions and recombination more directly on a molecular model of these phenomena. This will be pursued in greater depth elsewhere, but the basic issues involved are shortly sketched here.
Central to most models of gene conversion/recombination is the Holliday junction (Holliday 1964). Although the original Holliday model has since been superceded by more complicated models, such as the single strand invasion model (Meselson and Radding 1975) and the double-chain-break repair model (Resnick 1976), the Holliday model would be still be the natural starting point for modeling gene conversion from a population genetical viewpoint. Issues in the Holliday model would also be central in the more complicated models. In the Holliday model the following three problems would be of central importance.
The movement of the Holliday junction and its consequences. If the movement is controlled by a random walk that is stopped at a fixed time, the length of the gene conversion would be close to a symmetric binominal distribution moved so its mean is zero. Since the movement of the Holliday junction along nonidentical sequences will create heteroduplexes that are energetically unfavorable to homoduplexes, it could be expected that there would be a sequence-dependent centralizing drift in this random walk. Since the energetic costs of mismatches are approximately known, this distribution could be calculated. The length of gene conversions would then be sequence dependent. This would be a complication when incorporating it into the coalescent that takes a view from the present toward the past.
How is the heteroduplex corrected? Two extremes can be imagined. The heteroduplex is resolved when the DNA is duplicated or one single strand is made the master copy and the other single strand is corrected so it matches to this master strand. From the point of view of redistributing ancestral material, this is simple as the region within the gene conversion will choose one ancestor. The other possibility, that mismatches are corrected independently one by one, has more drastic consequences for the distribution of ancestral material as each correction will be independent choices of parents. Within the region of the gene conversion, different positions can have different ancestors. This would also have drastic consequences for the phylogenetic relationships among the sequences as one moves along the sequences, as even arbitrarily close positions could be related by different phylogenies.
The Holliday structure can be resolved in two ways, one leading to a gene conversion, but no recombination, and the other leading to a gene conversion and a recombination. What are the probabilities of gene conversion and recombination? Few, if any, biological systems have gene conversion but no recombination. In modeling the consequences, it would be natural to make this probability a single parameter in the model.
Later models, especially Meselson and Radding (1975) and the double exchange model (Resnick 1976), are more complicated and will slightly change the picture. From a population genetical viewpoint, all that matters in these different models is how the ancestral material is redistributed when viewed from the present toward the past. The Meselson-Radding model would have a very short region with a one-directional gene conversion added relative to the pure Holliday model, due to the strand invasion. Since this region is expected to be short relative to the length of the migration of the Holliday structure, the Meselson-Radding model is expected to be very similar to the pure Holliday model. The double-chain-break repair model would have two strand invasions and also two Holliday junctions that could undergo a random walk relative to each other. The double-chain-break repair model probably has greater consequences for the distribution of ancestral material than the Meselson-Radding model, but that remains to be explored.
Since it is not well known how different recombination mechanisms are phylogenetically distributed, it would be interesting to know if it was ever possible to distinguish the underlying mechanism from pure population data. It would here be natural to formulate what was expected in terms of different sets of incompatibilities. For instance, if three informative sites were given, and the first and last sites were compatible, but the two remaining pairs involving the middle site were incompatible, this would be taken as an indicator of a gene conversion. If more informative sites were found close to the middle position, these would be expected to have different kinds of compatibilities dependent on which kind of repair mechanism was operating on the heteroduplex.
Whether or not different recombination/gene conversion models are distinguishable by population sequence data depends on the frequency of configurations of segregating nucleotides in the population that would result in different products when recombined under different models. A functional geneticist would design the necessary configurations of nucleotides and set up the adequate crosses. In population genetics this would have to occur by chance. It is unlikely that population sequence data can compete with designed experiments in this respect, but this does not imply that different mechanisms of recombination will not have consequences for the expected patterns of variation in population sequence data; such data are becoming available from many organisms, where molecular genetical experiments have not yet been performed.
Acknowledgments
J. P. Hjorth is thanked for helpful discussions on what gene conversion is, M. Nordborg for commenting on an early version of the manuscript, and M. Schierup for various suggestions that improved the manuscript considerably. We thank A. Mikkelsen for implementing the model and performing the simulations used in the article. C.W. was supported by grant Biotechnology and Biological Sciences Research Council 43/MMI09788 and by the Carlsberg Foundation, Denmark. Part of this work was carried out while the authors visited the Isaac Newton Institute, University of Cambridge.
APPENDIX: PROOF OF FORMULAS (19) AND (20)
Wiuf and Hein (1999) consider the coalescent with recombination and discuss the sequence length, Λn, conditional on the total branch length, Bn, from position 0 until the first recombination point affecting the history of the sample. They find that
Remember [see (10), (11), and (12)]
The rate, G(1 − K(Q0G))/2, of type Co events converges to 1/(2Q0) for G → ∞. Conditional on Bn = b, the number, Go, of Co events in the sequences ancestral to position 0 is thus Poisson distributed with parameter b/(2Q0),
Finally, combining (A2), (A3), and (A5) we deduce that Λn = min (Λo, Λ1,2) has distribution
Footnotes
-
Communicating editor: A. G. Clark
- Received August 4, 1999.
- Accepted January 18, 2000.
- Copyright © 2000 by the Genetics Society of America