Abstract
There are two types of recombination that we may wish to detect: rare recombinants between members of different populations or species and repeated recombination within a population. Methods appropriate in the former context are inappropriate in the latter because they depend on recognizing the existence of runs of nucleotides with similar ancestry. If recombination is sufficiently frequent, no such runs will be present. Several methods, including the homoplasy test and the incompatibility test, are described that are appropriate for detecting repeated recombination and for measuring its importance, relative to mutation, in causing genetic change. The sensitivity of these tests is investigated by simulating populations with varying frequencies of mutation and recombination and calculating the various statistics on samples.
THIS article is concerned with the use of DNA sequence data to detect and measure the role of homologous recombination in natural populations. Its most obvious relevance is to bacteria, which vary from the clonal to the effectively panmictic (Maynard Smith and Smith 1998; Suerbaumet al. 1998) and in which recombination is decoupled from reproduction. However, the methods are also relevant to eukaryotes in cases in which the role of recombination is uncertain—for example, between mitochondrial genomes (EyreWalkeret al. 1999; Hagelberget al. 1999) and in organisms whose reproduction is apparently parthenogenetic.
The main point to be made is that there are two very different contexts in which we may wish to detect recombination. Methods appropriate in one context may be ineffective in the other. We may be seeking either (i) unique or rare recombination between members of genetically different populations—for example, between different species or between different parts of the genome; or (ii) repeated recombination between homologous sites in members of a single population.
Unique or rare events lead to linked runs of nucleotides within a sequence whose ancestry is different from other nucleotides in the same sequence. Several methods exist for detecting such runs: they are described only briefly.
With repeated recombination, such runs may not exist or may be hard to detect. If recombination is frequent enough, the sites within a gene will be in linkage equilibrium: that is, there will be no association between neighboring nucleotides. Even with rates of recombination far too low to generate linkage equilibrium, it is shown that methods aimed at detecting runs of nucleotides with similar ancestry are ineffective. Methods of detecting repeated recombination are described. Ultimately, they depend on the fact that, if there has been recombination, the pattern of variation in a set of sequences is incompatible with the hypothesis of clonal descent: in this article, “clonal” is taken to mean reproduction without genetic recombination. In principle, these methods can be used not only to detect recombination but also to measure its importance, relative to mutation, in generating change. The effectiveness of these tests is investigated by applying them to simulated populations of varying size, mutation rate, and recombination rate.
RARE OR UNIQUE RECOMBINATION
Several methods exist for detecting runs of nucleotides with similar ancestry. Stephens (1985) examined sites that are associated with particular phylogenetic partitions of the set of sequences into two groups: thus a site is associated with a partition if it has one allele in one subset and a different allele in the other. He describes statistical tests to determine whether a set of sites associated with a particular partition are clustered. Sawyer (1989) gives a more general method for deciding whether the differences, or similarities, between a pair of sequences occur in runs: the method can also be applied to a set of sequences.
If a visual inspection of the polymorphic sites in a set of sequences suggests that one or more recombinational events have occurred, the maximum chisquare method (Maynard Smith 1992) will locate the most likely positions of the crossovers and test their statistical significance. The method was used to identify the role of horizontal gene transfer in the spread of antibiotic resistance in Streptococcus (Dowsonet al. 1989) and Neisseria (Sprattet al. 1992). Hein (1990) showed how maximumlikelihood methods can be used to locate the position of crossovers. Holmes et al. (1999), in an analysis of recombination in Dengue virus, described a maximumlikelihood method that can be used if a putative recombinant and two parental sequences are suggested by the data.
Sneath et al. (1975) proposed a method that depends on identifying incompatible pairs of sites. They considered protein sequences, but the method is equally applicable to nucleotide sequences and is discussed here in that context. Two sites are incompatible if they cannot be incorporated into a phylogenetic tree without assuming that one of the sites has changed twice. If each site is present in the data set in only two states, as is typical for nucleotide data, the pair are incompatible if all four genotypes, 00, 01, 10, and 11, are present. An incompatibility matrix can be plotted for a set of sequences, for all phylogenetically informative sites (that is, sites at which both alleles are present in at least two strains). If there are n informative sites, an n × n matrix is plotted, with black squares for incompatible sites. Jakobsen and Easteal (1996) describe a program to construct incompatibility matrices and to test whether incompatible sites cluster. Jakobsen et al. (1997) describe a program that plots partition matrices: in effect, this is a visualization of Stephens’ method described above. It is not clear that this offers any advantage over eyeballing a simple listing of polymorphic sites, combined with using the maximum chisquare method to test the significance of possible recombinants.
All the methods described in this section depend on the existence of runs of polymorphic sites with a similar ancestry. They are therefore unsuitable for detecting repeated recombination within a population.
REPEATED RECOMBINATION WITHIN A POPULATION
There are several methods for detecting repeated recombination. First, the logic of these tests is described, and then their sensitivity is investigated by applying them to simulated populations.
The homoplasy test: This test was described by Maynard Smith and Smith (1998). It can be applied to a set of sequences of a single gene or of several unlinked genes from the same strains. The logic is as follows. Construct a maximumparsimony tree for a set of sequences. If there are v polymorphic sites, and t steps in the tree, then h = t  v is the number of homoplasies, or double events. If, as is usual, there are several equally parsimonious trees, this does not matter because it is only the total number of steps that matters. If descent is clonal and the number of sites infinite, then h = 0. If there has been recombination, however, in general h > 0. This fact was used by Hudson and Kaplan (1985) to estimate recombination rate.
Difficulties arise when mutation is so common that there can be repeated mutations at the same site. If so, there may be homoplasies even in a clonal population. Maynard Smith and Smith discuss how this difficulty can be overcome. It is usually best to confine attention to synonymous third sites: little information is lost. If there are S synonymous sites, all equally likely to change, it is easy to calculate exph, the expected number of homoplasies with clonal reproduction, given v polymorphic sites. If sites are not equally likely to change, an estimate of exph requires an estimate of S_{e}, the effective site number, defined as follows. Consider two identical copies of a gene, obeying the same evolutionary rules, and let each undergo a random substitution. Let p_{s} be the probability that the two substitutions are identical. Then S_{e} = 1/p_{s}. Clearly, if all sites are equally likely to change, S_{e} = S; otherwise, S_{e} < S.
For synonymous sites, codon bias is the most likely reason why the probability of change should vary between sites. Given a known pattern of codon usage, Maynard Smith and Smith describe a method for calculating S_{e}: applied to Escherichia coli, this gives S_{e} = 0.73 S for genes with a very high codon adaptation index and S_{e} = 0.83 S for genes with a medium high index (data from Bulmer 1988). Alternatively, S_{e} can be estimated using an outgroup. Further difficulties arise if some sites are hypermutable. Methods for detecting hypermutability, if it exists, are described by EyreWalker et al. (1999): in general, such methods require an outgroup.
If the observed number of homoplasies, obsh, is significantly greater than exph, the plausible explanation is recombination. The extent of recombination is measured by the “homoplasy ratio,”
The incompatibility ratio: If more than a very few recombinational events have taken place in the ancestry of a set of sequences, patterns in the incompatibility matrix become hard to interpret. However, for a given degree of polymorphism, recombination increases the proportion of incompatible sites. This suggests the use of the incompatibility ratio, (IR) as a statistic, where IR = (number of pairs of sites that are incompatible) ÷ (number incompatible in a shuffled matrix). The data set is the matrix of phylogenetically informative sites, and, as for the homoplasy ratio, a shuffled matrix is one in which, at each site, the alleles have been randomly shuffled between strains.
IR has one advantage and one disadvantage compared to H. The advantage is that it is easier to calculate the proportion of incompatible sites than to find a maximumparsimony tree, particularly for a large data set. The disadvantage is that its expected value for a clonal population is not known and is certainly not zero so that, in most cases, it cannot be used to test departure from clonality. However, if the effective site number S_{e} is very large compared to the number of polymorphic sites, then the expected number of incompatible pairs in the absence of recombination is close to zero, and this difficulty does not arise.
The index of association: A third possible measure is related to the index of association, (I_{A}), which has been used to analyze multiplelocus enzyme polymorphism (Brownet al. 1980; Maynard Smithet al. 1993). This is based on V_{obs}, the variance of the genetic distance between pairs of strains, compared to V_{exp}, the corresponding variance in a shuffled matrix. The expected value of the ratio V_{obs}/V_{exp} is 1.0 for complete linkage equilibrium and >1.0 if recombination is absent or infrequent. It is most useful as a measure of departure from linkage equilibrium because its expected value is then known, and an expression for its error variance has recently been published (Hauboldet al. 1998). It is less useful for detecting departure from clonality because its expected value in clonal populations is unknown unless S_{e} is very large relative to the number of polymorphic sites. A second difficulty with I_{A} is that its expected value for a clonal population increases with the number of loci analyzed. Burt et al. (1999) suggest a related statistic, which increases monotonically with I_{A}, but whose expectation is independent of the number of loci analyzed.
The coefficient of linkage disequilibrium: Lewontin (1964) suggested the coefficient D = (P_{AB} · P_{ab}  P_{Ab} · P_{aB})/(P_{AB} · P_{ab} + P_{Ab} · P_{aB}), where P_{AB} is the frequency of AB haplotypes, and similarly for Ab, aB, and ab, as a measure of departure from linkage equilibrium. Because the choice of the symbols A and B is arbitrary, it is customary to take the absolute value of D, a number whose expectation varies from 0 (linkage equilibrium) to 1 (complete association). Conway et al. (1999) have recently used D to demonstrate recombination from population data in Plasmodium falciparum. They show that values of D significantly different from zero are frequent for bases <1 kb apart (demonstrating the power of the test to detect disequilibrium) but absent for sites further apart. The method is appropriate provided that none of the frequencies of the four gametic types are too low: with rare alleles or small samples, values of D = 1 will occur by chance, even with frequent recombination. Conway et al. analyzed samples varying from 66 to 124 isolates from single geographic regions and included only loci at which the frequency of the common allele did not exceed 0.9. Provided that data of this type are available, the method is an effective one, but it lacks sensitivity applied to more restricted data sets. For the samples of 20 or 30 individuals from the simulated populations analyzed below, it failed to distinguish between linkage equilibrium and clonality (data not shown).
TESTING FOR RECOMBINATION IN SIMULATED POPULATIONS
Values of the three statistics, H, IR, and Sawyer’s ratio, were calculated for samples drawn from simulated populations. Simulations were carried out as follows:
Each population was haploid, of N individuals (varying from 50 to 1000), each with 100 sites equally likely to mutate between two alleles, 0 and 1 (thus the simulations are of singlenucleotide polymorphisms for which only two nucleotides are usually found at a site). Each new generation was formed by sampling with replacement from the previous one.
In each generation, m mutations occurred, each at a random site in a random individual.
In each generation, r recombinations occurred. A random donor and recipient were chosen, and all sites beyond a random crossover point were exchanged.
For each set of parameter values, starting from a population with only 0 alleles, a foundation population was formed by iterating 3N generations. Starting from this foundation population, five simulations were made, each of 3N further generations.
From each final population, two random samples (usually of 20 or 30 individuals) were drawn, and statistics calculated.
Different statistics were calculated on different simulated populations. This was not necessary but arose because the investigation of H was completed before the investigation of IR started.
In Figure 1, the statistics are plotted against R/M, where R is the probability that a particular site in a gene is altered by recombination and M the probability that the site is altered by mutation. The use of this measure of recombination is discussed further below.
Figure 1A shows Sawyer’s ratio, a measure based on Sawyer (1989). This test depends on the sum of squares of the lengths of runs in the data; Figure 1A shows the ratio of this sum calculated for the real data and for a randomized matrix with the same allele frequencies at each site. A value of 1.0 indicates that there is no tendency for differences to occur in runs. As expected for frequent recombination, a test based on the occurrence of runs is unable to distinguish between clonality and complete linkage equilibrium, although for low values of R/M the ratio is usually >1.0.
Figure 1B plots the homoplasy ratio, H. The value rises continuously with R/M. Although the range of values for a given r and m is rather large, no overlap occurred between values for R/M = 0, 20, and 80. At least one can use H to distinguish between no recombination, some recombination, and linkage equilibrium. H = 0.5, a value not atypical for bacteria, implies that R/M ∼ 20. The value is very approximate, but it does confirm the conclusion of Guttman and Dykhuizen (1994) that, in E. coli, recombination is more important than mutation in generating genetic change in bacteria in the short term.
Finally, Figure 1C plots IR. Like H, this statistic rises continuously with R/M, but, as expected, its value in clonal populations is not zero. Table 1 shows how, in a clonal population, the value varies with different levels of genetic variability. Without an estimate of its expected value in clonal populations, the test cannot be used to distinguish between clonality and low levels of repeated recombination unless an “infinite sites” assumption is justified, in which case any incompatible pairs are evidence for recombination.
Note that recombination was reciprocal as for chromosomal genes of eukaryotes. Recombination in prokaryotes differs in two respects: it is nonreciprocal and involves the insertion of relatively short pieces of DNA. It was not practicable to simulate nonreciprocal recombination because it would have a large effect in reducing genetic variability, given the small population sizes: this effect would be negligible in the large populations characteristic of bacteria. However, the results for H and IR should hold for bacterial populations. Populations were simulated in which short regions of 50 sites were reciprocally exchanged. H and IR were then plotted against R/M, with results (not shown) very similar to those in Figure 1. There is, however, one context in which the prokaryotic type of recombination is likely to have results different from the eukaryotic type. This is for the rate of decline in linkage disequilibrium with distance, which will depend on the size of the pieces transferred: this problem is worth further investigation.
The effect of population size on the variance of these estimates is of some interest. Table 2 shows the effect on IR of varying N in clonal populations and in populations with some recombination; similar results (not shown) were obtained for the effect on H of varying N. The variance between populations does not decrease with N. For clonal populations, this is not unexpected. Statistics such as H and IR depend on the form (topology and branch lengths) of the phylogenetic tree. This does not become uniform as the population gets larger. If one imagines looking at the (true) phylogenetic tree of a large population in successive generations, its form would not remain constant. Consider, for example, the coalescence time of all members of a clonal population. This will usually increase by one in each generation but occasionally decrease by some large number. That is, the tree changes discontinuously. It is therefore not surprising that statistics that depend on the tree (as do both H and IR) vary between large populations with the same parameters. It was, at least to me, more surprising that the variance of IR does not decline with N in populations with recombination.
What do the statistics H and IR measure? In Figure 1, H and IR are plotted against R/M, where R is the probability, per generation, that a nucleotide will change because of recombination and M is the probability that it will change by mutation. M is simply the per base mutation rate, but R is more complicated because it depends not only on the recombination rate but also on the genetic distance between recombinants. In a bacterial context, with oneway insertion of DNA fragments, it is easy to interpret R. Imagine an insertion of length d into a region of length s. If d < s/2, the new gene will, at least approximately, occupy the same position in a phylogenetic tree as it did before insertion, and the d inserted nucleotides may contribute homoplasies, depending on the number of changed sites introduced: that is, the number of homoplasies depends on Rdh, where h is the heterozygosity. Note that if d > s/2, the new gene is more similar to the gene in the donor, and the number of homoplasies depends on R(s  d)h. In the limit, when a whole gene is transferred, d = s, and no homoplasies are caused.
A similar approach can be adopted for reciprocal recombination with a single crossover point. The recombination event separated the gene into a shorter piece, length d, and a longer piece, length s  d. Again, H will depend on Rdh, but, because the event is reciprocal, each crossover contributes 2dh to R. In plotting Figure 1, R was taken as the number of site changes caused by recombination in the shorter gene region, summed over all chromosomes, and divided by Ns, where N is the population size and s the number of sites per gene.
The expected value of R/M can be calculated as follows. Let the per generation mutation rate = u; the probability that a site in two random individuals is occupied by a different allele = h; and the per gene recombination rate = sc. Then M = u and R = P (a gene undergoes a recombination event) × P (a particular site is included in the inserted region) × h = sch/4. For a haploid population with only two alternative alleles per site, h = 2Nu/(1 + 4Nu), and the expected value of R/M is Nsc/2(1 + 4Nu).
Table 3 shows expected and observed values of H and R/M for simulated populations with varying rates of mutation and reciprocal recombination. With one minor exception, H increases monotonically with R/M. Simulations (not shown) of populations with reciprocal exchange of short regions confirmed that H is a function of R/M. Thus the answer to the question at the head of this section is that H is a measure of R/M.
CONCLUSIONS
Several methods (Stephens 1985; Sawyer 1989) exist for detecting unique or rare recombinations between genetically different populations. They depend on recognizing the presence of runs of linked nucleotides with distinct ancestries. Incompatibility and partition matrices can be used to give a visual impression that such runs are present before statistical testing (Jakobsenet al. 1997). However, it is not clear that such matrices offer any advantage over the more obvious procedure of inspecting a printout of all informative sites in a set of sequences. The statistical significance of any runs sugested by such an inspection can be tested by the maximum Chisquare method (Maynard Smith 1992) or in other ways.
Such procedures are already familiar. The main point of this article is to point out that they are ineffective in detecting repeated recombination between the members of a population because repeated recombination breaks up the runs of linked nucleotides on which they depend. In the limit, in a population in linkage equilibrium, there is no association between neighboring nucleotides and hence no runs.
Several methods of detecting repeated recombination are described here, and their effectiveness is compared on simulated populations. The homoplasy test (Maynard Smith and Smith 1998) compares the observed number of homoplasies in a maximumparsimony tree of the sequences with the number expected in the absence of recombination. It has the advantage that the number of homoplasies expected in an equally variable clonal population can be estimated so that the evidence for recombination can be tested. It has the drawback, however, that it depends on finding a maximumparsimony tree for the data, which is time consuming and inaccurate for large data sets. This difficulty can be met by analyzing a subset of, say, 30 sequences.
An alternative is the incompatibility ratio, which compares the number of pairs of sites that are phylogenetically incompatible with the number expected in a panmictic population. The statistic is easy to compute but has the drawback that the expected value in a clonal population is unknown unless an infinite sites model is appropriate for the data being analyzed.
The homoplasy ratio, H, is a number whose expectation varies from 0 (clonality) to 1 (complete linkage equilibrium). It can therefore be used as a measure of the rate of recombination. But just what rate is being measured? In simulated populations with a range of values of recombination and mutation rates, H is a function of R/M, where M is the probability that, in a short time interval, a nucleotide will alter as a result of mutation, and R is the probability that it will be altered by a recombination event. In a bacterial population, R is easy to interpret because recombination events usually consist of the insertion of a short region of DNA. The value of R will depend on the frequency of such events, the length of the inserted regions, and the genetic distance between donor and recipient. In eukaryotes, interpretation is less obvious. Recombination gives rise to two new sequences, each consisting of a region from each parent. Which parent, then, is to be regarded as the “donor” of novel nucleotides? The answer is that the shorter of the two regions is to be treated as “donated” DNA. Although an intuitive justification for this procedure can be given, the real justification is that, if R is calculated in this way, H proves to be a monotonic function of R/M.
Footnotes

Communicating editor: P. L. Foster
 Received March 22, 1999.
 Accepted June 7, 1999.
 Copyright © 1999 by the Genetics Society of America