Abstract
The distribution of the number of pairwise differences calculated from comparisons between n haploid genomes has frequently been used as a starting point for testing the hypothesis of linkage equilibrium. For this purpose the variance of the pairwise differences, V_{D}, is used as a test statistic to evaluate the null hypothesis that all loci are in linkage equilibrium. The problem is to determine the critical value of the distribution of V_{D}. This critical value can be estimated either by Monte Carlo simulation or by assuming that V_{D} is distributed normally and calculating a onetailed 95% critical value for V_{D}, L, L = E(V_{D}) + 1.645
BACTERIA might be called “facultative sexuals” because they can exchange genetic material through conjugation, transformation, and transduction, but genetic exchange is not a part of their reproductive mode. Just how frequently recombination takes place in bacteria has been a topic of debate since the first major study of bacterial population genetics, in which Escherichia coli genomes were assumed to recombine frequently leading to linkage equilibrium (Milkman 1973). Selander and Levin (1980) showed that this assumption was incorrect and that E. coli populations consisted of many asexual clones evolving in genetic isolation from all other clones comprising the species (cf. Maruyama and Kimura 1980, but see Guttman and Dykhuizen 1994). During the 1980s this clonal model was thought to hold for all bacterial populations until Istock et al. (1992) reported that a local population of Bacillus subtilis was in linkage equilibrium and argued that this resulted from frequent mixis. In addition to B. subtilis, linkage equilibrium has been reported for Neisseria gonorrhoeae (O’Rourke and Stevens 1993), subpopulations of Rhizobium (Souzaet al. 1992; Maynard Smithet al. 1993; Bottomleyet al. 1994; Strainet al. 1995), Burkholderia cepacia (Wiseet al. 1995), Helicobacter pylori (Goet al. 1996), and fluorescent Pseudomonas (Haubold and Rainey 1996).
The conclusion of linkage equilibrium reached in these studies is based on the variance of the distribution of the number of pairwise differences (V_{D}) among bacterial isolates that have been subjected to genetic analysis at multiple loci. V_{D} can be compared to a critical value obtained under the null hypothesis that all loci are in linkage equilibrium. This approach was first developed by Brown et al. (1980), who applied it to allozyme data from wild barley, Hordeum spontaneum. Whittam et al. (1983) pioneered its use in bacterial population genetics, and more recently this method served as the basis for an extensive comparative study of bacterial population structure (Maynard Smithet al. 1993).
There are two methods of calculating a critical value for V_{D}. (1) The null distribution of V_{D} can be simulated on a computer, and (2) assuming the null distribution of V_{D} is normal, a critical value can be calculated by the wellknown method of adding x standard deviations to E(V_{D}). But, as it is not known whether the null distribution of V_{D} is normal, Monte Carlo simulation has recently emerged as the preferred way for testing linkage equilibrium in bacterial populations (Souzaet al. 1992; Wiseet al. 1995; Haubold and Rainey 1996). However, this approach is computationally intensive and many workers have preferred to use the simplifying assumption of normality for hypothesis testing. In this case the correct test depends above all on an accurate estimator of the variance of V_{D}.
THE TRADITIONAL METHOD OF COMPUTING THE VARIANCE OF V_{D}
Suppose we have n sampled haploid individuals, arbitrarily numbered from 1 to n, that have been genetically assayed at q loci. Let d_{ij} denote the number of loci at which individuals i and j differ. Then the variance of pairwise differences is by definition equal to
Brown et al. (1980) suggested that the onetailed 95% critical value for V_{D} could be calculated assuming that the distribution of V_{D} is normal. Thus they estimated this critical value by
In the next section we derive a formula for the variance of V_{D} under the randomization scheme of Souza et al. (1992) and show that (6) is inappropriate for calculating the variance of V_{D} under these circumstances.
COMPUTING THE VARIANCE OF V_{D}
In this section we obtain an exact expression for the variance of V_{D} under the shuffling of alleles across individuals (the sampling without replacement method; see also Hudson 1994). In the following, d_{ij} denotes the random number of loci at which individual i and j differ in a shuffled sample. First we write V_{D} in terms of s_{ij}, the number of loci at which individuals i and j are identical. Noting that s_{ij} = q  d_{ij}, it follows that
We now proceed to derive expressions for each of the terms on the righthand side of the last line of Equation 10. Let x_{k} be an indicator variable, equal to one if individual 1 and individual 2 are identical at locus k, and zero otherwise. Then
Similarly, to calculate the other terms in (10) we define z_{k} to be one if individuals 3 and 4 are identical at locus k and zero otherwise, and we define y_{k} to be one if individuals 1 and 3 are identical at locus k. It follows that
We can write the results in a way that does not require double, triple, or quadruple sums. For example, note that
RESULTS AND DISCUSSION
To convince ourselves of the correctness of the above algebra and to demonstrate the inadequacy of Var(V_{D})_{old} we used Monte Carlo simulations. Eleven artificial samples were constructed in the following way: The first data set containing 100 strains and 10 loci with five alleles at each locus was constructed from 96 strains of genotype
The second data set was made up of 88 strains of the major genotype and 3 strains of each of the minor genotypes and so on until a data set of maximum genetic diversity was reached consisting of 20 strains of each genotype. In this way we obtained artificial data sets with genetic diversities ranging from 0.078 to 0.8, which represent the range of genetic diversities to which the test developed by Brown et al. (1980) has been applied. The completely linked artificial data sets were then unlinked by one round of resampling without replacement.
For each sample, Var(V_{D})_{old} and Var(V_{D}) were computed (using Equations 6 and 10, respectively). In addition, the randomization method suggested by Souza et al. (1992) was applied to each sample. That is, the alleles at each locus were shuffled randomly (resampling without replacement) and V_{D} calculated for each of 10,000 such shuffled samples. This allowed the calculation of the simulated sampling variance of V_{D}, Var(V_{D})_{MC}.
When Var(V_{D})_{old} was compared with Var(V_{D})_{MC}, it was found that the two values diverged dramatically for input matrices of high genetic diversity (Figure 1). This causes similar divergence between true and estimated critical values (data not shown) and has implications for testing linkage equilibrium in bacterial populations that will be discussed later. Clearly, Equation 6 should not be used. No discrepancies were found between Var(V_{D})_{MC} and the variance calculated with Equation 10 (see Figure 1).
The usefulness of (19) for hypothesis testing depends on whether the distribution of V_{D} is approximately normal under our null hypothesis of linkage equilibrium with replicates being produced by shuffling of alleles on haplotypes. For multilocus data sets there are three variables that may influence the shape of the distribution of V_{D}, the number of loci, the degree of diversity at each locus, and the number of strains. We investigated the effect of these three variables on the skewness of the distribution of V_{D} through Monte Carlo simulation by calculating g1 as a measure of skewness from sets of resampled V_{D} values,
Given that the distribution of V_{D} has positive skewness even for large samples, we investigated the effect of this deviation from normality on hypothesis testing. Data sets consisting of between 15 and 480 strains and 10 loci, each with genetic diversity of 0.444, were resampled to calculate the frequency with which V_{D} exceeded the critical values that would be obtained if the distribution of V_{D} was normal. Even for small data sets the discrepancy was slight. For instance, with 15 strains 6.69% of the resampled V_{D} values exceeded the 5% normal critical value (Table 1). For a sample of 480 strains the discrepancy between 5.13% and 5.0% was negligible. Note that the probabilities of exceeding the normal critical values were always slightly too large, as would be expected from the positive skewness of the distribution of V_{D}. For real data this means that whenever a sample has been diagnosed as being in linkage equilibrium, the same conclusion would be reached by Monte Carlo simulation. Further, the more time consuming it becomes to test the hypothesis of linkage equilibrium due to large sample size, the more useful our formula becomes. This is because the sampling distribution of V_{D} approaches normality for large samples.
Several recent reports of panmixis in bacteria have used the observed variance of pairwise differences (V_{D}) as a test statistic. Panmixis was concluded if the critical value of V_{D} was greater than the observed value of V_{D} (Maynard Smithet al. 1993; Bottomleyet al. 1994; Duncanet al. 1994; Strainet al. 1995; Goet al. 1996). The original method to calculate the critical value was devised for plant populations, which are only moderately diverse [e.g., h (H. spontaneum) = 0.145 (Brownet al. 1980)], compared to bacterial populations (cf. Table 1). In this study we showed by Monte Carlo simulation that high genetic diversity leads to an artificial inflation of Var(V_{D})_{old} (Figure 1). This problem was overcome by rederiving Var(V_{D}) (Equation 10; Figure 1).
Bacterial populations: To test the usefulness of this derivation in the study of bacterial population genetics, we investigated published allozyme data for the ECOR collection of E. coli (Ochman and Selander 1984), which is a wellknown example of a clonal population (Miller and Hartl 1986). In addition, data sets from Bradyrhizobium japonicum, B. subtilis, and Rhizobium leguminosarum were included in the analysis, because for these populations claims of linkage equilibrium have been based on incorrect formulas for the variance of V_{D}. Finally, an allozyme data set from N. gonorrhoeae was reexamined, as this taxon is considered a prime example of a sexual bacterial population (Maynard Smithet al. 1993; O’Rourke and Stevens 1993).
Generally we observed that bacterial populations are highly diverse (h = 0.311 to 0.691; Table 2) and that the genetic diversity varies strongly between loci (standard deviation = 0.178 to 0.304; Table 2). Further, the distribution of V_{D} displayed positive skewness in all cases, as observed in the simulations (Table 2).
E. coli: As expected from previous work (Miller and Hartl 1986), the electrophoretic types of the ECOR collection of E. coli are in linkage disequilibrium when the critical value obtained through the Monte Carlo process, L_{MC}, is compared to V_{D}(L_{MC} < V_{D}; Table 2). Further, L_{new} (= 2.592) is a good estimator of L_{MC} (= 2.608), while L_{old} (= 2.985) not only overestimates the critical value of V_{D}, but would also lead to the spurious conclusion that E. coli is in linkage equilibrium as L_{old} > V_{D} (Table 2).
B. japonicum: Bottomley et al. (1994) reported linkage equilibrium for a B. japonicum population represented by 17 electrophoretic types. This claim is clearly rejected by Monte Carlo simulation, which shows significant linkage for this population (L_{MC} = 2.593 < V_{D} = 3.985; Table 2). The same conclusion is reached by comparing L_{new} (= 2.557) with V_{D}. Surprisingly, V_{D} also exceeds L_{old}, on which the original claim of linkage equilibrium had been based. This discrepancy is resolved if L_{old} is calculated on the basis of the biased estimator
R. leguminosarum: Strain et al. (1995) obtained evidence of linkage disequilibrium in their U.K. population of R. leguminosarum by using Monte Carlo simulation, but H_{0} was not rejected on the basis of L_{old}. We obtained the same result, reinforcing the inappropriateness of L_{old} for hypothesis testing. We further found that L_{new} (= 2.911) was again a good alternative to the lengthy calculations necessary for obtaining L_{MC} (= 2.967; Table 2) through simulation. Strain et al. (1995) also analyzed groups I + III + IX and I + III of their R. leguminosarum U.K. population and reported linkage equilibrium for both subpopulations. We found that H_{0} is rejected for groups I + III + IX and I + III on the basis of L_{MC} and L_{new} (Table 2).
B. subtilis: Duncan et al. (1994) reported linkage equilibrium for the 50 electrophoretic types of B. subtilis contained in the B and D subdivisions of their sample. In contrast, we found that the combined electrophoretic types of groups B and D display strong linkage (Table 2) with V_{D} (= 4.128) far exceeding L_{MC} (= 2.422) and L_{new} (= 2.397). Group D on its own is also not in linkage equilibrium with L_{MC} and L_{new} < V_{D}, but note that as for E. coli, R. leguminosarum, and B. japonicum, application of L_{old} would lead to an inappropriate conclusion of linkage equilibrium. We further concluded on the basis of L_{MC} (= 2.664) and L_{new} (= 2.605) that group B is indeed in linkage equilibrium (Table 2).
N. gonorrhoeae: This group of bacteria is the best established example of a bacterial population in linkage equilibrium. An extensive allozyme data set comprising 228 isolates has been published and reported to be in linkage equilibrium (Maynard Smithet al. 1993; O’Rourke and Stevens 1993). Moreover, N. gonorrhoeae is naturally competent and frequently encounters different genotypes of the taxon due to the sexual habits of its host. As expected, we found that this population is in linkage equilibrium according to L_{MC} (= 1.837); L_{new} (= 1.831) gave the same result, further confirming the usefulness of this algebraic confidence limit (Table 2).
For all the bacterial populations tested, L_{MC} and L_{new} agreed well. This contrasted with the strong divergence of L_{old} from L_{MC}, which led to conflicting conclusions about the genetic structure of E. coli, B. japonicum, R. leguminosarum, and B. subtilis. Using computer simulations, Maynard Smith (1994) showed that a recombination rate only 20 times the rate of mutation was sufficient to unlink bacterial genomes. The detection of linkage disequilibrium in the soildwelling populations of B. japonicum, R. leguminosarum, and B. subtilis presented in this article indicates that the recombination rates in these groups are probably very low. This has also been found experimentally for B. subtilis (Roberts and Cohan 1995).
We conclude that past attempts to detect linkage disequilibrium in haploid multilocus data sets through the computation of a critical value for V_{D} were based on an erroneous formula for the variance of V_{D}. The correct formula for Var(V_{D}) communicated in this article forms the basis of a simple test of linkage. Furthermore, we find that V_{D} is approximately normally distributed (especially for large samples). Hence the algebraic test proposed here is a useful alternative to Monte Carlo simulation in cases where simulation is deemed too expensive or time consuming. A computer program written in FORTRAN77, which implements both the algebraic as well as the Monte Carlo test, can be obtained from B.H. upon request.
Acknowledgments
We thank J. Maynard Smith for first drawing our attention to the problem of testing linkage equilibrium from mismatch data and for helpful discussion. Thanks are also due to P. J. Bottomley for providing the Rhizobium allozyme data, and to T. S. Whittam and two anonymous reviewers for comments on the manuscript. This work was supported by grants from the Royal Society, Oxford University and the Biotechnology and Biological Sciences Research Council (United Kingdom).
Footnotes

Communicating editor: P. L. Foster
 Received April 2, 1998.
 Accepted August 21, 1998.
 Copyright © 1998 by the Genetics Society of America