Abstract
A study of DNA polymorphism and divergence was conducted for the cytosolic phosphoglucose isomerase (PGI:E.C.5.3.1.9) gene of five species of the mustard genus Leavenworthia: Leavenworthia stylosa, L. alabamica, L. crassa, L. uniflora, and L. torulosa. Sequences of an internal 2.3-kb PgiC gene region spanning exons 6–16 were obtained from 14 L. stylosa plants from two natural populations and from one to several plants for each of the other species. The level of nucleotide polymorphism in L. stylosa PgiC gene was quite high (π = 0.051, θ = 0.052). Although recombination is estimated to be high in this locus, extensive haplotype structure was observed for the entire 2.3-kb region. The L. stylosa sequences fall into at least two groups, distinguished by the presence of several indels and nucleotide substitutions, and one of the three charge change nucleotide replacements within the region sequenced correlates with the haplotypes. The differences between the haplotypes are older than between the species, and the haplotypes are still segregating in at least two of five species studied. There is no evidence of recent or ancient population subdivision that could maintain distinct haplotypes. The age of the haplotypes and the results of Kelly's ZnS and Wall's B and Q tests with recombination suggest that the haplotypes are maintained due to balancing selection at or near this locus.
IN a previous study of the single cytosolic phosphoglucose isomerase (PgiC) gene of Leavenworthia stylosa, Liu et al. (1999) found unexpectedly high levels of DNA sequence polymorphism. In the part of the gene studied, around intron 12, this was found to be accompanied by distinct haplotype structure and a high level of linkage disequilibrium between haplotypes. On the basis of the sequence data, the recombination rate was estimated to be high in the region studied, and data from other loci (Liuet al. 1998) did not support the hypothesis of population subdivision. It therefore seemed likely that the high nucleotide polymorphism and haplotype structure could be the results of the long-term action of balancing selection in or near intron 12. On the other hand, the data exhibited no deviations from the neutral model. Thus the cause of the high polymorphism remained unclear.
A neutral polymorphic site is expected to be maintained in a population for less than 4N generations. Advantageous or deleterious alleles will be fixed or eliminated much faster than this (Kimura and Ohta 1969). However, polymorphisms can be maintained for very long times if balancing selection acts in or near the locus. In the presence of a polymorphic variant, such as an amino acid involved in an allozyme polymorphism, diversity is expected to be elevated at very closely linked sites (Strobeck 1983; Hudson and Kaplan 1988). Only a few definitive cases of balancing selection have been described at the DNA level, including the major histocompatibility complex (Mhc) polymorphism (Klein 1986), studied mostly in humans (reviewed in Kleinet al. 1990) and self-incompatibility loci in plants (reviewed in Sims 1993 and Charlesworth and Awadalla 1998). A peak of polymorphism has been attributed to the effect of balancing selection in the Drosophila melanogaster Adh locus (Hudsonet al. 1987), although as this peak is not entirely attributable to the F/S amino acid difference, but also occurs in the S alleles, the situation at this locus is not entirely clear. Studies of other loci with allozyme polymorphisms have failed to find such peaks (Takanoet al. 1993). Here we present further DNA polymorphism data from the cytosolic phosphoglucose isomerase (PgiC) locus of L. stylosa, which strengthen the evidence that this gene is another example of a gene under balancing selection.
The goal of the present work is to examine the level of nucleotide polymorphism and haplotype structure in a much larger PgiC region than previously sequenced, to determine the extent of the region of high polymorphism in L. stylosa, and to obtain a larger number of allele sequences in order to test for deviations from the neutral model in this locus. For these purposes we obtained DNA sequences of PgiC alleles for a region spanning exons 6–16. Very surprisingly, this demonstrated that the observed haplotype structure extends much further than the intron 12 region, despite evidence of multiple recombination events. Interestingly, one of the amino acid replacements associated with a charge change correlates with the haplotypes. Comparison of the sequences from different Leavenworthia species shows that the split between haplotypes preceded the split between the species in the genus. This is even clearer now that we have included sequences from a further species in the genus that also has high allozyme diversity (L. alabamica populations; see Charlesworth and Yang 1998). Despite the ancient origin of the haplotypes, they are thus still segregating in at least two Leavenworthia species. Furthermore, several tests potentially able to detect deviations from the neutral model, particularly tests including the possibility of recombination, Kelly's ZnS (Kelly 1997) and Wall's B and Q statistics (Wall 1999), detect significant deviation from neutral expectations.
MATERIALS AND METHODS
Species and populations: A detailed description of the genus Leavenworthia, including most of the populations studied here, is in Liu et al. (1999). In this study we used 14 plants of an outcrossing species L. stylosa from populations 95007 (6 plants) and Hem1 (8 plants). Four alleles were obtained from 3 plants from two L. alabamica populations (see Charlesworth and Yang 1998), two alleles from a partially self-incompatible population (95006) and two from a highly self-fertile population (95009). One allele was sequenced from a L. crassa plant from the partially self-incompatible population 95005, and one was sequenced from each of two highly inbreeding species, L. torulosa (population 95008) and L. uniflora (population 95011). Since the genus Leavenworthia is thought to be closely related to Cardamine (Rollins 1963), we also isolated DNA from one Cardamine hirsuta plant collected at the University of Edinburgh King's Buildings campus for use as an outgroup.
Molecular methods: Genomic DNA was isolated from Leavenworthia leaves of individual plants by a standard hexadecyltrimethylammonium bromide (CTAB) plant miniprep method with several modifications. Leaves (~100 mg) were thoroughly ground in liquid nitrogen and then in 1 ml of extraction buffer (0.35 m sorbitol, 5 mm EDTA, 0.1 m Tris-HCl pH 7.4, 30 mm sodium bisulfite). Nuclei were collected by centrifugation at 3000 × g for 5 min. The nuclei were resuspended in 300 ml of extraction buffer and 300 ml of lysis buffer (0.2 m Tris, pH 7.5, 50 mm EDTA, 2 m NaCl, 2% CTAB, 5% N-lauroyl sarcosine) and incubated with RNAse for 10 min at 65°. After phenol-chloroform purification, DNA was precipitated with 0.6 volume of isopropanol and dissolved in 100–200 ml Tris-EDTA pH 8.
We used sequences of the Arabidopsis thaliana PgiC gene and L. crassa PgiC cDNA (GenBank accession nos. X69195 and AF054455) to design five “plus” and four “minus” primers for PCR and sequencing of the central 2.3-kb region of Leavenworthia PgiC gene (plus primers: +8, CCACTGTTTGTTCATACGGCTC; +10, AAATATTGATCCTGTTGATGTTG; +12, TGCTGTSAGCACTAATCTTGCG; +3, TTTGCATTTTGGGACTGGG; +14, AAGGGAGCTTCAAGCATTGAT; minus primers: −11, GCGTTCAGCATTGTTTCAGC; −13, TTGTTCGGGTCAATACCAAACT; −15, GCTGATCAATGCTTGAAGCTCC; −4, TCGAACGGGAGAGGTAGACCA). The +8 and −13 primers were used to amplify a region of 1.2-kb PgiC from L. stylosa, referred to below as region A (Figure 1). The +12 and −4 primers were used to amplify a 1.3-kb region referred to below as region B (Figure 1). Primers +8 and −4 were used to amplify the entire 2.3-kb PgiC region of the four other Leavenworthia species and of C. hirsuta. The amplification products were cloned into the pCR2.1 vector using the TA cloning kit (Invitrogen, San Diego) and both strands were sequenced on an ABI Prism 377 automatic sequencer (Perkin Elmer, Norwalk, CT).
Sequence alignment and analysis: The sequences obtained were aligned using ClustalX v.1.64 software (Thomsonet al. 1997) followed by additional hand alignment using the PROSEQ v.2.3, multiwindow sequence processor for Windows 95 developed by D. Filatov (unpublished results). Sequence data analyses (estimators of DNA diversity, the estimators of recombination rate C and γ, the estimator of silent site divergence Dxy, and Tajima's, Fu and Li's, McDonald and Kreitman (MK), and Hudson, Kreitman, and Aguadé (HKA) tests of neutrality) were performed using DNAsp v.2.93 (Rozas and Rozas 1997), SITES v.1.1 (Hey and Wakeley 1997), PROSEQ v.2.3, and an unpublished Fortran program written by D. Charlesworth. Wall's B and Q tests (Wall 1999), Kelly's test (Kelly 1997), sliding windows for linkage disequilibrium analyses, permutation tests for geographic subdivision (Hudsonet al. 1992), and coalescent simulations with recombination were performed by PROSEQ v.2.3. For all phylogenetic analyses we used MEGA v.1.01 (Kumaret al. 1993).
A permutation approach (Hudsonet al. 1992) was used to estimate the significance of sequence differences between the two L. stylosa populations studied and between the L and S haplotypes of L. stylosa. The value of the Kst* statistic was calculated for two groups of sequences (either for the two geographic populations or for L and S haplotypes). The critical value for this statistic was obtained by 1000 random permutations of the sequences between the two groups in the sample.
Coalescent simulations: For the coalescent simulations (see below) we used program routines in Pascal code kindly provided by J. Hey. These routines implement the standard algorithm of the coalescent process with recombination (Hudson 1983, 1990, 1993). The routines were rewritten in Object Pascal and built into the PROSEQ v.2.3 (D. Filatov, unpublished data). The program generates random samples of a given size with a given number of segregating sites and specified recombination rates. These simulated samples were used to estimate critical values of Kelly's (1997) and Wall's (1999) tests with recombination. In our simulations (see below), we used several values of the recombination rate, chosen to be close to the values estimated from the L. stylosa PgiC data (see Table 3).
Kelly's and Wall's tests with recombination: To calculate the probability of the observed values of test statistics arising by chance (P value) we simulated random samples of a given size, number of polymorphic sites, and recombination rate using the coalescent process. For each such sample generated, Kelly's ZnS statistic (Kelly 1997) and Wall's B and Q statistics (Wall 1999) were calculated and stored. After the statistics had been calculated for 10,000 simulated samples, the P value for the observed value of the statistic was obtained as a proportion of cases when the simulated statistic value was greater than or equal to the observed one.
RESULTS
DNA polymorphism: Nineteen alleles were sequenced from the two L. stylosa populations (see materials and methods) for region B of the PgiC gene. From these data, it was apparent that high diversity extends throughout this region, so 16 alleles were sequenced for a further region (A) 5′ to region B (see Figure 1). The sequence of the entire 2.3-kb region was obtained for 11 chromosomes for this species, using the few plants for which both alleles were sequenced for both the A and B regions. In L. stylosa, the level of DNA polymorphism is remarkably high: 180 of 1020 sites of region A and 214 of 1147 sites of region B are segregating in our samples. In the 11 sequences covering the whole 2.3-kb region, 263 of 2045 sites are polymorphic. The distribution of nucleotide polymorphisms (πtotal) along the sequence, based on these 11 alleles, is shown in Figure 2. Due to the large number of polymorphic sites, it is impossible to show them all in a figure; however, the list of all polymorphic sites, as well as the alignments, is available from the authors on request.
The PgiC region sequenced from exon 6 to exon 16. Thick and thin horizontal lines represent exons and introns, respectively. Small arrows show positions, directions, and names of the primers. For L. stylosa the whole 2.3-kb region was cloned and sequenced as two overlapping regions (region A and region B) of ~1.3 kb each.
No insertions or deletions (indels) were found in the coding regions, but extensive intron length polymorphism was observed in the introns. Nine of the 10 introns sequenced vary in size due to indels up to 100 bp long. In total, we found 51 indels in the introns of the whole 2.3-kb region. All indel polymorphism regions are excluded from the analysis below. Since indels represent about 10% of the region sequenced, indel regions were also analyzed separately, to check that they do not differ greatly in diversity from other intron regions. The nucleotide variation within indel regions was approximately the same as that elsewhere in L. stylosa PgiC introns (per-nucleotide π ≈ 5%).
The distribution of nucleotide polymorphisms along the 11 L. stylosa PgiC alleles 2.3 kb long detected in sliding windows of 100 bp with increment 25 bp. Values of nucleotide diversity (π) are shown for 11 alleles pooled (πtotal) and for S (πS) and L (πL) haplotypes separately. The arrow shows the position of the Asp/Lys polymorphism.
The nucleotide variation (excluding indel regions) found in L. stylosa is summarized, in terms of the standard measures of sequence polymorphism, π and θ, in Table 1. The two PgiC regions sequenced have similar average levels of DNA polymorphism. Most of the segregating sites are in the introns: 223 of 1437 intron sites (15%), compared with 40 of 598 exon sites (7%). The level of nucleotide polymorphism at nonsynonymous sites is about an order of magnitude lower than at synonymous sites (Table 1). However, for the entire 2.3-kb region we observed 12 amino acid polymorphisms, 3 of which were replacements with charge changes. No correlations of the allozyme mobility classes (see Charlesworth and Yang 1998) with the observed charge change amino acid replacements were found (data not shown).
Based on the sequences, the two different populations show no significant evidence of isolation. For the largest set of data, the 19 B region sequences, the Fst value estimated from the π values for the two populations was low, 0.019; the value of the test statistic for detecting geographic subdivision (Hudsonet al. 1992) was Kst* = 0.0098, which is lower than the 95th percentile (Kst*0.95 = 0.0168) calculated from 1000 permutations of the 19 sequences. For region A and for the entire 2.3-kb region, tests for subdivision (Hudsonet al. 1992) were also nonsignificant. Since there is no evidence of isolation of these two populations, they will be combined for further analysis.
Recombination: The minimum number of recombination events, estimated by Hudson and Kaplan's (1985) method, is 21 for the sample of 11 alleles 2.3 kb long. For the A and B regions this method detected 16 and 28 recombination events, respectively (Figure 3). Thus, the PgiC region is not a cold spot of recombination in L. stylosa. Hudson's (1987) C estimator of recombination (C = 4Nec, where Ne is the effective population size of the species and c is the recombination frequency per nucleotide site) gave a value of 0.083/bp for the B region, for which the set of sequences is largest; both A and B regions are consistent in suggesting frequent recombination. For the same data set, the γ estimator of recombination (Hey and Wakeley 1997) gave a value of 0.044/bp. The per-nucleotide recombination rate in all parts of the PgiC gene sequenced therefore appears to be ~5% and the ratio of C/θ appears to be ~1. Both γ and C estimates are biased but the biases are in opposite directions (Hudson 1987; Hey and Wakeley 1997). The two estimators have large variance; however, according to the simulations conducted by Hey and Wakeley (1997), for 12 sequences 2 kb long the variance of γ is only about twice that for Watterson's estimator of θ. Thus, for our samples, the variance of γ should be ~0.01 and the true per-nucleotide recombination rate (C = 4Nec) should probably not be <0.03.
DNA polymorphism in the PgiC gene of L. stylosa
Linkage disequilibrium and haplotype structure: Linkage disequilibria (significant by χ2 tests at P < 0.05 without correction for multiple tests) were detected for >25% of pairs of sites, many of which were >1 kb one from another. Although no disequilibria were significant after correction for multiple tests, the significance of Kelly's ZnS test with recombination (see below) suggests that linkage disequilibrium is significant in an evolutionary sense, i.e., that it exceeds that expected under neutrality. We do not show linkage disequilibrium data in the form of the commonly used linkage disequilibria grid because of the large number of sites. Instead, the distribution of linkage disequilibrium along the region, measured as ZnS and average D in a sliding window of 15 polymorphic sites, is shown in Figure 4. There is a clear peak of linkage disequilibrium in the middle of the sequence, centered close to intron 10 and exon 11, and two smaller peaks centered in introns 7 and 12. Interestingly, the peaks of linkage disequilibrium coincide approximately with the peaks of nucleotide polymorphisms (Figure 2).
Three distinct groups of alleles were previously defined by indel polymorphisms in intron 12 (Liuet al. 1999). From length differences in this intron the following names were assigned: short (S), long 1 (L1), and long 2 (L2). The groups are also distinguished by a number of nucleotide substitutions. With the longer sequence now available, the L1 and L2 groups are no longer distinct from one another, but sequences of the S group still cluster together and separately from L1 and L2 (Figure 4). Since the trees for L1 and L2 alleles are not well resolved, we will here combine L1 and L2 into a single L group. This is convenient for the analysis, since the frequency of S type alleles is approximately equal to the combined frequency of L1 and L2 alleles.
The distribution of linkage disequilibrium (measured by D and ZnS) among the 11 L. stylosa alleles of the 2.3-kb PgiC region, detected in sliding windows of size 15 polymorphic sites with an increment of 5 polymorphic sites. For each window the value of ZnS and an average value of D are plotted. The thin horizontal line shows the critical (P < 0.05) value of ZnS without recombination for 11 sequences with 15 segregating sites.
Pairwise nucleotide divergences between the 11 sequences covering the whole 2.3-kb region are shown in Table 2. Within groups (boxed), divergence is lower than between the groups. The diversity among sequences within the combined L1 + L2 group is, however, nearly as high as the between-group divergence. This suggests that these two groups are distinct from one another, despite their not being well resolved in the trees. The level of DNA polymorphism within the S group (π = 0.025 ± 0.003) is significantly lower than in the whole sample (π = 0.052 ± 0.0036), or within the L1 (π = 0.040 ± 0.007) or L2 (π = 0.050 ± 0.010) groups, or within the combined L1 + L2 group of sequences (π = 0.056 ± 0.004). The L2 group has the highest within-group polymorphism and could probably be further divided into smaller subgroups if a bigger sample were studied.
The distribution of DNA polymorphisms along the 2.3-kb PgiC region for 11 pooled sequences and separately for S (6 alleles) and L (5 alleles) haplotypes is shown in Figure 2. Most polymorphisms are within the L type but the peak around intron 7 is mostly due to the differences between the S- and the L-type alleles.
To test the significance of differences between the L and S haplotypes, we applied a permutation approach (Hudsonet al. 1992), treating the two groups as geographic populations. This tests for evidence of significant isolation between the two sequence groups, in our case potentially attributable to their being associated with different alleles maintained in the populations by balancing selection, rather than isolation due to geographic separation, for which the test was originally designed. The test results were highly significant throughout the gene. For the full 2.3-kb region, Kst* = 0.055 (Kst*0.999 = 0.044, P < 0.001); for the A region, Kst* = 0.067 (Kst*0.999 = 0.062, P < 0.001); and for the B region, Kst* = 0.076 (Kst*0.999 = 0.047, P < 0.001).
Neighbor-joining tree for the entire 2.3-kb PgiC region of five Leavenworthia species and of C. hirsuta. Jukes-Cantor distances are shown for each branch. Bootstrap values (in brackets) are shown only for basic branching between S and L haplotypes. Haplotype assignments are shown as follows: S, short; L1, long 1; L2, long 2. L comprises L1 and L2. Species are indicated by the first two letters in each label as follows: ST, L. stylosa; AL, L. alabamica; CR, L. crassa; TO, L. torulosa; UN, L. uniflora; CD, C. hirsuta.
Pairwise nucleotide divergence between 11 L. stylosa alleles sequenced for the entire 2.3-kb PgiC region
Tests for selection assuming no recombination: To test for the operation of natural selection in the region under study, we applied several tests (HKA test, Hudsonet al. 1987; MK test, McDonald and Kreitman 1991; Tajima's D, Tajima 1989; Fu and Li's D*, Fu and Li 1993) potentially able to detect deviations of the observed data from the expectations under the neutral model. None of the tests was significant, probably because these tests assume no recombination and, since there is evidence of recombination in the L. stylosa PgiC region, the power of the tests is reduced (Hudsonet al. 1987).
Tests for selection allowing for recombination: Kelly's test (Kelly 1997) examines regions for excess of linkage disequilibrium compared with that expected under neutrality. The test statistic ZnS averages the values of linkage disequilibrium (dij, the squared correlation of allelic identity between loci i and j, see Hartl and Clark 1989, pp. 53–54) across all polymorphic sites in the region. It thus summarizes linkage disequilibrium between all pairs of polymorphic sites. Since the critical values of ZnS given in the original article (Kelly 1997) are for up to 50 polymorphic sites in the sample, we applied the test in a sliding window of 15 or 50 polymorphic sites. The results of the tests were significant only for the intron 10 region using a window size of 15 polymorphic sites (see Figure 4). To apply the test to the data for the whole sequence, we calculated critical ZnS values using a program kindly provided by John Kelly. Again, the results for the A and B regions and the whole 2.3-kb region were nonsignificant. However, the critical values for Kelly's statistics used in these tests are calculated from coalescent simulations without recombination (Kelly 1997) so the test is extremely conservative. We therefore calculated P values for the observed ZnS (Table 3) by coalescent simulations with recombination (see materials and methods). With recombination, the results of Kelly's test are nonsignificant (P > 0.05) for both A and B regions when C < 0.04–0.05 and for the entire 2.3-kb region when C < 0.07.
We also applied J. Wall's (1999) B and Q tests (Table 3) for deviations of the results from the neutral model. The B statistic is the proportion of pairs of adjacent segregating sites that are congruent, i.e., that have consistent genealogies. The Q statistic is also based on the number of adjacent congruent sites, but takes into account the length of the regions where all sites are congruent. The P values for the observed B and Q were calculated in exactly the same way as for Kelly's ZnS statistic (see materials and methods). For the A and B regions the B and Q statistics are significant when C ≥ 0.03–0.04 (Table 3). For the 2.3-kb region both B and Q statistics detected significant deviation from the neutral model, with recombination rate C ≥ 0.01 (see Table 3), which is four times less than the value of γ estimator (note that this estimator tends to underestimate the amount of recombination).
Between-species comparisons: We sequenced the 2.3-kb PgiC region for four alleles of L. alabamica and one allele for each of the species L. crassa, L. uniflora, and L. torulosa. The neighbor-joining tree for the five Leavenworthia species and C. hirsuta (outgroup) is shown in Figure 4. L. torulosa is very close to L. stylosa, as is also the case in phylogenies based on morphological characters (Christiansen 1993), and the L. torulosa sequences cluster together with the S haplotype of L. stylosa as was previously found for the intron 12 region (Liuet al. 1999). Sequences from L. crassa and three of the four sequences from L. alabamica (closely related species that are more distant from L. stylosa and have a different chromosome number; see Rollins 1963; Christiansen 1993) also cluster together with L. stylosa S-haplotype sequences. However, one of the two L. alabamica allele sequences from population 95006 clusters with the L. stylosa L1-haplotype sequences. Thus, the differences between haplotypes are apparently older than the differences between species in the genus Leavenworthia. It is interesting that haplotypes are still segregating in at least two of five Leavenworthia species studied.
Results of Kelly's ZnS and Wall's B and Q tests with recombination
Correlation of haplotypes with amino acid replacements: Overall we detected 12 amino acid replacements in the entire 2.3-kb region sequenced. For the 2.3-kb region, 3 of the 12 replacements involve charge changes. Most of the amino acid polymorphisms were singletons or variants found only twice in our sample of alleles. Only in one case (Asn/Lys in exon 8) do the two alleles have similar frequencies. This Asn/Lys polymorphism is due to a T/A mutation in the third position of PgiC codon 200. Interestingly, the Asn/Lys polymorphism is strongly associated with the haplotypes (Table 4). All alleles of S type have Lys (positively charged) and all except one L-type allele have Asn (uncharged) in this site. One allele of L1 type has Lys; this allele may be a recombinant since its 5′ part is more similar to S-type sequences, and only its 3′ part (the part used in the haplotype assignments) is of L1 type.
The comparison with the other species (Table 4) demonstrates that Asn is the ancestral state since C. hirsuta also has Asn in that position (based on a single sequence from this species, which appears to be highly selfing, and which we therefore assume will have little sequence variation). The PgiC sequences of the two other Leavenworthia species, L. crassa and L. torulosa, are generally very similar to the L. stylosa S type, and both also have Lys at position 200. The L. alabamica alleles also corroborate the association between haplotypes and the Asn/Lys polymorphism. Three of the four L. alabamica alleles sequenced are of the S type and have Lys, while the fourth allele is of L1 type and has Asn in the position 200 of the PgiC protein. Such an association suggests that the S(Lys)/L(Asn) haplotype structure arose due to a single mutation at position 200 of the PgiC protein of an ancestral Leavenworthia species.
DISCUSSION
High level of DNA polymorphism in L. stylosa: The level of DNA polymorphism observed in PgiC of L. stylosa is strikingly high. It is much higher than for most animal genes (Moriyama and Powell 1996) and for other plant genes in which sequence diversity has been quantified (except self-incompatibility loci; see, e.g., Richmanet al. 1996). The high level of DNA polymorphism in the L. stylosa PgiC gene could be due to the action of balancing selection in or near the locus. Theory (Strobeck 1983; Hudson and Kaplan 1988; Kaplanet al. 1988) suggests that there should be a peak of polymorphism and linkage disequilibrium near a site that is under balancing selection. The level of nucleotide diversity in L. stylosa PgiC between the peaks of polymorphism (around introns 7, 10, 12; see Figure 2) is approximately the same as in the other genes studied in this species (π ~ 3–4%; see Table 5). We must therefore consider the possibility that high DNA sequence polymorphism may be typical for the whole genome of this species. In the peaks of polymorphism, π reaches much higher values (6–10%). Thus, it is possible that π of 3–4% is typical for the genes of L. stylosa, but that the higher peaks of polymorphism are due to the maintenance of polymorphic sites for a long time by balancing selection.
Amino acid replacements in the entire 2.3-kb PgiC region of six species
Comparison of DNA polymorphism in several genes of L. stylosa
Plant sequence diversity: Most data available on DNA polymorphism within species currently come from animals, especially Drosophila (reviewed by Moriyama and Powell 1996). DNA polymorphism has been studied for only a few plant species, many of which are domesticated, and thus diversity may be underestimated (Doebley 1989, 1992). Data are currently available from maize (Shattuck-Eidenset al. 1990; Gaut and Clegg 1993a; Henry and Damerval 1997), melon (Shattuck-Eidenset al. 1990), and millet (Gaut and Clegg 1993b), and for natural populations of wild barley (Cummings and Clegg 1998), wild yam (Terauchiet al. 1997), morning glory (Huttleyet al. 1997), and several species of a mustard genus Leavenworthia (Charlesworthet al. 1998; Liu 1998; Liu et al. 1998, 1999). Polymorphism levels vary greatly in different plant species; estimates of θ = 4Nm range from ~0.001 for melon, millet, wild yam, and selfing species of Leavenworthia, L. uniflora and L. crassa, to much higher (up to 0.05) values for maize and L. stylosa. These high values exceed those in Drosophila populations (Moriyama and Powell 1996). Levels of intraspecies polymorphism depend on mutation rates and on aspects of population history, including the long-term population size and the occurrence of bottlenecks, which directly affect effective population sizes (Kimura 1983). In addition, genetic variability is affected by the mating system, since inbreeding increases the effects of selective sweeps and of selection against deleterious mutations (Hedrick 1980; Charlesworthet al. 1993; Liu et al. 1998, 1999). Some of the variability in levels of DNA polymorphism in plants could thus be due to the fact that some of the data are from selfing species or populations (wild barley, Cummings and Clegg 1998; L. crassa, L. uniflora, and L. torulosa, Liuet al. 1999), but for the outcrossing (dioecious) species Dioscorea tokoro (Terauchiet al. 1997) one would have to invoke lower mutation rates or population bottlenecks.
One possible cause of the high level of DNA polymorphism seen in L. stylosa PgiC and in maize loci (see Shattuck-Eidens et al. 1990; Henry and Damerval 1997) may be high mutation rates. Since plants do not have a germ line, germinal tissues are formed from somatic cells, so the number of cell divisions needed to form a progeny gamete from a parent seed, and hence per-generation mutation rates, could often be higher in plants than animals (Klekowski and Godfrey 1989). Mutation rates for chlorophyll deficiency mutations in long-lived mangrove species have been estimated to be high, 2–5 × 10−3 per haploid genome per generation (Klekowski 1988). Assuming 200 genes (Wettsteinet al. 1971) and a very high value of 50–100 replacement sites in each gene that could affect chlorophyll biosynthesis, we obtain a per-base per-generation mutation rate of at least 1–2 × 10−7, about an order of magnitude higher than estimates for animals (Kondrashov 1998; Drakeet al. 1998). Even assuming mutation rates in the annuals L. stylosa and maize as high as 2–5 × 10−7 per site per generation, to account for the observed π values the effective population sizes would need to be at least 2–5 × 105. Such a large effective population size (half that estimated for D. melanogaster, see Kreitman and Hudson 1991) seems implausible for L. stylosa, given the current fragmented state of Leavenworthia populations (Rollins 1963) and the likelihood of past population bottlenecks during glaciations. Thus, even if mutation rates are high, some other factor leading to high diversity appears necessary.
Haplotype structure: Apart from the high DNA polymorphism, another interesting feature of the L. stylosa PgiC data is the strong haplotype structure, which our new studies show spans at least the whole 2.3-kb region of the gene sequenced, despite the clear findings showing that this gene is not a cold spot of recombination. The significant results of both haplotype (Hudsonet al. 1994; Kirby and Stephan 1995) and permutation (Hudsonet al. 1992) tests show that the observed haplotype structure is highly improbable by chance alone under a neutral model.
An obvious potential explanation of the haplotype structure is recent or ancient population subdivision of L. stylosa. However, we could rule this out, since the two populations show no evidence for significant differentiation. Furthermore, sequence data from six other loci in L. stylosa show no signs of haplotype structure or isolation (Liu 1998). Another potential explanation of the observed trans-species polymorphisms is gene flow between the species of the genus. However, this does not seem likely, since L. stylosa does not give viable progeny with any other species and chromosome numbers differ between L. stylosa, L. uniflora, and L. torulosa on the one hand, and L. crassa and L. alabamica on the other hand (Rollins 1963). Even if some gene flow occurred in the past, there must be a force maintaining the haplotypes since that time.
The other possible explanation of the haplotype structure is balancing selection in or near the PgiC locus. Despite many tests of selection being nonsignificant, several lines of evidence suggest that balancing selection acts in this region. First, the results of Kelly's (1997) and Wall's (1999) tests with recombination demonstrate significant deviations from neutral expectations. Kelly's ZnS test statistic is a stringent test that is sensitive to the lengths of the internal branches of gene trees. The value of the test statistic is strongly affected by linkage disequilibrium between the sites where mutations occurred on the most ancient branches of the gene tree, which go directly to the common ancestor of the entire sample. Thus, the test has good power to detect balancing selection, based on its effect of stretching the internal branches of the genealogy. Wall's B and Q tests are also quite sensitive to the length of internal branches of the sample genealogy. The critical bounds for the ZnS, B, and Q test statistics were derived by coalescent simulations of random samples for a range of recombination rates close to that estimated for the L. stylosa PgiC gene, and the tests are mostly significant unless we employ recombination rates much lower than those estimated (Table 3). According to our results, B and Q tests appear to be more sensitive to detect balancing selection than the ZnS test.
Second, comparison with other Leavenworthia species reveals that the age of the haplotypes is higher than the age of species and even the karyotype differences in the genus, since the same haplotypes segregate in at least two of the five Leavenworthia species studied. Unfortunately we cannot precisely date the age of the haplotypes and species in the genus, since neither reliable estimates of mutation rate in dicotyledonous plants nor fossil data for the Leavenworthia genus are available. Estimates of molecular clock parameters are available for monocotyledons; the substitution rate per synonymous site per year between rice and maize for nine nuclear genes is estimated to be ~6 × 10−9 (Gaut 1998). Assuming the same silent site substitution rate for Leavenworthia, the divergence time between L. stylosa and C. hirsuta estimated from the mean sequence divergence per silent site, Dxy = 0.2, is about 17 million years. This is unreasonably high since these two species are thought to be close relatives, and the age of the whole Brassicaceae family is estimated to be about 15 million years (Muller 1984). Five million years seems a more realistic upper value of the time since the common ancestor of Cardamine and Leavenworthia (Rollins 1963; Priceet al. 1994). In that case, the silent site substitution rate in this group would be ~2 × 10−8 per year. This value seems reasonable, but is hard to reconcile with the higher value discussed above. If the silent mutation rate is ~10−8, the effective population size would have to be even larger than the already implausibly high value discussed above. The divergence times between species within the genus Leavenworthia would then be between 0.1 and 1 million years, and the divergence time of the S and L haplotypes, based on their mean divergence, Dxy = 0.065, would be about 2 million years. It seems unlikely that the haplotype polymorphism could be maintained for such a long time by drift if the variants were neutral, but this cannot be definitely excluded in the absence of good information about the effective population size of this species.
Finally, we found a possible target of balancing selection, the Asn/Lys polymorphism that correlates with the L type vs. S type of the PgiC alleles. Intriguingly, the site of this polymorphism is in exon 8, i.e., within one of the peaks of polymorphism and linkage disequilibrium (Figures 2 and 3). Moreover, it is the only high peak in the region studied that is mostly due to divergence between the S and L haplotypes rather than to polymorphism within the L haplotype (Figure 2). Multiple peaks of polymorphism suggest that this region of the PgiC gene contains several targets for selection. This is consistent with the fact that there are multiple allozyme variants in this locus (four in L. stylosa, data not shown). We did not find any correlation between the amino acid replacements and the other peaks of polymorphism; however, this may be due to the small sample size for the L-type alleles. Multiple allozyme variants in PgiC were also reported for Colias butterflies (Watt 1992) and for field crickets Gryllus veletis and G. pennsylvanicus (Katz and Harrison 1997). For both these insects there is some evidence that the maintenance of polymorphisms is due to balancing selection. The polymorphism in Colias has not been studied at the DNA level. The data on DNA polymorphism in Gryllus PgiC demonstrates that DNA sequence polymorphisms are not shared between the species and are thus short lived. This is consistent with a conclusion of Hasson et al. (1998) that allozyme polymorphisms in Drosophila are short lived. However, our observations suggest that allozyme polymorphisms may be maintained for a long time.
Comparison with the outgroup species, C. hirsuta, demonstrates that Asn is the ancestral amino acid at this site, and all but one of the L-type alleles have Asn, while all S-type alleles have Lys at this polymorphic site. Moreover, in L. alabamica segregating L and S haplotypes also have Asn and Lys at the polymorphic site, respectively. The change of Asn to Lys changes the charge of the whole protein and it is known that such changes could be selectively important (Riddoch 1993). The observation that polymorphism in S-type alleles is about half of that in L-type alleles suggests that the S type may be younger than the L type. Thus, the following scenario seems to explain the observed haplotype structure in the Leavenworthia PgiC gene. An ancestral species had predominantly L-type and a few S-type alleles with Asn at position 200 of a protein. Due to a mutation of T to A in the third position of this codon, the Asn residue S type mutated to Lys, and the change was advantageous. The frequency of mutant S+Lys-type alleles increased but did not go to fixation, due to either frequency-dependent or overdominant selection. During the subsequent speciation events, L. alabamica inherited the ancestral L(Asn)/S(Lys) polymorphism. L. crassa probably became fixed due to either small population size or a high rate of selfing, while fixation of different haplotypes in L. uniflora and L. torulosa is most likely attributable to their high rates of selfing, since diversity is expected to be low in highly inbreeding populations (Charlesworthet al. 1993).
Acknowledgments
We thank Jody Hey and Brian Charlesworth for discussions and advice on analyses, Jody Hey, John Kelly, and Jeff Wall for providing computer programs, and the University of Edinburgh greenhouse staff for plant care. D. Charlesworth was supported by the Natural Environment Research Council of Great Britain, and D. A. Filatov was supported by a grant to D. Charlesworth from the Leverhulme Trust.
Footnotes
-
Communicating editor: W. Stephan
- Received April 2, 1999.
- Accepted July 6, 1999.
- Copyright © 1999 by the Genetics Society of America