Abstract
Fifteen lines each of Drosophila melanogaster, D. simulans, and D. sechellia were scored for 19 microsatellite loci. One to four alleles of each locus in each species were sequenced, and microsatellite variability was compared with sequence structure. Only 7 loci had their size variation among species consistent with the occurrence of strictly stepwise mutations in the repeat array, the others showing extensive variability in the flanking region compared to that within the microsatellite itself. Polymorphisms apparently resulting from complex nonstepwise mutations involving the microsatellite were also observed, both within and between species. Maximum number of perfect repeats and variance of repeat count were found to be strongly correlated in microsatellites showing an apparently stepwise mutation pattern. These data indicate that many microsatellite mutation events are more complex than represented even by generalized stepwise mutation models. Care should therefore be taken in inferring population or phylogenetic relationships from microsatellite size data alone. The analysis also indicates, however, that evaluation of sequence structure may allow selection of microsatellites that more closely match the assumptions of stepwise models.
MICROSATELLITES are short stretches of DNA in which a motif of 1-6 bases is repeated up to about 60 times. Their high variability and ease of analysis have made them an extremely popular marker for genetic mapping, with maps available for a number of species, in particular mammals (Dibet al. 1996), but also protozoa (Su and Willems 1996) and plants (Causseet al. 1994).
Microsatellites are also increasingly used for population inference. Because of their high level of polymorphism, they have become a very powerful tool in determining kinship (Quelleret al. 1993; Blouinet al. 1996) and assessing intrapopulation variability and substructure (Bowcocket al. 1994; Englandet al. 1996, Favreet al. 1997). The use of microsatellites in the study of interpopulation (or interspecific) genetic differentiation is much less developed. Analyses of population divergence require specification of a distance measure that may or may not explicitly specify a mutation process. When divergence is low, it is appropriate to use distances based on differences in allele frequencies between populations (Neiet al. 1983; Takezaki and Nei 1996). Recently, a number of distances have been developed specifically for microsatellite data. These specify a mutation model and include information on allelic size, e.g., ∂μ2 and Rst (Goldsteinet al. 1995, Slatkin 1995). These distances are based on the assumption that microsatellites roughly follow the stepwise mutation model (Ohta and Kimura 1973; Valdeset al. 1993). Under this model, microsatellite mutations are caused by the gain or loss of one perfect repeat or, in generalized models, more than one repeat (Zhivotovsky and Feldman 1995). Direct and indirect observations have tended to support stepwise mutation at microsatellite loci (Schlötterer and Tautz 1992; Weber and Wong 1993).
The strict stepwise mutation models, however, are not compatible with numerous observations, and a number of variations on the theme have been proposed. Experiments in yeast suggest that microsatellite variability depends on the repeat array length (Wierdlet al. 1997) and the presence of variant repeats (Peteset al. 1997), as do population analyses (Weber 1990; Goldstein and Clark 1995; Jinet al. 1996). Other observations suggest an asymmetry in the distribution of mutations, with a higher frequency of insertions than deletions (Rubinszteinet al. 1995; Ellegrenet al. 1997a). Furthermore, it is clear that some form of size constraint prevents most microsatellite loci from reaching a very high number of repeats (Bowcocket al. 1994; Zhivotovskyet al. 1997, Schlötterer 1998). Some of these complexities are now being taken into account in measures of genetic distance (Nauta and Weissing 1996; Feldmanet al. 1997), but parameterizing such models remains beyond reach (Goldstein and Pollock 1997; Estoup and Cornuet 1999).
Recently, even more complex patterns of microsatellite evolution have emerged, involving mutations in the flanking region, which violate the assumptions of even the most general models. This problem has been reported previously (Angers and Bernatchez 1997; Ortiet al. 1997). Earlier studies have included relatively few loci, however, and the generality of this phenomenon is not known.
In this work, we aimed to assess the correlation between microsatellite structure and variability, as well as to investigate the relative importance of complex mutations in the microsatellites and their flanking regions. A major goal of this work is to determine whether moderate sequencing efforts will allow identification of relatively well-behaved microsatellites suitable for making population and phylogenetic inferences.
Markers included in the study
Size ranges
MATERIALS AND METHODS
Drosophila lines: Fifteen lines each of Drosophila melanogaster, D. simulans, and D. sechellia were typed for assessment of variability at 19 microsatellite loci. The D. melanogaster lines were collected in 1997 in Mount Carmel, Israel, and kindly provided by Prof. Eviator Nevo. The D. simulans lines comprise nine North American, two Israeli, and two Kenyan lines, one line from St. Antioco, Sardinia, and one line of unknown origin. Lines sim132, sim133, sim134, and sim148 were obtained from Umeå Drosophila Stock Center. The lines of D. sechellia were obtained from the National Drosophila Species Resource Center, Bowling Green State University, OH, except for secS9 and secS32, provided by John Roote, Cambridge, UK, and JDsec, provided by Dr. Jean David.
Microsatellite loci: The loci and their primers are listed in Table 1. The 17 amplicons account for 19 microsatellites, all located on the third chromosome. They were originally characterized for D. melanogaster (Goldstein and Clark 1995, Schug et al. 1997, 1998). From a total of 24 primer pairs, originally selected to achieve dense coverage of the third chromosome with polymorphic markers, we chose 17 that readily amplify in all three species.
DNA analysis: DNA was extracted from single flies as described in Gloor et al. (1993). One microliter of the extract was used for PCR with the forward primer fluorescently labeled. Some reactions were conducted with two or three primer pairs. The PCR reactions were performed in prealiquoted plates (Advanced Biotechnologies, Epsom, UK) containing 0.2 mm each dNTP, 2.5 mm MgCl2, 7 pmol of each primer, and 0.31 units of Taq polymerase in 25-μl reaction volumes. Thirty-five cycles of PCR were performed using an annealing temperature of 55° for all markers. After PCR, the products were diluted and pooled such that 1-2 ng of each of between four and six markers could be loaded on a 4% acrylamide gel and analyzed on an ABI 377 automated sequencer. Scoring was executed with Genotyper 2.0 software (Perkin-Elmer, Norwalk, CT).
—Sequences of microsatellite loci. Only parts of the sequence presenting size variation and microsatellites are shown. The entire sequences can be found in GenBank, accession nos. AF067866-AF067935. Microsatellite sequence is in boldface; imperfections in the repeat array are underlined. Nonadjacent parts of the sequences are separated by //. The number after the species refers to the allele size in base pairs. (A) Loci showing size variation in accordance with the stepwise mutation model; (B) loci showing size variation in the flanking region; (C) loci showing size variation incompatible with stepwise mutation models.
—Continued.
—Continued.
—Continued.
—Continued.
The same primers were used for sequencing, except that the forward primers were not fluorescent. For three markers (ABDB, TRH, PROS), new forward primers were designed to amplify a longer DNA fragment in order to facilitate sequencing. The sequencing forward primers are ggcatcctatacagcaac for TRH, ccacacagacgctgtata for ABDB, and tgatctaccgaaagttga for PROS. The PCR products were purified with the Qiaquick PCR purification kit (QIAGEN, Venlo, NL) and sequenced with dRhodamine dye-deoxy terminator sequencing kit (Perkin-Elmer). Sequencing was performed on an ABI 377 automated sequencer. Both strands of between one and four alleles of each species were sequenced from homozygous individuals. In species with more than one allele, the most common allele was chosen, along with one to three others depending on the extent of size variation. When present, alleles of the same size were sequenced in different species to assess the extent of size homoplasy. Some primers (TRH, PROS) showed extensive misannealing and repeatedly failed to yield readable sequences for some individuals. In these cases, the desired band was isolated from a Metaphor agarose gel and used as a template for a PCR of 25 cycles before sequencing.
RESULTS
Typing: Table 2 shows the allele size ranges for the 45 lines typed. The isofemale lines used showed different levels of polymorphism, due to their breeding histories. To avoid introducing a bias in the estimation of diversity, one allele was chosen at random in heterozygous individuals. For estimation of the size ranges, all the alleles observed were taken into account. EHAB failed to amplify some lines of D. simulans, possibly indicating the presence of a null allele (Pembertonet al. 1995). Most loci showed allele sizes congruent with strictly microsatellite insertions or deletions (indels). However, some of them (DSRC, CPDR) gave very ambiguous results, in which size variation did not correspond to the pattern expected from repeated indels. These ambiguities were resolved after sequencing. In both cases, size differences were unrelated to the microsatellite.
Types of polymorphism: Sequences of the 17 amplicons are shown in Figure 1. Only 7 amplicons show a direct relationship between size variation (both intra- and interspecific) and number of repeats within the microsatellite (PROS actually contains two microsatellites with repeat number variation; Figure 1A). Three loci show interspecific size variation outside the microsatellite (POINT, ABDB, DSRC); three others present intraspecific size variation exclusively outside the microsatellite, sometimes mimicking repeat loss or gain (HSP82, SIDNA, CPDR; Figure 1B). Two loci, U1951 and LAMB2A, have alleles characterized by large deletions, sometimes overlapping between the microsatellite and the flanking region. In the U1951 allele of size 157 in D. melanogaster, the whole microsatellite has disappeared, along with 5 bases of the flanking region, the other alleles in D. melanogaster showing evidence for stepwise mutations in the long AT repeat. In the LAMB2A allele of size 149, two repeats of the microsatellite are absent along with 15 bases of the flanking region. In both cases, the presence of the missing segments in the sister species clearly points toward a deletion as the cause for the observed polymorphism. The AT-rich flanking region of LAMB2A shows extreme intra- and interspecific variability. Interestingly, the interspecific variability is mainly due to base substitutions, whereas deletions account for most of the intraspecific variability, especially in D. simulans, where five distinct deletions are observed in four alleles sequenced, with lengths ranging from 2 to 58 bases. Finally, SIMA and CATHPO show very degraded repeat arrays, with deletions and insertions of repeats occurring in the perfect as well as in the degraded repeat array. In SIMA, variability in the degraded repeat is interspecific, with D. melanogaster having one repeat less than the other two species. In CATHPO, however, the variability is intra- as well as interspecific (Figure 1C).
Variability data: For all microsatellites compared, the average number of repeats is higher in D. melanogaster than in D. simulans or D. sechellia (Table 3). Variability is extremely low in D. sechellia compared to the two other species, probably due at least in part to the very small geographical range of the species and consequent smaller population size. D. melanogaster and D. simulans share similar widespread distributions and larger population sizes. We should be cautious, however, in comparing their variabilities, as the geographic ranges of the areas sampled are very different for the two species: the D. melanogaster sample comes from a single valley in Israel, whereas the D. simulans lines come from various places, mainly the United States but also Europe and Africa. However, our D. simulans samples do not show significant geographic variation.
In contrast with the observation of Harr et al. (1998), the extent of heterozygosity does not appear to be strongly locus specific across the three species. The correlation coefficient of the log variances in the five microsatellites polymorphic in both D. melanogaster and D. simulans is 0.12 (P = 0.84), compared with 0.60 in Harr et al. (1998). It should be noted, however, that Harr et al. averaged variances across multiple populations, whereas our sample of D. melanogaster was drawn from a single population. Averaging across populations will tend to make the variances at a locus in different species more similar.
The average heterozygosities calculated for the loci in which size variation is consistent with stepwise mutations in the microsatellite region (SGS, NANOS, PROS, TRH, EHAB, ABDB, HOX, RHOb) are 0.31 and 0.40 for D. melanogaster and D. simulans, respectively. Previous observations of heterozygosities in D. melanogaster have been made by Schug et al. (1997; 0.31) and Goldstein and Clark (1995; 0.80) for populations at local and continental ranges, respectively. The heterozygosities in Goldstein and Clark (1995) are substantially higher than others reported for D. melanogaster. This may reflect in part the wide geographic distribution of lines used in that study, but may also be due to a methodological artefact: the typing was carried out using Metaphor agarose gels, and variation among the gels may have inflated heterozygosity by misclassifying alleles by one or two repeat counts.
We calculated the correlations between variability and number of repeats for eight microsatellites, the seven “well-behaved” loci plus ABDB, for which the size variation outside the microsatellite was only interspecific and negligible compared with the variation within the microsatellite. In these cases, therefore, the repeat count could be confidently estimated from the allele length. The most significant correlation was found between the variance in repeat number and the maximum repeat count in all species (Figure 2), as was originally noted by Goldstein and Clark (1995). Correlations involving the heterozygosity were lower than when using the variance, especially in D. simulans (Table 4). This substantial difference is partially due to one locus, EHAB, for which diversity data were computed from only eight lines, due to its failure to amplify some lines. When that marker was removed from the analysis, the correlations between heterozygosity and repeat counts increased from 0.44 and 0.39 to 0.64 and 0.52, respectively, for mean and maximum repeat count. Correlation between the variance and the minimum repeat count is not significant (data not shown).
At all loci, when imperfections are present, they are usually situated at one extremity of the repeat array, and therefore do not dramatically reduce the number of repeats, except in SIMA and CATHPO where they are widely distributed throughout the microsatellite. Surprisingly, the polymorphism of these two loci does not seem to be lower than in microsatellites with perfect repeat arrays (they show heterozygosities of 0.626 and 0.604 for D. simulans, compared with an average heterozygosity of 0.620 for the six perfect polymorphic microsatellities in D. simulans).
DISCUSSION
Two striking features of our data are the low heterozygosities of our samples and the small number of repeats of several of the microsatellites. The average numbers of repeats among 20 microsatellites are 7.5, 5.6, and 5.8, respectively, for D. melanogaster, D. simulans, and D. sechellia. The difference in average size between species is consistent with what would be expected from ascertainment bias (Ellegren et al. 1995, 1997b; Crawfordet al. 1998), with D. melanogaster as the focal species. Indeed, assuming a selection threshold (the minimum acceptable size in the focal species) of 5 repeats, the expected difference of length is 2.5 repeats (Goldstein and Pollock 1997), compared to 1.9 and 1.7, in our data, for D. simulans and D. sechellia, respectively. The relatively small number of repeats observed in Drosophila microsatellites has been suggested by Schug et al. (1997) as an explanation for their observation of a low mutation rate. Out of 24 loci studied (with 6,570 allele generations per locus giving 157,680 allele generations in total), Schug et al. observed only one mutation, at locus U1951. This locus is present in our study, where it ranks first for heterozygosity and second for variance in D. melanogaster, but is monomorphic in the two other species, with an average number of repeats falling from 14.4 in D. melanogaster (the highest in our sample) to 6 in both D. simulans and D. sechellia.
Our data support a role for the mutation process in constraining microsatellite size (Schlötterer 1998). They show a few examples of differences in allele length consistent with very large deletions having occurred, in particular at locus LAMB2A, but also at U1951 and CATHPO. In LAMB2A, deletions of up to 58 bp appear to have occurred in D. simulans and D. sechellia, some involving part of the repeat region. U1951 shows one allele in which the whole microsatellite is absent, along with 5 bp flanking the microsatellite, possibly the result of a single large deletion. Interestingly, although this locus is the most polymorphic in D. melanogaster, it is monomorphic in D. simulans and D. sechellia, and identical in these two species. This could be due to the smaller number of repeats in D. simulans (6) compared to D. melanogaster (0-19), consistent with a general positive correlation between repeat count and polymorphism (Goldstein and Clark 1995). In the case of SIMA and CATHPO, the introduction of numerous imperfections has considerably reduced the number of consecutive repeats. Surprisingly, this does not seem to have affected the variability of these loci, indels being observed along the entire length of the degraded repeat arrays. However, sequencing of more alleles would be necessary to investigate this question in more detail. These two phenomena, accumulation of imperfections and the occurrence of large deletions, together with stepwise drift to small sizes, apparently contribute to the degradation of microsatellite loci over time. This is illustrated in our data by four loci (NANOS, U1951, PROS, TRH) that show smaller sizes in D. simulans compared to D. melanogaster, together with an obvious loss of variance (two loci becoming monomorphic).
Microsatellite variability
—Variance of repeat count vs. maximum repeat count in D. melanogaster (solid diamonds) and D. simulans (open diamonds). Note that the diamond at (10, 0) represents two points (one in each species).
The long-lived microsatellites do not show any obvious trends in terms of sequence structure, although they tend to be of larger average size than nonpreserved ones (P = 0.03). Long trinucleotide repeats seem to degrade rapidly by the accumulation of imperfections: the longest undegraded trinucleotide repeat count is 11, whereas in SIMA we counted a maximum number of 6 perfect repeats in a stretch of 12; in CATHPO we counted a maximum of 8 perfect repeats in a stretch of 24. The two loci showing long deletions are also characterized by a low GC content (<30%). This can tentatively be related to the occurrence of large deletions in neutral portions of the Drosophila genome (Glennet al. 1996; Petrov and Hartl 1998).
Eight out of our 19 microsatellites were used to study the correlation between structure and variability (Table 4). ABDB was added to the seven “well-behaved” loci because its extramicrosatellite length variation was simple, species specific, and resolved by our limited sequencing of four alleles. The best correlation was observed between variance in repeat number and maximum repeat count, as has been reported by Goldstein and Clark (1995). Five dinucleotide, two trinucleotide, and one tetranucleotide microsatellites were included in the analysis. The effect of the dinucleotides is prominent in the significance of the correlations, and similar results were obtained from the analysis of these loci alone. While the number of tri- and tetranucleotides is very low and does not allow separate statistical analysis, they seem to follow the same general trend as the dinucleotides.
Correlations (r) between size and variability parameters
To assess the meaning of these correlations, we ran computer simulations of stepwise mutation models assuming eight loci and a haploid population size of 100. We first considered a strict stepwise model, with no length dependence of the mutation rate. The constant mutation rate was set to 0.05, and the range set to 20, producing an expected variance only slightly lower than the 5.0 expected in the absence of range constraints (Feldmanet al. 1997). Out of 1000 independent simulations, the 95% confidence intervals on the correlation coefficient excluded a significant outcome for the correlation between the variance and all three repeat count measures (assuming a one-tailed test for positive r). We conclude, therefore, that the significant correlation between variance and both the maximum and mean size indicates a departure from this simple mutation model. Simulation of various different mutation models demonstrates that a wide variety of departures from the strict stepwise model could lead to significant correlations between the mean and the maximum repeat count and the variance, while not producing a significant correlation between the minimum and the variance. To distinguish among possible nonstepwise models, it would be important to uncover the form of the dependence between the variance and the mean and maximum repeat counts. The current set of eight observations is insufficient for this task, however, and we therefore consider only a single obvious departure to provide an example of how a significant correlation might be produced. The parameters of the simulation were the same as above, except that the mutation rate changed linearly with the number of repeats with a slope of 0.006. In this case, the upper limit of the 95% confidence intervals for the correlation coefficient are 0.5, 0.7, and 0.75, for the minimum, mean, and maximum repeat sizes, with the latter two being significant at the 0.05 level. Thus, a simple model of length dependency could be responsible for the significant correlations observed, as could a variety of more complex models (data not shown).
Many authors have emphasized that flanking regions carry information about allele relationships (Zardoyaet al. 1996), which is confirmed here. The CPDR flanking region is particularly informative, whereas the three repeat arrays at this locus do not show variation between the three species. The most likely explanation for this observation is a slow mutation rate, although selection against longer repeats cannot be rejected. Finally, variations in the flanking region can lead to size homoplasy (Grimaldi et al. 1997), as in CATHPO.
Our data confirm the frequent occurrence of polymorphisms in microsatellite loci apparently resulting from complex mutations. Other published results indicate that the same phenomenon can be observed in vertebrates, particularly humans (Grimaldi and Crouau-Roy 1997). Out of 17 loci studied here, only 7 (41%) show size variation compatible with the stepwise mutation model. Great care should therefore be taken when using microsatellite loci for population and phylogenetic inference. The identification of a subset of markers in this study, in which polymorphism is consistent with the occurrence of stepwise mutations, also indicates that a limited sequencing effort may allow selection of loci that show sufficiently consistent mutation properties across species that they may be useful in conjunction with generalized stepwise genetic distances (Zhivotovsky and Feldman 1995; Feldmanet al. 1997).
Acknowledgments
We thank Prof. Eviator Nevo (University of Haifa), Jong S. Yoon (National Drosophila Resource Center), Karin Ekström and Stephan Escher (Umeå), and John Roote (University of Cambridge) for providing Drosophila lines and Christian Schlötterer for comments on the manuscript. This work was funded by a Biotechnology and Biological Sciences Research Council grant to D. B. Goldstein.
Footnotes
-
Communicating editor: M. W. Feldman
- Received June 19, 1998.
- Accepted March 5, 1999.
- Copyright © 1999 by the Genetics Society of America