The patterns of mutation and evolution at 13 microsatellite loci were studied in the filamentous fungal genus Neurospora. First, a detailed investigation was performed on five microsatellite loci by sequencing each microsatellite, together with its nonrepetitive flanking regions, from a set of 147 individuals from eight species of Neurospora. To elucidate the genealogical relationships among microsatellite alleles, repeat number was mapped onto trees constructed from flanking-sequence data. This approach allowed the potentially convergent microsatellite mutations to be placed in the evolutionary context of the less rapidly evolving flanking regions, revealing the complexities of the mutational processes that have generated the allelic diversity conventionally assessed in population genetic studies. In addition to changes in repeat number, frequent substitution mutations within the microsatellites were detected, as were substitutions and insertion/deletions within the flanking regions. By comparing microsatellite and flanking-sequence divergence, clear evidence of interspecific allele length homoplasy and microsatellite mutational saturation was observed, suggesting that these loci are not appropriate for inferring phylogenetic relationships among species. In contrast, little evidence of intraspecific mutational saturation was observed, confirming the utility of these loci for population-level analyses. Frequency distributions of alleles within species were generally consistent with the stepwise mutational model. By comparing variation within species at the microsatellites and the flanking-sequence, estimated microsatellite mutation rates were ∼2500 times greater than mutation rates of flanking DNA and were consistent with estimates from yeast and fruit flies. A positive relationship between repeat number and variance in repeat number was significant across three genealogical depths, suggesting that longer microsatellite alleles are more mutable than shorter alleles. To test if the observed patterns of microsatellite variation and mutation could be generalized, an additional eight microsatellite loci were characterized and sequenced from a subset of the same Neurospora individuals.
MICROSATELLITES are composed of tandemly repeated, simple DNA sequence motifs of as many as six nucleotides in length. These loci are commonly found throughout both prokaryotic and eukaryotic genomes and typically are highly polymorphic within species and populations. In addition, these codominant genetic markers are relatively easy to score and have high reproducibility and specificity. As such, microsatellites have become one of the most popular classes of molecular markers and are commonly employed to investigate the population genetics of a diverse range of organisms (Bruford and Wayne 1993; Goldstein and Schlötterer 1999). Although fewer studies have addressed the evolutionary dynamics and mutational processes of microsatellite loci, an understanding of these topics is beginning to develop (see Estoup and Cornuet 1999; Schlötterer 2000).
This study of microsatellite evolution in the model microbial eukaryote Neurospora had three main objectives:
To describe the patterns of mutation and evolution at multiple microsatellite loci: Are these patterns similar across loci and across species? Are these patterns consistent with accepted theories of microsatellite mutation and evolution?
To determine if microsatellites are appropriate genetic markers for inferring the phylogenetic relationships among species of Neurospora: At what level of phylogenetic divergence does microsatellite mutational saturation occur? Are the amounts of allelic homoplasy large enough to render interspecific comparisons meaningless?
To determine if microsatellites would be useful for answering population genetic questions: What are the relative rates of microsatellite mutation, and how variable are these genetic markers within a single species? What factors can be used to predict microsatellite variability?
Slipped-strand mispairing during DNA replication (Levinson and Gutman 1987) is considered the predominant mutational mechanism for microsatellites (Schlötterer and Tautz 1992; Strand et al. 1993). During replication, the nascent strand may dissociate from the template strand and, due to the repetitive nature of the microsatellite DNA, the strands may reanneal incorrectly, or “out-of-register.” The pairing of misaligned repeats causes one to several repeats to loop out from either the nascent or the template strand. This slipped-strand mispairing results in the nascent strand having a different number of repeats from the template strand once DNA replication is complete. This mutation process allows the same microsatellite allele to arise multiple times, thus generating size homoplasy. A number of mutational models that account for this homoplasy have been proposed, with the stepwise mutational model (SMM; Kimura and Ohta 1978) and the two-phase model (TPM; Di Rienzo et al. 1994) being the most common. The SMM assumes all mutational events involve a change in a single repeat only, while the TPM allows a proportion of mutations to involve changes greater than single repeats. On the other hand, the infinite-allele model (IAM; Kimura and Crow 1964) does not allow for homoplasy and assumes that every mutation results in the creation of a new allele. Determining which mutational model is most appropriate is important because microsatellite-specific genetic distances rely on the underlying assumptions of the chosen model (Takezaki and Nei 1996).
Several outbreeding species of Neurospora have been recognized both phylogenetically (Dettman et al. 2003a) and in terms of reproductive isolation (Turner et al. 2001; Dettman et al. 2003b); however, many of the basal relationships among species remain unresolved. A number of these species are included in ongoing studies of comparative genomics (Gilbert et al. 2003) and the genetics of reproductive isolation (Turner and Taylor 2003), underscoring the need for a reliable phylogeny of the genus. Molecular data suggest that the group underwent a rapid species radiation (Natvig and May 1996; Skupski et al. 1997; Dettman et al. 2003a). Protein-coding genes, and even noncoding genomic regions, appear to have evolved too slowly to record the full speciation history (i.e., few mutations arose and became fixed between successive speciation events). Because microsatellites mutate at such high rates, they may possess enough variation to clarify interspecific relationships. However, the consequence of hypermutability combined with homoplasy is the complete loss of phylogenetic signal over long periods of evolution. Thus, microsatellites that are conserved across diverse taxonomic groups (e.g., Schlötterer et al. 1991; FitzSimmons et al. 1995; Zardoya et al. 1996) may not be appropriate for inferring the phylogenetic relationships among the groups. Before constructing a phylogeny of Neurospora species using microsatellites, we assessed whether or not we could place confidence in the resulting tree.
A common approach to the study of the evolutionary dynamics and mutational processes of microsatellites has been to infer past mutational events from contemporary allele frequency distributions within populations (e.g., Shriver et al. 1993; Valdes et al. 1993; Di Rienzo et al. 1994; Estoup et al. 1995a,b; Garza et al. 1995; Nielsen and Palsbøll 1999). This method typically requires electromorph typing of a large number of loci from hundreds of individuals. A different approach is to directly sequence the microsatellite and flanking regions to reveal mutations that simple electromorph typing cannot detect (e.g., Estoup et al. 1995b; Jin et al. 1996; Angers and Bernatchez 1997; Grimaldi and Crouau-Roy 1997; Ortí et al. 1997; Primmer and Ellegren 1998; Colson and Goldstein 1999; Fisher et al. 2000; Harr et al. 2000; Makova et al. 2000; Zhu et al. 2000; Blankenship et al. 2002). This approach is more laborious than the former so sequencing usually has been limited to a smaller number of individuals and/or loci.
We studied the evolutionary dynamics of 13 microsatellite loci in the filamentous fungal genus Neurospora by taking an approach that was a combination of those mentioned above. First, a detailed investigation was performed on five microsatellite loci by sequencing the repeat arrays, along with ∼500 nucleotides of their flanking regions, from a set of 147 individuals from eight phylogenetic species of Neurospora. To elucidate the genealogical relationships among microsatellite alleles, repeat number was mapped onto trees constructed from flanking-sequence data. This approach allowed the microsatellite mutations to be placed in the evolutionary context of the less rapidly evolving, nonrepetitive, flanking DNA. In addition to assessing variation and testing microsatellite mutation models, three types of novel analyses were performed. By comparing microsatellite and flanking-sequence genetic distances, the phylogenetic depth at which microsatellite mutational saturation occurred could be estimated. Variation within species at the microsatellites and the linked flanking regions were used to estimate the relative rates of mutation for the two DNA types. The hierarchical sampling scheme of this study provided a unique opportunity to investigate the evolution of microsatellites across multiple genealogical depths. To further investigate the evolution of microsatellite repeat arrays, an additional eight microsatellite loci were characterized and sequenced from a subset of 29 individuals containing representatives from each of the same eight Neurospora species. A total of 921 microsatellite alleles were directly sequenced, which represents the largest effort put toward microsatellite sequencing to date.
MATERIALS AND METHODS
Selection of individuals:
The 147 individuals used in this study were described by Dettman et al. (2003a) and are listed in the supplemental data. The entire sample included individuals from eight phylogenetic species of Neurospora: N. crassa (n = 33, 3, and 9 from NcA, NcB, and NcC subgroups, respectively); N. intermedia (n = 52 and 10 from NiA and NiB subgroups, respectively, and 6 from basal lineages); N. sitophila (n = 7); N. tetrasperma (n = 4); phylogenetic species 1 (n = 3); phylogenetic species 2 (n = 7); phylogenetic species 3 (n = 5); and N. discreta (n = 8). Some species are referred to simply as “phylogenetic species” because they have not yet been formally named (see Dettman et al. 2003a,b).
Sequences of four nonrepetitive DNA regions, within which five microsatellite loci were located, were reported in Dettman et al. (2003a)(GenBank accessions AY225899, AY225900, AY225901, AY225902, AY225903, AY225904, AY225905, AY225906, AY225907, AY225908, AY225909, AY225910, AY225911, AY225912, AY225913, AY225914, AY225915, AY225916, AY225917, AY225918, AY225919, AY225920, AY225921, AY225922, AY225923, AY225924, AY225925, AY225926, AY225927, AY225928, AY225929, AY225930, AY225931, AY225932, AY225933, AY225934, AY225935, AY225936, AY225937, AY225938, AY225939, AY225940, AY225941, AY225942, AY225943, AY225944, AY225945, AY225946, AY225947, AY225948, AY225949; TREEBASE accessions S950 and M1574). Here we report the sequences of the repeat arrays for these five microsatellite loci (DMG-CA, TMI-AGC, TML-ATC, QMA-CTTTT, and QMA-TGG) from 147 individuals (see supplemental data at http://www.genetics.org/supplemental/).
New primer sets were designed for an additional eight microsatellite loci from five genomic regions that were unlinked to the previous four genomic regions. Various microsatellite repeat motifs (tandemly repeated 15 times) were used to search the Broad Institute N. crassa Database (http://www.broad.mit.edu/annotation/fungi/neurospora/). Iterative rounds of primer design were performed until a 400- to 600-bp fragment (microsatellite plus flanking regions) could be amplified from the majority of species. These eight new microsatellites were sequenced from a subset of 29 individuals from eight species (see supplemental data).
On the basis of open reading frame predictions provided by the genome database and the form and distribution of variation among the sequences from different individuals, three of the microsatellites (TMI-AGC, T6L-AGC, and T7L-ACA) may be located within coding DNA. However, the existence of these predicted genes and hypothetical proteins has not been verified.
PCR amplification conditions and sequencing protocols were the same as described in Dettman et al. (2003a) except for locus-specific annealing temperatures (Table 1). For all loci, microsatellite repeat number was determined directly from the sequence data and included both perfect and imperfect repeats. When substitutions or noninteger length mutations made counting repeats difficult, repeat numbers were calculated by dividing array length by motif length and then rounding to nearest integer.
The program Microsatellite Analyser (MSA, version 3.0; Dieringer and Schlötterer 2003) was used to calculate descriptive statistics and microsatellite genetic distances among groups. Neighbor-joining (Saitou and Nei 1987) trees were constructed from genetic distances by importing the distance matrices into PAUP (version 4b10; Swofford 2001). Branch support was assessed by performing 100 bootstrap replications using MSA. The topologies of the microsatellite trees were compared to that of the flanking-sequence tree from Dettman et al. (2003a) using Shimodaira-Hasegawa tests (Shimodaira and Hasegawa 1999; resampling estimated log likelihoods from 1000 bootstraps). For these tests, the reference data set contained combined sequence data from the DMG, TMI, TML, and QMA genomic regions for a single individual from each of the eight species, and maximum-likelihood parameter values were estimated using ModelTest (version 3.06; Posada and Crandall 1998).
The program JMPin (version 3.2.6; SAS Institute 1999) was used for analysis of covariance (ANCOVA), correlation, and regression analyses of both nontransformed and log-transformed data. Unless otherwise noted, the results and conclusions of analyses did not differ between the nontransformed and log-transformed data. All flanking-sequence genetic distances reported in this study are Kimura two-parameter distances, as calculated by PAUP. The program SITES (Hey and Wakeley 1997) was used to calculate numbers of segregating sites and to estimate the population mutation parameter θ from sequence data. For haploid organisms such as Neurospora, θ equals two times the effective population size times the mutation rate (θ = 2Nμ), assuming neutrality and mutation-drift equilibrium.
Mutational saturation of a microsatellite was assessed as follows: For each possible pairwise comparison of individuals, we calculated the number of substitution differences between flanking sequences and the commonly employed microsatellite genetic distance (δμ)2 (Goldstein et al. 1995a) between microsatellite alleles. When comparing individuals, (δμ)2 is functionally equivalent to the measure D1, which is the squared difference in repeat number (Goldstein et al. 1995b). For each microsatellite, we then plotted the mean (δμ)2 distances against the number of substitutions in flanking sequence. Because the expectation of (δμ)2 is linear with respect to time (Goldstein et al. 1995a), mutational saturation was estimated to occur at the point where mean (δμ)2 no longer increased with the increasing number of flanking-sequence substitutions. To account for differences in flanking-sequence length, these comparisons were repeated using genetic distances between the flanking sequences (which were binned), and the same results were obtained. Using difference in repeat number between microsatellite alleles rather than (δμ)2 also did not affect the results. These calculations assumed that the microsatellites and their physically adjacent flanking regions were completely linked. Correlations between microsatellite distances and flanking-sequence genetic distances for each microsatellite, and correlations among flanking-sequence genetic distances among genomic regions, were calculated from all possible pairwise comparisons of individuals.
Evolution under the SMM commonly results in a unimodal, symmetrical distribution of repeat alleles around the mean, whereas evolution under the TPM is more likely to result in irregular, multimodal distributions. However, multimodal distributions alone cannot justify the rejection of the SMM because evolution under SMM can produce these distributions by chance (Valdes et al. 1993). Therefore, the fit of microsatellite mutation models to N. crassa and N. intermedia allele-frequency distributions was tested using a likelihood approach implemented by MISAT (Nielsen 1997). The fitting of models and θ estimation were restricted to intraspecific subgroups (NcA and NiA) to avoid pooling genetically differentiated groups. Markov chains were run for 100,000 and 1,000,000 generations for the SMM and TPM, respectively. For the TPM, the proportion of multistep mutations was allowed to vary between 0.00001 and 0.5. Microsatellite θ value estimation and likelihood-ratio tests were performed as described by Nielsen (1997). These calculations, along with θ comparisons between microsatellites and flanking regions, rest on the assumption that the populations have reached mutation-drift equilibrium.
The evolution of 13 microsatellite loci from Neurospora was investigated in this study. These microsatellites are embedded in nine different genomic regions, each from a different chromosome arm or linkage group (Table 1). To distinguish the microsatellite from its genomic region, the name of the microsatellite proper includes a dash followed by a repeat motif. For example, TMI-AGC is a microsatellite with an AGC repeat motif and is embedded within the ∼500-bp stretch of PCR-amplified sequence called the TMI genomic region. Similarly, QMA-CTTTT and QMA-TGG are two linked microsatellites that are both embedded within the QMA genomic region.
Five microsatellite loci, located on four different chromosome arms, were sequenced from 147 individuals from eight Neurospora species (Table 2). The repeat numbers of the microsatellite alleles are listed in the supplemental data, along with the full sequences of alleles. The patterns of variation were quite different among these microsatellite loci. Each locus possessed between 11 and 21 different repeat alleles, and there was an eightfold difference among loci in variance in repeat number (Table 2). The dinucleotide microsatellite DMG-CA was the most variable locus with respect to the number of repeat alleles and variance in repeat number. The frequency distributions of repeat alleles also differed drastically among loci (see supplemental data at http://www.genetics.org/supplemental/). For example, DMG-CA had a wide range of intermediate-frequency repeat alleles, while QMA-TGG had one main high-frequency repeat allele.
A significant amount of variation was found in the nonrepetitive regions flanking the microsatellites, even within a single species (Table 2, Figure 1). Previous authors (Brohede and Ellegren 1999) have reported that the variability of nonrepetitive DNA increases with proximity to microsatellites. However, polymorphism plots across the length of the Neurospora sequence alignments provided no evidence for such a trend (data not shown). The number of unique flanking-sequence variants increased with the length of included flanking sequence per region (Table 2). Interestingly, the relative levels of variation per microsatellite did not correspond with the relative levels of variation in the linked, nonrepetitive flanking regions. The QMA flanking region was the most variable, but QMA-TGG had the lowest variance in repeat number, whereas the DMG flanking region was the least variable, but DMG-CA had the highest variance in repeat number (Table 2). Thus, highly variable microsatellite loci were not found only in highly variable genomic regions.
Eight additional microsatellite loci, located on five different chromosome arms, were sequenced from the 29 individuals that were representatives of the eight Neurospora species (see supplemental data). Each locus possessed between 6 and 11 different repeat alleles, and there was a 35-fold difference among loci in variance in repeat number (Table 3). Even for two microsatellites located within the same genomic region, there could be a 13-fold difference in variability (e.g., T7L-ACA and T7L-CGA).
The following sections present analyses that were performed primarily on data from the five intensively sampled microsatellite loci. Most of the same analyses were performed on data from the additional eight loci, and results typically were similar. Analyses of the five-locus data set had more power because the sample size per locus was significantly larger. For the sake of brevity, analyses of the eight-locus data set are discussed only if they provide novel information.
Mutational saturation and homoplasy:
For each of the four genomic regions investigated in detail, the genealogies presented in Dettman et al. (2003a) were pruned to retain only a single representative of each flanking-sequence variant (Figure 1). The microsatellite repeat alleles that were associated with each flanking-sequence variant then were mapped onto the trees. As predicted by the fact that microsatellites are hypermutable, many flanking-sequence variants were associated with multiple repeat alleles. However, many repeat alleles also were shared by highly divergent flanking-sequence variants. Although identical in state, it was very unlikely that all copies of the same repeat allele were identical by descent, especially when shared among well-differentiated genealogical lineages (Figure 1).
Microsatellite mutation following the SMM tends to create derived alleles that have a similar number of repeats as the ancestral allele. Difference in repeat number therefore holds some information on the recency of common ancestry, providing the basis for SMM-based genetic distances (Goldstein et al. 1995a,b). Because the number of substitution differences among nonrepetitive DNA also is proportional to recency of common ancestry, a correlation between microsatellite distance and flanking-sequence genetic distance is expected, at least until the divergence becomes so great that mutational saturation begins to degrade the phylogenetic signal. Despite the physical linkage of the microsatellites and their respective flanking regions, there was a poor correlation between flanking-sequence genetic distance and (δμ)2, a microsatellite-specific distance (mean R2 for five microsatellite-by-flanking region comparisons = 0.02). On the other hand, flanking-sequence genetic distances were well correlated among the four genomic regions (mean R2 for six flanking region-by-flanking region comparisons = 0.64), suggesting that the different nonrepetitive regions have been accumulating mutations in a more or less linear fashion without significant substitutional saturation. The poor correlation between variation in the flanking regions and linked microsatellites was presumably due to mutational saturation and homoplasy of the microsatellites. If this hypothesis is true, the correlation between variation levels of the two DNA types would be expected to increase as comparisons are made over shallower genealogical depths.
At what level of divergence does microsatellite homoplasy begin to erase the history of the alleles? By plotting (δμ)2 against the number of substitutions between the flanking sequences or binned genetic distances to account for differences in flanking-sequence length, mutational saturation was estimated to begin when flanking-sequence genetic distances reached 0.006, 0.012, 0.048, 0.017, and 0.017 for DMG-CA, TMI-AGC, TML-ATC, QMA-CTTTT, and QMA-TGG, respectively (Figure 2, Table 4). Results were similar when flanking-sequence genetic distance or difference in microsatellite repeat number were used. As expected, the most variable of the five microsatellite loci, DMG-CA, was mutationally saturated at the smallest flanking-sequence genetic distance. On average, the ability of these microsatellite loci to represent evolutionary history begins to degrade at 2.0% divergence of flanking regions.
Comparison among species:
Four of the five microsatellite loci reached mutational saturation at a flanking-sequence genetic distance that was less than the mean genetic distance among the seven ingroup species, i.e., excluding N. discreta (Table 4). Averaged across loci, the flanking-sequence divergence among ingroup species was 3.3%, far beyond the 2.0% divergence level at which the microsatellites were shown to saturate. In addition, the range of microsatellite repeat number commonly overlapped among species. The percentage of repeat alleles shared by at least two species was 66.7, 75.0, 63.6, 25.0, and 30.8% for DMG-CA, TMI-AGC, TML-ATC, QMA-CTTTT, and QMA-TGG, respectively. This saturation and excessive homoplasy suggested that interspecific phylogenies based on microsatellite data may be misleading.
Microsatellite genetic distances among the eight Neurospora species were calculated. The tree constructed from (δμ)2, an SMM-based genetic distance, was topologically incongruent with the species phylogeny inferred from the flanking-sequence data (Figure 3; Shimodaira-Hasegawa test, P = 0.009). Even when using an IAM-based genetic distance (proportion of shared alleles (POSA); Bowcock et al. 1994), which ignores difference in repeat number, the tree was topologically incongruent with the flanking-sequence tree (Figure 3, P = 0.019). Neither of the two sister-group relationships supported by the flanking-sequence data (PS3-N. crassa and PS2-N. intermedia) appeared in the microsatellite-based trees. When data from the eight additional microsatellite loci were combined with these five loci, the resulting trees (not shown) were even less resolved and still topologically incongruent with the sequence-based tree.
Variance in repeat number:
Mean repeat number did not differ significantly among species, but species did differ significantly in terms of maximum repeat number and variance in repeat number (Mann-Whitney U-tests, P < 0.05). These values tended to be greatest in N. crassa and N. intermedia (Table 5); however, these two species also had the largest sample sizes. To determine which variables were the best predictors of variance in repeat number, we modeled locus, species, sample size, maximum, and mean repeat number using ANCOVA. When all of these variables were accounted for, the only ones that were significant predictors of variance in repeat number were maximum and mean repeat number (both P < 0.0001).
Some genealogical lineages were fixed for one repeat allele or possessed only a few, whereas other lineages possessed a wide range of repeat alleles (Figure 1). Although shallower clades tended to be less variable, similar-depth clades could differ greatly in variability, and clades could be fixed for a single repeat allele, regardless of clade depth. When a clade was fixed for a single allele, that allele tended to have a low number of repeats, and when a clade was highly variable, the alleles tended to have higher numbers of repeats (Figure 1), consistent with ANCOVA results.
To investigate the robustness of the positive relationship between variance and repeat number, we plotted the variance in repeat number against maximum and mean repeat number for three genealogical depths: (1) flanking-sequence variants that were sampled four or more times (n = 32 variants): these comparisons were restricted to the shallowest genealogical depth possible to decrease the likelihood of homoplasy; (2) intraspecific values for the five intensively sampled loci (n = 37 species-by-locus combinations); and (3) the entire sample of 29 individuals from eight species (n = 13 loci): by using the variance of the entire sample, possible species-specific effects would be subsumed. At all three depths, there was a significant positive correlation between variance in repeat number and both maximum and mean repeat number (all P < 0.008). Variance in repeat number was best correlated with maximum repeat number (Figure 4). To estimate the lower threshold for slippage, variance in repeat number was linearly fit to maximum and mean repeat number for the 37 species-by-locus combinations, and the x-intercepts were determined (where variance in repeat number = 0). Using these calculations, only microsatellites with a maximum of ≥7.70 repeats or a mean of ≥5.08 repeats are expected to be variable within a species. The high prevalence of alleles with five to seven repeats in clades that were fixed for a single allele (Figure 1) was consistent with this threshold.
Substitution mutations within microsatellite repeat arrays:
The nine genomic regions were successfully amplified from all eight species of Neurospora, with the exception of T7L from N. sitophila. The basic repeat motifs of the 13 microsatellite loci generally were well conserved across species with the exception of QMA-CTTTT and QMA-TGG from N. discreta, the outgroup. However, sequence data revealed that substitution mutations within the repeat arrays were common (see supplemental data). For the five loci investigated in detail, the mean number of repeat alleles, or alleles with different numbers of tandem repeats, was 15.4/locus. When repeat number and sequence variation were considered together, the mean number of unique microsatellite alleles increased to 34.8/locus (Table 2). For the eight additional loci, the mean number of repeat alleles was 8.9/locus, but the mean number of unique microsatellite alleles was 13.3/locus (Table 3). Thus, between 55.7% (five loci) and 33.1% (eight loci) of the total allelic variation at microsatellites would have been missed if sequence information had not been obtained.
Several substitution mutations within the repeat arrays were unique and diagnostic for single species. In addition, a number of substitutions were shared among species, many of which were consistent with phylogenetic inferences drawn from the flanking regions (Dettman et al. 2003a; Figure 3). For example, the PS2-N. intermedia sister-group relationship was supported by substitutions in TML-ATC and T7L-ACA microsatellites, while the PS3-N. crassa sister-group relationship was supported by substitutions in T1R-TAG and T7L-CGA microsatellites.
The duplication of repeats with substitutions (imperfect repeats) within the same microsatellite allele was observed in DMG-CA, TML-ATC, QMA-CTTTT, QMA-TGG, T1R-TAG, T1R-AGC, 3L-AGC, T7L-ACA, and T7L-CGA (examples in Figure 5). Such examples of substitutions that had been multiplied tandemly or along with other repeats without substitutions (perfect repeats) provided evidence that the microsatellites had mutated by slipped-strand mispairing. This explanation was more parsimonious than multiple identical substitutions occurring independently in multiple repeats. Within the same allele, duplicated imperfections that were not adjacent to each other were observed at TML-ATC, QMA-CTTTT, QMA-TGG, T1R-TAG, T1R-AGC, T7L-ACA, and T7L-CGA (examples in Figure 5). Such nonadjacent imperfections were evidence for multistep mutational events, because duplication of imperfect repeats via single-step mutations would create only imperfections that were immediately adjacent.
Intraspecific analyses of N. crassa and N. intermedia:
The following intraspecific analyses were restricted to the two species with sufficient samples sizes, N. crassa and N. intermedia, and to the largest subgroup within each species, NcA (n = 33) and NiA (n = 52), respectively. For each of the five microsatellites investigated in detail, the mean flanking-sequence genetic distance within species was less than the flanking-sequence genetic distance at which microsatellite mutational saturation occurred (Table 4, Figure 2). Because mutational saturation was not a significant problem at intraspecific levels of phylogenetic divergence, these microsatellites are appropriate for use in population genetic studies.
To assess the applicability of the SMM within species, frequency distributions of repeat alleles were fit to SMM predictions. The frequency distributions of repeat alleles for NcA and NiA differed among the five loci (Figure 6). At DMG-CA and TMI-AGC, both NcA and NiA were highly variable and had overlapping repeat allele ranges in contrast to the nonoverlapping distributions at TML-ATC. At both QMA-TGG and QMA-CTTTT, NiA was fixed for a single repeat allele, whereas NcA was much more variable. To determine whether these distributions were more consistent with evolution under the SMM or TPM, we applied the likelihood method of Nielsen (1997) to the five loci in each species. The SMM fit the data better than the TPM for 8 of the 10 locus-by-species combinations (Table 6). For the two comparisons in which the TPM was a better fit, the proportion of multistep mutations was substantial (0.11) only for the DMG-CA locus in NcA. However, the gap between the smaller and larger alleles in the DMG-CA allele distributions in NcA (Figure 6) may not have been caused by multistep mutations. Rather, this gap may reflect the presence of two genealogical lineages of DMG-CA within NcA (see Figure 1), with different allele distributions for each genealogical lineage.
The population mutation parameter θ (θ = 2Nμ) can be estimated from data from each microsatellite locus. Because our loci were sampled from the same set of individuals, effective population size (N) was constant across loci within species, allowing us to directly compare relative mutation rates (μ) of loci. The θ-per-microsatellite locus in NcA and NiA was estimated using a likelihood approach (Nielsen 1997; Tables 6 and 7). QMA-CTTTT and QMA-TGG were omitted from NiA calculations because they were not variable.
Microsatellite θ values, and therefore mutation rates, differed by an order of magnitude across loci within the same species (Table 7). In both NcA and NiA, θ for DMG-CA > TMI-AGC > TML-ATC. On average, microsatellite θ's for NcA were 77% greater than those for NiA, but this difference could be due to other factors, such as differences in effective population sizes of the two species.
The flanking regions were sampled from the same set of individuals as the microsatellites were, allowing for the comparison of the relative mutation rates of the two different, yet physically linked, DNA types. The θ per base pair of flanking DNA (Table 7) was estimated from the numbers of segregating sites (S; Watterson 1975). The ratio of microsatellite to flanking-sequence θ was quite variable among loci and between NcA and NiA. Interestingly, microsatellite θ's were on average ∼2500 times greater than flanking-sequence θ's for both NcA and NiA (2514 and 2498 times, respectively). Therefore, slippage rates of the microsatellites were ∼2500 times greater than the substitution rates of the nonrepetitive flanking DNA.
An advantage to a sequence-based study of microsatellite evolution is that the historical relationships among microsatellite alleles can be determined independently of their repeat number by the phylogenetic analysis of molecular variation observed in the flanking regions. The microsatellites and their associated flanking regions share the same evolutionary history and lines of descent due to their physical linkage. Much is known about the evolution of regular nonrepetitive DNA, allowing confident reconstructions of the genealogical relationships among the different variants of flanking sequence. Without flanking-sequence information, the phylogenetic affinities of microsatellite alleles can be deduced only from number of repeats, which may be a poor indicator of true relationships among long-diverged alleles. Placing the microsatellite mutations in the phylogenetic context of the more slowly evolving flanking regions reveals the complexities of the evolutionary dynamics of microsatellites. Intensive sequencing of the flanking regions provided novel approaches to the estimation of microsatellite mutational saturation and relative rates of mutation.
The sampling scheme of this study provided another novel approach: investigating microsatellite evolution at multiple genealogical depths and levels of divergence. By sampling multiple copies of the same flanking-sequence variant, the recent history of microsatellite evolution could be investigated. By sampling a moderate number of individuals from two species, N. crassa and N. intermedia, some species-level questions could be addressed. Including representatives from six other Neurospora species permitted comparisons of microsatellite evolution at a much deeper scale. The sampling scheme also allowed the conclusions drawn from the five intensively sampled microsatellite loci to be challenged by analysis of eight modestly sampled microsatellite loci.
Nearly every study that has compared multiple microsatellite loci has reported differences in the levels of variability among loci (e.g., Di Rienzo et al. 1994; Estoup et al. 1995a; Garza et al. 1995; Innan et al. 1997; Harr et al. 1998b; Schug et al. 1998). This study of microsatellites in Neurospora was no exception. The question is, Why are some loci more variable than others? The usefulness of microsatellite loci, regardless of their application, is dependent upon their variability. As such, it was this characteristic of microsatellites that we were most interested in predicting.
It is possible that microsatellites in genomic regions with higher mutation rates may show higher variability; however, the microsatellites and flanking regions examined here clearly evolve with different mutational dynamics. Microsatellite variability was not correlated with flanking region variability, demonstrating that microsatellite mutation rates are affected by different factors than are substitution rates in flanking regions. This study reports the first detailed investigation of the evolution of multiple, physically linked microsatellites. The fact that tightly linked microsatellites embedded within the same genomic region could differ drastically in variability (e.g., T7L-ACA and T7L-CGA) provided further evidence that genomic placement does not determine the variability of a particular microsatellite.
Microsatellite variability also differed among species, suggesting a possible species effect due to differences in effective population size or genome-wide mutation rates. However, no consistent species effects were detected, as exemplified by the fact that N. intermedia could be highly variable at one microsatellite, yet nearly monomorphic at another.
Of all the factors examined, the best predictors of microsatellite variability were mean and maximum repeat number. The association of increased variance with increased repeat number was quite robust and significant across all three genealogical depths (flanking-sequence variants, intraspecific, and entire sample; Figure 4). These findings are consistent with other studies that have found a significant positive relationship between variance and repeat number at single genealogical depths (Goldstein and Clark 1995; Innan et al. 1997; Hutter et al. 1998; Neff and Gross 2001). Similar to results reported by Goldstein and Clark (1995), variance in repeat number was best correlated with maximum repeat number. Several studies examining individual mutational events have also found that alleles with larger numbers of repeats are more likely to mutate than alleles with fewer numbers of repeats (e.g., Weber and Wong 1993; Wierdl et al. 1997; Primmer et al. 1998; Schug et al. 1998; Ellegren 2000a; Vigouroux et al. 2002; Shinde et al. 2003).
This association is believed to exist for two reasons. During DNA replication, longer stretches of repeated units pose more of a problem to polymerase than do shorter stretches, making longer alleles more prone to slipped-strand mispairing. In addition, larger numbers of repeats provide more opportunities for misalignment during the reannealing of the nascent strand (see Eisen 1999). Extending the above arguments, there is expected to be a rough threshold of minimum repeat number below which a microsatellite is not likely to mutate or be variable. On the basis of the data presented here, microsatellites are expected to be variable within a Neurospora species if they have a maximum ≥7.70 and/or a mean ≥5.08 repeats. Other studies also have concluded that such a threshold exists (e.g., Messier et al. 1996; Rose and Falush 1998; Lai and Sun 2003; Shinde et al. 2003). For example, by examining slippage during PCR, Shinde et al. (2003) estimated that the threshold was between four and eight repeats.
Phylogenetic utility of microsatellites:
The ultimate consequence of extended periods of evolution via slipped-strand mispairing is deterioration of phylogenetic signal due to excessive homoplasy. Although the resolving power of microsatellites decreases with increasing divergence among groups, some studies have used these markers to estimate phylogenetic relationships among species (Harr et al. 1998a; Petren et al. 1999; Fisher et al. 2000; Alvarez et al. 2001; Irion et al. 2003). These studies have had mixed results, in that microsatellite-based phylogenies do not necessarily agree with phylogenies based on other types of data.
Comparing the microsatellite-based trees with the previously reported interspecific phylogeny based on flanking sequences (Dettman et al. 2003a) allowed us to evaluate the phylogenetic utility of the microsatellites. The microsatellite loci we examined typically reached mutational saturation at flanking-sequence genetic distances that were less than genetic distances among species, and overlapping allele distributions among species were common. The high levels of interspecific homoplasy cast serious doubt on the efficacy of these microsatellites to reconstruct the phylogenetic relationships among Neurospora species, and trees constructed from the microsatellite data were not congruent with trees constructed from flanking-sequence data. These species have been evolving independently long enough for mutational saturation to have degraded the phylogenetic signal of microsatellites.
Given that microsatellite variability is dependent upon length, relatively short microsatellites may provide better resolution for the construction of interspecific phylogenies. However, it is unlikely that researchers could specifically choose microsatellites of the ideal length because loci are typically isolated from a single focal species, which may not be representative of the other species in the genus.
Mapping repeat alleles onto the genealogies of the flanking regions demonstrated that different flanking-sequence lineages within the same species can independently mutate to the same repeat allele, confirming microsatellite allelic homoplasy. Considering that microsatellites mutate at least three orders of magnitude faster than nonrepetitive DNA, two of the same repeat alleles that are embedded within different flanking-sequence variants are likely to be homoplastic. In fact, even identical repeat alleles associated with the same flanking-sequence variant may not be identical by descent because back mutations erase the evidence of previous mutations. When the sequences of the repeat arrays were considered, homoplasy was reduced because imperfect repeats can be used to distinguish among alleles with the same number of repeats. The substitution acts as a landmark for alignment and reveals if the same or different stretch of perfect repeats underwent slippage. The abundance of substitution mutations within the repeat arrays also demonstrated that microsatellite descriptors such as “perfect” and “imperfect” must be considered relative terms in reference to a particular group. Even though intraspecific variation is known to be high, species-wide conclusions are commonly drawn from the sequence information from a single individual. This study underscores the importance of sampling multiple individuals from the same species.
Microsatellites have been shown to be particularly useful for the investigation of population genetics (Bruford and Wayne 1993; Goldstein and Schlötterer 1999) and continue to increase in popularity among most biological subdisciplines, including mycology (e.g., Bucheli et al. 2001; Fisher et al. 2001; Zhou et al. 2001; Bergemann and Miller 2002; Dunham et al. 2003). With this report, the first to characterize microsatellite loci in Neurospora, microsatellites can now be typed from large population samples (Turner et al. 2001; Jacobson et al. 2004) and used to investigate the population biology and ecology of this model filamentous fungus. We wanted to verify that these loci mutate in a manner consistent with the presently accepted models before initiating a large-scale population genetic study. Within species, the data presented here are consistent with microsatellite evolution under the SMM with most mutational events involving changes of a single repeat unit. For two species examined, the estimated proportions of multistep mutations were large enough to reject a strict SMM for only one of the five loci investigated in detail. In contrast to interspecific analyses, no evidence for significant microsatellite mutational saturation within species was found, suggesting that these microsatellites will be useful for population genetic analyses. Although some intraspecific allelic size homoplasy was detected, it is not predicted to interfere with genetic distance or gene flow calculations. The efficacy of a microsatellite genetic distance to elucidate the true phylogenetic relationships among populations is dependent upon realistic assumptions of the mutation model. Because evidence suggests that intraspecific allelic diversity has been generated mainly under the SMM, SMM-based genetic distances such as (δμ)2 are appropriate for use in future population genetic studies of Neurospora.
The key feature of microsatellites as molecular markers is their hypermutability and, hence, their hypervariability in species and populations. Microsatellite mutation rates are estimated at 10−2–10−6/generation (Ellegren 2000b), which are several orders of magnitude greater than that of regular nonrepetitive DNA (∼10−9, Li 1997). We found that microsatellite mutation rates were an average of ∼2500 times greater than the mutation rates of the flanking DNA. The best estimates of mutation rates in filamentous Ascomycetes were performed by Kasuga et al. (2002). They estimated that the mutation rate of noncoding DNA (introns) is between 1.12 × 10−9 and 1.00 × 10−8 substitutions/site/year, consistent with general estimates from other kingdoms. By simply multiplying the values of Kasuga et al. (2002) by 2500, the average mutation rate of these microsatellites is estimated to be between 2.80 × 10−6 and 2.50 × 10−5. Although these are rough estimates, they are very similar to the microsatellite mutation rates estimated for Saccharomyces cerevisiae (1.2 × 10−5, Kruglyak et al. 1998) and Drosophila melanogaster (2.8 × 10−6, Kruglyak et al. 1998; 6.3 × 10−6, Schlötterer et al. 1998; 9.3 × 10−6, Schug et al. 1998).
Sequence data and assessment of microsatellite polymorphism:
Microsatellite polymorphism in populations typically is assessed by electromorph length alone, but substitution mutations within the repeat array cannot be detected from electromorph length. Without sequencing these microsatellites in Neurospora, 33–56% of the total allelic variation would have been missed. A significant amount of variation in the nonrepetitive regions flanking the microsatellites was observed, even within a single species of Neurospora. Variation in flanking regions within species was commonly found in other systems (e.g., Jin et al. 1996; Grimaldi and Crouau-Roy 1997; Macaubas et al. 1997; Ortí et al. 1997; Colson and Goldstein 1999; Blankenship et al. 2002), but not all (Estoup et al. 1995b; Angers and Bernatchez 1997; Viard et al. 1998; Culver et al. 2001). These substitutions in the flanking regions can reduce the efficiency of standard techniques for assessing microsatellite variation. For example, mutations in primer sites can cause amplification failure and null alleles. Even more troublesome are insertions and deletions because they can cause differences in electromorph length that are not attributable to differences in repeat number at the microsatellite. For intraspecific comparisons within NcA and NiA, the DMG, TMI, TML, and QMA flanking sequences contained an average of 1.75 multibase insertion/deletions that could obscure the detection of differences in repeat number between alleles (assuming the targeted electromorph length is up to 300 bp). If comparisons were made across species, even more multibase insertion/deletions would cause misinterpretation of electromorph length. Therefore, predicting repeat number from electromorph length alone can be problematic due to undetected homoplasy. While there are clear benefits to determining repeat number directly from sequence data, doing so requires an unreasonably large amount of time and effort, even for population genetic studies of moderate size. A practical approach would be to sequence a small number of representative individuals from the groups under study to roughly assess levels of variation in the flanking regions and likelihood of allelic size homoplasy caused by insertion/deletions. This approach would allow for the exclusion of microsatellite loci that violate model assumptions or have complex mutational histories. Overall, it takes only a modest amount of sequencing to elucidate the general patterns of mutation and evolution at microsatellite loci. Before starting a large-scale population study, we recommend performing preliminary analyses to determine if the microsatellite loci are appropriate for addressing the particular questions of interest.
We thank Hanna Johannesson, Takao Kasuga, Elizabeth Turner, and anonymous reviewers for critical review of the manuscript. We also thank David Jacobson for advice and suggestions during the course of this project. Funding for this research was provided by a grant to J.W.T. from the National Science Foundation. This article is based in part on the Ph.D dissertation of J.R.D. at the University of California, Berkeley.
Communicating editor: D. Begun
- Received March 26, 2004.
- Accepted July 14, 2004.
- Genetics Society of America