Abstract
We investigate the interplay between genetic diversity and recombination in maize (Zea mays ssp. mays). Genetic diversity was measured in three types of markers: single-nucleotide polymorphisms, indels, and microsatellites. All three were examined in a sample of previously published DNA sequences from 21 loci on maize chromosome 1. Small indels (1-5 bp) were numerous and far more common than large indels. Furthermore, large indels (>100 bp) were infrequent in the population sample, suggesting they are slightly deleterious. The 21 loci also contained 47 microsatellites, of which 33 were polymorphic. Diversity in SNPs, indels, and microsatellites was compared to two measures of recombination: C (=4Nc) estimated from DNA sequence data and R based on a quantitative recombination nodule map of maize synaptonemal complex 1. SNP diversity was correlated with C (r = 0.65; P = 0.007) but not with R (r =-0.10; P = 0.69). Given the lack of correlation between R and SNP diversity, the correlation between SNP diversity and C may be driven by demography. In contrast to SNP diversity, microsatellite diversity was correlated with R (r = 0.45; P = 0.004) but not C (r =-0.025; P = 0.55). The correlation could arise if recombination is mutagenic for microsatellites, or it may be consistent with background selection that is apparent only in this class of rapidly evolving markers.
THE interplay between recombination and selection shapes the degree and distribution of genetic variation in a genome. Two theoretical models have been developed to explain the interaction between these two processes. Under the background selection model, deleterious alleles are continuously eliminated from a population, a process that decreases linked neutral genetic variation (Charlesworthet al. 1993; Charlesworth 1994; Hudson and Kaplan 1995). In contrast, the hitchhiking model posits that selectively advantageous alleles sweep through a population, thereby reducing genetic variation at sites linked to the advantageous allele (Maynard-Smith and Haigh 1974; Kaplanet al. 1989).
The common thread for both models is the strong influence of recombination. Both models predict that selection (either positive or negative) reduces polymorphism at linked neutral sites, and both predict that loss of polymorphism is greatest in regions of low recombination. The predicted positive correlation between genetic diversity and recombination has been demonstrated empirically in Drosophila (Drosophila melanogaster; Kaplanet al. 1989; Begun and Aquadro 1992), humans (Nachmanet al. 1998; Przeworskiet al. 2000), mouse (Mus domesticus; Nachman 1997), tomato (Lycopersicon esculentum; Stephan and Langley 1998), sea beet (Beta vulgaris; Kraftet al. 1998), and goatgrass (Aegilops; Dvoraket al. 1998).
One difference between the two models is that background selection is an equilibrium process, with continuous removal of deleterious alleles from populations (Wiehe 1998). As a result, genetic variation at sites linked to deleterious sites is expected to remain low. Because of the equilibrium dynamics of this process, a positive correlation should be observed between recombination, c, and levels of genetic variation, irrespective of the mutation rate μ. In contrast, the recovery of linked genetic variation under the hitchhiking model depends on both c and μ. If the rate of selective sweeps is low and μ is high (as in microsatellite markers, for example), theory predicts that neutral genetic variation can be restored between rounds of selection, thus masking the effect of a selective sweep (Wiehe 1998). As a result, the positive correlation between diversity and recombination under the hitchhiking model may be obscured when μ is high (Wiehe 1998; Payseur and Nachman 2000). Thus, one way to contrast the background and hitchhiking models is to study molecular markers that have different mutation rates.
Demography can also play a large role in the maintenance and distribution of genetic diversity. Population subdivision and population bottlenecks, as well as other demographic factors, can obscure the relationship between recombination and diversity. For example, Baudry et al. (2001) compared recombination to diversity in regions of differing recombination rate in five Lycopersicon (tomato) species, two of which inbreed at high to intermediate levels. All five species demonstrated a positive correlation between recombination and diversity, but the type of mating system (and presumably the demographic factors associated with the mating system) had a stronger influence on genetic variation than recombination. Thus, for any one system, the distribution of genetic variation within the genome is a complex function of mutation, recombination, selection, and demography.
In maize (Zea mays ssp. mays L.), genetic diversity has been studied for 21 loci distributed along the genetic map of chromosome 1. Tenaillon et al. (2001) found a positive correlation between nucleotide diversity, as measured by single-nucleotide polymorphisms (SNPs), and estimates of the population-recombination parameter C = 4Nc, where N is the effective population size. It was presumed that the positive correlation reflects interplay between recombination and selection, but it is important to note that C, which is inversely related to linkage disequilibrium (LD), is also affected by demographic factors. For example, population subdivision increases LD and thus decreases C; similarly, LD is low (and C high) in expanding populations (reviewed in Pritchard and Pzeworski 2001). Demography also may not affect all loci equally; for example, historical levels of gene flow can differ among loci (Wanget al. 1997). Given the effect of demography on C and the possibility that demographic effects differ among loci, demography could have contributed to the positive correlation observed in maize. Thus, the cause of the positive correlation—i.e., background selection, hitchhiking selection, or demography—is unclear. In the previous study, C was estimated from sequence data, and there were no independent estimates of recombination based on a measure that is related to physical distance along the chromosome. The lack of a physical measure of recombination limits our understanding of the forces that shape maize diversity.
Here we investigate further the relationship between recombination and genetic diversity in maize by studying markers that evolve with different μ than SNPs and also by using a physical measure of recombination along maize chromosome 1. To estimate measures of diversity, we reanalyzed data from the 21 genetic loci examined in the previous study (Tenaillonet al. 2001). The DNA sequence of each of these loci was determined for ∼25 individuals representing much of the geographic range of cultivated maize. All of the 21 loci contained SNPs, and most of the loci contained both microsatellite and insertion-deletion (indel) variation. Only SNP variation was analyzed previously, but microsatellite and indel variation is extensive, representing 24% of the total aligned length of the 21 loci. Thus, the data of Tenaillon et al. (2001) offer a unique opportunity to compare diversity among marker types. Moreover, for the microsatellites there was no ascertainment bias for highly polymorphic microsatellites.
In addition to examining measures of diversity, we report a physical measure of recombination based on a quantitative cytogenetic map of the distribution of recombination nodules (RNs) along synaptonemal complex 1 (SC1). During prophase I of meiosis, an SC forms between each homologous pair of chromosomes, and RNs are found associated with SCs at pachytene (Zickler and Kleckner 1999). RNs mark the physical locations of crossovers along SCs (Herickhoffet al. 1993; Sherman and Stack 1995). SC1 can be identified on the basis of its relative length and arm ratio (Anderson and Stack 2001), and the frequency and location of RNs along the length of SC1 have been determined. The frequency distribution of RNs along SC1 provides an estimate of recombination along the physical length of the chromosome. Here we report the first RN map for maize chromosome 1, and we use this map to predict the density of crossovers per physical length for each of the 21 loci from the previous study (Tenaillonet al. 2001).
Altogether, this study has three objectives. First, we estimate different measures of genetic variation based on SNP, microsatellite, and indel variation. Second, we report physical estimates of recombination (R), based on a quantitative cytogenetic map, for each of the 21 loci. Third, we investigate the influence of recombination on genetic diversity by comparing estimates of both R and C to estimates of diversity. By taking this approach, we intend to provide a better understanding of the interplay between recombination and diversity in maize and begin to provide insight into the relative importance of hitchhiking selection, background selection, and demographic effects.
MATERIALS AND METHODS
DNA polymorphism
Sequence data and analyses: DNA sequence data for 21 loci were obtained from Tenaillon et al. (2001; GenBank nos. AF377345-AF377864). The 21 loci represent seven known genes, six cDNA clones, and eight anonymous restriction fragment length polymorphism (RFLP) clones. All 21 loci were located on the UMC98 genetic map (Daviset al. 1999), and hence their relative locations have been identified on maize chromosome 1. The length of the loci varied from 248 to 2740 bp, with an average length of 648 bp. All 21 loci were originally targeted for sequencing from a common set of 25 individuals of cultivated maize (Z. mays L. ssp mays). However, some loci were difficult to amplify in some individuals, and thus the data set was not complete. Nonetheless, ≥22 individuals were sequenced from each locus, and all 25 individuals were sequenced for 11 of 21 loci. A full description of the plant material and the sequencing protocols were published in Tenaillon et al. (2001). For clarity, we hereafter refer to the 21 loci as genetic loci, to differentiate them from microsatellite loci.
Tenaillon et al. (2001) reported SNP diversity in genetic loci by the sequence statistic θ (Watterson 1975). Each estimate of θ
Microsatellite analyses: To locate microsatellites in DNA sequence data, we performed searches with RepeatMasker (http://repeatmasker.genome.washington.edu/cgi-bin/RepeatMasker) and Ephemeris version 1.0 (http://www.uga.edu/srel/DNA_Lab/ephemeris_readme.htm), as well as manual searches.
Given the lack of consensus regarding the definition of microsatellites in the literature, we based our definition on the expected frequency of occurrence of a microsatellite. Assuming that all nucleotides are present at equal frequencies, the probability of occurrence of a microsatellite is pm = 0.25x(m-1), with x the length of the motif (i.e., x = 2 for a dinucleotide repeat) and m the number of repeats. We studied microsatellites for which the expected frequency is less than five microsatellites in 10 kb. This frequency corresponds to microsatellites of length m = 7 for mononucleotide repeats; m = 4 for dinucleotide repeats; m = 3 for tri-, tetra-, and pentanucleotide repeats; and m = 2 for hexanucleotide repeats.
For each microsatellite locus, we calculated the number of alleles in our sample (A), the sample variance in allele size (V), and expected heterozygosity (Hmicrosat), on the basis of Nei’s unbiased estimate (Nei 1973),
Indel analyses: Nonmicrosatellite indels were also identified and characterized. SITES (Hey and Wakeley 1997) was used to determine the number, length, and position of indels in the data set. BLAST searches and RepeatMasker, in conjunction with a maize transposable element database (provided by S. Wessler, University of Georgia), were used to determine if large indels corresponded to known mobile elements.
To determine levels of diversity, all identified indels were scored as present (1) or absent (0). This binary data matrix was then transformed into frequencies, and diversity values were calculated using Nei’s measure of heterozygosity (Hindel), as previously described. To test the neutral mutation hypothesis on large indels, we calculated Tajima’s D separately on all indels <100 bp and indels >100 bp, as suggested by Charlesworth and Langley (1989). We used DnaSp ver. 3 (Rozas and Rozas 1999) to calculate Tajima’s D.
Recombination rate based on RN map
Maize SC karyotype and distribution of RNs: Maize cultivar Kansas Yellow Saline (KYS) was used for the two-dimensional spreads of SCs. Plants were grown to maturity and anthers containing microsporocytes at pachytene were collected. Spreads of SCs were produced using a modification of the procedure described by Peterson et al. (1999) and examined with an AE 801 electron microscope. The positions of kinetochores and RNs were determined for each SC in a set, and the SCs were measured using the computer program Micro-Measure version 3.2 (Reeves 2001). Based on relative lengths and arm ratios, the 10 maize SCs were assigned to the 10 maize pachytene bivalents (2n = 20). Although the absolute lengths of SCs vary in different sets, the relative lengths and arm ratios remain constant (e.g., Sherman and Stack 1995). To compare RN positions on SC1 from different sets of SCs, the position of each RN was measured as a fractional length of either the long or short arm from the kinetochore. Then, using an average length of 45.4 μm and average arm ratio of 1.26 for SC1, the position of each RN on an average SC1 was calculated by multiplying the fractional distance of the RN from the kinetochore by the appropriate arm length (see Sherman and Stack 1995 for a similar procedure). In total, the positions of 277 RNs on 110 SC1s were mapped.
Recombination rate along chromosome 1: To construct a frequency map of RNs along the physical length of SC1, the total number of RNs observed in each 0.4-μm segment of SC length was determined. The 0.4-μm segment length was chosen to maximize the total number of segments but minimized the number of segments that had no observed RNs. The Lowess procedure (Cleveland 1981) was applied to these data to smooth local variation, as suggested by Stephan and Langley (1998). The Lowess procedure smoothes the recombination function by applying weighted least-squares regression to sliding windows that are defined by the number of data points. We applied four different sliding window sizes, ranging from 5 to 11 data points, to examine the influence of window size on recombination rate estimates. As is detailed below, the size of sliding windows made little qualitative difference on results. The Lowess procedure was applied in the R statistical package, version 1.2.0 (http://www.r-project.org/).
Determining recombination rate in the 21 loci: The 21 genetic loci were localized on the RN map using an approach similar to that of Stephan and Langley (1998). First, the RN distribution was converted into centimorgan map units, following Sherman and Stack (1995). Regions with more observed RNs represented regions with greater centimorgan distances. Second, the RN map and the genetic map of maize chromosome 1 (UMC98; Daviset al. 1999) were aligned in a linear fashion, such that each arm of the SC corresponded to the appropriate arm of the genetic map. Finally, the ratio between the total length of the RN map in cemtimorgans (125.9) and the total length of the UMC98 map of chromosome 1 (249.2 cM) was used to determine the positions of the 21 loci on the RN map. Once positioned along the RN map, we estimated the recombination rate R for each locus as the predicted frequency of occurrence of RNs per micrometer.
The population-recombination parameter, recombination, and diversity
The population-recombination parameter C was estimated from DNA sequence data by three different methods: (i) Hudson’s (1987) method, with estimates taken from Tenaillon et al. (2001); (ii) Wall’s (2000) method, in which the estimate maximizes the joint probability of obtaining the observed number of minimum recombination events and haplotypes (program provided by J. Wall); and (iii) the program LDhat (http://www.stats.ox.ac.uk/~mcvean/LDhat/LDhat.html), which employs Hudson’s (2001) method with importance sampling (Fearnhead and Donnelly 2001). Estimates based on the three methods were denoted ĈHud87,ĈWall00, andĈHud01, respectively. All reported C estimates were per site values. We did not use full-likelihood methods to estimate C because they are computationally infeasible with high levels of recombination (Wall 2000).
We contrasted estimates of R and C with the diversity measures
RESULTS
Microsatellite diversity: A total of 47 microsatellites were identified in 18 of the 21 genetic loci. A description of the microsatellite loci including their genetic diversity (H), number of alleles (A), and variance in allele size (V) is presented in Table 1. The 14 monomorphic and 33 polymorphic microsatellites were further characterized as being located in either coding or noncoding sequence (Table 1). The majority occurred in noncoding regions, with 7 noncoding among 14 monomorphic microsatellites and 30 noncoding among 33 polymorphic microsatellites. All polymorphic markers in coding regions were trinucleotide or hexanucleotide repeats and did not induce frameshifts.
We investigated the relationships between three different measures of microsatellite diversity (Hmicrosat, V, and A) calculated for the 33 polymorphic microsatellites. Hmicrosat ranged from 0.17 to 0.88, and the average variance in allele size (V) ranged from 0.08 to 60.69 for the longest microsatellite (Table 1). These ranges are comparable to previous estimates from maize (Smithet al. 1997; Senioret al. 1998; Provanet al. 1999; Matsuokaet al. 2002a) and other organisms (Innanet al. 1997; Schuget al. 1998). The correlation between H and V was significantly positive (r = 0.37; P = 0.019) but this depends, however, on a single data point corresponding to the longest microsatellite [(CT)11-26] in the tb1 locus (Table 1). The correlation was not significant without this data point (r = 0.11; P = 0.72), suggesting that H and V are weakly related, at best. Additional results involving V were often dependent on the single tb1-based data point, an observation that we reiterate. In contrast to the weak correlation between H and V, there was a strong positive correlation between H and A (r = 0.80; P < 0.001); for the remainder of this article, we ignore A because results based on H were similar (data not shown).
We examined pairwise LD among microsatellite markers. LD was significant at the 5% level in 33 of 528 pairwise comparisons. However, only five associations remained significant after sequential Bonferroni correction, and only one of these five included a pair of microsatellites located in the same locus, umc67. Altogether, these analyses are consistent with previous observations that LD in maize breaks down very rapidly over distance (Remingtonet al. 2001; Tenaillonet al. 2001).
We calculated the average number of repeats (ANR) for the 38 perfect microsatellites in Table 1 and compared ANR to measures of microsatellite diversity. There was no significant difference (t-test, P = 0.15) between ANR within the 13 monomorphic perfect microsatellites (3.8 repeats) and ANR within the 25 polymorphic perfect ones (4.9 repeats). However, there was a significant positive correlation between ANR and Hmicrosat (r = 0.55; P < 0.001), and the correlation remained significant when only polymorphic microsatellites (P < 0.001) were considered. Finally, ANR and V were positively correlated (r = 0.51; P < 0.001) for polymorphic microsatellite loci. This correlation relied, however, on the single tb1 data point, and the correlation was not significantly positive without that data point (r =-0.48; P = 0.99).
Indel variation: A total of 263 nonmicrosatellite indels were scored in 17 of 21 loci. Indel size ranged from 1 to 640 bp, and the number of indels per genetic locus ranged from 2 to 59 (Table 2). A total of 56% of the indels were 1-2 bp in length, and 92% were <20 bp in length (Figure 1). Of the 21 indels longer than 20 bp, 5 were found to have sequence similarity to previously identified transposable elements, including miniature inverted repeat elements (MITEs). Two families of MITEs were found: a Tourist element in 1 individual for each of asg75, umc230, and csu381 and a Stowaway element in 3 of 23 individuals of umc67. In addition, BLAST searches revealed the presence of a Ds element in 4 of 23 individuals for umc128. Hindel values ranged from 0.08 to 0.52, with an average of 0.25 among the polymorphic indels.
A previous study in D. melanogaster suggested that the frequency distribution of large indels deviated from the neutral equilibrium model, consistent with selection against large indels (Tajima 1989). To determine whether maize indels had a nonneutral frequency spectrum, we first studied the relationship between length and indel diversity. If long indels are deleterious, we expect a negative correlation between Hindel and indel length, because large deleterious indels have a low probability of reaching an appreciable population frequency and should therefore have low Hindel values. A significant negative correlation was found between Hindel and indel length (r =-0.11; P = 0.02), consistent with this expectation. We note that the longer length variant was present in low frequency (i.e., <15%) for 9 of the 10 large indels. We also studied the relationship between Tajima’s D and indel length. Tajima’s D was calculated for three different data sets. The first set included all 253 indels <100 bp. However, because the initial data set had missing data entries, all indels could be identified in a common sample of only 13 individuals. Tajima’s D for this data set of 253 indels and 13 individuals was D253-13 = -0.43, which was not a significant deviation from the neutral expectation of 0.0, assuming no recombination.
We also calculated Tajima’s D in a data set that included all 25 individuals and a common sample of 55 indels that were <100 bp; D55-25 was -0.66 for this data set and again did not deviate from the neutral model, assuming no recombination. Finally, we calculated D for 10 indels >100 bp that were scored in a common sample of 22 individuals; D10-22 was -1.413, which was not significant under the conservative assumption of no recombination, but was substantially lower than D values calculated on short indels. However, the large indels are physically distant from one another, and it is therefore reasonable to apply Tajima’s D test assuming free recombination. With free recombination, D10-22 =-1.413 represents a highly significant departure (P = 0.008) from the neutral equilibrium model. Thus, the significant and comparatively low D value based on large indels (>100 bp) is consistent with the hypothesis that the large indels in this sample are selectively deleterious.
Characterization and variability in microsatellites found in 18 of 21 loci
Number of nonmicrosatellite indels found in 17 of the 21 loci
Map of the density of RNs per micrometer: Figure 2 plots the frequency of occurrence of RNs per micrometer relative to physical position along SC1. Overlaid on the RN frequency distribution is a smoothed line for the rate of recombination (R = number of RNs per micrometer) derived using the Lowess procedure and a sliding window size of 11 data points. Initially four different sliding window sizes were used. We localized the 21 loci on the RN map and obtained four different estimates (corresponding to the four sliding window sizes) of recombination rate, R, for each locus. Estimates of R based on these four different window sizes were highly and significantly correlated; among the six pairwise comparisons, the lowest correlation was r = 0.88 (P < 0.0001), which corresponded to the correlation between the most extreme window sizes (i.e., 5 vs. 11 data points). Because estimates of R were similar among window sizes, we report results on the basis of a single sliding window size, which we have chosen to be 11 data points.
—Length distribution of indels. For ease of illustration, indels >20 bp in length are grouped.
The values of R, measured in RNs per micrometer, ranged from 0.0099 for umc67 to 0.1297 for fus 6 (Table 3). This range is comparable to the range ofĈHud87, which varied from 0.0001 to 0.1337 per base pair (Table 3), and it is also similar to R values reported for tomato (Stephan and Langley 1998). The pattern of R along SC1 was characterized by a marked reduction of recombination rate near the centromere and an increase toward telomeres (Figure 2). This pattern, low recombination in the centromere with higher recombination toward telomeres, was confirmed for the 21 genetic loci, because the distance of the loci from the centromere was correlated with their R values (r = 0.88; P < 0.0001).
Comparing estimates of R, C, and genetic diversity: Previous studies have demonstrated a positive correlation between recombination rate and genetic diversity, and one purpose of this study was to characterize this correlation in maize. We tested correlations among four different measures of diversity (
—A map of the distribution of recombination nodules R (RN/μm) along SC1. The short arm of SC1 is to the left, and the centromere is located approximately at position 20. The data points are the frequency of occurrence of RN (no. of RN/no. of SC observed) per micrometer in each 0.4-μm segment along the SC in the abscissa. The line is the result of the Lowess smoothing procedure with a sliding window containing 11 data points. After alignment to the genetic map (UMC98) of chromosome 1, we determined the positions (indicated by the solid arrows) of the 21 loci along SC1 in the abscissa. The corresponding R values in the ordinate were determined for each of the 21 loci.
A significant positive correlation betweenĈHud87 and
However, Hudson’s (1987) estimator of C can be unreliable, particularly when C values are small per gene (Hudson 1987; Wall 2000). We therefore utilized two additional estimates of C. We found a significant positive correlation betweenĈWall00 and
We also compared
Of the eight correlations between recombination and diversity shown in Figure 3, two were both positive and significant at the 5% significance level after multiple-test correction. The first was the correlation between
Per site estimates of the population-recombination parameter C and the physical recombination rate R
DISCUSSION
In theory, the relative contributions of background and hitchhiking selection can be determined by comparing recombination rates to genetic diversity on the basis of markers that evolve with different rates (Slatkin 1995). However, demography can obscure the relationship between recombination and diversity. Here we have measured genetic diversity for three types of molecular markers in the hope that comparisons among markers would provide insight into the forces shaping maize genetic diversity. Sampling was identical for the three marker types, and hence information among marker types is directly comparable. Recombination was measured both by the population-recombination parameter C and by a quantitative cytogenetic map of chromosome 1.
The quantitative cytogenetic map provides estimates of recombination rate (R) that are related to physical distance along SC1; R is the first quantitative measure of recombination in maize on a chromosomal, rather than a genic, scale. The distribution of R along SC1 indicates that the frequency of exchange per physical unit is reduced in centromeric regions relative to distal chromosomal regions (Figure 2), similar to centromeric suppression observed in other organisms (see Jones 1984 and Resnick 1987 for reviews), including Drosophila (Hudson and Kaplan 1995) and cultivated tomato (Sherman and Stack 1995). The pattern observed in maize chromosome 1 is similar to that reported for large grass genomes such as wheat and barley, in which recombination primarily occurs along the distal half of the chromosomal arm (Gillet al. 1996; Kunzelet al. 2000). In contrast, the region of centromeric repression is relatively small in rice (Chenget al. 2001).
The distribution of R also suggests substantial heterogeneity in recombination along chromosomal arms (Figure 2). Although the magnitude and scale of recombination needs to be characterized further, heterogeneity in R is consistent with the observation in barley that recombination is mainly confined to a few small areas spaced by large segments in which recombination is severely suppressed (Kunzelet al. 2000). Similar recombination hotspots have been previously described in maize (Dooner 1986; Civardiet al. 1994; Okagaki and Weil 1997; Fuet al. 2002).
The correlation between SNP diversity and recombination estimates: One striking result is thatĈ correlates with
Given these results, it is important first to consider differences between C and R. One obvious difference is that the two parameters differ in spatial scale. R is estimated on a chromosomal scale and therefore reflects an “average” recombination rate over large chromosomal regions. In contrast, C is estimated for a particular genetic locus. Maize contains recombination hotspots, particularly in genic regions (Civardiet al. 1994; Fuet al. 2002), and it is therefore possible that C more accurately incorporates information about recombination on the “local” spatial scale at which θ is measured.
More importantly, R and C measure different quantities. Both R and C describe recombination to some extent, but R measures only the recombination rate per physical distance; it is unaffected by population history, selection, and demography. In contrast, C is scaled by population size N, and it is inversely related to LD. Like LD, C is affected by population admixture, population subdivision, fluctuations in population size, and selection, in addition to recombination (reviewed in Pritchard and Przeworski 2001). The lack of correlation between R andĈ may indicate that selection or demographic factors contribute to an uncoupling between LD and recombination.
—Correlations between two estimates of recombination (R andĈHud87) and between estimates of recombination and diversity. b-e are based on 18 genetic loci, but f-i contain data from all genetic loci (see results). Regression lines, Pearson correlation coefficients (r), and P values are given.
What could be the evolutionary forces causing a correlation betweenĈ and
Both C and θ (= 4Nμ) contain historical information about population size. Both are also estimated from SNPs that evolve at an estimated rate of ∼10-9 substitutions per site per year (Gautet al. 1996) and therefore encompass relatively long time frames. Some SNPs have been retained in Zea populations for 1 million years or more (Gaut and Clegg 1993). As a result, the time frame encompassed by both C and θ exceeds maize domestication ∼7500 (Iltis 1983) to ∼9000 years ago (Matsuokaet al. 2002b). Domestication was associated with a bottleneck that decreased SNP diversity in maize ∼20% on average relative to its wild ancestor (Zhanget al. 2002). There have also been substantial demographic events since domestication, such as the geographic patterning of extant maize races (e.g., Matsuokaet al. 2002b).
It is not yet clear, however, how demographic events, like a domestication bottleneck or geographic subdivision, affect C and θ jointly. One possibility is that population size N varies among loci because gene flow and other demographic factors vary from locus to locus, both within maize and among its wild relatives (as in the D. pseudoobscura complex; Wanget al. 1997; Machadoet al. 2002). If demographic effects vary among loci, they could contribute substantially to a correlation between C and θ through N. Variation in N among loci can establish strong positive correlations between C and θ, even in the absence of correlations between θ and c. Simulations with 21 loci suggest that N can vary <10-fold and establish a correlation between C and θ (data not shown). To explore the effect of demography more fully, it will be helpful to have some knowledge of diversity in maize prior to domestication and also of divergence population genetics (Klimanet al. 2000) in the genus Zea. We are in the process of gathering empirical data from wild relatives and will address the effect of demography on C and θ more thoroughly in future work.
Microsatellite diversity: We identified 47 microsatellite loci in our data, and 33 of these loci were polymorphic. Levels of diversity in these loci, as measured by Hmicrosat, are positively correlated with R. There are at least three possible explanations for this correlation.
The first explanation is based on sampling—i.e., our sample may contain rapidly evolving loci in regions of high recombination by chance alone. This scenario is particularly plausible because the microsatellites in this study likely evolve with different mutation rates. Microsatellite mutation rates (μ) vary considerably by repeat motif (Chakrabortyet al. 1997; Schug et al. 1997, 1998), length (Schlottereret al. 1998; Schug et al. 1998b; Udupa and Baum 2001), and base composition (Schlotterer and Tautz 1992; Glennet al. 1996; Bachtroget al. 2000). For example, μ is estimated to be ∼7.7 × 10-4 mutations per generation for dinucleotide repeats in maize but <5 × 10-5 for longer repeat motifs (Vigourouxet al. 2002).
To examine whether any particular class of microsatellite is driving the correlation between Hmicrosat and R, we partitioned microsatellite loci into different classes by repeat type, including perfect mono-, di-, tri-, and hexanucleotide repeats, as well as compound + imperfect repeats (Figure 4). The only class exhibiting a positive and significant correlation between Hmicrosat and R was the compound + imperfect class (Figure 4), but this correlation was not significant after multiple test correction. However, four of the five classes exhibited a positive correlation between Hmicrosat and R (Figure 4), suggesting that a positive correlation with R may be a general property of the microsatellite loci in our sample. The mononucleotide repeat class is particularly interesting, both because these loci may evolve rapidly and because they are primarily located in regions with high R (Tables 1 and 3). The mononucleotide class is positively but not significantly correlated with R (r = 0.46; P = 0.14), but the overall correlation between Hmicrosat and R remains when this class is removed from analysis (r = 0.42; P = 0.02). Thus, it does not appear that the overall correlation between Hmicrosat and R is driven either by one particular class of microsatellite or by the chance location of rapidly evolving microsatellites (like mononucleotide repeats) in high R regions.
—Correlations between microsatellite diversity (Hmicrosat) and R for each microsatellite repeat class. Only two polymorphic loci were available from the tetranucleotide class and none from the pentanucleotide class. Compound and imperfect loci were combined because there is little a priori information as to their mutation rates. The regression line, Pearson correlation coefficient (r), and P value are given for each microsatellite class.
A second possibility for the correlation between Hmicrosat and R is that recombination is itself mutagenic, thereby causing microsatellite polymorphisms. For example, human data suggest that recombination can lead to the contraction and expansion of trinucleotide repeats (Richard and Paques 2000). The effect of recombination on microsatellite diversity needs to be investigated further, but mutagenic effects of recombination could underlie the correlation between Hmicrosat and R.
A third possibility is that the correlation is a property of the relationship between recombination and selection. To discuss this possibility, it is first important to note that microsatellite mutation rates have been measured in many organisms, including humans (Weber and Wong 1993; Xuet al. 2000), Drosophila (Vazquezet al. 2000), chickpea (Udupa and Baum 2001), and maize (Vigourouxet al. 2002). In all of these organisms, microsatellites mutate at a rate μ> 10-6 mutations per generation. Thus, the microsatellites in this study probably mutate at least three orders of magnitude more rapidly than SNPs. The consequence of high mutation rates is profound. Microsatellites are expected to quickly approach an equilibrium between mutation and drift (Slatkin 1995), and they recover rapidly from demographic and selective events. For example, microsatellites in Drosophila, which mutate relatively slowly at μ= 5.1 × 10-6 (Vazquezet al. 2000) compared to plant microsatellites, are estimated to recover from selective sweeps in <1000 years (Nurminsky 2001). Although the number of Drosophila generations in 1000 years likely exceeds the number of maize generations since domestication, it is possible that some maize microsatellites may have recovered, at least partially, from the effect of the domestication bottleneck ∼7500 (Iltis 1983) to ∼9000 years ago (Matsuokaet al. 2002a). If this is true, it is possible that the signature of ongoing hitchhiking or background selection is no longer dominated by a past demographic event (i.e., a domestication bottleneck) in microsatellite loci as it may be in SNPs.
Finally, we note that comparisons between V and R do not yield a positive correlation (Figure 3). However, when V is based on repeat number, rather than allele size, results with Hmicrosat and V are more comparable. It is desirable to use repeat number, as opposed to allele size, because V based on repeat number is not biased by repeat length. However, V based on repeat number cannot be calculated for several microsatellite loci in our sample because the repeats were imperfect, were compound, or did not evolve in stepwise fashion. For the 21 polymorphic perfect loci that evolve in stepwise fashion (Table 1), V based on repeat number is positively, but not significantly, correlated with R (r = 0.19; P = 0.23). When tb1 is dropped from consideration, the correlation is significantly positive (r = 0.47; P = 0.024), and this result is comparable to that we obtained with Hmicrosat. All of our analyses with V—whether based on allele size or repeat number—were heavily influenced by the outlying tb1 microsatellite locus. Altogether, the reliance on tb1, the bias due to repeat length for V based on allele size and the dependence on stepwise mutations for V based on repeat number, diminish the value of V as a measure of microsatellite diversity for these data.
Indel diversity: Indel diversity in maize is marked by a size distribution that is heavily skewed toward small indels (1-5 bp), with a few large (>100 bp) indels marking the extreme tail of the distribution (Figure 1). Similar distributions have been reported for mammalian and Drosophila nuclear DNA (Gu and Li 1995; Bergman and Kreitman 2001), and hence maize is not unique in having a preponderance of small indels. Indel polymorphism was not correlated withĈ or R (Figure 3). Because little is known about indel mutation rates and how μ varies among different indel sizes, it is difficult to interpret the lack of correlation.
It is perhaps more interesting that the population frequency of indels is skewed by size. In our sample, large indels are on average less frequent in the population sample than small indels, suggesting that large indels are slightly deleterious. The 10 large indels also have a lower Tajima’s D value than the small indels. Of these 10, only 2 are clearly associated with coding DNA (adh1 and csu381; Table 2); the rest are located in anonymous RFLP marker regions. These results raise an interesting paradox. Greater than 50% of the maize genome consists of retrotransposons (SanMiguelet al. 1996). Given the preponderance of transposable elements in the maize genome, it seems unlikely that large indels are usually strongly deleterious, yet these population data suggest they are measurably deleterious. The resolution to this problem consists of two components. First, the vast majority of the maize genome consists of retrotransposons that insert into one another (SanMiguelet al. 1996); presumably the targeting of retrotransposons into nonessential genic regions is evolutionarily favorable for element proliferation. Because of this targeting, retrotransposons may be under different evolutionary dynamics from the indels in our sample, none of which are retrotransposons. Second, MITEs and Ds elements, the only identifiable elements in our study, preferentially insert into transcribed regions (Bennetzen 2000), suggesting some of our genetic loci are near coding regions where insertions are more likely to be deleterious.
The forces affecting genetic diversity in maize: This study offers several insights into the forces contributing to genetic diversity in maize. First, there is no evidence that R and
Acknowledgments
The authors thank E. Buckler, P. Tiffin, Y. Vigouroux, T. Johnson, and T. Long for discussion. J. Wall and P. Fearnhead made programs available and answered questions. Two anonymous reviewers made comments that greatly improved the manuscript. This study was supported by National Science Foundation grants DBI-0096033 to B.S.G. and J.F.D and MCB-9728673 to S.S.
Footnotes
-
Communicating editor: D. Charlesworth
- Received May 14, 2002.
- Accepted August 1, 2002.
- Copyright © 2002 by the Genetics Society of America