Abstract
Patterns of linkage disequilibrium (LD) are of interest because they provide evidence of both equilibrium (e.g., mating system or long-term population structure) and nonequilibrium (e.g., demographic or selective) processes, as well as because of their importance in strategies for identifying the genetic basis of complex phenotypes. We report patterns of short and medium range (up to100 kb) LD in six unlinked genomic regions in the partially selfing domesticated grass, Sorghum bicolor. The extent of allelic associations in S. bicolor, as assessed by pairwise measures of LD, is higher than in maize but lower than in Arabidopsis, in qualitative agreement with expectations based on mating system. Quantitative analyses of the population recombination parameter, ρ, however, based on empirical estimates of rates of recombination, mutation, and self-pollination, show that LD is more extensive than expected under a neutral equilibrium model. The disparity between ρ and the population mutation parameter, θ, is similar to that observed in other species whose population history appears to be complex. From a practical standpoint, these results suggest that S. bicolor is well suited for association studies using reasonable numbers of markers, since LD typically extends at least several kilobases but has largely decayed by 15 kb.
THE extent of allelic associations, commonly called linkage disequilibrium (LD), is of great interest in many species because of its implications for the design and feasibility of association studies and genome-wide scans to identify the genetic basis of complex traits. Genome-wide patterns of LD are fundamentally the product of two processes: (1) a new mutation occurs and is necessarily associated with the variants on the chromosome on which it arises, and (2) recombination places that mutation on a different genetic background, breaking the association. Thus the rate of recombination (r) is a key parameter in the process of LD decay. The relationship between r and LD is also affected by demographic factors. More specifically, the extent of LD is a reflection of the population recombination parameter, 4Ner, or ρ, where Ne (effective population size) is a function of the long-term historical size of the population, population structure, and mating system. Similarly, the mutational process that generates associations is summarized in the population mutation parameter 4Neμ, or θ, where μ is the mutation rate. At equilibrium in a randomly mating population without selection, the extent of allelic associations is simply a function of the relative rates of mutation and recombination, since 4Ne cancels out in the ratio ρ/θ. Nevertheless, even in this simplest of scenarios, LD decay varies widely in unlinked regions due to the substantial stochasticity of the evolutionary process (Nordborg and Tavare 2002).
When a population departs from the panmictic equilibrium model, we can no longer accurately estimate ρ and θ from empirical data, and observed levels of LD may be inconsistent with empirically determined rates of recombination under simple demographic models. In the case of a partially self-pollinating species, the resulting reduction in heterozygous genotypes means that the effective rate of recombination is lower for a given rate of crossing over, so LD is, on average, expected to be more extensive. This is an equilibrium effect and can be accounted for by appropriate scaling (Nordborg and Donnelly 1997; Nordborg 2000). In other cases, however, mating is random but nonequilibrium population history leads to inconsistencies that are less easily explained. For example, LD in Drosophila is more extensive than expected on the basis of genetic distances and estimates of Ne from θ (Andolfatto and Przeworski 2000). In humans, LD can be either more or less extensive than expected, depending on the range of interlocus comparisons (Pritchard and Przeworski 2001). Selection can also generate LD, although the locus-specific effects of selection can be hard to distinguish from the noise generated by neutral processes (Huttley et al. 1999; Kim and Nielsen 2004).
Our knowledge of the extent of LD in plants is limited (Flint-Garcia et al. 2003). Only in Arabidopsis, with a fully sequenced genome and high-density genetic map, has it been possible to conduct analyses comparable to those in Drosophila and humans, namely to evaluate LD over large, defined physical distances in the context of local rates of recombination. Very extensive LD on the order of 250 kb has been observed in this highly self-pollinating species (Hagenblad and Nordborg 2002; Nordborg et al. 2002), although a recent genome-wide study has shown that LD at most loci decays within 25–50 kb (Nordborg et al. 2005). In contrast, LD in maize, an outcrosser, decays within a few hundred base pairs in diverse samples (Tenaillon et al. 2001), although the extent of LD increases when narrower samples of germplasm (Remington et al. 2001; Ching et al. 2002; Jung et al. 2004) or targets of selection (Clark et al. 2004; Palaisa et al. 2004) are analyzed. Studies of gene-sized regions in both Populus (Ingvarsson 2004) and loblolly pine (Brown et al. 2004) indicate low levels of LD in these highly outcrossing trees. While these contrasts are frequently explained in terms of mating system, LD in wild barley, with a selfing rate similar to Arabidopsis, has patterns of LD more similar to maize (Lin et al. 2002). Very extensive LD is observed in rice (Garris et al. 2003; Semon et al. 2004) where the effects of mating system are likely confounded with population structure.
The extent of LD is a key issue in the design and feasibility of association mapping methods (Flint-Garcia et al. 2003). Association studies (also called LD mapping) have been successful in maize (e.g.,Thornsberry et al. 2001), where they may be limited to candidate genes because of the small extent of LD. Sorghum bicolor, a largely (∼70%) self-pollinating domesticated grass (Rooney and Smith 2000) that is important for human nutrition in semiarid regions of sub-Saharan Africa (FAO 1996), is closely related to maize but has a smaller and less complex genome (Draye et al. 2001). In a previous study (Hamblin et al. 2004), we reported that LD over very short distances in sorghum was more extensive than in maize, suggesting that sorghum may be suitable for LD mapping of genes underlying complex, agronomically important traits common to both species. However, our limited data did not reveal anything about the scale over which LD dissipates, and we had no information about local relationships between genetic and physical distance for the regions analyzed. A physical map of the sorghum genome is being assembled and integrated with the genetic map (Draye et al. 2001; Mullet et al. 2002; Bowers et al. 2003), which ultimately will allow for estimation of rates of recombination, as well as LD, over fairly large distances, on the scale of centimorgans and megabases.
Meanwhile, genomic regions represented by BAC clones containing genetic markers can be used to sample patterns of LD and the relationship between physical and genetic distance on a scale of tens of kilobases. Six fully sequenced BAC clones, representing five different chromosomes, were used for this purpose. The goals of this study were to examine the pattern of pairwise associations among a large number of single-nucleotide polymorphisms (SNPs) in six large (40–100 kb) unlinked regions and to estimate ρ (the population recombination parameter, 4Ner), θ (the population mutation parameter, 4Neμ), and r (the rate of recombination per base pair per generation) for those same regions. Analysis of this data set allowed us to assess the contribution of mating system and recombination rate to patterns of LD in sorghum and provides a general picture of those patterns that may prove useful in LD-based methods of genetic analysis.
MATERIALS AND METHODS
Sorghum accessions:
Our panel of 32 S. bicolor accessions was a subset of 104 diverse accessions that had been characterized at 76 SSR loci (Casa et al. 2005) and were chosen to represent all of the population clusters identified by phylogenetic analysis. These include two U.S. inbred lines: BTx623 and RTx430; 22 land races: PI510985 and PI510906 from Botswana, NSL83707 from Cameroon, NSL50875 from Chad, PI22913 from China, PI267525 and PI267523 from Egypt, PI257595 from Ethiopia, PI221607 from Ghana, NSL51365, NSL87088 and NSL51836 from India, PI213900 from Kenya, NSL51032 from Mali, PI221655, PI221540, and NSL50744 from Nigeria, NSL51397 from South Africa, PI152702 from Sudan, NSL55751 and NSL77034 from Uganda, PI287624 from Zimbabwe; two subspecies of drummondii from Sudan: L-WA12 and L-WA71; six subspecies of verticilliflorum: L-WA13 from Sudan, L-WA22 and L-WA28 from Angola, L-WA42 from South Africa, L-WA55 from Benin, and L-WA88 from Egypt. The verticilliflorum subspecies is believed to be the ancestor of cultivated sorghum; all the subspecies are fully interfertile. One S. propinquum accession was used as an outgroup for estimates of ρ (see below). DNA was prepared from young leaves of individual plants according to the method of Doyle and Doyle (1987).
Choice of regions for resequencing:
Annotation of predicted genes was used to identify putative intronic regions between 500 and 1700 bp in length that could be amplified from PCR primers on the basis of flanking exon sequence. Note that this is a technical consideration only and that the functional status of the sequence has no bearing on the analysis.
PCR products ranged from 700 to 1700 bp in size (Table 1) and were prepared for sequencing by treatment with shrimp alkaline phosphatase and exonuclease I digestion. PCR primers (and internal primers as necessary) were used for cycle sequencing with ABI Big Dye V. 3.1, and sequencing reactions were analyzed on a 3730 or 3700 sequencer at the Bioresources Center at Cornell University.
Regions surveyed and informative SNPs observed
Analysis:
Chromatograms were trimmed and edited manually using Sequencher 4.2 (Gene Codes, Ann Arbor, MI). When necessary, text files were exported and aligned using Multalin (http://prodes.toulouse.inra.fr/multalin/multalin.html) or Se-Al (http://evolve.zoo.ox.ac.uk). Although sorghum is usually homozygous at most loci, some heterozygous individuals were observed (Table S7 at http://www.genetics.org/supplemental/). In these cases, the heterozygous individual was considered to have two chromosomes at that region only. Except for TASSEL (see below), none of our analyses required that phase be known. Summary statistics of DNA sequence polymorphism were estimated by DnaSP (Rozas and Sanchez-DelBarrio 2003). Multilocus tests of polymorphism and divergence (Hudson et al. 1987) and the variance of Tajima's D (Tajima 1989) were performed with Jody Hey's program HKA (http://lifesci.rutgers.edu/heylab). Coalescent simulations of a population bottleneck were performed with Hudson's program ms (Hudson 2002).
Files of variable sites generated by MEGA (Kumar et al. 2001) were formatted for LD estimation after removal of sites for which the minor allele had a frequency of <10% (Remington et al. 2001). In a few regions we observed an individual that appeared to result from introgression from a divergent wild relative, as it differed from all other alleles in the sample at many sites. For example, in region 2, of a total of 91 segregating sites, accession LWA22 contributed 51. Such individuals were removed from the sample for those regions only. Note that, because these individuals contribute only singletons, they have no effect on estimates of LD.
In a small number of cases, some accessions produced sequence of insufficient quality for calling all bases, but most of the polymorphic SNPs could nonetheless be reliably scored and were added to the LD analysis of the full sequence data. These sequences were not included in estimates of sequence diversity. These exceptions to the strategy of full resequencing are noted in Table 1. The programs dipdat (kindly provided by Dick Hudson) and maxdip (http://genapps.uchicago.edu/maxdip/index.html) were used to estimate r2 (Hill 1974) and ρ (i.e., 4Ner), respectively, from genotypic (i.e., unphased diploid) data (Hudson 2001). LD triangle plots were constructed using TASSEL (http://www.maizegenetics.net/bioinformatics/tasselindex.htm) after removal of genotypes containing more than one heterozygous site, since TASSEL requires that phase be known.
We also used the program PHASE (Li and Stephens 2003) to obtain estimates of ρ and of the variation in recombination rates across the surveyed regions. Default parameters were used, except that the −X10 option was used to increase the number of iterations in the final run, as suggested in the documentation. Point estimates of ρ and λ (the factor by which the recombination rate in an interval between two loci exceeds the background rate) were obtained by taking the median value from 1000 iterations, as suggested in the documentation. An interval was considered a potential hotspot if the fifth percentile of λ was >1.0.
Estimation of rates of recombination:
We used a recombinant inbred line (RIL) population, DNAs of which were kindly provided by Tom Hash of the International Crop Research Institute for the Semiarid Tropics. There were 244 lines in the population, which had been self-pollinated and advanced to the F6 generation by single-seed descent. Within each BAC sequence, we identified at least two simple sequence repeats that showed length variation between the parents of the RIL population (BTx623 and IS18551). We designed primers for fragment analysis and scored the markers in the 244 DNAs. Recombination per generation (r) was estimated using the formula: observed recombination fraction = 2r/(1 + 2r) (Burr et al. 1988) and divided by the number of base pairs between markers.
RESULTS
We characterized the decay of linkage disequilibrium in six unlinked regions of the S. bicolor genome represented by six fully sequenced BAC clones (Table 1). LD was estimated on the basis of SNP variation in several amplicons of ∼700–1700 bp spaced irregularly across each region (Figure 1, Table 1), spanning a total distance of between 38 and 103 kb/BAC. Our panel of 32 S. bicolor accessions (see materials and methods), which includes both cultivated and wild representatives, was chosen to maximize diversity and to capture as much of the evolutionary history of S. bicolor as possible. All 32 lines were fully sequenced for most regions (for exceptions, see materials and methods). The numbers of SNPs observed in each subregion (i.e., amplicon) vary considerably (Table 1). Only SNPs for which the minor allele was present three or more times in the sample were included in the analyses of LD; all subsequent references to SNPs refer to this subset of 249 of 427 total SNPs observed.
Gene content (diagonally striped boxes) and sequenced subregions (horizontally striped boxes) for the six regions. See Table 1 for more information.
Pairwise estimates of LD:
We calculated r2 for all pairwise comparisons among SNPs within the same region and plotted those values as a function of physical distance. Figure 2, which shows these plots for each region separately, reveals a great deal of heterogeneity in the decay of LD among these regions. While the pooled data indicate that, on average, r2 falls below 0.1 by 15–20 kb, in region 3 many values of r2 remain >0.2 even at distances >35 kb, and associations in region 5 are essentially absent at distances of 5–10 kb. To assess “background LD,” we calculated r2 between pairs of sites in regions 1 and 2 (different chromosomes) and between regions 2 and 3 (both chromosome 1); the mean values of r2 are 0.035 and 0.024, respectively.
Plots of r2 (y-axis) vs. distance (in kilobases; x-axis) for individual regions and pooled data. The curves are logarithmic trend lines fit to the data.
Figure 2 also shows that the relationship between physical distance and r2 is not strong, particularly over shorter distances. When logarithmic trend lines are fit to the data, the coefficients of determination vary from 0.03 for region 2 to 0.61 for region 4. When only distances <4 kb are included, the relationship between r2 and distance almost entirely disappears: only region 4 has a coefficient of determination >0.03.
There is considerable interest in whether LD is “block-like,” as strong haplotype structure, if real, may simplify LD-mapping studies (Zhang et al. 2002). The plots in Figure 2 obscure patterns of LD among blocks of sites, so we have also made triangle plots of the data (Figure S3 at http://www.genetics.org/supplemental/). Again, the patterns are very different among the regions. In region 4, for example, there are only a few isolated associations of any significance (< 0.01) between sites >1 kb apart. Region 5 has associations of P < 0.0001 that extend >15 kb, but those associations are patchy, not block like. In contrast, region 1 has an almost solid block of associations of P < 0.001 extending >13 kb. These patterns are not easy to compare because the spacing of subregions and the numbers of SNPs observed in each are not uniform; nonetheless it is apparent that some regions of the sorghum genome have extended haplotypes while others do not.
Multilocus estimates of LD:
As is evident in Figure 2, pairwise estimates of LD are noisy and often difficult to interpret (Nordborg and Tavare 2002). A statistic that summarizes LD over an entire region is ρ, the population recombination parameter (see Introduction). A number of methods have been proposed for estimating ρ, but many have been found to perform poorly, particularly when sample sizes are small (Wall 2000). Due to the nature of our data set (namely, fairly large, discontinuous regions), we chose to use the composite likelihood (CL) method of Hudson (2001) as well as the product of approximate conditionals (PAC) method of Li and Stephens (2003). Consistent with simulation studies that showed that these estimators perform well, the two estimates for each region do not differ by more than about twofold (Table 2). ρPAC was lower than ρCL in five of six cases, but the relative order of the values, by region, was similar for the two methods.
Estimates of the population recombination parameter ρ
The CL method allows for gene conversion as well as crossing over and estimates the ratio of the two processes, given a certain mean gene conversion tract length (l) (see materials and methods). Allowing for gene conversion can result in lower estimates of ρ; however, the impact on this data set is modest. Assuming l = 300 or 500, a model without gene conversion (i.e., f = 0) fit the data best for all regions except regions 2 and 5 (Table 2). For region 2, the likelihood curve was extremely shallow, so that a model without gene conversion fit the data almost as well as several other models that included high rates of gene conversion. Support for a model including gene conversion was strong only for region 5 but only a modest amount of gene conversion (f = 1) was inferred; ρCL for this region was 7.55 under a model with f = 0.
For both methods, the estimates of ρ (Table 2) vary ∼12-fold across the six genomic regions, and the 90% C.I.'s for regions 3 and 4 do not overlap. Variation in recombination rate among regions of the genome is a possible factor underlying these differences. To test this hypothesis, we experimentally measured rates of recombination for five of our regions. (Recombination was not measured for region 1 because the entire BAC clone is only 40 kb in size and recombination events were not likely to be observed.) These studies, shown in Table 3, indicate that rates of recombination vary ∼4-fold from region to region and may explain a portion of the difference in patterns of LD that we observe, in particular the difference between regions 3 and 4. None of the estimates of r, however, are significantly different. Variation in r explains only ∼25% of the variation in ρCL, but explains almost 60% of the variation in ρPAC.
Recombination frequency scored in RIL population
While rates of recombination based on crossover frequencies can be estimated over fairly large regions, there is interest in whether variation in recombination rates may occur on a much finer scale; except in the case of major hotspots (e.g., Yauk et al. 2003), such variation can be detected only by inference from population genetic data (Fearnhead and Donnelly 2001). The method of Li and Stephens (2003), which we used to estimate ρ, allows rates of recombination to vary along the region and reports a value of λ for each interval between adjacent SNPs (see materials and methods). The vast majority (98%) of λ-values were between 0.5 and 2.0, indicating a fairly uniform rate of recombination. Two exceptions, noted in Table 2, were found. In both cases, point estimates of λ were ∼6, and 95% of the iterations produced a λ-value >1.0. While this does not constitute a test of significance, and does not account for multiple tests (252 intervals were tested), these two intervals are clearly outliers in the data set.
Cultivated sorghum has experienced a domestication bottleneck (Aldrich and Doebley 1992), which is expected to affect patterns of LD (Pritchard and Przeworski 2001). Furthermore, admixture of accessions from the wild and cultivated populations could result in elevated LD (Pritchard and Przeworski 2001). However, estimates of ρ for the cultivated accessions only are almost all (five of six) lower than that for the total sample (Table S5 at http://www.genetics.org/supplemental/), suggesting that the effect of the bottleneck is more important than that of admixture. This is not surprising, given that the wild accessions are only moderately differentiated from the cultivated ones and that there is little structure among the cultivated accessions (Casa et al 2005). The exception is region 1, for which ρ is considerably higher in the cultivated sample. This is because the cultivated sample had 47 SNPs as compared to 78 in the total sample, and a large number of alleles in strong LD were present only in the wild accessions (most of the SNPs in subregions 1A–1D). When wild accessions were eliminated, the number of SNPs in region 5 dropped from 70 to 62, and for the four other regions there was no difference. In general, most of the polymorphisms that were observed only in the wild accessions were in low frequency and had not been included in the analysis. The greater LD in the cultivated sample is therefore presumably due to the loss of haplotypes that provide evidence of recombination among the same pairs of alleles.
Because the ability to detect evidence of recombination depends on the presence of informative polymorphisms, it is useful to look at the ratio of ρ to θ in comparing ρ across regions. Table 4 shows the ratio of ρ and θ for the total sample and the cultivated accessions only. (In this analysis, we use the higher estimate of ρ, usually ρCL, which is conservative for testing the hypothesis that LD is higher than expected.) This ratio, which ranges from 0.040 to 0.375 in the total sample and from 0.014 to 0.249 in the cultivated sample, indicates that recombination is relatively infrequent relative to mutation.
Comparison of ρ and θ for total and cultivated samples
DISCUSSION
In this study of linkage disequilibrium in S. bicolor, we present estimates of the population recombination parameter, ρ, based on six large, unlinked, fully resequenced regions for which we have also estimated the local rate of recombination per base pair. We find that the extent of allelic associations in sorghum, as assessed by pairwise measures of LD, is higher than in maize but lower than in rice and Arabidopsis, in qualitative agreement with expectations based on differences in mating system. Multilocus estimates of the population recombination parameter ρ, however, are among the lowest observed in any species, including Arabidopsis (Kuittinen and Aguade 2000; Hagenblad and Nordborg 2002; Nordborg et al. 2005). In attempting to account for these observations, several factors should be considered.
Estimation of ρ:
A number of different methods have been proposed for estimating ρ, many of which do not perform well when tested against simulated data with known values of ρ (Wall 2000; Hudson 2001): point estimates are often far from the known value, and/or confidence intervals are very large. Likewise, different estimators of ρ may produce very different results for the same empirical data (e.g., Tenaillon et al. 2002). Therefore it is reasonable to ask how much confidence we have in the estimates that we report. True confidence intervals for estimates of ρ are not trivial to obtain, particularly when the data collection scheme is not simple (as in our case). For ρPAC, however, it is easy to obtain an approximation of the uncertainty from the distribution of sampled ρ-values; each of these ∼90% credible intervals (Table 2) contains the corresponding ρCL. In fact, the two estimates are in all cases within about twofold of each other, suggesting that they are not very far from the true value.
Testing an equilibrium model:
Both mating system and rates of recombination affect LD in predictable ways that can be accounted for in an equilibrium model. If accounting for these factors does not explain the data, then we must invoke nonequilibrium phenomena such as population structure and history and selection.
The effects of mating system, recombination rate, and mutation rate on ρ in a partially selfing organism can be described by the equation ρ/θ = (r/μ)(1 − F), where F is the inbreeding coefficient (Hagenblad and Nordborg 2002). The rate of self-pollination in sorghum is ∼70% (Rooney and Smith 2000), so F = 0.7/(2 − 0.7) = 0.54. On the basis of synonymous substitutions between maize and sorghum at multiple loci (Swigonova et al. 2004), we use an estimate of μ = 1 × 10−8/bp/generation. We therefore expect that r/μ should be on the order of (r /1 × 10−8) × (0.46), a value that is >1 for all the regions in our study. Actual values of ρ/θ are ∼5–33 times lower (Table 4), using the higher of our two estimates of ρ. While mutation rates may vary among regions, as reflected by the differences in θ, it appears that the effects of mating system and recombination rate cannot account for the low values of ρ observed in this sample. Much higher rates of self-pollination (>90%) would be required to fit these data with an equilibrium model. While some cultivated sorghum accessions do have rates in this range, sorghum spent the vast majority of its evolutionary history as a wild species with an outcrossing rate believed to be ≥30% (Doggett 1988). Rates of recombination much lower (∼10−9/bp/generation), or mutation much higher (∼2 × 10−7/bp/generation), could also resolve the discrepancy, but such values are not plausible.
Population structure and history:
Population structure can contribute to elevated LD (Pritchard and Przeworski 2001), so our wide sampling could bias estimates of LD upward if population structure were strong. However, we know from studies of SSR diversity that there is little structure in sorghum populations (Casa et al. 2005). Furthermore, results of Wakeley and Lessard (2003) suggest that our sampling strategy may minimize the impact of any population structure on estimates of LD. In samples drawn from just two demes, they show that correlations in histories of alleles are increased. The properties of “scattered” samples like ours, however, where each individual comes from a different deme, approach those of samples from panmictic populations. In any case, estimates of ρ in the cultivated sample are lower than those in the total (i.e., admixed) sample, including wild accessions. These results are consistent with the findings for maize, where samples that capture greater genetic diversity (i.e., more ancestral recombination events) show increasingly less LD (Flint-Garcia et al. 2003). Thus our sampling strategy could be considered conservative for testing the hypothesis that LD in sorghum is more extensive than expected.
Nonequilibrium population history can result in discrepancies between ρ and θ, as has been observed in humans and Drosophila (Frisse et al. 2001; Wall et al. 2002), and this factor is likely to be important in sorghum, which has experienced a domestication bottleneck. In our data, the cultivated subsample has 58% of the segregating sites of the total sample (Table S6 at http://www.genetics.org/supplemental/). Consistent with the effects of a bottleneck, the frequency spectrum of variation in the cultivated data set shows a strong departure from the neutral expectation: coalescent simulations show that both the mean (0.45) and the variance of Tajima's D (1.42) are significantly too large. We have explored a small range of recent simple bottleneck models to attempt to find one that is consistent with the domestication history of sorghum and with our empirical data. Using θ = 0.0056 based on variation in wild S. bicolor (unpublished data from our lab), and μ = 1 × 10−8/bp/generation, we estimate ancestral Ne to be 1.4 × 105. A domestication event 5000–6000 years ago (Kimber 2000) would thus correspond to ∼0.01(4Ne) generations. Using coalescent simulations, we were unable to find bottleneck parameters for which both the average D and the variance of D were close to what we observe. We are currently performing more exhaustive analyses with a larger data set that will be published elsewhere. Nonetheless, while the details remain to be determined, it seems reasonable to conclude that a nonequilibrium history has perturbed the frequency spectrum and also elevated linkage disequilibrium in this sample.
The weak relationship between association and distance for alleles that are <4 kb apart (Figure 2) is an interesting observation that may also be due to population history. If a bottleneck generated excess LD, that excess LD will have decayed over time more quickly for alleles that are farther apart (i.e., for which r is larger), while closely linked alleles may still retain the signature of that demographic event. A similar pattern is observed at some loci in Arabidopsis; e.g., see Figure 4 in Shepard and Purugganan (2003).
Selection:
While the data may be consistent with a strictly neutral, nonequilibrium model, this of course does not preclude that selection may also have influenced the observed patterns of LD. In particular, we might expect that directional selection associated with domestication has played a role in the cultivated subsample. A multilocus HKA test of polymorphism and divergence (Hudson et al. 1987) in that subsample showed that the data were highly unlikely under a neutral model (P = 0.00001), but there was no convincing evidence of selection at any particular loci. It is possible that the departure is due to demography rather than to selection. Interestingly, subregions A–D of region 1 in the cultivated sample have only 10–15% of the diversity of the total sample, possibly due to selection at the shrunken2 locus near subregion 1A, but this has not resulted in higher estimates of LD for this region in the cultivated sample (see results). Region 3, with the highest LD, contains the phytochrome A locus, for which a previous study showed no evidence of selection in sorghum (White et al. 2004). Thus, while we cannot rule out that selection may have played a role in shaping observed patterns of LD, there is no clear relationship in the data between the extent of LD and any evidence of a history of selection.
Features of recombination in sorghum:
In maize, where LD is much less extensive, it appears that most recombination occurs within genes rather than in intergenic regions, perhaps because the substantial variation in transposable element complements from chromosome to chromosome disrupts the homology necessary for recombination to occur (Fu et al. 2002). Our sequencing strategy was not designed to address this question, and most evidence for recombination occurs between the sequenced subregions in genomic areas that consist of both coding and noncoding sequence. There is evidence of only 12 recombination events [using the four-gamete test of Hudson and Kaplan (1985)] within the 27 fully resequenced subregions, almost all of which correspond to introns. However, the small number of bases surveyed represents only ∼7% of the total length of the genomic regions analyzed (24 of 330 kb). Interestingly, the most gene-rich region, region 3, has the highest LD, while regions 4 and 5, relatively sparse in genes, have the least. Furthermore, rates of recombination, as assessed by the PAC method of Li and Stephens (2003), appeared to be quite uniform across each region. These observations, although anecdotal, do not suggest that recombination in sorghum is concentrated primarily within genes.
Haubold et al. (2002) and Nordborg et al. (2005) concluded that, in Arabidopsis, most “recombination” is in fact caused by gene conversion. High rates of gene conversion have also been implicated in the discrepancy between short- and long-range LD in humans (Frisse et al. 2001; Przeworski and Wall 2001; Andolfatto and Wall 2003) and a deficit of LD in short regions of low recombination in Drosophila (Andolfatto and Wall 2003). In sorghum, if there is any discrepancy between short- and long-range LD, it goes in the other direction: short-range LD is more extensive than expected, relative to long-range LD. Consistent with this observation, we saw little evidence for gene conversion, and there is little evidence of recombination within subregions, the scale on which short gene conversion tracts would be observed.
To understand and make use of information about LD, we made empirical estimates of r per base pair. Rates of recombination within and between species vary tremendously, such that a genetic interval defined by two markers 1 cM apart may correspond to 50 kb or 5 Mb of DNA, depending on the organism and the local rate of recombination. In wheat, for example, rates as low as 0.04 cM/Mb and as high as 8.5 cM/Mb have been measured (Gill et al. 1996a,b). In maize, measurement of recombination nodules per cytological distance vary >12-fold across chromosome 1 (Tenaillon et al. 2002). In sorghum, our estimates of r varied only ∼4-fold in five unlinked regions, in the range of 2–8 cM/Mb, and the differences were not significant. All measured rates were higher than the estimate of 1.5 cM/Mb on the basis of total genome size and genetic map distance. This suggests that other genomic regions, not sampled in this study, may be recombinationally relatively inert.
Conclusions:
The extent of LD in sorghum is greater than that in maize, where it is generally low (Tenaillon et al. 2001), and less than that in Arabidopsis (Nordborg et al. 2005). This qualitative observation is consistent with the mixed mating system of sorghum producing intermediate levels of effective recombination. Quantitative analyses, however, based on estimated rates of recombination, mutation, and selfing, show that both ρ and the ratio of ρ to θ are lower than expected under an equilibrium model. These analyses, as well as the frequency spectrum of polymorphism, suggest that a genome-wide departure from equilibrium underlies this phenomenon.
The greater extent of LD in sorghum makes it amenable for association studies using a limited number of markers. Genotyping of a few SNPs per gene in many cases can capture most of the haplotypic variation.
Acknowledgments
We thank Tom Hash for providing the DNAs for the RILs; David Witonsky for help with Maxdip; Don Viands for advice in estimating rates of recombination; Matthew Stephens for help with interpreting the results of PHASE; Chip Aquadro, Ed Buckler, Magnus Nordborg, and anonymous reviewers for comments on the manuscript; and Joy Bergelson for editorial assistance. Support for this project came from grant DBI0115903 from the National Science Foundation to A.H.P. and S.K.
Note added in proof: J.-S. Kim, M. N. Islam-Faridi, P. E. Klein, D. M. Stelly, H. J. Price, R. R. Klein and J. E. Mullet (2005, Comprehensive molecular cytogenetic analysis of sorghum genome architecture: distribution of euchromatin, heterochromatin, genes and recombination in comparison to rice. Genetics 171 (in press)) have recently estimated the average genome-wide rate of recombination for euchromatic regions in S. bicolor. Their estimate is very similar to ours: 0.254 Mbp/cM or 4 × 10−8/bp.
Footnotes
- Received February 10, 2005.
- Accepted June 2, 2005.
- Copyright © 2005 by the Genetics Society of America