Multilocus surveys of sequence variation can be used to identify targets of directional selection, which are expected to have reduced levels of variation. Following a population bottleneck, the signal of directional selection may be hard to detect because many loci may have low variation by chance and the frequency spectrum of variation may be perturbed in ways that resemble the effects of selection. Cultivated Sorghum bicolor contains a subset of the genetic diversity found in its wild ancestor(s) due to the combined effects of a domestication bottleneck and human selection on traits associated with agriculture. As a framework for distinguishing between the effects of demography and selection, we sequenced 204 loci in a diverse panel of 17 cultivated S. bicolor accessions. Genomewide patterns of diversity depart strongly from equilibrium expectations with regard to the variance of the number of segregating sites, the site frequency spectrum, and haplotype configuration. Furthermore, gene genealogies of most loci with an excess of low frequency variants and/or an excess of segregating sites do not show the characteristic signatures of directional and diversifying selection, respectively. A simple bottleneck model provides an improved but inadequate fit to the data, suggesting the action of other population-level factors, such as population structure and migration. Despite a known history of recent selection, we find little evidence for directional selection, likely due to low statistical power and lack of an appropriate null model.
MULTILOCUS surveys of sequence variation can, in principle, be used to identify targets of selection, since neutral loci are all consistent with a common set of population parameters, while recently selected loci are not (reviewed in Schlotterer 2003 and Storz 2005). A frequent goal of such studies is the identification of targets of adaptive evolution in derived populations that have recently experienced a change of environment and are typically not at equilibrium. Unfortunately, selection in the context of a genomewide departure from equilibrium may be hard to detect because many loci may have low variation by chance and the frequency spectrum of variation may be perturbed in ways that resemble the effects of selection. In these cases, it may be possible to define an alternative nonequilibrium model that describes patterns of variation at neutral loci; outliers in this model are inferred to have experienced selection (Ometto et al. 2005; Stajich and Hahn 2005; Thornton and Andolfatto 2006). Other approaches to this problem are to compare variation in the derived population with that of the putatively ancestral one (Schlotterer 2002; Glinka et al. 2003) or to consider outliers in the empirical distribution as candidate targets of selection (Schmid et al. 2005).
Domesticated species are a special case of derived populations: their population genetic characteristics are a complicated product of the characteristics of the ancestral population modified by demographic events, such as bottlenecks, migration, and nonrandom mating, and by varying degrees of selection on genes underlying traits important to farmers and breeders. Selection during domestication may target alleles that are neutral in the wild ancestor and are segregating at moderate frequencies; in such cases, predictions of simple models of selection on new mutations will not be met (Orr and Betancourt 2001; Innan and Kim 2004; Przeworski et al. 2005). Domesticated plants may also experience introgression to/from wild relatives due to the lack of reproductive isolation between wild ancestors and their very recently derived (and abundant) cultivated descendants, resulting in cultivated individuals that carry wild alleles and outgroup taxa that carry cultivated alleles. Successful use of genomewide scans to identify domestication genes has been largely limited to maize, where patterns of variation at both SSRs and SNPs have revealed loci that appear to have lost more variation than can be accounted for by the domestication bottleneck (Vigouroux et al. 2002; Tenaillon et al. 2004; Wright et al. 2005; Yamasaki et al. 2005).
We are studying the population genetics of the cultivated tropical grass Sorghum bicolor, which was most probably domesticated in eastern Africa 3000–6000 years ago, subsequently spread to the entire African continent, and reached Asia during the first millennium (Kimber 2000). Much more recently (mid-nineteenth century), sorghum was brought to the United States. There is a great deal of morphological diversity in cultivated sorghum, in characters such as seed size and color, panicle architecture, and height. Furthermore, because sorghum is grown in environments that differ dramatically from one another (e.g., in rainfall, temperature, soil type, and elevation), there must be physiological diversity as well (Doggett 1988). We are interested in characterizing the patterns of genetic variation in worldwide samples of grain sorghum, with the goal of identifying loci that are important in domestication, local adaptation, and agronomic performance. In a previous study of genomewide patterns of sequence variation (Hamblin et al. 2004), 95 loci of average length 275 bp were surveyed. In this study, we resequenced 204 additional loci of average length 671 bp distributed throughout the genome in a sample of 16 cultivated grain sorghum landraces (i.e., not elite, modern cultivars) as well as the reference elite cultivar BTx623 and an outgroup, S. propinquum. We characterize the overall patterns of sequence variation, compare those patterns to the predictions of some simple bottleneck models, and test for evidence of both directional and diversifying selection.
MATERIALS AND METHODS
Genomewide diversity was assessed in 17 S. bicolor accessions (16 landraces and BTx623) and one individual of a wild relative, S. propinquum (Table 1). These accessions composed all racial types of cultivated grain sorghum and represented a wide geographic sampling from the species' center of diversity (Africa). Seeds of the cultivated material (landraces) were obtained from either the National Center for Genetic Resources Preservation [United States Department of Agriculture (USDA)/Agricultural Research Service (ARS), Fort Collins, CO] or the Plant Genetic Resources Conservation Unit (USDA/ARS, Griffin, GA) and information on geographical origin and racial classification was gathered primarily from the System-wide Information Network for Genetic Resources database (http://singer.cgiar.org/Search/SINGER/search.htm). BTx623 seeds were kindly provided by William Rooney (Texas A&M University).
Loci and primer design:
We tested a total of 293 genetically mapped loci. These comprised sequenced genomic RFLP probes and cDNA clones from several species including sorghum (n = 160), johnsongrass (68), sugarcane (14), maize (21), buffelgrass (8), barley (2), oat (3), wheat (1), rice (13), and Arabidopsis (3) (locus name, map positions, GenBank accession numbers, and locus origin can be found at http://igd.tc.cornell.edu). Loci derived from species other than S. bicolor were analyzed only if sorghum homologs could be identified through database searches.
When possible, we extended the length of the original DNA sequence of each locus through iterative searches against the GenBank GSS and EST databases. Primers were designed from these contigs so that in some cases primers amplified regions flanking the original locus. Loci that could be successfully amplified from BTx623 and S. propinquum DNA were then amplified from the panel of 16 sorghum accessions.
Sequencing and analysis:
Total genomic DNA was isolated from individual seedlings following a standard CTAB extraction protocol (Doyle and Doyle 1987) and used as template in PCRs following previously established protocols (Casa et al. 2005). PCR products were prepared for direct sequencing by treatment with exonuclease I (New England Biolabs) and shrimp alkaline phosphatase (Promega) following the manufacturers' instructions. Single-pass sequencing was performed at the BioResource Center (Cornell University) using a single PCR primer, as most individuals were homozygous. Double-pass sequences were obtained only when putative heterozygotes were observed. Chromatograms were assembled into contigs using Sequencher (Gene Codes, Ann Arbor, MI) software. Alignments were then visually inspected and manually edited. Each set of sequence chromatograms was inspected independently by two or three people.
Summary statistics of diversity and divergence were obtained with DnaSP version 4.0 (Rozas et al. 2003). Blocks of three or more contiguous SNPs were disregarded. Insertion/deletion polymorphisms were not considered in either the diversity or divergence analyses. Loci were tested for departure from neutrality using the method of Hudson et al. (1987) implemented by the multilocus HKA software (Jody Hey, available at http://lifesci.rutgers.edu/∼heylab/index.html). One locus, pSB1812, was not tested with all 203 other loci because it caused some sort of (unknown) incompatibility in the simulations. When tested with a subset of the data, this locus showed no unusual pattern of polymorphism and divergence. Significance of individual loci was assessed by removing the most significant locus and testing the remaining n − 1 loci iteratively until no significant locus was detected. The HKA program was also used to test for significance of average D (Tajima 1989) under the standard neutral model and to estimate θ (4Neμ) on the basis of both polymorphism and divergence.
Assignment of coding regions:
To determine whether sequenced regions corresponded to protein-coding loci, consensus sequences obtained from aligned loci were used in database searches (blastn and blastx) using GenBank default parameters. For classifying a region as an open reading frame, we used the criteria described by Hamblin et al. (2004).
Models of population history were simulated using the program ms (Hudson 2002) with the following assumptions: (1) ancestral θ (4Neμ) is the same as θ in the wild population today and is estimated to be 0.0057/bp on the basis of variation at 24 loci in 4–26 accessions of S. bicolor ssp. verticilliflorum (A. Zamora and our unpublished data); (2) θ at individual loci as estimated by the program HKA (see above) was scaled relative to the ancestral θ; (3) ancestral Ne is calculated as follows: 4Neμ = 0.0057; therefore Ne = 0.0057/4/1 × 10−8 = 1.43 × 105, where a neutral mutation rate of 1 × 10−8/bp/generation is based on sequence divergence at 11 loci between maize and sorghum (Swigonova et al. 2004); (4) the time of domestication is assumed to be in the range of 3000–6000 years ago, on the basis of archeological evidence, roughly equivalent to 0.005–0.01 when scaled in 4Ne generations, at 1 generation/year; and (5) 4Ner is estimated as 4 × (1.43 × 105) × (4 × 10−8 crossovers/bp/generation) × locus length × 0.46, where 0.46 = (1 − F) to account for a self-pollination rate of 0.7 (Hamblin et al. 2005). A recent analysis of genetic vs. physical distance on a chromosomal scale came to an identical estimate of average r (0.25 Mb/cM) for euchromatic regions (Kim et al. 2005). We simulated a small set of models where all parameters were fixed except for the size of the bottlenecked population, which was fit to the observed value of average S, and tested whether other summary statistics produced by the model were consistent with the data.
Tests of haplotype number:
The probability of observing K haplotypes given an observed number of segregating sites, S, was obtained using the program haploconfig (Innan et al. 2005). This program generates gene genealogies on the basis of an input value (or range) of θ and accepts only those that have the observed number of segregating sites. Using the overall average value of θ for this data set (1.5), large numbers of segregating sites are very unlikely to be observed and the acceptance rate becomes unreasonably low. We therefore performed simulations where θ was chosen to maximize the acceptance rate. For S < 18, the acceptance rate was ≤5%, but for large values of θ, the probability of observing any particular value of S becomes quite small even when θ is optimized. These simulations were performed (1) without recombination, which is very conservative, and (2) with 4Ner = 5, which is slightly conservative on the basis of empirical estimates of crossing over (see Simulations).
The goal of this study was to characterize genomewide patterns of sequence diversity for cultivated sorghum as a framework for distinguishing between demographic and selective factors as causes of patterns observed at individual loci. PCR primers were designed for 293 genetically mapped loci, distributed throughout the genome (Bowers et al. 2003), and tested in BTx623 and S. propinquum. Of the 293 primer pairs tested, 89 were discarded due to either failed amplification or amplification of multiple products. Our final data set, therefore, consisted of 204 loci, for an average spacing of 5.2 cM (range 0–30 cM) between loci. These loci, of average size 671 bp, were sequenced in a panel of 17 cultivated accessions (Table 1), selected to represent a morphologically, geographically, and genetically diverse subset of 73 lines previously assessed with simple sequence repeats (SSRs) (Casa et al. 2005). In addition, all loci were sequenced in one accession of S. propinquum. While most individuals were homozygous at most loci, as expected for a species with a high rate of self-pollination (Doggett 1988), a total of 56 heterozygous genotypes were identified at 46 loci. A majority of these instances of heterozygosity were observed in one individual, accession NSL87902 (S. bicolor race durra from Cameroon), which exhibited two alleles at 39 loci; in three instances (on chromosomes 5, 6, and 7) blocks of 3 or more consecutive loci were heterozygous, spanning 5.4–13.8 cM, suggesting a fairly recent history of outcrossing in this accession.
Patterns of sequence diversity across the genome:
Approximately 138 kb of DNA sequence (alignment gaps excluded) were surveyed per individual. Forty-three loci (∼21%) were invariant within cultivated S. bicolor. One of these loci was also invariant across species. Levels of polymorphism and divergence at individual loci are presented as supplemental data (Table S1 at http://www.genetics.org/supplemental/). Table 2 presents summary statistics from this study as well as those from our previous study (Hamblin et al. 2004). Levels of nucleotide diversity and divergence are consistent with the previous estimates based on a different set of loci in a different species-wide sample that was also chosen to capture racial and geographic diversity, but without prior knowledge of genetic relationships.
A total of 324 indels were identified in 96 of the 204 loci evaluated within S. bicolor and ranged from 1 to 15 per locus. In most cases (306/324, ∼95%), length polymorphisms were short (<20 bases). This proportion is similar to that observed in maize, where 92% of nonmicrosatellite indels were <20 bp in length (Tenaillon et al. 2002). DNA secondary structure prediction programs and BLAST searches revealed that 4 of the 13 indels >20 bp contained insertions similar to miniature inverted repeat transposable elements (MITEs). When queried against the EST database, these putative transposable elements had matches to sorghum EST sequences derived mostly from stress-induced libraries (e.g., wound, salt, and heat shock).
Consensus DNA sequence from each locus was used in BLAST searches (see materials and methods) to identify exons and introns for analysis by functional category. Results from this partitioning are presented in Table 3. Across loci, the numbers of synonymous and nonsynonymous changes within vs. between species showed no departure from the neutral equilibrium expectation (McDonald and Kreitman 1991) (Table 4). This result contrasts with that of our earlier, smaller study, which found a significant excess of replacement polymorphism (Hamblin et al. 2004). Diversity levels in introns were lower (π = 0.24%) than at synonymous sites (π = 0.38%), consistent with observations in other species.
Genomewide departures from equilibrium:
The frequency spectrum of variation was measured by two statistics: D (Tajima 1989), which summarizes the folded frequency spectrum (i.e., sites at frequency 1 are equivalent to sites at frequency n − 1), and H (Fay and Wu 2000), which becomes more negative when derived alleles are in high frequency relative to other alleles. Observed averages and variances of these statistics were compared to their expectations under the standard neutral model (first row of Table 5). While the average value of D is close to its neutral expectation of zero, the average value of H is quite negative, and the variances of both D and H are very large. Likewise, the variance of S is very large (42): in 100 coalescent simulations of our 204-locus data set under the standard neutral model (SNM) with average S = 5, the variance of S ranged from 13.3 to 26.4, with a median value of 18.7.
Haplotype number should be an increasing function of S, but this relationship is weak in our data set: the maximum number of haplotypes observed was eight, even though loci with 20 or more segregating sites were observed. The expected number of haplotypes, given S, depends on θ as well as the rate of recombination, so it is most easily estimated by simulations (Depaulis and Veuille 1998). For all loci with S ≥ 5, we tested whether the number of haplotypes was unusual, using simulations based on θ and conditioned on S (see materials and methods). These simulations showed that a large fraction of loci had significantly fewer haplotypes than expected (Table 6). Even under a very conservative (and unrealistic) assumption of no recombination, 30% of tested loci had a significantly low haplotype number. These results are consistent with a previous observation that linkage disequilibrium (LD) in sorghum is more extensive than expected under assumptions of equilibrium (Hamblin et al. 2005).
These genomewide departures from equilibrium have important consequences for the interpretation of test statistics at individual loci. Of the 160 individual values of D that we observed, 28 (17%) would be considered significant under an equilibrium model with recombination; given the large variance in D, however, this number clearly includes many false positives. Similarly, most of the 32 significantly negative values of H (20%) can likely be attributed to the sixfold greater variance of H relative to that of the SNM. In the case of D, it is not only the large variance that raises concerns in interpretation, but also the haplotype structure. For the 10 loci with the most strongly negative D (all < −1.9), a feature that is associated with selective sweeps, the distribution of variation among haplotypes is not that expected following simple directional selection, namely a star-shaped genealogy. Rather, these negative D values are all a consequence of a large number of singletons falling on one to three lineages (Table 7). The genealogy that we observe at these loci could result from recombination during a selective sweep, which may produce an excess of high-frequency derived variants detectable by the H statistic (see Figure 2 of Fay and Wu 2000). While the power of the H statistic drops off quickly after the fixation of a favorable allele (Przeworski 2002), selection during the domestication of sorghum is recent enough (<0.08 Ne generations ago) that power should be ≥30%, so some of the very low H values in our data set may be evidence of selection. However, as Przeworski (2002) has shown, a significant H statistic is “not a unique signature of positive selection.” Population structure can give rise to an excess of significant H values, as could introgression from a divergent population. Therefore, in the absence of independent evidence, these results must be interpreted with caution. Neutral processes likely explain most of the extreme values of H observed in this data set.
Tests of selection based on levels of polymorphism and divergence:
We used the HKA method of Hudson et al. (1987) to test whether there is a consistent relationship between polymorphism and divergence at unlinked loci, as would be expected if all loci have the same effective population size and time of divergence to the outgroup. A multilocus HKA test of our data showed that the data set as a whole is highly unlikely under a neutral model, with an HKA statistic of 458.8 (202 d.f., P < 0.0005). It is not clear how robust the HKA test is to departures from equilibrium; the large variance in S that we observe is among, not within, loci, and the test assumes that θ varies among loci. However, it is likely that a bottleneck will inflate the variance in ratios of polymorphism to divergence, making the test anticonservative (e.g., Hammer et al. 2004; Haddrill et al. 2005). These results, therefore, should only be used to identify outliers, rather than as a test of significance. Of the 5% of loci that contributed the most to the HKA statistic, only one (PRC0378) showed a deficiency of polymorphism relative to divergence (Table 8), the expected signature of directional selection.
One possible reason for the lack of evidence of directional selection is that the low variation in sorghum, as well as the relatively low divergence to the outgroup, leads to poor statistical power. When 21% of loci have no variation, it is difficult to conclude that any particular instance of low variation is unusual unless divergence is extremely high. There were seven loci with a deficiency of variation in the 10% that contributed most to the HKA statistic. We collected additional data for four of these loci; if these regions actually have experienced directional selection, increasing the number of sites surveyed should increase the significance of the departure. In three of these cases, polymorphism and divergence were less unusual when the additional data were included, suggesting that the apparent deficiency of polymorphism was due to chance. In the fourth case, however (PRC0378, Table 8), the additional data resulted in a greater departure from the neutral expectation, suggesting that this region may be a good candidate for having experienced directional selection.
The sequence for PRC0378 was obtained from an EST library of rhizome cDNAs from S. halepense and maps to a QTL for rhizome traits in a cross between S. bicolor and S. propinquum (Hu et al. 2003). The QTL extends over a large region of the sorghum map, however, so this could easily be coincidence. Since neither wild nor cultivated S. bicolor has rhizomes, underground stems involved in clonal propagation and associated with perenniality, a selective sweep associated with a rhizome trait would likely have occurred in the ancestral population rather than during domestication.
As for the six loci that appear to have an excess of variation, three are among the loci that are highlighted in Table 7 because of their excess of singletons on a few lineages, and the fourth, pHER-1E07, shares this pattern. It is the large number of singletons, rather than a signature characteristic of diversifying selection, that has produced the significant test statistic at those loci. The two remaining loci, pSB0643 and pSB1804, have positive D values and gene genealogies that are more consistent with diversifying selection. Locus pSB0643 is closely linked to Dw2, a locus associated with variation in plant height that is expected to show local adaptation. Variation at pSB0643 is distributed in two clades of 7 and 10 individuals with 12 fixed differences between them (Figure 1a); there is no obvious geographic or phenotypic structure to the groups (see Table 1). Variation at locus pSB1804, which has homology to a transport-like protein in rice (XP_466724), is also distributed in two clades (Figure 1b). There are 16 fixed differences, including 9 amino acid differences, between them; 2 unique haplotypes do not fall into either clade. Again, there is no obvious geographic or phenotypic structure to the variation.
Given that multiple features of the data—the variance of the frequency spectrum, variance of S, and haplotype structure—depart strongly from equilibrium expectations, and given that sorghum has a history of domestication, it is reasonable to ask to what extent a recent bottleneck model of population history can explain these patterns. We examined a small number of simple bottleneck models in which most of the parameters were estimated on the basis of independent data (explained in detail in materials and methods). The average ancestral population mutation parameter (4Neμ) was fixed at 3.8 (with 4Neμ at individual loci scaled accordingly), the population recombination parameter (4Ner) was fixed at 0.01/bp, and it was assumed that the size of the current population and the ancestral population are the same (i.e., N0 = 1). Using our estimate of ancestral 4Neμ, we inferred Ne, which then allowed us to calculate a plausible time of the bottleneck (T0) on the basis of archeological data: a domestication time 3000–6000 years ago would correspond to 0.005–0.010 × (4Ne) generations. The intensity of the bottleneck (i.e., length and size reduction) was adjusted to produce the observed value of S, and other resulting summary statistics were recorded.
We used a T0 of 0.005 as a starting point and increased it incrementally to assess the fit of the resulting summary statistics (Table 5). For all models tested, the variances of the summary statistics were larger and H was more negative. However, models that were consistent with the estimated time of the bottleneck, i.e., T0 = 0.005–0.010, had average D values much larger than we observed (BN1 and BN2 in Table 5). Only when the time since the bottleneck was increased to 0.025, arguably equivalent to 14,000 years ago, did D approach the observed value (BN5). Conversely, BN1 and BN2 produced average values of H that were closer to the observed data, but the fit to H was worse in BN5.
The median values produced by these models (Table 5) show that none of these models is a good fit to the data, but they do not tell us whether the models exclude the data. Because the variances are large, any given iteration of a particular model might, by chance, produce a result that is much closer to the observed data. Indeed, Figure 2 shows that all five bottleneck models could produce the observed variance in S, variance in D, average H, and variance of H, although these values are in some cases in the tails of the distributions. However, for the recent bottlenecks, the range of average D (in 100 simulations) does not include the observed value.
Genomewide sequence variation in a species-wide sample of cultivated S. bicolor is strongly perturbed from equilibrium expectations with regard to the variance of the number of segregating sites, the variance of the site frequency spectrum, and haplotype configuration. This presents a serious challenge for identification of loci that may have experienced selection, as many tests for selection are based on these same properties of the data. This problem has been recognized and extensively explored in the model organisms of population genetics—Drosophila, humans, and Arabidopsis—all of which have at least some populations that clearly are not at equilibrium (e.g., Andolfatto and Przeworski 2000; Pluzhnikov et al. 2002; Glinka et al. 2003; Marth et al. 2004; Nordborg et al. 2005; Schmid et al. 2005). Recently, there has been considerable effort to explore the effects of demographic changes and population structure on summary statistics and to find alternative models that reproduce some or many features of the data. While it has been possible for both maize and Drosophila to find fairly simple alternative models that capture many features of the data (e.g., Haddrill et al. 2005; Wright et al. 2005), this has not been possible in Arabidopsis (Nordborg et al. 2005; Schmid et al. 2005). Our results suggest that the evolutionary history of sorghum, like that of Arabidopsis, has been too complex to allow use of a simple alternative model.
The time of a domestication bottleneck:
The statistic that varied the most among models was Tajima's D (Figure 2). A bottleneck 0.005 or 0.010 4Ne generations ago, our best estimate based on independent data (see materials and methods), produces average D values that are strongly positive due to the loss of rare alleles. Models that place the bottleneck further in the past produced a better fit to the data (implying that our estimate of ancestral Ne is too large), in which D is very slightly negative. This is because additional time since the bottleneck allows for the accumulation of new, rare variants at loci where variation had been eliminated (a star-shaped genealogy). Thus there are more strongly negative and more strongly positive D's than expected under neutrality, producing a large variance in D, but an average near zero, as we observe. The strongly negative D's in our data set, however, are not produced by star-shaped genealogies (see results and Table 7), suggesting that this scenario is not appropriate and that fitting a model to summary statistics can be misleading. Furthermore, the fit to H, the variance of H, and the variance of S are all marginal (see Figure 2).
To reconcile an older bottleneck (i.e., T0 = 0.025) with a domestication time of 6000 years ago would require either that our estimate of μ be twofold larger or that our estimate of ancestral 4Neμ/bp (θA) be twofold smaller. Thus some combination of higher μ, smaller θA, and older time of domestication may be able to explain the data. Our estimate of θA may, in fact, be somewhat inflated, if our sample of wild accessions includes individuals from subpopulations that did not contribute to the cultivated population. However, a twofold smaller θA would be very similar to 4Neμ in cultivated sorghum, inconsistent with a bottleneck.
Another piece of evidence to be considered is that a different species-wide sorghum sample produced a sequence data set with a positive average value of D (0.29; see Table 2). This positive value is consistent with a more recent bottleneck and with the other independent data. For this reason, we suggest that a more recent bottleneck is more likely correct for cultivated sorghum, but that the true evolutionary model includes other factors (e.g., population structure and migration) that generate a sufficiently large variance that both positive and near-zero average D's can be observed when different loci, or different individuals, are sampled. Some models that include population structure and migration do have sufficiently large variances, although they are a poor fit to other aspects of the data (not shown).
The discrepancy in average Tajima's D, as well as a difference in the ratio of nonsynonymous to synonymous polymorphism, in different samples raises some issues about sampling (see Tables 2 and 4). Both of these samples were intended to be representative of species-wide diversity in S. bicolor, but were chosen using slightly different criteria. The Hamblin et al. (2004) sample (n = 22 accessions) was chosen to maximize geographic distribution and morphological variation, as no genetic data were available at that time. Criteria for the sample in the current study (n = 17 accessions) also included maximization of genetic diversity, as assessed by variation at 74 SSR loci (Casa et al. 2005); it contains 8 accessions in common with the 2004 study.
Independent samples chosen randomly from a true single population should have similar properties, i.e., the lineages should be exchangeable; the fact that they are not suggests that all members of the population do not share the same history. In this case, the sample was not drawn from a geographically restricted locality but instead is scattered, a sampling technique that is appropriate when species-wide variation is of interest and when natural populations do not exist. When each individual comes from a different deme, so that only the collecting phase of the genealogy (coalescence and migration among demes) is captured by the sample, the properties of the sample should be similar to those of a panmictic population (Wakeley 2004). In our data, it appears that the chance sampling of more divergent haplotypes has contributed enough low frequency variants to bring the average D close to zero even though another sample showed the expected effect of a recent bottleneck, namely a positive average D. These divergent haplotypes could be due to population structure in the ancestral and/or current population, in a scenario that violates the assumptions of Wakeley's analysis. We may have inadvertently increased the probability of sampling divergent lineages in this study, since we maximized genetic distance in selecting the sample. Given the star-shaped sorghum genealogy based on variation at SSRs (see Figure 1 in Casa et al. 2005), we did not expect this effect; however, the Casa et al. study may have been too small to detect structure. Early studies based on small numbers of individuals and markers concluded that Arabidopsis has no population structure, but later studies have come to quite different conclusions (Nordborg et al. 2005; Schmid et al. 2006). Unlike Nordborg et al. (2005), we are unable to attribute the divergent haplotypes to one or two lineages; many accessions contributed divergent haplotypes at different loci.
With regard to the discrepancy in the ratio of nonsynonymous to synonymous polymorphism, the difference is due to the number of replacement variants detected, as the level of synonymous polymorphism is very similar in the two studies. This could be a consequence of sampling but could also be due to sample size if amino acid polymorphism is slightly deleterious, since some fraction of low frequency amino acid variants that would be observed in a sample of 22 would be missed in a sample of 16.
Lack of evidence of directional selection:
Because of sorghum's history of domestication, which was very recent in evolutionary terms, we expected to see evidence of directional selection at some loci but found none when we used the multilocus HKA test (Table 8). Strong evidence of directional selection was also lacking in two previous studies of genomewide variation in S. bicolor (Casa et al. 2005; Hamblin et al. 2004). Considering the three studies together, a total of 445 loci (371 sequenced loci totalling167 kb, plus 74 SSR loci) have been surveyed. While the loci surveyed had been genetically mapped, they were mapped in a cross between S. bicolor and S. propinquum and thus were not biased to be variable within S. bicolor.
If many loci in the data set have experienced selection, the multilocus HKA test may be overly conservative because the overall distribution is non-neutral. To assess the effect that this might have had on our results, we also performed HKA tests using pooled data from a putatively neutral reference subset of 22 loci chosen by the following criteria: the loci had (1) a small (<0.5) deviation in the HKA test of all loci and (2) a |D| statistic of <0.5. In tests of each locus vs. the pooled data, 42 were significant at the 0.05 level: 23 loci with a deficiency of variation and 19 loci with an excess of variation. This far exceeds the 10 loci expected by chance, suggesting that some of these loci are true outliers. Conversely, none of the loci met the Bonferroni-corrected significance criterion of P < 0.05/204 = 0.0002, so this procedure does not change our conclusions.
Comparisons with maize:
In cultivated maize, Wright et al. (2005) estimated that a minimum of 2–4% of the genome has experienced directional selection. If the proportion in sorghum were similar, and the probability that any random locus has experienced directional selection were 0.02, then the chance that 445 randomly chosen loci include zero selected loci would be quite small (0.00012). Furthermore, because of more extensive LD in sorghum (Hamblin et al. 2005), we expect the footprint of selection to be more extensive, increasing the percentage of the genome affected. There are a number of possible factors that might explain the difference between our results and those obtained for maize. Some of those factors have to do with the biology and history of the organism, while others have to do with the methodological approaches used to study them.
The simplest and most obvious explanation for our negative results is that low variation in sorghum provides poor power for detecting directional selection and that much larger regions need to be sequenced to obtain sufficient power. While the average size of regions sequenced in this study is twice the average size surveyed in Wright et al. (2005), the average number of segregating sites per locus is roughly one-fourth. This implies that genomewide scans in organisms with low variation may require at least several kilobases of sequence at each locus. Sorghum and maize also have a major difference in mating system: sorghum's high frequency of self-pollination reduces the effective rate of recombination, which increases linkage disequilibrium and likely contributes to the very large variances associated with summary statistics. Finally, the domestication process in sorghum and maize may have been quite different. While maize has undergone a very dramatic change in plant growth habit and ear morphology compared to the progenitor teosinte (Doebley and Stec 1993), the morphological changes in domesticated sorghum are more subtle and fewer major genes may be involved. Selection in sorghum may have acted more frequently on standing variation than on new or rare mutations; in such cases, the signal of directional selection may be weak (Orr and Betancourt 2001; Innan and Kim 2004; Przeworski et al. 2005). Clear signals of selection may also have been obscured if domestication spread through a structured population, producing patterns of neutral variation very different from those expected following selection in a panmictic population (Slatkin and Wiehe 1998; Santiago and Caballero 2005).
Tests for directional selection in maize have largely been based on simulations that model the bottleneck from teosinte to maize (Vigouroux et al. 2002; Tenaillon et al. 2004; Wright et al. 2005), while our approach in this study was to look for a deficiency of variation relative to divergence to the only available sequence of an outgroup, S. propinquum, which is not the direct ancestor of cultivated sorghum. In cases where maize genes have been tested by both methods, it has been suggested that the comparison between maize and teosinte produces more significant results. Tenaillon et al. (2004), for example, concluded that “the addition of parviglumis [teosinte] data has improved significantly the ability to infer selection,” when they found strong evidence of selection on ts2 and d8, two genes for which the HKA test had produced equivocal results. In a recent article (Yamasaki et al. 2005), both methods were used to test for evidence of selection during domestication, and the difference in results was small; using the simulations, six loci showed significant evidence for selection, while five were significant by the HKA test. The difference was greater for loci showing evidence of improvement (i.e., selection in inbreds only). Whether this difference represents increased power or a higher false-positive rate is not known.
Comparisons with Drosophila:
Detecting evidence of adaptation in the context of a population bottleneck is a problem also faced in population genetic studies of Drosophila, as non-African (i.e., derived) populations show genomewide departures from equilibrium and reduced variation relative to African populations. In some cases, a bottleneck model can explain most of the reduction in variation in non-African populations (Kauer et al. 2003; Haddrill et al. 2005; Thornton and Andolfatto 2006). While multilocus patterns are often suggestive of adaptive evolution in derived populations (Orengo and Aguade 2004; Schofl and Schlotterer 2004), it has been hard to demonstrate that selection has acted on a particular locus (but see Bauer Dumont and Aquadro 2005; Catania and Schlotterer 2005 for exceptions). Furthermore, apparent signals of selection in non-African populations often appear, upon further study, to be amplified signals of selection in the ancestral African populations (Li and Stephan 2005; Beisswanger et al. 2006; Pool et al. 2006), rather than evidence of adaptation to temperate environments. Interestingly, Ometto et al. (2005) found that a large proportion (>60%) of unusual loci in a non-African sample of Drosophila had an excess, rather than a deficiency, of polymorphism. As in our case, this may reflect increased power for detecting diversifying selection in a population with overall low levels of variation. Thus the difficulty of detecting directional selection in a bottlenecked population is not unique to domesticated species and presents important challenges for the field of population genetics.
Detecting a locus-specific reduction in variation, diagnostic of an episode of directional selection, is difficult when levels of variation at neutral loci are already low. When low levels of variation are caused by a bottleneck, they are accompanied by perturbations of the frequency spectrum and increased LD that further decrease the signal-to-noise ratio. In the case of S. bicolor it appears that, in addition to a bottleneck, other population-level phenomena have contributed to the observed departures from equilibrium. We base this conclusion on the observation that simple bottleneck models are inconsistent with the data and that, while several loci had extreme negative values of Tajima's D statistic, in not one case was this due to a star-like genealogy. Other phenomena that might underlie this observation could be, for example, ancestral population structure, multiple domestications, or introgression from wild conspecifics or congeners. More sophisticated models (incorporating data from wild populations) may, in principle, allow disentanglement of these factors. Phenotypic analyses in experimental populations, however, complementing population genetic analyses, may prove to be a more fruitful strategy for elucidating the genetic basis of the cultivated phenotype (Wright and Gaut 2005).
We thank Sharon Mitchell for designing some of the PCR primers; Joey Bedell for providing access to sequence data prior to publication; Alejandro Zamora for providing estimates of diversity in wild sorghum; Jeff Jensen, Kevin Thornton, and John Pool for advice about modeling; Kangyu Zhang for providing source code for haploconfig; Baohua Wang for help with data formatting and submission to GenBank; Tessa Dumont, Amanda Garris, Gael Pressoir, and two anonymous reviewers for comments on the manuscript. This project was supported by grant DBI0115903 from the National Science Foundation to A.H.P., C.F.A., and S.K. A.M.C., C.F.A., S.K., and A.H.P. designed the study. A.M.C., H.S., and S.C.M. collected the data. M.T.H., A.M.C., and S.C.M. analyzed the data. M.T.H. wrote the paper.
- Received December 5, 2005.
- Accepted March 13, 2006.
- Copyright © 2006 by the Genetics Society of America