Sorghum has shown the adaptability necessary to sustain its improvement during time and geographical extension despite a genetic foundation constricted by domestication bottlenecks. Initially domesticated in the northeastern part of sub-Saharan Africa several millenia ago, sorghum quickly spread throughout Africa, and to Asia. We performed phylogeographic analysis of sequence diversity for six candidate genes for grain quality (Shrunken2, Brittle2, Soluble starch synthaseI, Waxy, Amylose extender1, and Opaque2) in a representative sample of sorghum cultivars. Haplotypes along 1-kb segments appeared little affected by recombination. Sequence similarity enabled clustering of closely related alleles and discrimination of two or three distantly related groups depending on the gene. This scheme indicated that sorghum domestication involved structured founder populations, while confirming a specific status for the guinea margaritiferum subrace. Allele rooted genealogy revealed derivation relationships by mutation or, less frequently, by recombination. Comparison of germplasm compartments revealed contrasts between genes. Sh2, Bt2, and SssI displayed a loss of diversity outside the area of origin of sorghum, whereas O2 and, to some extent, Wx and Ae1 displayed novel variation, derived from postdomestication mutations. These are likely to have been conserved under the effect of human selection, thus releasing valuable neodiversity whose extent will influence germplasm management strategies.
WHAT is the genetic basis of crop success? Domestication, that is the outcome of a selection process leading to increased adaptation of plants to cultivation and utilization by humans, can be viewed as a long-term selection experiment (Gepts 2004). It is generally considered as (i) driven by the selection of the most favorable alleles at genes involved in important and visible traits, and (ii) likely accompanied by a significant loss of diversity in the rest of the genome, due to genetic drift by random sampling among preexisting diversity. The genetic architecture of those traits that are part of the “domestication syndrome” is essential in making crop plant selection efficient and crop domestication possible, or not. Yet crop success is also determined by the potential for steady and diversified progress in terms of adaptation to new environments, making the crop able to accompany man in his early migrations and agricultural colonization of new regions. In many cases, continual spontaneous, then breeder-induced, introgression from wild relatives represents a major source of diversity among modern cultivars. Some crops, such as rice, have probably been domesticated several times (Second 1985), thus providing a basis for cultivar diversification through introgression among contrasting early domesticates. Other crop species, however, do not seem to have the same opportunities for a broadening of their genetic basis; recent allopolyploids are in this category, such as bread wheat or groundnut for example. Therefore the process of crop evolution encompasses spontaneous variation that could expand the initial genetic basis and lead to further adaptability.
The rate and magnitude of mutations and their role in the course of domestication is still a matter of conjecture (Gepts 2004). The recent progress of physiological trait understanding and of molecular investigation methods makes it now possible to focus diversity surveys on individual genes involved in traits of interest in agriculture and to resolve fine sequence variation. Allele sequence polymorphism generally enables the deciphering of allele genealogies, which opens the way to studying processes in time, such as domestication. Olsen and Purugganan (2002) presented a case of very fruitful phylogeographic analysis of genealogies among the various alleles found at the Wx locus, where the “glutinous” phenotype in rice is encoded.
We used diverse molecular approaches to understand diversity in sorghum. Sorghum (Sorghum bicolor L. Moench) is an annual, predominantly inbreeding, cereal of African origin with five recognized races within cultivated forms (ssp. bicolor), namely bicolor, caudatum, durra, guinea, and kafir, as well as 10 intermediate types. This species has been studied with various types of molecular markers (Ollitrault et al. 1989; Aldrich et al. 1992; Deu et al. 1994, 1995; Cui et al. 1995; De Oliveira et al. 1996; Menkir et al. 1997; Djè et al. 2000; Grenier et al. 2000; Casa et al. 2005) and has recently undergone detailed sequence diversity analysis using small representative panels of diverse accessions (Hamblin et al. 2004, 2006), which enabled investigation of linkage disequilibrium (LD) and patterns of selection among several hundred loci. This resulted in a fine understanding of ecogeographic patterns of variation in sorghum, in particular with reference to its geographic spreading out of the area of origin in the northeastern part of sub-Saharan Africa. For integrated characterization, we have decided to focus on a core sample of 210 accessions of diverse geographic origin from International Crops Research Institute for the Semi-Arid Tropics (ICRISAT) and Centre de Coopération Internationale en Recherche Agronomique pour le Développement (CIRAD) germplasm banks, established to represent cultivated sorghum landraces from around the world, with sampling on the basis of race, as per the scheme of Harlan and de Wet (1972), geographical origin, response to day length, and production system. As per Deu et al. (2006), RFLP diversity among those accessions led to identification of 10 clusters that appeared to feature combinations of race and geographical origin. They distinguished: guinea accessions from western Africa (cluster 1), guinea margaritiferum subrace from western Africa (cluster 2), durra accessions from central and eastern Africa and from Asia (cluster 3), bicolor and caudatum accessions from China (cluster 4), caudatum accessions from Africa (cluster 5), a group of transplanted caudatum and durra accessions from Lake Chad region (cluster 6), accessions of the kafir race from southern Africa (cluster 7), guinea accessions from southern Africa (cluster 8) and from Asia (cluster 9), and accessions of the caudatum race from the African Great Lakes region (cluster 10). The accessions that did not fall into one of those clusters were most frequent in central and eastern Africa and usually classified as intermediate races (e.g., durra-caudatum) or bicolor. The latter race bicolor is recognized as a diverse set of primitive forms that are closer to wild sorghum than the other four races (Harlan 1995) and relates to the early domesticates prior to other race differentiation. Most of the bicolor accessions were intermediate; they did not fall into clear molecular-marker-based clusters, and they commonly displayed rare alleles (Deu et al. 1994, 2006). Cluster 2 was more differentiated from the others. It corresponded to the guinea margaritiferum types, which were also differentiated from the rest of the S. bicolor ssp. bicolor races for cytoplasmic markers (Deu et al. 1995). The differences between the other clusters were based on contrasted frequencies of shared alleles, rather than on diagnostic alleles that discriminated one group from all the others.
It is commonly agreed that early domestication of sorghum took place in the northeastern part of the distribution in Africa between Lake Chad and Ethiopia, giving rise to early bicolor types (de Wet et al. 1976; Harlan 1995). Then migrations, starting >3 millennia ago, led to the emergence of the guinea types westward and to the kafir types southward. The caudatum types emerged in the center of origin and later spread in both directions. The durra types are predominant in South Asia and northeastern Africa; it is unclear whether they appeared first in Africa (de Wet and Huckabay 1967; Doggett 1988) or in Asia (Harlan 1995). Exceptions to the global race geographical pattern, such as the guinea types in South Africa, might be due to more recent germplasm movement or to a more complex origin of diverse types classified under that race (Ollitrault 1987; Ollitrault et al. 1989; Dégremont 1992; Deu et al. 1994; Folkertsma et al. 2005). In this scheme, varietal groups localized in the most external regions could be considered of secondary origin, such as the guinea types of western Africa (clusters 1 and 2), the kafir and the guinea types of southern Africa (clusters 7 and 8), as well as all the forms found in Asia (clusters 4 and 9).
Besides the survey of presumably neutral polymorphisms (Deu et al. 2006), we analyzed sequence diversity in portions of six candidate genes involved in starch biosynthesis and protein content regulation. Grain characteristics are likely to be subject to natural selection, since grain reserves, such as carbohydrates or proteins, play a key role in life cycles, as well as to human selection during domestication, given their impact on “grain quality” in human uses. The patterns of diversity displayed by such candidate genes are thus expected to exhibit traces of selection. In this article we describe sequence diversity at those loci and we apply the unique properties of that type of data to further investigate sorghum evolution.
MATERIALS AND METHODS
We analyzed 194 accessions representative of the diversity of cultivated sorghum that were part of the core sample of 210 varieties described above. The accession list with indication of the race, country of origin, and RFLP-based cluster can be found in supplemental Table S1.
The distribution of our material is shown in Table 1 and is schematized in Figure 1, together with current hypotheses on sorghum domestication. The regions are considered from west to east and north to south and include extreme West Africa (A), northcentral Africa (B), northeast Africa (C), southeast Africa (D), southern Africa (E), south Asia (SA), East Asia (EA), and others (O). We emphasized the comparison between the margins represented by clusters 1 and 2 in West Africa, clusters 7 and 8 in southern Africa, clusters 4 and 9 in Asia, which we qualified as secondary units in the crop history, and the center of the distribution represented by clusters 6, 3, 5, and 10, which we qualified as primary units.
Genes and primer design:
Six genes known to be important in the genetic control of grain quality in cereals, notably in maize, were selected. They were directly involved either in starch synthesis [Shrunken2 (Sh2), Brittle2 (Bt2), Soluble starch synthaseI (SssI), Amylose extender1 (Ae1), and Waxy (Wx)] or as transcriptional activators controlling endosperm protein storage genes [Opaque2 (O2); Pirovano et al. 1994]. Sh2 and Bt2 encode, respectively, for the large and small subunits of a major enzyme involved in endosperm starch biosynthesis, ADP-glucose pyrophosphorylase (AGPase) (Schultz and Juvik 2004). AGPase catalyzes the first step of starch synthesis in plants, i.e., the production of ADP-glucose. ADPG is further used to build amylose, a linear glucosyl chain and amylopectin, a highly branched glucan, both constituting starch. Amylose synthesis is catalysed by granule-bound starch synthase, encoded by the Waxy (Wx) locus. Amylopectin synthesis is catalysed by starch branching enzymes (SBEI, SBEIIa, and SBEIIb encoded by Sbe2b, now called Ae1 gene) and debranching enzymes (DBE) (Wilson et al. 2004 and references herein).
The sequence accessions from sorghum used for primer design were AF488412 (Wx), AF010283 (Sh2), CD426561 (Bt2), AY304540 (Ae1), AF168786 (SssI), and X71636 (O2). For Bt2, Ae1, and SssI, for which only the cDNA sequences from sorghum were available, we also used sequences from maize (AF334959 for Bt2 and AF072725 for Ae1) and rice (AB026295 for SssI). Primers were designed using the PRIMER3 program (http://fokker.wi.mit.edu/primer3/input.htm). The primers are listed in supplemental Table S2. Several segments were targeted per gene. For Wx and O2, primers were designed to cover the major part of the gene and for O2 of the promoter. For the other genes, primers were selected to amplify two segments of at least 0.5 kb in size and separated by at least 1.5 kb (Figure 2). The segment positions were chosen preferentially within coding regions, in the protein motif when possible, with maximum intragene distance between segments to enable analysis of linkage disequilibrium.
PCR and DNA sequencing:
Genomic DNA was extracted from fresh leaves harvested from a single 3-week seedling per accession following a cetyltrimethylammonium bromide (CTAB) protocol previously described (Deu et al. 1995). SP6 (5′-GATTTAGGTGACACTATAG-3′) and T7 (5′-TAATACGACTCACTATAGGGC-3′) tails were added to the 5′ ends of primers to facilitate direct sequencing of the PCR products. DNA amplifications were performed in 50 μl containing 25 ng of genomic DNA, 0.2 μm of each primer, 2 mm MgCl2, 0.2 mm of dNTP, and 1 unit Taq DNA polymerase. Reactions followed the following cycling conditions: 94° for 4 min; 10 cycles [30 sec at 94°, 60 sec at melting temperature (TM) + 5°, 60 sec at 72°, the annealing temperature was reduced by 0.5°/cycle]; 25 cycles (30 sec at 94°, 60 sec at TM, 60 sec at 72°); and a final extension step of 8 min at 72°. The annealing temperature was 55° except for Wx segment 1 (60°) and Wx segment 2 (65°).
Sequencing from PCR products was performed using an Applied Biosystems Prism 3100 DNA analyzer (Applied Biosystems, Foster City, CA) in only one direction, by Centre National de Genotypage, Evry, France (http://www.cng.fr) for O2, and by GATC Biotech (http://www.gatc-biotech.com) and Genome Express (http://www.cogenics.com/) for the other genes. Sequence data have been deposited with the EMBL/GenBank Data Libraries under accession nos: EU388245–EU388607 for Sh2, EU388985–EU389363 for Bt2, EU388608–EU388984 for Sss1, EU387881–EU388244 for Ae1, EU387138–387880 for Wx, and EU389364–EU390699 for O2.
Sequence quality control and clipping were performed using Sequencher 4.0 (Gene Codes, Ann Arbor, MI) with minimum Phred scores set to 20. All sequences, including the reference sequence, were aligned using Sequencher and base substitution among sequences single nucleotide polymorphisms (SNPs) and insertion or deletion polymorphisms (IDPs) were detected. Artemis version 7 (Rutherford et al. 2000) was used to verify the position of splicing sites and determine the correct reading frame. Conservative and nonconservative amino acid substitutions were defined by the Blosum matrix (Henikoff and Henikoff 1992) and by calculating hydrophobicity (Kyte and Doolittle 1982).
Only accessions with sequences for all segments of a given gene were kept for further data analyses. Thus, global SNP and IDP were assessed on slightly different samples for each gene: 184 accessions for Bt2, 166 for SssI, 153 for Sh2, 154 for Ae1, and 129 and 146 for Wx and O2, respectively, the two genes with the largest number of segments. Further use was made of a core set of 53 accessions (supplemental Table S1) with no missing data across all genes and well distributed within the pattern of diversity revealed by RFLPs (Deu et al. 2006).
Nucleotide diversity was estimated using θ, Watterson's estimator of 4Neμ per base pair, on the basis of the number of segregating sites (where Ne is the effective population size and μ the mutation rate) (Watterson 1975), and π, the average number of pairwise differences per nucleotide between sequences (Tajima 1983). Tajima's D test (Tajima 1989) was used to test for deviations from neutral mutation-drift equilibrium. All those estimators were calculated using DnaSP 4.10 (Rozas et al. 2003). Analyses were conducted separately for coding and noncoding regions and for the entire concatenated sequence of all segments of the same gene.
SNPs and IDPs were extracted from the sequences and further analyzed using Tassel version 1.9.5 (http://www.maizegenetics.net/). Both SNPs and IDPs were considered in haplotype analysis; however, singletons were excluded to prevent excessive impact of sequencing errors. Haplotype diversity was analyzed for each segment separately and for each gene separately (concatenating all segments). Relationships between haplotypes were analyzed according to the median-joining network method described by Bandelt et al. (1995, 1999) with Network 4.2 software (http://www.fluxus-technology.com/sharenet.htm), using the median-joining option, an equal weight for all sites, and an ε-parameter of 0, which retained only frequent haplotypes for describing cycles in the network, i.e., a representation of alternate paths between two haplotypes due to homoplasies (recurrent mutations, recombination, but also sequence errors).
The polymorphism revealed among all the accessions is presented in detail for each gene in a file entitled sorghum_SNP.pdf downloadable from http://tropgenedb.cirad.fr/sorghum/. Its global features are described in Table 2. A total of 170 polymorphisms, including 141 SNPs and 29 IDPs, were recorded within a total of 11.3 kbp scored. That resulted in an average of one SNP every 80 bp. The IDPs (14 involved 1 bp only) were confined to introns, promoter or 3′-UTR regions and accounted for 17% of all polymorphism. Only 36 of the SNPs were located in coding regions; among them 9 were nonconservative and 1 was a nonsense mutation. O2 was the gene with the highest proportion of nonsynonymous SNPs.
For the sake of comparison among genes and with other studies, we computed global parameters for a subset of 53 diverse accessions (supplemental Table S1) that had complete data for all genes. The frequency of polymorphic sites per gene, considering each IDP as a unique event, varied between 0.3% (Bt2) and 1.4% (Sh2). It was generally twice as high for noncoding regions as for coding regions, except for Bt2 for which all polymorphisms were located in noncoding regions. The amount of diversity, measured by π, ranged from 0.34/kbp for Bt2 to 4.26/kbp for Ae1. The frequency of polymorphic sites, measured by θ, was highest for Sh2 (3.31/kbp), due to the large number of singletons for that gene, intermediate for Ae1, Wx, and O2, and low for Bt2 and SssI (0.66 and 1.34/kbp, respectively). That resulted in contrasting Tajima's D estimates, with Sh2 showing a significant negative D value and Ae1 and Wx showing significant positive D values.
Haplotype structure and distribution:
A total of six haplotypes were observed for Sh2 (Table 3; supplemental Tables S3 and S4 for details). Segment 1 displayed two groups of haplotypes, one comprising a predominant type (a, H1 + H3) and two unrelated minor types (b in H2, with one difference with a, and c in H4, with 2 differences), the other comprising a single type (d in H5 + H6). Segment 2 displayed two groups of haplotypes, differentiated by at least 10 of the 11 polymorphic sites: with one predominant (a in H1 + H2 + H4 + H5, b in H3 with one difference with a) and one minority (c in H6, 8%). For both segments, it was possible to represent the haplotype relationships in a simple manner (Figure 3). The pattern of association between the two segments was very strong, highlighting two groups, of uneven frequency, of full-length haplotypes and a minor recombinant-like type (H5) found in only four accessions (2.6%). On that basis, it can be concluded that there is strong LD along the whole gene and a simple interpretation can be attempted to describe the genealogical relationships among the various alleles. The minority group, and to some extent the recombinant haplotype, were typical of cluster 2 of the guinea margaritiferum subrace in West Africa, but were occasionally found in other backgrounds (supplemental Table S4).
A total of 12 haplotypes were recorded for Bt2 (Table 3; supplemental Table S5 and S6 for details), of which six were very rare (<2%). Both segments displayed little differentiation, with almost continuous variation and random combination of polymorphisms. There was only a weak LD along the gene and no simple inter-haplotype genealogical hypothesis was possible (Figure 3). Here again, the guinea margaritiferum subrace cluster was highlighted; it encompassed all the accessions that displayed haplotype H9 (supplemental Table S6).
Soluble starch synthaseI (SssI):
A total of six haplotypes were observed for SssI (Table 3; supplemental Tables S7 and S8 for details). Segment 1 displayed two groups of haplotypes differentiated by at least four of the seven polymorphic sites, one made up of the predominant type and one minor type (a and b in H1 and H2), the other made up of three types (c, d, and e in H3–H6). Segment 2 also displayed two groups of haplotypes, differentiated by at least three of the five polymorphic sites, one made up of the predominant type (a), the other made up of three types (b, c, and d in H3–H6). The pattern of association between the two segments was very strong, highlighting strong LD along the gene. A simple interpretation can be attempted to describe the genealogical relationships among the various alleles (Figure 3). One minority haplotype (H3) was typical of cluster 2 of the guinea margaritiferum subrace in West Africa but was occasionally found in other backgrounds (supplemental Table S8). The predominant group was found in all other varietal clusters and regions.
Amylose extender1 (Ae1):
A total of 12 haplotypes were observed for Ae1 (Table 3; supplemental Tables S9 and S10 for details). Segment 1 displayed three groups of haplotypes, one made up of a predominant type (a in H1 + H2), another made up of one haplotype (f in H12), the third one made up of four related types differentiated at one to three polymorphic sites (b in H3 + H4 + H5, e in H11, c in H6 + H7 + H8, and d in H9 + H10). Within the latter group, simple genealogical relationships can be inferred between the various haplotypes (Figure 3). Segment 2 displayed six haplotypes distributed in two groups differentiated by at least eight of the 13 polymorphic sites, contrasting H1 and H12 (a and f) on one side to H2–H11 (b–e) on the other; it was also possible to consider H12 as a recombinant type between a in H1 and b in (H2 + H3 + H10). Among H2–H11, four types existed (b, c, d, and e), among which simple genealogical hypotheses can be formulated. Segments 1 and 2 displayed strong associations; only H2 (13 accessions, 8.4%) was an exception to association between segment-specific groups, and may thus be interpreted as the result of recombination. Considering the rarer variants within each group, H12 was the result of markedly differentiated haplotypes on both segments. It is noteworthy that the most recent haplotypes based on our simple genealogical hypotheses, i.e., H9 + H10 for segment 1 and H6 for segment 2, were most frequent in secondary clusters (supplemental Table S10).
A total of 22 haplotypes were recorded for Wx (Table 3; supplemental Tables S11 and S12 for details). The main feature of Wx gene diversity was the existence of two gene portions that displayed total or near total internal LD and little or no LD between portions. The first portion consisted of segment 1; it featured one predominant haplotype (a) and two other haplotypes (b and c) that differed from it by 13 or 14 polymorphic sites. The second portion consisted of segment 2, which displayed two groups of haplotypes that were differentiated from one another by at least 5 of the 8 recorded polymorphic sites, as well as segment 3 and segment 4, which displayed a series of infrequent (<10%) haplotypes differentiated by 1–4 polymorphic sites from the predominant haplotype (Figure 3). Many of those minor haplotypes seemed confined to secondary varietal groups or secondary regions, suggesting that they might have been derived from recent mutations (supplemental Table S12). Interestingly, there was no LD between that portion and the first portion. This suggested that intragenic recombination had occurred at the same pace as point mutations and had had a significant impact on allele diversification in the course of domestication.
The O2 gene displayed a total of 21 haplotypes (Table 3; supplemental Tables S13 and S14 for details). They fell into two groups of haplotypes characterized by a uniformly distributed contrast over the full 4 kb length: H1–H19 on one side and H20 + H21 on the other differed by at least 31 of the 55 polymorphic sites. Within a smaller magnitude of variation, the first group, which was also markedly predominant, displayed a remarkable tendency toward multidirectional radiation (Figure 3). H1, the haplotype that corresponded to the root, had a global frequency of 20%; the various branches displayed between one and up to seven mutations. Interestingly, most of the branches expanded in a specific geographical direction and the most recent alleles tended to be specific to secondary varietal clusters and regions: haplotypes H11, H12, and H13 in cluster 2 (guinea margaritiferum) and H10 in cluster 1 in western Africa; H14–H17 in cluster 8 and H2 and H3 in cluster 7 in southern Africa (supplemental Table S14); H4 in two unclustered accessions from southern Africa.
Sequence diversity in sorghum:
Our study was based on 1.7 Mbp sequence data enabling a comparison between 129 to 184 accessions of sorghum for 11.3 kbp covering six genes and 1.0–3.8 kbp per gene. The best reference for comparison in sorghum is provided by Hamblin et al. (2004), who described 27 sorghum accessions, including three wild ones, through 95 loci and a total of 29.2 kbp. Our sample of 53 accessions used to quantify variation across the genes was close to double in size, and gave more weight to western Africa, southern Africa, central Africa, and Asia, and less to eastern Africa. Whereas our study revealed a higher frequency of polymorphisms (one SNP every 80 bp, compared to one SNP every 123 bp), probably largely due to sample size, the average diversity (π) was remarkably similar in both studies (2.05 vs. 2.25/kbp).
Five of the genes we studied in sorghum (Sh2, Bt2, Ae1, Wx, and O2) have been studied in maize by Henry and Damerval (1997), Whitt et al. (2002), Henry et al. (2005), and Manicacci et al. (2007). Compared to that crop, the diversity appeared slightly higher in sorghum for Ae1, but lower for Wx, Bt2, Sh2 and O2 (Table 2). The most accurate comparison was possible for Sh2 and O2 where most of the coding sequence was studied (Henry et al. 2005; Manicacci et al. 2007); the diversity in sorghum was found to be three and seven times lower than in cultivated maize. The Wx gene enabled a comparison with cultivated barley and with Asian domesticated rice (Kilian et al. 2006; Olsen et al. 2006); the diversity in sorghum (4.21/kbp) was close to that in barley (3.32/kbp) and in rice (5.6/kbp), despite their well-documented high intraspecific diversity, particularly the indica-japonica contrast in rice (Table 2).
A remarkable feature in sorghum was the existence of strongly differentiated haplotypes at most genes (Table 3). The diversity level for a particular gene was very much related to the relative frequency of the most contrasting haplotypes. For three of the six loci (Sh2, Bt2, and SssI), the three most common haplotypes bore over 80% of allelic diversity.
On the basis of Tajima's D test, three loci (Sh2, Ae1 and Wx) exhibited significant deviations from neutral mutation-drift equilibrium (Table 2). That might be explained by population size variations, including population expansions or bottlenecks, as well as selection. A negative D value (Sh2) revealed an excess of very rare alleles, which might indicate a purifying selection pattern, while positive values (Ae1 and Wx) might be explained by balancing selection. Pushing further this interpretation, the fact that coding regions appeared less affected than noncoding regions for Sh2 and Wx (Table 2) might be related to background selection or hitchhiking (Otto, 2000). Those three genes have been shown to be targets of selection in other cereals, in particular Sh2 in maize (Whitt et al. 2002; Manicacci et al. 2007), Ae1 in maize (Whitt et al. 2002; Wilson et al. 2004) and Wx in rice (Yamanaka et al. 2004; Olsen et al. 2006). Explanations other than selection might yet be proposed, especially in a cultivated species such as sorghum, which displays clusters of contrasting haplotypes. Ancestral population structure, multiple domestications, or introgression from wild relatives have recently been highlighted by Hamblin et al. (2006), also in sorghum, as potentially confusing phenomena when the objective is to detect an episode of selection.
The level of within-gene LD was remarkably variable between genes. There was near-complete LD over >4 kb of the whole-gene length in O2, whereas Wx displayed LD breakage between segments 1 and 2 (Table 4). If some haplotypes in Wx portion 2 (segments 2 + 3 + 4) derived from mutations that occurred after the start of domestication, this break in LD highlights intensive recombination within the 244 bp that separate sites within intron 6 from those located in intron 7 (supplemental Table S11). There was also strong LD across Ae1 (which represents over 16 kb in maize), induced by the existence of markedly differentiated haplotypes that apparently seldom recombined (Tables 3 and 4). However, for the same gene, there were also minor variants, which probably appeared more recently and were not in LD. This illustrates that recombination might occur frequently, but that it affects only some combinations of alleles. Many other allele combinations may be lacking among the heterozygous forms, as a result of a departure from random mating. Such a bias can lead to conservation of major haplotype groups and maintain strong LD among those SNPs that discriminate them. In that case LD relates to population structure. The heterogeneity of LD among genes is well documented for several species (Gupta et al. 2005); our study highlighted Wx as a spot with high recombination activity, as has been observed in maize and rice (Okagaki and Weil 1997; Inukai et al. 2000), but it also highlighted complete LD in O2 whereas that gene displays traces of intensive recombination activity in maize (Henry and Damerval 1997).
Insights into sorghum domestication:
LD patterns and predominant haplotypes can help trace crop history and crop domestication. Across the six genes, there were five instances of occurrence of two groups of haplotypes exhibiting moderate (Bt2) to high (Sh2, SssI, Wx, and O2) mutual contrast. In three cases, one of those groups was in the minority and clearly related to cluster 2 of guinea margaritiferum, although it was not exclusive to it (supplemental Table S9–S14). In the other two cases, however, the main haplotype differentiation did not relate clearly to any known structure or geographical distribution. In the sixth case (Ae1), there were three strongly differentiated groups of haplotypes. It is noteworthy that two or more of those groups were observed in most varietal clusters. At the foundation of a crop, the diversity among the early domesticates depends both on the initial wild progenitor diversity structure and the distribution of the domestication process. If the size of the wild populations is large enough, it is expected that, for most loci, multiple alleles coexist, which may display complex derivation relationships and reveal no particular genealogical pattern. Any heterogeneity may result in a multiple foundation with discrete groups of early cultivated forms. Our study in sorghum revealed a marked contrast between a relatively loose structure when whole-genome markers were used and the existence of at least two, occasionally three, strongly differentiated haplotype groups at the individual gene level. One interpretation is that the foundation was from a set of discrete lineages and that present variation was the result of profuse recombination after that foundation. The possible confusion between initial differentiation among founders, which impacted on differentiation among the early domesticates, and the later introgression from local wild forms, which might induce a localized appearance of highly differentiated alleles, cannot be settled without accurate geographical analysis. An extension of this type of work to more genes and gene segments, while including representatives of the wild progenitor across its geographical distribution, will provide firm evidence of the domestication process.
The most striking feature of our results is probably the observation of what seemed to be novel recent diversity. For several genes, primary varietal groups displayed more allele diversity, whereas secondary varietal groups in western Africa and southern Africa usually displayed the most frequent allele, which was generally the most ancient one (supplemental Tables S9–S14). Such patterns are typical of founder effects where secondary types only kept the predominant allele, as expected through drift at neutral loci. In contrast, Ae1, Wx, and especially O2 displayed novel variation outside the area of origin of sorghum. There were many instances of alleles that were present in the varietal groups of secondary origin and absent, or very rare, in the area of origin. There might be various explanations for those situations, such as incomplete coverage of the survey in the area of origin, disappearance of some alleles in the area of origin, or introgression of new alleles in cultivated forms in varietal groups of secondary origin. However, those alleles that were specific to secondary groups were also the most recent ones in allele genealogies, suggesting they appeared in the course of the domestication process and concomitant varietal migrations. Such cases of potential novel alleles are more convincing than others. For Ae1 the meaning of the patterns observed is unclear. One of the possible “novel” haplotypes (H9) was observed both in western Africa and in Asia, which are not known to have exchanged much sorghum germplasm, and it was not completely absent in region B and C samples; the other (H6) was found in southern Africa in only three varieties. In both cases the frequency of the haplotype might have increased through drift from a preexisting allele present at low frequency in primary regions and not be detected in our study. For Wx portion (2 + 3 + 4), one novel type (c) found only in western Africa and one (e) found only in southern Africa displayed only one variant base, whereas another (d), present in half of west African cluster 1 and totally specific to that cluster, exhibited three specific variant sites. It is noteworthy that those haplotypes for portion (2 + 3 + 4) were randomly associated with segment 1 haplotypes, showing that intragenic recombination occurred at the same pace as single-base substitutions. For O2, several novel haplotype branches were observed, which displayed between one and five mutations (Figure 3). In western Africa, one branch (haplotypes 11, 12, and 13 in cluster 2) displayed one or two haplotype-specific variants and another (H10 in cluster 1) displayed three specific variants. In southern Africa one branch (H2, H3, and H4 essentially in cluster 7) represented a series with one, two, and three variants, respectively, which displayed increasing specificity, and another (haplotypes H14–H17 in cluster 8) had a distal part that was specific and displayed two to four more variants than the proximal part. Altogether, this suggests that multiple mutations occurred and were retained during the geographical spread of cultivars outside the area of early domestication and it points at Wx and especially O2 as the cases with clearest evidence of such novel diversity. Olsen and Purugganan (2002) documented a similar situation with the Wx gene in rice, where several mutations were observed, which appeared posterior to the “glutinous-phenotype” mutation, itself occurring after the start of domestication.
Mutation is generally considered to have little impact over domestication given the short time span. However, our results for sorghum indicated several instances where both mutation and intragenic recombination had generated novel alleles since the beginning of domestication. It is likely that the emergence of novel alleles at high frequency was made possible by positive selection. Cultivated forms are subjected to both natural and human selection. Selection for crop traits may screen among new recombinants and new mutants and differentially impact adaptive vs. neutral genes. It can foster rapid changes in gene diversity by selecting new alleles derived from recent mutations. This generates higher substitution rates in those genes that are subjected to selection. Our study suggests that certainly O2, and possibly Wx, were directly subjected to selection.
On a finer scale, our results and interpretation suggest that all mutations that contributed to the emergence of novel haplotypes, or at least many of them, were individually subjected to positive selection. For O2, those variants were localized in exon 1, where they were all nonsynonymous, but also in the promoter region and in intron 1; for Wx, they were found in exon 9 but most numerous in the 3′-UTR. The high frequency of those variants in noncoding regions is noteworthy, as polymorphisms that are subjected to selection are expected to be localized primarily in regulatory regions or in exons. Yet the scarce data available on noncoding regions in mammals suggest that the substitution rates in 5′ regions and 3′-UTR are 2.5 times as high as those at nondegenerate exon sites but similar to those at twofold degenerate exon sites and only 0.6 times as high as those at fourfold degenerate exon sites, introns, or pseudogenes (Li 1997).
What we are observing can be called “crop neodiversity”: it is novel diversity that is directly the result of human action through the selection of favorable mutants in the crop. This relates to the observations of Rasmusson and Phillips (1997) who advocated the implication of de novo variation as the substrate for sustained barley improvement from an initial narrow basis. Among the phenomena that could contribute to this process, the authors mentioned single-base mutations, possibly favored by DNA methylation (Coulondre et al. 1978), as well as intragenic recombination, which can be much more frequent (e.g., 1 cM for 10–50 kb in maize Brown and Sundaresen 1992; Dooner 1986) than recombination along the whole genome. These are two phenomena that are depicted in our data.
Crop phylogeographic analysis:
Almost all crops have rapidly spread through migration out of their area of origin. This is likely to have been accompanied by strong selection by man, both for his usage of the products and for the adaptation of the varieties to cultivation in new environments. We have proposed an interpretation of our data in terms of sorghum evolution since domestication. Including wild progenitors as a representation of the reservoir of initial allelic diversity and of potential contributors through introgression will strengthen the interpretation. Germplasm collections of the most important crops are often very large and are becoming accurately characterized with neutral markers for large numbers of accessions (http://www.generationcp.org); this represents a wealth of materials and information for developing such pattern analyses.
The research described here was supported by Centre de Coopération Internationale en Recherche Agronomique pour le Développement (CIRAD) and Universidade Católica de Brasília and was funded by a grant of the National Council for Scientific and Technological Development of Brazil to L.F.A.F, a grant of Génoplante to D.B., B.C., and M.D., and a grant of the Generation Challenge Programme to J.C.G.
- Received January 19, 2008.
- Accepted April 6, 2008.
- Copyright © 2008 by the Genetics Society of America