Abstract
We report a genetic recombination map for Sorghum of 2512 loci spaced at average 0.4 cM (∼300 kb) intervals based on 2050 RFLP probes, including 865 heterologous probes that foster comparative genomics of Saccharum (sugarcane), Zea (maize), Oryza (rice), Pennisetum (millet, buffelgrass), the Triticeae (wheat, barley, oat, rye), and Arabidopsis. Mapped loci identify 61.5% of the recombination events in this progeny set and reveal strong positive crossover interference acting across intervals of ≤50 cM. Significant variations in DNA marker density are related to possible centromeric regions and to probable chromosome structural rearrangements between Sorghum bicolor and S. propinquum, but not to variation in levels of intraspecific allelic richness. While cDNA and genomic clones are similarly distributed across the genome, SSR-containing clones show different abundance patterns. Rapidly evolving hypomethylated DNA may contribute to intraspecific genomic differentiation. Nonrandom distribution patterns of multiple loci detected by 357 probes suggest ancient chromosomal duplication followed by extensive rearrangement and gene loss. Exemplifying the value of these data for comparative genomics, we support and extend prior findings regarding maize-sorghum synteny—in particular, 45% of comparative loci fall outside the inferred colinear/syntenic regions, suggesting that many small rearrangements have occurred since maize-sorghum divergence. These genetically anchored sequence-tagged sites will foster many structural, functional and evolutionary genomic studies in major food, feed, and biomass crops.
AS a model for the large genomes of many tropical grasses, sorghum [Sorghum bicolor L. Moench.; 748–772 million base pairs (Mbp); Arumuganathan and Earle 1991] is a logical complement to Oryza (rice; ∼420 Mbp; Arumuganathan and Earle 1991), a distant relative (tribe Oryzeae) that will be the first grass genome to be completely sequenced (Goffet al. 2002; Yuet al. 2002). Sorghum is an especially important bridge to several economically important large-genome crops in its own tribe (Andropogoneae) such as maize (∼2292–2716 Mbp) with which it may have shared common ancestry between 11 (Gaut and Doebley 1997) and 24 (Thomasson 1987) million years ago. Sorghum and sugarcane, a large-genome (∼2547–3605 Mbp) polyploid that ranks among the world's most economically important crops, may have shared a common ancestor as recently as 5 million years ago (Sobralet al. 1994), retain similar gene order (Minget al. 1998), and even produce viable progeny in some intergeneric crosses (Dewetet al. 1976; P. L. Morrell and A. H. Paterson, personal communication). By contrast, rice and the maize/sorghum lineage may have diverged ∼50 million years ago (Linder 1987) and show much more chromosomal rearrangement (Patersonet al. 1995a). Analysis of the levels and patterns of genomic diversity within and between sorghum, sugarcane, rice, and maize (and others) promises to advance understanding of the biology and evolution of Poaceae grain and biomass crops and reveal new opportunities for their improvement.
Worldwide, sorghum is the fifth most important grain crop grown based on tonnage, after maize, wheat, rice, and barley (http://www.fao.org). Sorghum is unusually tolerant of low input levels, an essential trait for areas such as northeast Africa and the U.S. Southern Plains that receive too little rainfall for most other grains. In the more arid countries of northeast Africa, such as Sudan, sorghum contributes 39% of the calories in the human diet (http://www.fao.org; 1999 statistics). Increased demand for limited fresh water supplies, coupled with global climatic trends and expanding populations, suggests that dryland crops such as sorghum will be of growing importance.
Despite the likely growing importance of sorghum, its improvement has lagged behind that of maize, wheat, and rice, each of which have more than doubled in average yield on a worldwide basis in the last 38 years while sorghum yields have gained only 51% (average 1961–1963 compared to 1999–2001; http://www.fao.org).
In sub-Saharan Africa, already home to many of the world's hungry and with a population projected to double over the next 40 years (U.S. Census Bureau estimates 2002; http://www.census.gov), sorghum yields have gained only 6% over the last 38 years compared to 50% gains in wheat and maize (http://www.fao.org). In the U.S., sorghum was introduced over 200 years ago, possibly by Benjamin Franklin (Smith and Frederiksen 2000) and is now grown on 9–13 million acres. U.S. sorghum is principally used as an animal feed and therefore escapes direct notice by the general public, but is the 13th most valuable crop in the U.S. with a farm-gate value ranging from $0.8 to 2.0 billion/year (USDA 1992–2001 statistics).
S. bicolor is native to Africa. One other euploid species exists within the genus, S. propinquum, which is native to Asia and contains many “weediness” traits such as rhizomes, small seeds, and shattering. The genus also includes S. halepense, a tetraploid (2n = 40) thought to be derived from naturally occurring crosses between S. bicolor and S. propinquum (both 2n = 20). S. halepense is among the world's most noxious weeds, with widespread distribution. In the U.S., many local epithets for S. halepense have largely been supplanted by the term “Johnson grass,” first documented in an 1874 letter, referring to Colonel William Johnson, an Alabaman who sowed it on his farm (McWhorter 1971). The first U.S. federal appropriation for weed research targeted Johnsongrass (House Bill 121, 56th Congress, 1900).
Cross-fertility between S. bicolor and S. propinquum has permitted us not only to benefit from high levels of DNA polymorphism between them to build the detailed molecular map described herein, but also to conduct genetic analysis of many traits associated with grass domestication (e.g., Paterson et al. 1995a,b). The genetic map presented herein builds on and integrates much earlier work (Chittendenet al. 1994; Linet al. 1995; Annenet al. 1998; Wyrichet al. 1998; Drayeet al. 2001). Several other sorghum maps (Whitkuset al. 1992; Ragabet al. 1994; Xuet al. 1994; Dufouret al. 1997; Boivinet al. 1999; Penget al. 1999; Subudhi and Nguyen 2000; Haussmannet al. 2002) provide seminal data on comparative genome organization and reveal important quantitative trait loci (QTL) but lack the high marker density needed for use in complex endeavors such as positional cloning of genes, genetic anchoring of bacterial artificial chromosome (BAC)-based physical maps, or assembly of genomic shotgun sequence. The only other high-density sorghum map (Menzet al. 2002) is composed largely of amplified fragment length polymorphisms; the difficulties associated with inferring orthology of these arbitrary-sequence markers across taxa constrain its value for comparative and evolutionary genomics. Our map is currently being used to anchor BAC-based physical maps of both S. bicolor and S. propinquum (Linet al. 1999; Drayeet al. 2001), to facilitate rapid gene isolation by map-based cloning and provide landmarks for eventual genomic sequence assembly. The genetically anchored probes used in this map are also being hybridized to BAC libraries from rice, sugarcane, and maize, fostering comparative genomics across the Poaceae.
MATERIALS AND METHODS
Laboratory procedures: The genetic population and molecular methods are as previously described (Chittendenet al. 1994), except that the mapping population was expanded to 65 individuals from 56, drawing additional F2 progeny at random from residual seeds of the original cross. Briefly, DNA was extracted from young leaves by a published protocol (Chittendenet al. 1994), ∼5 μg DNA per lane digested with 15 units of EcoRI, HindIII, or XbaI (Promega, Madison, WI), electrophoresed and blotted onto Hybond N+ (Amersham, Arlington Heights, IL), rinsed in 2× SSC, and stored at 4° until use. About 20–50 ng of PCR-amplified fragment was labeled with [32P]dCTP, hybridized to blots, washed, and exposed to X-ray film as described (Chittendenet al. 1994).
DNA markers and sequences: Prefixes of DNA markers used and their sources are as follows. Arabidopsis cDNA: AEST (R. Scholl, Arabidopsis Biological Resources Center, Ohio State University), AHD and HMG (T. Thomas, Texas A&M); Barley cDNA: BCD (M. Sorrells and S. Tanksley, Cornell); Johnsongrass rhizome cDNA: pHER, pSHR (Y. Si and A. H. Paterson, unpublished results); Maize PstI genomic clones: BNL, UMC (E. Coe and M. McMullen, University of Missouri); Maize cDNA: CSU (Coe, McMullen); Millet Pst1 genomic clones: M (M. Gale, John Innes Center); Oat cDNA: CDO (Sorrells, Tanksley); Sorghum cDNA: HHU (Wyrichet al. 1998), HHUK (Annenet al. 1998); Sorghum phytochrome genes: PHY (L. H. Pratt and M.-M. C.-Pratt, University of Georgia); Sorghum PstI genomic DNA: pSB, SHO (A. H. Paterson); Sugarcane cDNA: CDSB, CDSR (P. Moore, Hawaiian Agricultural Research Center); Sugarcane genomic clones: SG (Sorrells); Rice genomic clones: RG and cDNA: RZ (S. McCouch and S. Tanksley, Cornell), C, G, and R (T. Sasaki, RGP, Japan).
Sequences were obtained from the National Center for Biotechnology Information (NCBI) or developed in house by end sequencing of probes using standard methods. In house sequencing used a software pipeline in which sequence data in ABI trace file format were input into the programs PHRED (version 0.000925.c) and CROSS_MATCH (version 0.990329 with minmatch = 12 and minscore = 20) to trim poor quality and vector sequence. Residual vector and primer sequences were trimmed manually and sequences of <50 nucleotides in length were removed from further analysis. A list of GenBank accession numbers is available as supplementary documentation at http://www.genetics.org/supplemental/.
Summary statistics for the SB × SP linkage groups
Map construction: A framework map of ∼600 codominant markers was constructed using the program MAPMAKER v2.0 on the PC, with error detection on (Landeret al. 1987). A new program written in Microsoft Visual Basic (J. E. Bowers and A. H. Paterson, unpublished data) was then used to insert additional markers into the framework. The algorithm used by this program was to determine the genotypes of each individual for each interval between markers already placed on the map, with multiple genotypes possible for individuals in which crossovers were observed between the framework loci. In cases where genotypes were uncertain due to dominant markers or missing data, the genotypes of the intervals were inferred from flanking loci, assuming that the minimum number of recombinations had occurred. An unmapped locus was tested against the possible genotypes for all intervals already on the map, in search of a perfect match. If such a match was found the marker was assigned to the appropriate interval, and then the framework was recomputed. If no perfect match was found, a second pass was made looking for matches to all but one individual, followed by subsequent passes with higher numbers of nonmatching individuals. Loci from these subsequent passes were rechecked for scoring errors in the individual that did not fit the expected pattern. If the data were determined to be correct the locus was then added to the framework map with a new recombination event not observed in the previous map. Individuals mapping to the ends of the chromosomes could not be placed with this approach and had to be added to the framework manually or with Mapmaker.
After the map had been constructed, it was manually edited to reduce the number of recombinations by exporting the locations of crossovers observed in the map into a spreadsheet. Instances with multiple recombinations for an individual progeny plant were reordered if possible to reduce the total number of recombination events observed. This step involved extensive checking of the raw data (films) for errors, with the plants apparently responsible for double recombinations being rechecked. Ostensibly codominant markers that could not be placed on the map were split into two dominant markers to attempt their mapping separately. Final map distances were computed using Kosambi (1944) centimorgans (cM), and maps were drawn by another Visual Basic program written for this purpose.
RESULTS
Genetic map: The SB × SP map (Table 1; Figure 1 and available at http://www.plantgenome.uga.edu/sorghummap) is composed of 2512 loci on 10 linkage groups that collectively span 1059.2 cM (Kosambi 1944). This is a 2236-locus (about sevenfold) increase compared to our previously published map of 276 loci (Chittendenet al. 1994), yet the recombinational length has actually been reduced from 1445 cM to the current 1059.2 cM largely by virtue of a sufficiently high density of markers to distinguish errors from true double recombinants. The map is based on a total of 1376 detected crossovers (see materials and methods), which would correspond to 1386 potentially distinct map locations; we have identified markers at 853 (61.5%) of these possible locations. The largest gap between two loci corresponds to only 7.8 cM or 10 crossovers and only seven intervals in the map were >5 cM.
On the basis of the 65 F2 plants used, a single recombination event yields an estimate of 0.77 cM between consecutive loci, which defines the resolution limit of the map. Consequently, loci are plotted to intervals of this size (in the figure rounded to one decimal place).
All of the restriction fragment length polymorphism (RFLP) markers tested could be placed on the map although a small number (<20) that had initially been scored were determined in retrospect to be too faint for accurate scoring and were discarded. A similar number of markers with two segregating bands of nearly the same migration rates, which could not be reliably distinguished from one another, were also discarded. Another group of <20 markers that could not be mapped showed segregation ratios approaching 15:1 and were assumed to be caused by two loci with indistinguishable band sizes and were therefore discarded.
Duplicate probes were removed from the map by inspection of genomic hybridization patterns for cosegregating loci and also by sequence comparisons of most probes. Some probes used in past studies were shown to be identical to newly mapped sorghum probes and in a few cases cDNAs from other species corresponded closely to sorghum probes or to one another. In cases where RFLP markers had similar or identical sequences and mapped to similar or identical loci, one of the duplicates was removed from the map. In total, this resulted in the removal of 336 markers at 386 loci (which are not included in the 2050 probes and 2512 loci that compose the map). The genetic locations and corresponding information for these loci remain available at our web site (http://www.plantgenome.uga.edu/sorghummap).
—Sorghum genetic map. Distances along the map are in Kosambi centimorgans. Marker prefixes are summarized in materials and methods. Text color indicates loci that are codominant (black), dominant for the S. bicolor allele (blue), or dominant for the S. propinquum allele (green). Loci revealed by probes that contain SSRs are indicated by a percent sign. Approximate centromere positions (determined as described in text) are indicated by an O. Space constraints prevent some markersfrom being placed immediately next to their map location. In these cases the (superscripted, parenthetical) map location was printed in the smaller font followed by the list of markers mapping to that location, with a line of reduced length plotted at the proper location on the map. For chromosomal regions with high marker density some loci were listed at the bottoms of the figures. Multiple markers mapping to the same chromosomal location show identical segregation patterns, and their physical order cannot be determined from present data.
Recombinational interference: Recombinational interference was assessed by comparing the frequency of occurrence of “double crossover” genotypes (i.e., aa–ab–aa; bb–ab–bb) to “adjacent crossover” genotypes (i.e., aa–ab–bb; bb–ab–aa) as a function of the size of the interval that contains the two crossovers required to produce each genotype. In the absence of interference, these two different classes of genotypes would be equally probable; however, only 121 double crossovers were found in the population vs. 262 adjacent crossovers, a highly significant difference (Figure 2). The numbers of observed genotypes in the two categories differ significantly (P < 0.05) from the expected (equal numbers) for cases in which the two recombination events were separated by 0–10 cM, 10–20 cM, and 40–50 cM and narrowly missed significance for the 20–30 cM (0.07) and 30–40 cM (0.06) spacings. Over intervals of >50 cM, no significant differences were found in the frequency of double vs. adjacent crossovers.
Segregation distortion: Five regions on the genetic map showed segregation distortion significant at the 5% level. The apices of distortion in the five regions were on LG B near cM 50.0, LG C near cM 46.2, LG D near cM 66.2, LG G near cM 26.2, and LG I near cM 0.0. Curiously, all five regions showed segregation distortion favoring the S. bicolor alleles. By far the most striking case was on LG C—the apex of the distortion was near the locus CSU507 and comprised a segregation ratio of 41:17:2 (homozygous S. bicolor:heterozygote:homozygous S. propinquum), significantly (P < 3 × 10–12) different from the expected 1:2:1 ratio. In a larger set of F2 progeny from the same cross (Linet al. 1995; Patersonet al. 1995a), we found similarly distorted segregation (236:94:8) in this region.
Patterns of DNA marker distribution: We evaluated the distribution of DNA markers across the sorghum map by comparing intervals of exactly 10.0 cM in length, starting from the top of each chromosome as drawn (Figure 1), except that the last interval in each group was either ≤15 cM or ≥5 cM to accommodate the varying lengths of the linkage groups. On the basis of the total number of loci per linkage group, the Poisson probability distribution function was applied to identify intervals that contained significant excesses or deficiencies of various classes of probes. We note that two regions (C04 and D04-07) were preferentially enriched for markers because they contain genes that we seek to clone [C04, the sorghum Sh1 gene regulating shattering of the mature inflorescence (Patersonet al. 1995b); and D04-07, the Ma1 gene regulating photoperiodic flowering, dw2 gene regulating plant stature (Linet al. 1995; Patersonet al. 1995a), and pApo1 gene regulating apomixis (R. Jessup, G. Burow, M. Hussey and A. H. Paterson, personal communication)]. The average number of loci in the deliberately enriched intervals plus the short terminal bins was 23, virtually identical to the average across the remainder of the genome (23.28); therefore elimination of these anomalous regions has no effect on the analyses.
Virtually every linkage group has at least one interval containing more loci than would be expected to occur by chance in 1% or fewer cases (A06-7; B01 and B08; C02, -04, and -08; D02 and D04-07; E05; F06-8; G04 and G06-7; H04-05; I06; and J04 and -06).
Significant marker deficiencies were associated with 13 intervals, including A01*, A02; B10 and B12*; C06, C07, and C12*; F01*, F05, F11, and F13*; G09; and H02. These included a disproportionately large number, 5 (25%), of the terminal intervals (*), but the reduced length of three terminal intervals (B12 = 5.8 cM; C12 = 7.5 cM; F13 = 8.7 cM) contributed partly to their marker deficiencies. The distributions of genomic and cDNA clone-derived loci over the intervals were closely correlated (r = 0.79).
Distribution of dominant loci: A total of 666 (26%) loci showed dominant inheritance, segregating as presence of an allele from one parent and absence from the other parent. A total of 395 (15.7%) of the dominant alleles were from SB and 269 (10.7%) from SP, a highly significant difference (χ2 = 23.9, 1 d.f., P < 1.1 × 10–6). Distribution of dominant loci is shown in Figures 1 and 3.
Among the 395 SB-derived dominant loci, 74 (18.7%) are in the single 10-cM interval C05, far beyond the random expectation (>8 loci would have been expected in only 1% of cases). The same interval also is enriched for codominant loci (44), but contains only 2 SP-derived dominant loci (nominally below the average of 2.7). By far the largest concentration of SP-derived dominant loci (23, 8.6%) was in interval H05—while this interval is generally marker rich, the number of SP-derived dominants in this interval is ∼50% higher than the number of SB-derived dominants (15), the opposite of their 50% lower abundance elsewhere in the genome. Several other intervals are also preferentially enriched for dominant loci from one parent or the other: F07 contains an abundance of 9 (P ≤ 0.0009) SP-derived dominant loci vs. 0 SB-derived dominants (P ≤ 0.085); H01 contains an abundance of SB-derived dominants (P < 0.0013) vs. only 1 SP-derived dominant (P ≤ 0.25) and I09 contains an abundance of 7 (P ≤ 0.0009) SP-derived dominants, vs. 1 SB-derived dominant (P ≤ 0.29).
Even after removing the dominant probes mapping to interval C04 of LG C, a significant excess of S. bicolor-derived dominant markers (321, vs. 267 S. propinquum dominants, significant at the 5% level) still remains. Curiously, this excess is explained almost completely by one marker class, the pSB probes, which were derived from S. bicolor hypomethylated (PstI-digested) genomic DNA. The pSB clones detected 152 S. bicolor dominants and 102 S. propinquum dominants (exclusive of probes mapping to LG C05). After removing the pSB clones, there remains a nonsignificant difference of 169 S. bicolor dominants vs. 165 S. propinquum dominants for non-pSB probes outside of interval C05.
Simple sequence repeat-containing loci: On the basis of the sequences of 1933 probes (see http://www.plantgenome.uga.edu/sorghummap for GenBank accession numbers) we were able to identify 130 simple sequence repeat (SSR)-containing sequences (defining an SSR as 6 or more repeats of a dinucleotide or repeats stretching 15 or more base pairs of longer repeat units). Although the distributions of genomic and cDNA clone-derived loci were closely correlated (r = 0.79), the genomic distribution of the SSRs was only loosely related to that of the entire population of mapped DNA probes (r = 0.33), suggesting that SSRs may locate in different genomic domains more frequently than low-copy probes. The relationship between SSR distribution and probe distribution was somewhat closer (r = 0.43) after removing the strong biases in distribution of dominant loci, partly attributable to possible genomic rearrangements (see below). The map location of SSR-containing clones is shown in Figure 1. Further characterization of a subset of the SSRs has been described (Schlosset al. 2002).
Distribution of duplicate loci: Among the 2050 probes mapped, a total of 357 revealed DNA polymorphisms at multiple loci that could be mapped, with 279 detecting 2 loci, 58 detecting 3 loci, 13 detecting 4 loci, 6 detecting 5 loci, and 1 detecting 6 loci. The distribution of duplicated loci across the genome is illustrated in Figure 4, composed of 606 data points (keeping in mind that 2-locus probes generate 1 point of intersection, 3-locus probes generate 3 points, 4-locus probes generate 6 points, 5-locus probes generate 10 points, and 6-locus probes generate 15 points). A clickable web-based version of this figure is available at http://www.genetics.org/supplemental/, which displays the probes and exact loci involved. On the basis of a chi-square contingency test, the distribution of duplicate loci over pairs of linkage groups was not random (χ2 = 224.06, with 81 d.f.). Several pairs of linkage groups showed striking excesses of duplicated loci (A and G, C and G, C and H, E and H, E and I). Associations of individual linkage groups with multiple partners (for example G with A and C), together with our prior observations (Chittendenet al. 1994) and other work (Whitkuset al. 1992; Gaut and Doebley 1997; Gaut 2001), suggest that if duplication in sorghum is due to paleo-ploidy, then the polyploidization event must be very ancient, surely predating the Sorghum-Zea divergence. Therefore we reevaluated the data on the basis of smaller intervals, breaking each sorghum chromosome into four “intervals” of equal length in centimorgans. This yielded an overall contingency chi-square of 2287.53 (with 1521 d.f., P < 7 × 10–33), further supporting the notion that duplicated loci are not randomly distributed across chromosome pairs. Among the 820 possible comparisons (including intrainterval comparison), a total of 22 pairs (2.7%) of intervals shown in Table 2 showed positive deviations from the random expectation that were significant at 0.005 (as measured by 1 d.f. chi-square), about 5.4 times higher than the random expectation. Respectively, 12, 16, 10, and 2 intervals showed correspondence with 0, 1, 2, or 3 other intervals in the genome. A total of 74 duplicate loci are intrachromosomal, not significantly different from the random expectation.
Correspondence to gene arrangements in other taxa: Table 2 summarizes the sources of clones and loci that have been mapped to date, illustrating the opportunities to use this map as a basis for comparisons of many Poaceae taxa.
As an especially important example of the utilization of these data, Figure 5 illustrates comparative alignments of the sorghum and maize genomes based on 952 loci from the maize “bins” map (Gardineret al. 1993; Daviset al. 1999). A clickable version of Figure 5 is available (at http://www.genetics.org/supplemental/) that shows the specific probes and loci involved. This comparison represents a considerable increase over previously published data (Whitkuset al. 1992; Pereiraet al. 1994; Ragabet al. 1994; Patersonet al. 1995a; Dufouret al. 1996). The distributions of loci over the 100 possible combinations of maize chromosomes and sorghum linkage groups were clearly not random (contingency χ2 = 790.04, with 81 d.f., P < 9 × 10–117). A total of 19 (19%) cells with the largest excesses (from 7.5 to 36.6) of observed data over random expectations account for 520 (55%) of the corresponding points and 74% (582.84) of the chi-square deviation from randomness, suggesting the correspondences illustrated in Figure 5 and listed in Table 3.
—Summary of recombinational interference. The frequencies of adjacent crossovers (two crossovers occurring on different chromosomes within the same individual) and double crossovers (occurring on the same chromosome in the same individual) are plotted vs. the distance between the crossovers.
Marker sequence annotation: Multiple local alignment searches using the programs blastn and tblastx were used for sequence annotation against publicly available databases of the NCBI as of November 21, 2002. The default matrix BLOSUM 62 and a cutoff of 1 × 10–6 were used in all BLAST searches. The NCBI database was subdivided into several taxon-specific groups to allow for the efficient determination of not only the best overall hit, but also the best hit among closely related species, excluding unannotated expressed sequence tag and genomic survey sequence database entries. Additional analyses included the use of hidden Markov models to classify sequence data by protein sequence signature. The program InterProScan (Zdobnov and Apweiler 2001) was used to search to compare the translated sorghum sequences against several protein databases (Pfam, SMART, and ProDom) and Genome Ontology (Ashburneret al. 2000) numbers for each of these classifications were obtained. The results of these analyses revealed that the 578 sorghum query sequences could be classified into 205 distinct protein families and 136 different functional molecular groups. The most common molecular functional groups were “ATP binding” (120 hits) and “Protein Kinase” (56 hits). Results of sequence similarity analyses are available at http://www.genetics.org/supplemental/.
DISCUSSION
This genetically anchored set of sequence-tagged sites provides transferable DNA markers suitable for a wide range of investigations in structural, functional, and evolutionary genomics in several major grain and biomass crops. Although the map was created using the RFLP method and has been applied to several goals by this technology (e.g., Linet al. 1995; Paterson et al. 1995a,b; Katsaret al. 2002; Minget al. 2002), genetically mapped sequence tagged sites such as these can be used to discover single-nucleotide or small insertion/deletion polymorphisms that can then be genotyped by many alternative technologies. This possibility increases the value of these loci and reduces the costs associated with their wider utilization. A total of 130 loci that contain simple-sequence repeats have the further advantage of being relatively allele rich (Schlosset al. 2002), a benefit in studies that require differentiation between closely related genotypes.
This framework of genetically anchored sequence-tagged sites will also provide a foundation for physical mapping and ultimately assembling a robust finished sequence of the sorghum genome. The present map permits us to assign loci to bins of ∼0.77 cM; on average, this represents ∼300 kb of genomic DNA based on a consensus genome size estimate of 750 Mbp (although we have recently estimated the genome to be somewhat smaller, ∼690 kb; Petersonet al. 2002). To orient different loci within 0.77-cM bins, we are presently hybridizing the genetically mapped probes to BAC libraries for both S. propinquum and S. bicolor. Since the two BAC libraries each provide ∼10× coverage of the genome and are composed of individual BACs that average ∼120 kb in length, this will permit us to resolve the order of closely linked loci to an average resolution of ∼12 kb, assuming that the breakpoints of individual BACs are more or less evenly distributed through the genome. By simply hybridizing the 2050 mapped probes to the 10×-coverage BAC libraries, we expect to identify ∼20,000 BACs in each library, comprising ∼50% of the genome. Further, both libraries have been fingerprinted (http://www.genome.arizona.edu/fpc/sorghum/), permitting the resulting “contigs” to be extended further. By selective BAC end sequencing and the use of comparative approaches made possible by the alignment of our genetically mapped sequences to the nearly completed rice sequence, a robust genetically anchored physical map is expected to coalesce.
—Distribution of codominant and dominant markers along the sorghum map. For 10-cM intervals along each linkage group, the numbers of codominant (solid), S. bicolor dominant (open), and S. propinquum dominant (shaded) loci are plotted.
—Patterns of duplication within the sorghum genome. In this Oxford grid, each dot represents a genetically mapped locus detected by probe that segregated at two or more polymorphic loci in sorghum, with the x- and y-axis representing chromosomal locations. Red circles along the axes represent the approximate locations of the centromeres. The total number of probes mapping to each pair of sorghum linkage groups is shown in each cell. Areas highlighted in yellow represent regions of significant marker abundance between a pair of linkage groups (determined as described in text). Note that some dots represent multiple probes with the same genetic locations; for a detailed list of exact information for each cell, see http://www.genetics.org/supplemental/.
Nonrandom patterns of DNA marker distribution provide clues to the locations of interesting and important features of sorghum genome organization. On most chromosomes, at least one significant concentration of loci appears to correspond to the centromeric region. We have recently applied overgo (Caiet al. 1998) probes for sorghum centromeric repetitive sequences homologous to pHind22 and Cen38 (Milleret al. 1998; Zwicket al. 2000) to the SP and SB BAC libraries. Co-hybridization of these probes with genetically mapped RFLPs has associated concentrations of centromeric repeats with marker-dense regions of 8 of the 10 linkage groups (LG A: DM064, cM57.7; B: DM007, cM57.7; C: pSB1406, cM47.7; D: pSB580, cM47.7 and pRC162, cM60; E: CSU0462 and pRC182, cM47; G: R2447, cM64.6; I: 5C04A07, cM69.3; and J: pSB0019, cM50.8). Due to the repetitive nature of these probes and the possibility that not all copies are centromeric, these data can be taken only as tentative indications of the possible locations of the sorghum centromeres. For example, on two linkage groups we found associations with mapped probes in regions of normal marker density (LG A: pSB1075, cM98.5; LG H: HHU49, cM68.5). Further, we found no association on one linkage group (F). More definitive mapping of the locations of the sorghum centromeres is in progress by probing synaptonemal complex spreads with genetically mapped probes or their corresponding BACs by fluorescence in situ hybridization (D. G. Peterson and A. H. Paterson, unpublished data).
Clearly, more information will be needed to explain the multiple, dispersed marker-dense regions found on several linkage groups. For example, linkage group B has one terminal concentration of markers and another interstitial concentration. We have recently shown that some sorghum chromosomes have cytologically distinguishable knobs (D. G. Peterson and A. H. Paterson, unpublished observations), and future studies will investigate whether these could account for some marker excesses or deficiencies. Linkage groups C, D, G, and J also show multimodal distributions of marker density that warrant further study.
Summary of probe sources
—Patterns of colinearity between sorghum and maize. In this Oxford grid, each dot represents a locus detected by a probe that was genetically mapped in both sorghum (left) and maize (top), with the x- and y-axis representing chromosomal locations in each taxon. The total number of probes mapping to each pair of maize and sorghum chromosomes is shown in each cell. Lines highlight the regions for which we have inferred synteny between maize and sorghum, as summarized in Table 3. Note that some dots represent multiple probes with the same genetic locations; for a detailed list of exact information for each cell, see http://www.genetics.org/supplemental/.
Differences in the abundance of dominant genetic marker loci appear to suggest that a chromosome structural rearrangement has occurred since the divergence of S. bicolor and S. propinquum from a common ancestor. The single 10-cM interval C05 contains 74 (18.7%) of the 395 SB-derived dominant loci found, far beyond the random expectation (see above), and 71 of these cosegregate at the single location cM 46.2 (along with 6 codominants, and one locus dominant for the SP allele). Curiously, this interval is also the apex of the most pronounced segregation distortion found (41:17:2, favoring bicolor homozygotes as described above). The DNA sequences of some of the probes that detect S. bicolor-dominant markers at LG C, cM 46.2 correspond to various portions of the ribosomal DNA [specifically AEST602 matches 18s rRNA (GenBank accession no. X16077) at e < 10–200 and C152 and pRC017 match the 25S ribosomal RNA gene (GenBank accession nos. M11585 and AY108843) at e = 10–170 and e = 4 × 10–71, respectively].
These three probes also mapped as dominant markers for the S. propinquum allele on LG H at cM 32.3–40.0, near the interval (H05) that contained by far the largest concentration of SP-derived dominant loci (23, 8.6%). This suggests that the ribosomal DNA and a large flanking area may have moved in one of the two sorghums since their divergence from a common ancestor, a hypothesis that we are further investigating (D. G. Peterson, J. E. Bowers and A. H. Paterson, unpublished data) and that is consistent with recent findings in rice (Shishidoet al. 2000) and legumes (Singhet al. 2001). On the basis of genotypes inferred from segregation at nearby codominant loci, all 435 plants that have been studied to date from this cross (including those from a larger population used for mapping QTL; e.g., Linet al. 1995; Patersonet al. 1995a) possess at least one copy of the ribosomal DNA on either LG C or LG H. These results are consistent with a requirement for a copy of ribosomal DNA for survival of gametes. We have also noticed some degree of enrichment of the two affected genomic intervals for QTL that differentiate between SB and SP [specifically for the number of seedling tillers and regrowth on H05 (Patersonet al. 1995b) and for the number of seedling tillers, three measures of rhizomatousness, and seed weight on C04 (Paterson et al. 1995a,b)].
Correspondence of maize chromosomes to sorghum linkage groups
The finding that many S. bicolor hypomethylated (PstIdigested) genomic probes lacked a homolog in S. propinquum suggests that there has been considerable and rapid divergence or deletion of low-copy DNA in these taxa. In contrast to cDNAs and excepting the probes mapping near the ribosomal DNA discussed above, a total of 728 pSB probes detect 152 dominant loci that lack an S. propinquum allele vs. only 102 loci that lack an S. bicolor allele, a highly significant difference. This suggests that a portion of the sorghum genome may be composed of rapidly evolving low-copy DNA, such as has been reported for tomato (Zamir and Tanksley 1988). However, this portion of the sorghum genome is likely to be relatively smaller in sorghum than in tomato, as Cot analysis shows that sorghum has a much smaller low-copy DNA fraction (Petersonet al. 2002). Finally, we note that the lack of an RFLP allele at a dominant locus does not necessarily reflect deletion of the locus, but could be attributable to comigration with monomorphic loci, gain/loss of restriction sites creating short or long fragments that are not captured on Southern blots, or other artifactual reasons, which presumably account for many of the 102 S. bicolor loci that are null for these S. bicolor-derived probes.
The genomic distribution of mapped (i.e., polymorphic in SB × SP) loci shows little relationship to differences in levels of intraspecific allelic diversity in different chromosomal regions (Dvoraket al. 1998; Hamblin and Aquadro 1999). In a separate study (P. Morrell, J. E. Bowers and A. H. Paterson, personal communication), we have shown that allelic diversity is not randomly distributed across the sorghum chromosomes but is highly structured. For 183 loci representing most of the 10-cM bins in this study, we have estimated allelic richness (not shown) from a worldwide sample of 55 land-race and wild accessions representing the breadth of diversity in the Sorghum genus (P. J. Morrell and A. H. Paterson, personal communication)—curiously, these estimates of allelic diversity showed remarkably little relationship with any of the factors studied herein. Correlations of allelic diversity with total mapped locus abundance per interval (0.0006), codominant locus abundance (–0.038), and SSR abundance (–0.16) were remarkable in the lack of information they yielded. An important future investigation will be to study how marker density and/or allelic diversity correlate with the distributions of phenotypically significant variants such as QTL.
By virtue of a very high level of DNA polymorphism (Chittendenet al. 1994), the SB × SP cross has proven especially facile for “comparative mapping” of DNA clones that have been previously mapped in other taxa. To foster opportunities to use the relatively small genome of sorghum to help advance genomics in the larger genomes of many other tropical Poaceae, we have mapped 865 heterologous DNA clones from eight other taxa (Table 2). In one example, we show herein the alignment of the sorghum genome to the four times physically larger genome of maize. Most maize chromosomes correspond to nonoverlapping regions of only one sorghum chromosome but most sorghum chromosomes correspond to nonoverlapping regions of two maize chromosomes, reiterating the recent duplication in maize. Sorghum is an especially valuable guide for genomic analysis of Saccharum [sugarcane, one of the world's most important crops, with the 2001/2002 world cane sugar crop forecast at a near record 126.8 million metric tons (FAS 2001)], which may have shared a common ancestor as recently as 5 million years ago (Sobralet al. 1994). The present data supplement and complement our prior efforts in this regard (Ming et al. 1998, 2002). Other work in progress uses probes described herein (Table 3) together with species-specific recombination data to address the comparative organization of sorghum and other tropical grasses including Pennisetum (R. Jessup, M. Hussey and A. H. Paterson, personal communication), Cynodon (C. Bethel, E. Sciara and A. H. Paterson, personal communication), Echinochloa (T. Fukao, A. H. Paterson and M. Rumpho, unpublished data), and Panicum (A. Missaoui, A. H. Paterson and J. Bouton, personal communication).
Despite the clear value of the comparative approach for fostering progress in study of gene arrangement in complex genomes (e.g., Saccharum) or underexplored taxa (e.g., Pennisetum, Cynodon, Echinochloa, and Panicum), it is equally important to note that a remarkable 45% of comparative data fell in regions other than those we infer to correspond between sorghum and maize. Many of these incongruities are likely to reflect nonchromosomal rearrangement mechanisms that are becoming clear from microsynteny studies (Tikhonovet al. 2000) and studies of ancient duplication (Bennetzen 2000; Patersonet al. 2000) or, possibly, rapid divergence or deletion of hypomethylated DNA as we report above. A few tantalizing hints of the possibility of more ancient duplication events in maize are suggested by locus arrangements in several regions (e.g., maize chr. 1/sorghum LG A), but await more data to test with confidence. One sorghum linkage group (H) that is well populated with DNA markers (Table 1) shows remarkably little correspondence to any maize chromosome (just a small portion of maize chromosome 4). This hints at the possibility that large segments of chromatin may have been lost during the maize-sorghum divergence; however, a conclusive test awaits more data.
While our results clearly reinforce the evidence in support of the duplication of most regions of the maize genome, many questions remain about the levels, patterns, and antiquity of chromatin duplication within sorghum itself. The patterns of distribution of duplicate loci in sorghum are clearly not random, with many small islands of colinearity evident, and adjacent intervals often showing correspondence to syntenic intervals (A2, -3, and -4 to G3, -1, and -4; E3-4 to H2-1; F3-4 to I3, I1; J2, -4 to I2-1). However, for ∼30% of the genome we can discern no corresponding duplicated region, and another 30% shows correspondence to two or more unlinked regions. Duplication of sorghum chromatin appears to more closely resemble the pattern observed for rice, in which the completed sequence (Goffet al. 2002; Yuet al. 2002) has largely borne out early hints (Kishimotoet al. 1994; Nagamuraet al. 1995) of ancient segmental duplication in some regions. The correspondence of some sorghum genomic intervals to two or more unlinked intervals may reflect either very localized colinearity or, possibly, recent duplications superimposed on ancient ones, which may be present in maize as we speculate above. Much more data will be needed to unravel the details of the relationship(s) between individual duplicated segments in sorghum, as well as their relationships (if any) to those in close relatives such as sugarcane and maize or distant relatives such as rice or even Arabidopsis.
Acknowledgments
We honor the memory of coauthor Keith F. Schertz, who made many of these discoveries possible while teaching several of us about sorghum and about much more. Science is richer for his efforts, and we are poorer for his passing. We thank the USDA-National Research Initiative, National Science Foundation Plant Genome Research Program, International Consortium for Sugarcane Biotechnology, and U.S. Golf Association for financial support of various aspects of this work.
Footnotes
-
Communicating editor: J. A. Birchler
- Received February 5, 2003.
- Accepted May 5, 2003.
- Copyright © 2003 by the Genetics Society of America