Sponsored by the National Science Foundation and the U.S. Department of Agriculture, a wheat genome sequencing workshop was held November 10–11, 2003, in Washington, DC. It brought together 63 scientists of diverse research interests and institutions, including 45 from the United States and 18 from a dozen foreign countries (see list of participants at http://www.ksu.edu/igrow). The objectives of the workshop were to discuss the status of wheat genomics, obtain feedback from ongoing genome sequencing projects, and develop strategies for sequencing the wheat genome. The purpose of this report is to convey the information discussed at the workshop and provide the basis for an ongoing dialogue, bringing forth comments and suggestions from the genetics community.
WHEAT AS AN IMPORTANT CROP SPECIES
Wheat was the first domesticated crop and is the youngest polyploid species among the agricultural crops (see Figures 1–3 for background information). Together with rice and maize, wheat provides >60% of the calories and proteins for our daily life. Wheat is best adapted to temperate regions, unlike rice and maize, which prefer tropical environments. Wheat occupies 17% of all crop area (in 2002, 210 million hectares vs. 147 million for rice and 139 million for maize). The trade value of wheat exceeds that of any other cereal species, including rice and maize: $31 billion of world trade in 2001 vs. $13 and $19 billion for rice and maize (FAOstat database: http://apps.fao.org/default.jsp). To meet human needs by 2050, grain production must increase at an annual rate of 2% on an area of land that will not increase much beyond the present level. Significant advances in the understanding of the wheat plant and grain biology must be achieved to increase absolute yields and protect the crop from an estimated average annual loss of 25% caused by biotic (pests) and abiotic stresses (heat, frost, drought, and salinity). Genome sequencing is a widely accepted mechanism for accelerating achievement of these objectives, because it leverages similar work from other crops and plants and enables more rapid genetic improvement. In addition to food security, wheat genome sequencing will lead to improved human health and nutrition.
CEREAL GENOME STRUCTURE
Rice, maize, and wheat, which coevolved from a common ancestor ∼55–75 million years ago (Kellogg 2001; Figure 1), differ greatly in genome size. Among agricultural crops, common bread or hexaploid wheat (Triticum aestivum L., 2n = 6x = 42, AABBDD) has the largest genome at 16,000 Mb, ∼8-fold larger than that of maize and 40-fold larger than that of rice (Arumuganathan and Earle 1991). Amplification of transposable elements (TEs), coupled with duplication of chromosome segments, was a major driving force for cereal genome expansion, although polyploidization also contributed to the large genome size of wheat. The rice genome at 430 Mb consists of at least 22% TEs (Ma et al. 2004). The maize genome at ∼2500 Mb consists of >50% TEs (Meyers et al. 2001, Whitelaw et al. 2003). About 90% of the wheat genome consists of repeated sequences and 70% of known TEs (Li et al. 2004). Low-copy TEs and miniature inverted repeat TEs are most often associated with active genes, but high-copy TEs mainly insert in the intergenic space (SanMiguel et al. 1996, 2002). Gene distribution along chromosomes is relatively homogeneous in the small genome of rice, but gene clusters (gene-rich regions) are separated by long stretches of TEs (gene-poor or gene-free regions) in the wheat genome, as demonstrated by deletion mapping (Gill et al. 1996a,b; Faris et al. 2000) and BAC-based physical mapping (J. Dvořák, unpublished results). Within some gene-rich regions of the wheat genome, gene density is similar to that of smaller genomes (Feuillet and Keller 1999).
Comparative mapping of cereal genomes using a standard set of probes showed extensive conservation in gene content and order at a low-resolution genetic map level. It seems a logical choice to use the small genome of rice as a surrogate for positional cloning of agriculturally important genes from large cereal genomes based on microcolinearity. However, small translocations, deletions, inversions, and duplications often violate microcolinearity and complicate this process (Tikhonov et al., 1999; Keller and Feuillet 2000; Dubcovsky et al. 2001; Li and Gill 2002; Sorrells et al. 2003). Most importantly, there is little colinearity for disease resistance genes due to their rapid evolution among grasses (Leister et al. 1998). Sequence comparisons between rice and the wheat genomic regions that harbor the genes Lr10, Pm3 (Guyot et al. 2004), Q, and Tsn1 (J. D. Faris, J. P. Fellers, H. Lu, K. M. Haen and B. S. Gill, unpublished results) detected extensive microrearrangements or complete loss of microcolinearity. All this suggests that rice will often not be a good model for positional cloning in wheat. However, sequencing the wheat genome will advance comparative studies of grass genomes and lead to better understanding of the relationship among grass lineages (Freeling 2001).
Comparative analysis of orthologous sequences from two related species has been a powerful tool for de novo prediction of genes and identification of noncoding functional elements on the basis of the assumption that sequences conserved among species that have diverged for many millions of years must have functional roles in their genomes. Comparison of human and mouse genome sequences increased the specificity of genome annotation in both species (Mouse Genome Sequencing Consortium 2002). Alignment of sequences orthologous to a 1.8-Mb region of human chromosome 7q31, from several animal genomes, detected numerous functional elements such as transcription-factor-binding sites and noncoding RNA transcripts (Margulies et al. 2003). Similarly, a significant portion of rice genes were identified on the basis of their similarities with Arabidopsis genes at the protein sequence level. Comparison of genomic sequences of 52 orthologous genes of rice and maize indicated that most genes contain conserved noncoding sequences (CNSs) and that upstream regulatory genes tend to be enriched in CNSs (Inada et al. 2003). From this point of view, sequencing the wheat genome will facilitate annotation of all plant genomes, especially grass genomes.
CURRENT UNDERSTANDING OF THE WHEAT GENOME AND ITS SEQUENCE
Common wheat is an allohexaploid consisting of seven groups of chromosomes, each group containing a set of three homeologous chromosomes belonging to the A, B, and D genomes, derived from a common ancestor (Figures 2 and 3). Despite their close homology, homeologs are normally prevented from pairing by the Ph1 gene on the long arm of chromosome 5B. Thus, common wheat functions much like a diploid organism, although it is able to tolerate aneuploidy due to the buffering effect of polyploidy. Sets of viable mono-, tri-, and tetrasomic cytogenetic stocks were developed for all chromosomes, and nullisomics were developed for 11 chromosomes (Sears 1954). Since the loss of a pair of chromosomes can be compensated by two additional doses of a homeolog, 42 compensating nulli-tetrasomics were developed (Sears 1966). The monosomic chromosomes tend to misdivide and this property was exploited to produce a series of chromosome-arm aneuploids: monotelosomics, ditelosomics, tritelosomics, and iso-chromosome lines (Sears and Sears 1978). More recently, taking advantage of the gametocidal chromosome introduced from the Aegilops cylindrica host, Endo and Gill (1996) developed 436 segmental deletion lines in Chinese Spring (CS). All of these genetic stocks, which are in the CS background, have been used to localize genes or markers to a specific chromosome, chromosome arm, or subarm region and play a central role in wheat genetics and genomics.
The 21 wheat chromosomes can be readily identified by heterochromatic banding (Gill et al. 1991; see Figure 3) or in situ hybridization patterns using repetitive DNA probes (Pedersen and Langridge 1997). A specific chromosome or chromosome arm can be flow sorted at high purity using the genetic stocks (Vrána et al. 2000). These sorted chromosomes have been used for construction of chromosome-specific BAC libraries (Safar et al. 2004), together with other genetic and molecular resources for wheat genome sequencing (summarized in Table 1).
Genes and recombination events are not randomly distributed along wheat chromosomes. They are clustered in the distal regions, while proximal regions are largely gene poor or gene free (Werner et al. 1992; Gill et al. 1993, 1996a,b; Kota et al. 1993; Delaney et al. 1995a,b; Mickelson-Young et al. 1995; Faris et al. 2000; Weng et al. 2000; Qi et al. 2003). A detailed study of the short arm of the group 1 chromosomes demonstrated that 70% of the genes and 82% of the total recombination distance were contained within two major gene-rich regions (1S0.8 and 1S0.5) that physically encompass only 14% of the arm (Sandhu et al. 2001). This picture of uneven wheat gene distribution has been strongly supported by the recent assignment of nearly 6000 wheat expressed sequence tags (ESTs) to 159 deletion bins across the 21 chromosomes (Qi et al. 2004). Associated with recombination, gene duplications also show similar distribution patterns along the wheat chromosomes (Akhunov et al. 2003).
Currently, ESTs (∼500,000 to date) are the largest sequence resource for wheat. ESTs are cDNA clones, and as such they do not contain promoters, introns, and other functional elements. The gene coverage of human ESTs is 75%; of mouse, 56% (Mouse Genome Sequencing Consortium 2002); of Arabidopsis, 60% (Arabidopsis Genome Initiative 2000); of tomato, 47% (Van der Hoeven et al. 2002); and of rice, 36% (Feng et al. 2002). It is estimated that the gene coverage of the wheat EST collection is ∼60%, close to that of Arabidopsis (Li et al. 2004), indicating that ∼40% of wheat genes are not represented in EST collections.
In addition to the more than half million ESTs, ∼6 Mb of wheat genomic DNA has been sequenced, including ∼3 Mb from a random shotgun genomic library and ∼3 Mb from large-insert genomic clones. BAC clones selected as hybridizing with specific genes revealed that gene density varies greatly, ranging from 1 (Faris et al. 2003) to 16 genes/BAC (Brooks et al. 2002), and that genes tend to be clustered into gene islands (Wicker et al. 2001; Brooks et al. 2002; SanMiguel et al. 2002). Feuillet and Keller (1999) reported a gene density of 1 gene/4–5 kb within a small segment on chromosome 1A. The 600-kb contig at the Tsn1 locus on chromosome 5B contained 13 genes at an average gene density of 1 gene/46 kb (J. D. Faris, J. P. Fellers, H. Lu, K. M. Haen and B. S. Gill, unpublished results). However, 9 of these genes were located within a 90-kb segment, resulting in a gene density of 1 gene/10 kb. In contrast, Faris et al. (2003) found only three known genes within a 300-kb BAC contig spanning the Q locus on chromosome 5A, yielding an estimate of 1 gene/100 kb. Large tracts of repetitive elements with very few intervening low-copy noncoding sequences separated the three genes.
The physical map of the D-genome donor species Aegilops tauschii Coss. is under construction. Five BAC libraries have been constructed and fingerprinted using a new, high-resolution method. Briefly, BAC DNA is simultaneously digested with four 6-bp restriction endonucleases (BamHI, EcoRI, XbaI, and XhoI) and a 4-bp restriction endonuclease (HaeIII). Subsequently, each of the four recessed 3′ ends generated by the four 6-bp restriction enzymes is labeled with a different fluorescent dye and the fragments are sized using a capillary DNA sequencer (Luo et al. 2003a). At the same time, wheat RFLP markers and ESTs have been placed onto the physical map to anchor the BAC contigs to genetic maps and deletion bins (Luo et al. 2003b).
SELECTION OF A TARGET WHEAT GENOME FOR SEQUENCING
Ancient or modern farmers have grown four wheat species: einkorn (monococcum), emmer (durum), timopheevi, and common (hexaploid, or bread) wheat. However, only durum and common wheat are currently used for food production, accounting for 4 and 96% of the total wheat acreage, respectively. The diploid relatives of bread wheat have smaller genomes than that of hexaploid wheat (5500 Mb vs. 16,000 Mb), and sequencing one of them should require approximately one-third of the time and expense of sequencing hexaploid wheat. However, in addition to its great economic importance, there are several lines of evidence that led many participants to favor sequencing common wheat. The A, B, and D subgenomes of common wheat have undergone dynamic evolution since they came together to form hexaploid wheat (see Figures 1 and 2). Today, they differ significantly from one another and from the genomes of cultivated diploid wheats. Sequencing hexaploid wheat could yield the greatest store of important new information about wheat and crop plant biology and provide the greatest return on investment.
First, it is the wild wheat species T. urartu and not einkorn wheat (the cultivated diploid) that is the A-genome donor of polyploid wheat (Dvořák 1998). The A genomes in these two diploid wheat species have diverged for a million years (Huang et al. 2002). During this period, the A genomes of diploid and polyploid wheat accumulated many genetic differences. Both B- and D-genome donors exist only as wild species. It would be difficult to find a diploid species that truly reflects the related, but highly diverged, genome found in hexaploid wheat. Therefore, although the physical map of the diploid D genome will be extremely valuable, hexaploid wheat should be the focus of a sequencing effort as all the characterized genes, mapping populations, and cytogenetic stocks exist in hexaploid wheat.
Second, polyploidy is a major force in plant evolution and especially in agriculture as most crop plants are also polyploid. At least 70% of angiosperms are thought to have undergone one or more cycles of polyploidization (Stebbins 1966; Masterson 1994). Rapid genetic and epigenetic changes and restructuring of the genomes occur in synthetic amphiploids where different genomes are forced to share the same nucleus (Comai 2000; Wendel 2000). Sequence elimination (Feldman et al. 1997; Liu et al. 1998; Ozkan et al. 2001), reactivation of transposable elements (Kashkush et al. 2003), and changes in methylation (Shaked et al. 2001) and gene expression patterns (He et al. 2003) were observed in wheat upon amphiploid formation. Actually, common wheat experienced two sequential polyploidization events at different times (Figure 2). Sequencing the genome of common wheat will offer a unique opportunity to elucidate mechanisms of polyploid speciation and evolution.
Third, many agronomically important genes or their alleles are chromosome specific and are not triplicated (see Figure 3). For example, known useful alleles of pest resistance genes such as those against fungal diseases (e.g., rusts and powdery mildew) and insects (e.g., greenbug, Russian wheat aphid, and Hessian fly) are present in only one of the three homeologous chromosomes; the Ph pairing-control genes mentioned above are located on chromosomes 5B and 3D; grain hardness gene Ha is located only on 5D because the copies on 5A and 5B were eliminated after polyploidization (Gautier et al. 2000).
Comparative sequence analyses have demonstrated that plant genomes are more dynamic than animal genomes. Studies on the bz (Fu and Dooner 2002) and z1C (Song and Messing 2003) regions of maize revealed significant variation in local gene content and colinearity among inbreds. A similar situation was also found in Ae. tauschii (S. Brooks and J. P. Fellers, unpublished results) and T. monococcum L. (Scherrer et al. 2002). Even though diploid wheats and the subgenomes of hexaploid wheat can be expected to differ significantly from one another, it seems unlikely that an orthologous locus would have been deleted from all three homeologous chromosomes in hexaploid wheat. Thus, sequencing the genome of common wheat would be the best way to harvest all the genes present in the diploid ancestors. ESTs and/or filtered genome sequences of diploid ancestors will be helpful in assigning the wheat BAC sequences to a specific subgenome.
APPROACHES TO SEQUENCING LARGE GENOMES
Depending upon the amount of information required and resources available, a genome can be sequenced by three approaches or their combinations: clone by clone (CBC), whole genome shotgun (WGS), or selective gene sequencing. To date, most finished complex genomes, chromosomes, or subchromosome regions have been sequenced by a CBC approach (C. elegans Sequencing Consortium 1998; Arabidopsis Genome Initiative 2000; International Human Genome Mapping Consortium 2001; Feng et al. 2002; Sasaki et al. 2002; Wood et al. 2002; Rice Chromosome 10 Sequencing Consortium 2003; Sherer et al. 2003). This strategy requires large insert libraries and fine clone-based physical maps for minimal tiling paths (MTPs). Although the most costly approach, CBC produces long, if not complete, pseudomolecules of a genome or a chromosome and provides the most comprehensive information about structure and function of a genome.
In contrast, WGS takes advantage of computing power to produce draft sequences for a genome relatively quickly by sequencing and assembling small insert libraries (Adams et al. 2000; Venter et al. 2001; Goff et al. 2002; Yu et al. 2002). The draft sequences can be used to extract important information such as gene content (Venter et al. 2001; Yu et al. 2002) and compositional gradients of genes (Wong et al. 2002) and to develop markers (Goff et al. 2002). The genome image inferred from draft sequences is incomplete, particularly with respect to gene context. The high proportions of repeated sequences in large genomes pose a major difficulty for WGS in computing capacity, sequence assembly, and financial cost. WGS is very useful for small genomes (Adams et al. 2000; Galagan et al. 2003), especially for labs with limited genomics resources. A WGS variant is the whole chromosome shotgun, in which a separated chromosome rather than the whole genome is used for library construction (Churcher et al. 1997; Bowman et al. 1999; Glockner et al. 2002). For sequencing the large genome of mouse, a mixed strategy was adopted. Assembly of sevenfold WGS sequences generated a draft genome sequence. CBC sequencing of BACs created a hybrid WGS-BAC assembly while BACs were used for finishing (Mouse Genome Sequencing Consortium 2002).
In recent years, two genome filtration strategies, methylation filtration (MF) (Rabinowicz et al. 1999) and C0t-based cloning and sequencing (CBCS; Peterson et al. 2002) or high C0t (HC; Yuan et al. 2003) were proposed for selectively sequencing the gene space of large genomes. MF is based on the characteristic of plant genomes in which genes are largely hypomethylated but repeated sequences are highly methylated. Methylated DNA is cleaved when transferred into a Mcr + Escherichia coli strain and only hypomethylated DNA is recovered. CBCS/HC separates single- and low-copy sequences, including most genes, from the repeated sequences on the basis of their differential renaturation characteristics. Both MF and HC have been used for efficient characterization of the maize gene space (Palmer et al. 2003; Whitelaw et al. 2003) although their ability to discover >90% of maize genes has not yet been proven. Combining CBCS with genome filtration can reduce the cost greatly while retaining high coverage of genic regions. Another alternative may be identification of gene-rich regions on a detailed physical map and sequencing large-insert clones from these regions.
WHEAT GENOME SEQUENCING: WHAT APPROACHES ARE APPROPRIATE?
In the workshop, various approaches to sequencing the wheat genome were considered. These included selected BAC/CBCS, MF, HC, and/or a combination approach. The discussion was focused on the relative efficiency of each strategy in relation to cost and division of labor among the international participants.
The WGS approach was considered too difficult mainly because of the large size and highly repetitive nature of the wheat genome. Several participants proposed a selected BAC approach, in which the gene-containing BACs were isolated by hybridization with ESTs and fingerprinted to construct MTPs and the gene-rich MTPs were sequenced. It was argued that a global physical map should be considered rather than only the gene-rich regions for greater impact on map-based cloning of agriculturally important genes. For gene filtration, preliminary results showed that MF could enrich wheat genes by 2- to 3-fold (Li et al. 2004) or even 5-fold (P. Rabinowicz, A. Bedell, M. A. Budiman, N. Lakey, A. O'Shaughnessy, V. Balija, L. Nascimento, W. R. McCombie and R. A. Martienssen, unpublished results). However, MF results in barley, a close relative of wheat, show a higher level of enrichment, suggesting that the effectiveness of MF in wheat may be underestimated (P. Rabinowicz, A. Bedell, M. A. Budiman, N. Lakey, A. O'Shaughnessy, V. Balija, L. Nascimento, W. R. McCombie and R. A. Martienssen, unpublished results). CBCS/HC enriches for genes by ∼10-fold (D. Lamoureux and B. S. Gill, unpublished results). It was suggested that a combination BAC/CBC approach and gene filtration would greatly reduce the cost while retaining high coverage of genic regions. Comprehensive genetic maps were considered pivotal for the assembly of BAC contigs.
Considering the large genome size and the possibility of international cooperation, a chromosome-based approach was suggested, in which a specific chromosome would be flow sorted (Vrána et al. 2000; Safar et al. 2004) and used to construct a BAC library and a gene filtration library. One advantage of this approach lies in its potential for a division of labor. One country or center could concentrate on one chromosome or one homeologous group.
There was general agreement in the meeting and in follow-up communications within the group that wheat genome sequencing should be conducted in three phases: pilot, assessment, and scale up. The pilot phase was recommended to be a 5-year wheat genome project focused mainly on physical and genetic mapping along with sample sequencing of the wheat genome aimed at better understanding wheat genome structure. This would involve generation of more refined physical maps for wheat and anchoring of these to the genetic map. With respect to sequencing, two types of pilot sequencing were recommended. First, more extensive sampling of the wheat genome would be conducted by sequencing a large number of BACs (>100) that represent the gene-rich regions of the wheat genome and another >100 that would provide a random sampling of the genome. This would provide a test of the gene-rich-region model for the structure of the wheat genome. The pilot phase of sequencing should also involve deeper sequencing of enriched libraries (MF and high C0t) to provide a more statistically significant sampling of the wheat genome. The pilot phase would also measure the appropriateness of chromosome-arm enrichment methods as they have great promise but as of yet have not been thoroughly sampled. Another vital component of the pilot phase is development and validation of wheat genome annotation methods. The current use of nonstandardized annotation methods prevents assessment of gene content across research groups. The assessment phase will involve a collective analysis of all available wheat sequence and annotation data. This will then allow for a well-developed, scientifically supported, and economically feasible approach to be proposed for sequencing the entire wheat genome. Thus, the scale-up phase will not be initiated until completion of the assessment phase.
There was a strong consensus among the workshop participants that sequencing the wheat genome would deliver positive impacts across the spectrum of education, new research technologies, and application to the agricultural industry and provide new insights into the functioning of a polyploid genome. Wheat genome sequencing could also be expected to stimulate national and international collaboration. An international wheat genome project could be established through the following steps:
Constructing an accurate, sequence-ready, global physical (BAC-contig) map anchored to the high-resolution genetic and deletion maps of the 21 chromosomes (see 4 below) of the hexaploid wheat genotype Chinese Spring.
Exploring the use of flow-sorted chromosome- and arm-specific libraries in the assembly of the global physical map and in preparation for the sequencing of the gene-containing regions of homeologous chromosome groups.
Identifying genomic sequence tags using gene-enrichment procedures such as hi-C0t or methyl filtration, ESTs, and full-length cDNAs of 2x, 4x, and 6x wheat for an accurate estimation of the wheat unigene set.
Leveraging rice sequence and wheat-rice gene synteny, comparative genetics, and wheat unigenes toward the development of high-resolution genetic and deletion maps of the 21 chromosomes of Chinese Spring wheat.
Identifying a random set of 100 gene-containing BACs from the physical map and another 100 random BACs for sample sequencing. This will provide a test of the gene-rich model and allow refining the technology for assembling sequences with a high repetitive sequence content. Sample sequencing of BACs from different ploidy wheats and genotypes should also be undertaken.
Integrating bioinformatics at every step for project management, data analysis, improved methods of sequence annotation, and dissemination of data.
Engaging all wheat stakeholders and educational institutions (K–12) globally, especially in developing countries, and locally in all aspects of the research, technology transfer, workforce training, and promotion of science.
Maintaining all data, materials, and resources in the public domain and free of intellectual property rights.
Organizing an international steering committee to coordinate and execute all aspects of the wheat genome sequencing project.
This workshop was supported by grants from the National Science Foundation (DBI 0344938) and U.S. Department of Agriculture-National Research Initiative (2003-38836-01639) to B.S.G. This is Kansas Agricultural Experiment Station journal article no. 05-109-J.
- Genetics Society of America