Transposable elements (TEs) constitute >80% of the wheat genome but their dynamics and contribution to size variation and evolution of wheat genomes (Triticum and Aegilops species) remain unexplored. In this study, 10 genomic regions have been sequenced from wheat chromosome 3B and used to constitute, along with all publicly available genomic sequences of wheat, 1.98 Mb of sequence (from 13 BAC clones) of the wheat B genome and 3.63 Mb of sequence (from 19 BAC clones) of the wheat A genome. Analysis of TE sequence proportions (as percentages), ratios of complete to truncated copies, and estimation of insertion dates of class I retrotransposons showed that specific types of TEs have undergone waves of differential proliferation in the B and A genomes of wheat. While both genomes show similar rates and relatively ancient proliferation periods for the Athila retrotransposons, the Copia retrotransposons proliferated more recently in the A genome whereas Gypsy retrotransposon proliferation is more recent in the B genome. It was possible to estimate for the first time the proliferation periods of the abundant CACTA class II DNA transposons, relative to that of the three main retrotransposon superfamilies. Proliferation of these TEs started prior to and overlapped with that of the Athila retrotransposons in both genomes. However, they also proliferated during the same periods as Gypsy and Copia retrotransposons in the A genome, but not in the B genome. As estimated from their insertion dates and confirmed by PCR-based tracing analysis, the majority of differential proliferation of TEs in B and A genomes of wheat (87 and 83%, respectively), leading to rapid sequence divergence, occurred prior to the allotetraploidization event that brought them together in Triticum turgidum and Triticum aestivum, <0.5 million years ago. More importantly, the allotetraploidization event appears to have neither enhanced nor repressed retrotranspositions. We discuss the apparent proliferation of TEs as resulting from their insertion, removal, and/or combinations of both evolutionary forces.
GENOMES of higher eukaryotes, and particularly those of plants, vary extensively in size (Bennett and Smith 1976, 1991; Bennett and Leitch 1997, 2005). This is observed not only among distantly related organisms, but also between species belonging to the same family or genus (Chooi 1971; Jones and Brown 1976). More than 90% of genes are conserved in sequenced plant genomes (Bennetzen 2000a; Sasaki et al. 2005; Jaillon et al. 2007) and thus differences in gene content explain only a small fraction of the genome size variation. It is widely accepted that whole-genome duplication by polyploidization (Blanc et al. 2000; Paterson et al. 2004; Adams and Wendel 2005) and differential proliferation of transposable elements (TEs) are the main driving forces of genome size variation. The differential proliferation of TEs results from their transposition (SanMiguel et al. 1996; Bennetzen 2000b, 2002a,b; Kidwell 2002; Bennetzen et al. 2005; Hawkins et al. 2006; Piegu et al. 2006; Zuccolo et al. 2007) as well as the differential efficiency of their removal (Petrov et al. 2000; Petrov 2002a,b; Wendel et al. 2002).
Polyploidization and differential proliferation of TEs are particularly obvious in the case of wheat species belonging to the closely related Triticum and Aegilops genera. Rice (Oryza sativa), Brachypodium, and diploid Triticum or Aegilops species underwent the same whole-genome duplications (Adams and Wendel 2005; Salse et al. 2008), but Triticum or Aegilops genomes are >10 times larger (Bennett and Smith 1991), mainly due to proliferation of repetitive DNA, which represents >80% of the genome size (Smith and Flavell 1975; Vedel and Delseny 1987). Diploid wheat species can differ in their genome sizes by hundreds or even thousands of megabases (Bennett and Smith 1976, 1991; http://data.kew.org/cvalues/homepage.html). For example, the genome size of Triticum monococcum (6.23 pg) is 1.3 pg greater than that of Triticum urartu (4.93 pg) (Bennett and Smith 1976, 1991), although these species diverged <1.5 million years ago (MYA) (Dvorak et al. 1993; Huang et al. 2002; Wicker et al. 2003b). Similarly, the calculated size of the B genome of polyploid wheat species (7 pg) is higher than that of any diploid wheat species (http://data.kew.org/cvalues/homepage.html).
The genome size variation within wheat is also accentuated by frequent allopolyploidization events, among which two successive events have led to the formation of the allohexaploid bread wheat Triticum aestivum (2n = 6x = 42, AABBDD). The first event led to the formation of the allotetraploid Triticum turgidum (2n = 4x = 28, AABB) and occurred <0.5–0.6 MYA between the diploid species T. urartu (2n = 2x = 14, AA), donor of the A genome, and an unidentified diploid species of the Sitopsis section, donor of the B genome (Feldman et al. 1995; Blake et al. 1999; Huang et al. 2002; Dvorak et al. 2006). The second allopolyploidization event occurred 7000–12,000 years ago, between the early domesticated tetraploid T. turgidum ssp. dicoccum and the diploid species Aegilops tauschii (2n = 14), donor of the D genome, resulting in hexaploid wheat (Feldman et al. 1995).
The amount of available wheat genomic sequences is very limited, compared to other organisms (reviewed by Sabot et al. 2005; Stein 2007; http://genome.jouy.inra.fr/triannot/index.php and http://www.ncbi.nlm.nih.gov/). Individual bacterial artificial chromosome (BAC) clones, selected primarily because they contained genes of agronomic interest, have been sequenced. Analyses of randomly chosen BAC clones from wheat have been also performed (Devos et al. 2005), and 2.9 Mb of sequences from a whole-genome shotgun library of Ae. tauschii were analyzed by Li et al. (2004). More recently, a detailed analysis of 19,400 BAC-end sequences of chromosome 3B, representing a cumulative sequence length of nearly 11 Mb (1.1% of the estimated chromosome length) was reported (Paux et al. 2006). Altogether, these sequencing efforts have confirmed previous estimates of the amount of repetitive DNA in the wheat genome (∼80%) (Smith and Flavell 1975; Vedel and Delseny 1987) and have identified the major types of TEs (Wicker et al. 2002; Sabot et al. 2005).
Because of the limited genomic sequence information, the extent to which various TEs contribute to the wheat genome and affect its size variation, or how they are distributed among different genomes, remains unexplored. Little is known about the dynamics of TEs, their proliferation processes, and whether they proliferated gradually or in waves of sudden bursts of insertions. In this study, 10 genomic regions from wheat chromosome 3B were sequenced and used to constitute, along with three other genomic sequences, 1.98 Mb of sequence from the wheat B genome. Transposable element dynamics and proliferation in these B-genome sequences were analyzed and compared to those in 3.63 Mb of sequence from 19 genomic regions of the wheat A genome. Our study provides novel insights into the dynamics and differential proliferation of TEs as well as their important role in the evolution and divergence of the wheat B and A genomes.
MATERIALS AND METHODS
Plant material and genomic DNA isolation:
Hexaploid wheat deletion lines used to map the 10 BAC clones on different deletion bins of chromosome 3B (see results) were originally described by Qi et al. (2003) and kindly provided by Catherine Feuillet (INRA, Clermont-Ferrand, France). Hexaploid wheat genotypes were kindly provided by Joseph Jahier (INRA, Rennes, France). Tetraploid wheat genotypes were kindly provided by Moshe Feldman (Weizemann Institute). Genomic DNA was extracted from leaves as described by Graner et al. (1990).
Primer design and PCR-based tracing of retrotransposon insertions:
The program Primer3 (Rozen and Skaletsky 2000) was used to design oligonucleotide primers on the basis of TE–TE or TE-unassigned DNA junctions. We often designed and used several couples (including nested) of PCR primers. Internal controls (PCR primers designed within the TE) were also used. Primer sequences are given in supplemental Table 1. PCR reactions were carried out in a final volume of 10 μl with 200 μm of each dNTP, 500 nm each of forward and reverse primers, 0.2 units Taq polymerase (Perkin Elmer). PCR amplification was conducted using the following “touchdown” procedure: 14 cycles (30 sec 95°, 30 sec 72° minus 1° for each cycle, 30 sec 72°), 30 cycles (30 sec 95°, 30 sec 55°, 30 sec 72°), and one additional cycle of 10 min 72°. Amplification products were visualized using standard 2% agarose gels.
BAC sequencing, sequence assembly, and annotation:
BAC shotgun sequencing was performed at the Centre National de Sequencage (Evry, France) essentially as described by Chantret et al. (2005). Genes, TEs, and other repeats were identified by computing and integrating results on the basis of BLAST algorithms (Altschul et al. 1990, 1997), predictor programs, and different software and procedures, detailed below. Cross-analysis of the information obtained for genes and TEs as well as for repeats and unassigned DNA was integrated into ARTEMIS (Rutherford et al. 2000). Sequence annotation and analysis were performed as described in supplemental Method 1. The 10 BAC clone sequences were submitted to EMBL and under the following accession nos.: TA3B54F7, AM932680; TA3B63B13, AM932681; TA3B63B7, AM932682; TA3B81B7, AM932683; TA3B95C9, AM932684; TA3B95F5, AM932685; TA3B95G2, AM932686; TA3B63C11, AM932687; TA3B63E4, AM932688; TA3B63N2, AM932689. Accession numbers for the three publicly available genomic sequences from the wheat B genome (Sabot et al. 2005; Gu et al. 2006; Dvorak et al. 2006) are CT009588, AY368673, DQ267103.
Publicly available genomic sequences from the wheat A genome:
The retained publicly available A-genome sequences consist of 19 sequenced and well annotated BAC clones or contigs (SanMiguel et al. 2002; Yan et al. 2002, 2003; Wicker et al. 2003b; Chantret et al. 2005; Isidore et al. 2005; Dvorak et al. 2006; Gu et al. 2006; Miller et al. 2006), representing >3.5 Mb. Accession numbers for the analyzed BAC sequences are the following: diploid A genome—AF326781, AF488415, AY146588, AY188331, AY188332, AY188333, AY491681, AY951944, AY951945, DQ267106, AF459639; tetraploid A genome—AY146587, AY485644, AY663391, CT009587, DQ267105; hexaploid A genome—AY663392, CT009586, DQ537335.
Chromosome 3B BAC clones and fluorescent in situ hybridization:
The 10 BAC clones and/or their subclones were originally mapped by fluorescence in situ hybridization (FISH) on flow-sorted 3B chromosomes using the Cot-1 fraction as blocking DNA to suppress hybridization of repeated sequences (Dolezel et al. 2004; Safar et al. 2004; M. Kubalakova and J. Dolezel, personal communication). Further FISH hybridization experiments were conducted, without Cot-1 DNA, on mitotic metaphase chromosomes of hexaploid wheat (T. aestivum) cv. Chinese Spring. The FISH hybridization protocol is presented in supplemental Method 2.
Estimation of Long Terminal Repeat-retrotransposon insertion dates:
For all genomic sequences of the B and A genomes of wheat, retrotransposon copies with both 5′ and 3′ long terminal repeats (LTRs), and target-site duplications (TSD) were considered as corresponding to original insertions and analyzed by comparing their 5′ and 3′ LTR sequences. The two LTRs were aligned and the number of transition and transversion mutations was calculated using MEGA3 software (Kumar et al. 2004). A mutation rate of 1.3 × 10−8 substitutions/site/year (SanMiguel et al. 1998; Ma et al. 2004; Ma and Bennetzen 2004; Wicker et al. 2005; Gu et al. 2006) was used. The insertion dates and their standard errors (SE) were estimated using the formula T = K2P/2r (Kimura 1980).
All statistical analyses and the different tests (Kolmogorov–Smirnov, Bootstrap, and probability density functions) were done with the R-package (http://www.r-project.org). Kolmogorov–Smirnov tests (Férignac 1962) were applied to check whether the distribution of insertion dates of retrotransposons deviates from uniformity, and whether they are different when comparing different TE families or superfamilies within and between the B and A genomes. Probability density of TE insertion dates was estimated using Gaussian kernel density estimation (Silverman 1986), taking into account measured standard deviation for each individual insertion date (Kimura 1980).
Constitution of a genomic sequence data set representative of the wheat B genome—analysis of 10 BAC sequences from the wheat chromosome 3B:
Only three large well-annotated genomic sequences (BAC clones), representing 0.55 Mb of sequence, were available for the wheat B genome (Sabot et al. 2005; Dvorak et al. 2006; Gu et al. 2006). To obtain more representative genomic sequences, we sequenced and annotated 10 BAC clones of wheat chromosome 3B, representing 0.15% of the chromosome length (1.43 Mb) (Figure 1). Detailed annotation files are deposited at EMBL/GenBank Data Libraries.
These sequenced genomic regions show a high proportion of TEs, which represent 79.1% of the cumulative sequence length (Figure 1, supplemental Table 2). Other repeated DNA sequences represent 2.4% and unassigned DNA sequences account for 17.5% of the cumulative sequence length.
We conducted gene prediction analysis for the remaining 18.5% non-TEs and nonrepeated DNA, using different search programs (see supplemental Method 1 and supplemental Text 1 for detailed description). Genes of known and unknown functions or putative genes were defined on the basis of predictions and the existence of rice or other Triticeae homologs. Hypothetical genes were identified on the basis of prediction programs only. Pseudogenes were not well predicted and frameshifts need to be introduced within the coding sequences (CDS) structure to better fit a putative function on the basis of BLASTX (mainly with rice). Truncated pseudogenes (genes disrupted by large insertion or deletion) and highly degenerated CDS sequences were considered as gene-relics. Combined together, all these types of gene sequence information (GSI) account for only 1.0% of the sequence and are present in seven BAC clones (one or two genes per clone) while the remaining three BAC clones (TA3B95C9, TA3B95G2, TA3B63N2) contain no genes (indicated in Figure 1A and detailed in supplemental Text 1, supplemental Table 3, and supplemental Table 4).
Six genes (of known or unknown function) and two putative genes were identified using the FGENESH prediction software (http://www.softberry.com) and by identification of homologs in rice (Figure 1A, supplemental Table 3). Six additional “gene-relics” or “pseudogenes” were also identified on the basis of colinearity with rice (Figure 1A, supplemental Table 3). Finally, 10 CDS, designated as “hypothetical genes,” were identified according to the FGENESH prediction program only (Figure 1A, supplemental Table 4).
TE prediction, annotation, classification, and nomenclature were performed essentially as suggested by the unified classification system for eukaryotic TEs (Wicker et al. 2007) with two modifications. The Athila retrotransposons were analyzed separately from the other Gypsy retrotransposons (see also supplemental Methods 1). The Sukkula retrotransposons were considered as belonging to the Gypsy superfamily because of similarities with the Erika (Gypsy) elements. The 79.1% of TEs were shown to be composed of a wide variety of TEs, distributed as follows: 61.9% class I (171 TEs from 48 families), 16.2% class II (113 TEs from 28 families), and 1.0% unclassified TEs (18 TEs from 9 families) (Figure 1). The CACTA TEs represent the majority (96%) of class II TEs. More details about the TE composition in the 10 different BAC clones of wheat chromosome 3B are provided in supplemental Text 2.
Twenty-one transposable element families, some of which are present in several copies, were identified for the first time in this study (Figure 1A, indicated by arrows). They account for 9.8% by number and 7.9% by length of the overall sequences. Class I retrotransposons are the category for which we found the majority of novel TE families (17). Description of these novel TEs, their features, and the suggested nomenclature are presented in supplemental Text 2 and supplemental Table 5.
The 10 sequenced BAC clones or their subclones were originally mapped by FISH on flow-sorted 3B chromosomes, using the Cot − 1 fraction as blocking DNA to suppress hybridization of repeated sequences (Dolezel et al. 2004; Safar et al. 2004; M. Kubalakova and J. Dolezel, personal communication). As described by Devos et al. (2005) and Paux et al. (2006), specific PCR markers, based on TE–TE or TE-unassigned DNA junctions, were used to confirm the different BAC clone map positions on the deletion bins (Qi et al. 2003) of chromosome 3B (except TA3B63E4) (Figure 1B). Details of PCR markers and genotyping results are given in supplemental Table 6.
Representation of transposable elements and the wheat B genome:
Five BAC clone sequences were publicly available from the B genome of wheat (Sabot et al. 2005; Dvorak et al. 2006; Gu et al. 2006). Four of these were sequenced for two orthologous regions in tetraploid and hexaploid wheat species (one BAC clone per region and per species) (Sabot et al. 2005; Gu et al. 2006). As they share nearly identical sequences (99%) with common TE insertions, they were considered as redundant in our study and only the longest BAC clone sequences (three in total) were counted in calculation and appreciation of TE proliferation. These, added to the above-described 10 genomic region sequences of wheat chromosome 3B, constitute 1.98 Mb of sequence from the wheat B genome. Four main TE superfamilies occupy 66.5% of the analyzed B-genome loci: the Athila superfamily (54 elements), the Copia superfamily (57 elements), the Gypsy superfamily (79 elements), and the CACTA superfamily (70 elements) (Table 1). Interestingly, proportions of the Athila, Copia, and Gypsy retrotransposons (respectively, 10.8, 14.2, and 28.1%) (Table 1) are very similar to estimates based on 11 Mb of the chromosome 3B sequence BAC end (Paux et al. 2006). The major deviation concerns the proportion of CACTA class II TEs, which is higher in the 13 genomic regions (13.4%) than in the overall BAC-end sequences (4.9%), probably due to their clustering in some BAC clones that we have sequenced, such as TA3B54F7 (40.5% of CACTA TEs) (Figure 1).
The 13 sequences represent only ∼0.03% of the B genome. However, statistical tests, using SE as well as a bootstrap analysis with 10,000 resamplings, confirm the robustness of estimations of sequence proportions of the Gypsy, Copia, Athila, and CACTA TE superfamilies (Table 1). We also evaluated the variation of mean sequence proportions estimated for the four TE superfamilies by comparing all possible clone number representations and combinations (from 1 to 12 BAC clones) (Figure 2). Results show that representing the wheat B genome with a low number of BAC clones results in very variable proportions of the TE sequences (Figure 2). These variations decrease significantly by increasing the number of considered BAC clones (Figure 2). This confirms the usefulness of our effort in sequencing more BAC clones for better representation of the wheat B genome.
It is also interesting to note that direct FISH hybridization, using the whole BAC clone as a probe, resulted in dispersed and mostly homogenous signals across all wheat chromosomes for 8 of all 10 BAC clones of wheat chromosome 3B (except TA3B63C11 and TA3B54F7) (Safar et al. 2004 and supplemental Figure 1), thus confirming sequencing results that show high TE composition.
Constitution of a genomic sequence data set representative of the wheat A genome:
The publicly available A-genome sequences that we were able to use are more abundant and consist of 20 sequenced and well-annotated BAC clones or contigs. Ten of these were comparatively sequenced for five orthologous regions of the wheat A genome at the diploid, tetraploid, and/or hexaploid levels and were partially overlapping (Wicker et al. 2003b; Chantret et al. 2005; Isidore et al. 2005; Dvorak et al. 2006; Gu et al. 2006), while others were determined at only one ploidy level (mostly diploid) (SanMiguel et al. 2002; Yan et al. 2002, 2003; Miller et al. 2006). Comparisons show that no shared TE insertions were observed between orthologous regions (from two ploidy levels), except in the region of the high-molecular-weight (HMW) glutenin gene, the sequences of which were nearly identical at the tetraploid and hexaploid levels (Gu et al. 2006). Thus, we used only the sequence from hexaploid wheat to represent the HMW glutenin gene region and considered all the other different orthologous regions (from different ploidy levels) separately. This led to 19 BAC clones, representing 3.63 Mb of sequence, that were analyzed for the wheat A genome.
The Gypsy TEs were found to occupy 19.7%, the Athila TEs 10.4%, the Copia TEs 21.8%, and the CACTA TEs 9.4% of the cumulative sequence length (Table 1). Similarly, for the B-genome sequences, we also analyzed and validated the robustness of the estimation of sequence proportions of the main TE superfamilies and their representation of the A genome (Figure 2). Similar proportions of the Gypsy, Copia, Athila, and CACTA TEs were found whether the 11 genomic sequences from the diploid A genome or those determined from A genomes of tetraploid (six regions) and hexaploid (three regions) wheat species were considered separately or combined (data not shown).
Comparison of TE sequence proportions and ratios of complete to truncated copies:
Our analysis showed a significantly higher number of Gypsy retrotransposons in the wheat B-genome sequences than in the A genome (Table 1). Conversely, a higher proportion of Copia retrotransposons is observed in genomic sequences of the wheat A genome than in the B genome (Table 1). Proportions of the Athila and CACTA TEs were not statistically different between the two genomes (Table 1).
Major differences were found between the three main retrotransposon superfamilies in the ratio of complete (intact) copies, defined as having both LTRs and target-site TSD, as compared to degenerated and truncated copies that resulted from LTR-mediated unequal homologous recombinations or illegitimate DNA recombination (Devos et al. 2002; Ma et al. 2004; Ma and Bennetzen 2004; Vitte and Bennetzen 2006) (Table 1). In the B-genome sequences, the Athila and Copia retrotransposons show low ratios of complete to incomplete retrotransposons (respectively, 0.32 and 0.46), whereas the Gypsy retrotransposons show the highest ratio (0.98) (Table 1). In comparison, the 3.63 Mb of genomic sequence of the wheat A genome shows a lower ratio (0.45) of complete to incomplete Gypsy retrotransposons whereas proportions of intact Copia retrotransposons are relatively higher than those observed in the B genome (0.67) (Table 1). The Athila retrotransposon ratio in the A genome is comparable to the ratio in the B genome (0.36 and 0.32, respectively).
CACTA TE original insertions are characterized by the “CACTA” sequence and 3-bp TSD sequence motifs surrounding terminal inverted repeats (TIR) at both ends. We used these signatures to define complete CACTA copies, where the “CACTA,” TIR, and TSD sequence motifs are observed at both ends, and truncated copies, where the “CACTA” and TSD motifs are absent from one or both ends. The ratio of complete to incomplete copies of the CACTA class II TEs was about five times lower in the wheat B genome (ratio of 0.37) than in the A genome (ratio of 1.52) (Table 1).
Insertion dates and proliferation of LTR retrotransposons:
To understand differences in sequence proportions and the ratios of complete to truncated copies between retrotransposon superfamilies, as well as between the B and A genomes, we compared TE proliferation periods and rates.
The two LTRs are identical at the time of retrotransposon insertion and their sequence divergence reflects time lapsed since the insertion (SanMiguel et al. 1998). Several studies have shown that LTRs evolve at approximately twice the rate of genes and UTR regions, and we used a rate of 1.3 × 10−8 substitutions/site/year (Ma et al. 2004; Ma and Bennetzen 2004; Wicker et al. 2005; Gu et al. 2006).
We calculated the LTR divergence and dates of insertion of the Athila, Copia, and Gypsy retrotransposon (complete copies with both LTRs and TSD) found in the wheat B and A genomes (Figure 3). Such TE insertion dates offer a very important insight into the relative timing of various events, regardless of the approaches used to estimate nucleotide substitution rates or the molecular clock calibration points used in these calculations.
The vast majority of complete retrotransposons in the B and A genomes of wheat (86 and 92%, respectively) were estimated to be <3 million years old (Figure 2) in agreement with several previous studies of grasses and other plants species (SanMiguel et al. 1998, 2002; Wicker et al. 2003b, 2005; Gao et al. 2004; Ma et al. 2004; Du et al. 2006; Piegu et al. 2006; Wicker and Keller 2007). This is explained by the fact that LTR retrotransposons are continuously removed by unequal homologous recombination and illegitimate DNA recombination as new ones are inserted (Vicient et al. 1999; Devos et al. 2002; Ma et al. 2004; Pereira 2004). Insertion of the Egug element (RLGa_Egug_TA3B95C9-1 ∼5 MYA; divergence of 0.131) is the oldest such event found in our study and the most recent one is the Sukkula insertion (RLG_Sukkula_TA3B63B7-2) for which only a 1-base indel differentiates the two LTRs of 4192/4193 bp.
Comparison of LTR divergence dates revealed that different LTR–retrotransposon superfamilies and families proliferated at different periods and rates during evolution of the wheat B and A genomes (Figure 3). We applied Kolmogorov–Smirnov tests to check whether within the last 3 million years (0.078 divergence) the distribution of insertion dates of retrotransposons deviates from uniformity (thus confirming a burst of higher proliferation), and whether these dates are different when comparing different retrotransposon families or superfamilies within and between the wheat B and A genomes (thus illustrating differential proliferation). This was done for all complete copies of the three main retrotransposon superfamilies as well as for the most abundant retrotransposon families (nine) that have five or more complete copies in the B and/or A genomes (Figure 3).
Superfamily level comparison:
The combination of all complete retrotransposon copies at the superfamily level (Figure 3A) indicated that the distribution of the Gypsy retrotransposon insertion dates in both B and A genomes and that of Copia retrotransposons in the A genome were significantly different from uniform (P-value <0.01) because of their higher proliferation during the last 2 million years (Figure 3A). Proliferation of the Copia retrotransposons in the B genome was uniform and low all across the 3-million-year period, whereas proliferation of the Athila retrotransposons was different from a uniform distribution in both genomes at P-value <0.1.
One possible reason for the non-uniform distributions of retrotransposon insertion dates within the 3-million-year period is because older insertions are more likely to be removed (completely or partially) from the genome (see above). Therefore, we checked whether distributions of insertions are significantly different from a uniform distribution for the most recent period of evolution during which the impact of DNA removal should be lower. To carry out this analysis, we divided the LTR–retrotransposon insertions according to the median (of their distribution) that varies depending on the retrotransposon superfamily and family (Figure 3A, gray circle). Kolmogorov–Smirnov (Férignac 1962) tests were then conducted on half of the complete copies, which show the most recent insertion dates. Distribution of insertion dates of the Gypsy retrotransposons in the wheat B genome and that of the Copia retrotransposons in the B and A genomes can be considered as uniform (P-value >0.05, Figure 3A), indicating that they have constantly proliferated during this most recent period. In contrast, the distribution of Athila retrotransposons in the wheat B and A genomes and that of Gypsy retrotransposons in the A genome are not uniform (P-value <0.05, Figure 3A), consistent with a decreasing proliferation during the most recent period.
Comparison of the proliferation of the three retrotransposon superfamilies shows that distribution of the Athila retrotransposons is statistically different from that of the Gypsy retrotransposons (Figure 3A, P-value <0.05) in the B genome. The Athila distribution is significantly different from that of the Gypsy and Copia retrotransposons (Figure 3A, P-value <0.05) in the A genome.
Comparison of the distributions of the three retrotransposon superfamilies between the B and A genomes shows that Copia distributions are significantly different (Figure 3A, P-value = 0.628) due to their higher proliferation and more recent insertions in the A genome. Both genomes show similar old distribution of the Athila retrotransposons (Figure 3A). Distributions of the Gypsy retrotransposons were not statistically different between the two genomes for the entire 3-million-year period (Figure 3A, P-value >0.05). However, separate Kolmogorov–Smirnov tests for the most recent period show that these have proliferated less in the wheat A genome (P-value = 0.052, Figure 3A), unlike in the wheat B genome (P-value = 0.38, Figure 3A).
Distribution of the most abundant retrotransposon families:
Some specific retrotransposon families were abundant in the B and/or A genomes. This is the case of the Angela and Wis families, together representing 72 and 85% of the Copia superfamily in the B and A genomes, respectively (Figure 3B). This is also the case of the Sabrina family representing 62 and 63% of the Athila superfamily in the B and A genomes, respectively (Figure 3B). There are more families that compose the Gypsy retrotransposon superfamily, the most abundant being Fatima, representing 25% in both genomes (Figure 3B).
Kolmogorov–Smirnov tests show nonsignificant deviations (P-value >0.05) from uniform distributions for all nine retrotransposon families (with five or more observed complete copies in at least one genome), with the exception of the Jeli (Gypsy) elements in the B genome and the Angela (Copia) elements in the A genome, which have more recently proliferated (Figure 3B). Separate analysis for the most recent period, corresponding to half of the complete copies, shows that, as expected from the superfamily-level analysis, the Wham family in the A genome and the Sabrina family in the B genome have not recently proliferated (P-value <0.05, Figure 3B).
Distribution of insertion dates of the Wham and Sabrina families is different from almost all the other seven families within and between the B and A genomes (P-value <0.05). Distribution of insertion dates of the Angela family in the wheat A genome is statistically different (P-value <0.05) from that of the Fatima family in both genomes. Distributions of insertion dates of the remaining families do not show statistical differences (P-value >0.05) within and between the wheat B and A genomes (Figure 3B).
Moreover, some retrotransposon families were abundant and present in several complete copies in only one genome (Romani, Daniela, Erika, and Wham for the A genome; Egug and Jeli for the B genome) but absent or presenting few copies in the other (Figure 3B). It is likely that this corresponds to differential proliferation of the considered retrotransposons, as different copies were detected in different genomic regions of wheat B or A genomes.
LTR–retrotransposon proliferation was neither enhanced nor repressed by the allotetraploidization event:
The allotetraploidization event that brought the B and A genomes of wheat together in one nucleus was estimated to occur no more than 0.5–0.06 MYA (Huang et al. 2002; Dvorak et al. 2006; Chalupska et al. 2008). This corresponds to a divergence interval of 0.013–0.016, using the corrected rate of 1.3 × 10−8 substitutions/site/year for more rapid divergence of LTRs (Ma et al. 2004; Ma and Bennetzen 2004; Dvorak et al. 2006).
Comparisons show that retrotransposon insertions continued in wheat B and A genomes during the last 0.5–0.6 million years, apparently without being enhanced nor repressed by the allotetraploidization event (Figure 3). For example, analysis of genomic sequences available from the three ploidy levels of the A genome does not show differences in proliferation periods and rates of retrotransposons (Figure 3).
To check the accuracy of these observations and to calibrate the divergence rate used for coding sequences, on one hand, and that used for LTRs of retrotransposons, on the other hand, we traced several retrotransposons for their insertion prior or posterior to the allopolyploidization event. A PCR-based tracing strategy, derived from the retrotransposon-based insertion polymorphism method (Flavell et al. 1998; Devos et al. 2005; Paux et al. 2006), was developed for 21 retrotransposon insertions from the B genome, sampled as having different estimated insertion dates (Figure 3, indicated by gray triangles). It simply relies on primers designed in both the retrotransposon and its flanking sequences (either unassigned DNA or an older preinserted TE sequence) so that PCR amplification will be specific to the retrotransposon insertion. As the diploid wheat species donor of the B genome is unknown (Feldman et al. 1995; Blake et al. 1999; Huang et al. 2002), we analyzed the occurrence (i.e., presence or absence) of the 21 retrotransposon insertions in hexaploid (T. aestivum) and tetraploid (T. turgidum) wheat genotypes, which carry the wheat B genome. Examples of PCR-based tracing of the 21 original retrotransposon insertions in the wheat genotypes compared with their estimated insertion dates (±SE) are presented in Figure 4. Full tracing results are supplied in supplemental Table 7 and sequences of the PCR primers in supplemental Table 1. With the exception of Jeli_TA3B95C9-1, all the other 7 most recently inserted retrotransposons, which have calculated insertion date intervals (means ±SE) equal to or less than the 0.5–0.6 MYA interval (divergence 0.013–0.016), were detected in some but not all genotypes carrying the B genome, suggesting their occurrence after the tetraploidization event (Figure 4 and supplemental Table 7). In contrast, all 13 retrotransposon insertions, which have calculated insertion intervals (means ±SE) >0.7 MYA, were detected in all tested genotypes carrying the B genome, suggesting their occurrence prior to the allotetraploidization event (Figure 4 and supplemental Table 7). Given the uncertainty in calculating intervals of insertion dates, the PCR-based tracing method confirms the calibration of LTR divergence on that of gene divergence. More importantly, it also confirms that retrotranspositions (insertions) were not enhanced or repressed by the alloteraploidization event.
Relative proliferation periods of the CACTA class II transposable elements:
The CACTA class II DNA TEs represent an important proportion of the B- and A-genome sequences (13.4 and 9.4%, respectively). As for the main LTR–retrotransposon superfamilies, ratios of complete to truncated copies are very different for B (0.37) and A (1.52) genomes (Table 1). In contrast to LTR retrotransposons, the CACTA TEs do not have long repeats or other features, which would allow determination of their insertion dates on the basis of sequence divergence. Therefore, their proliferation periods and rates were evaluated indirectly, relative to their level of insertions into or by other CACTA TEs and, more importantly, by elements of the three main LTR–retrotransposon superfamilies for which proliferation periods and rates were evaluated on the basis of the dates of insertions (described above). This was calculated for all CACTA TE copies as well as for complete and truncated copies separately (Table 2).
In the wheat B genome, the majority of CACTA TE insertions (mainly those detected as truncated copies) occurred in DNA annotated as unassigned (Table 2). For the rest, significantly higher insertions of CACTA TEs into Athila and other CACTA TEs than into Copia and Gypsy retrotransposons were observed. The two latter retrotransposon superfamilies were significantly more inserted into, rather than by, CACTA TEs (Table 2). These observations indicate that proliferation of the CACTA TEs in the B genome of wheat started before, and continued during and after Athila retrotransposon proliferation, whereas very few insertions occurred during the last waves of high proliferation of Copia and Gypsy.
Similarly, a high level of insertions into unassigned DNA was observed for the CACTA TEs in the A genome. However, for the remaining insertions, no clear period of proliferation could be determined as these show similar levels of insertions into or by all other TE superfamilies (Table 2). These observations, combined with the observed higher level of complete copies (Table 1), suggest that the CACTA TE proliferation continued in the wheat A genome during the last waves of proliferation of Copia and Gypsy, unlike those in the B genome.
To constitute representative genomic sequences of the wheat B genome, in this study we have sequenced 10 BAC clones of the chromosome 3B, representing the most important number of genomic regions sequenced for a single wheat chromosome and a cumulative sequence length of 1.429 Mb (0.15% of the chromosome length). As expected, TE proliferation was pronounced (representing 79.1%). Five of these were revealed as gene-containing BAC clones at a density of one or two genes per clone; two other BAC clones contain gene relics or pseudogenes, whereas the three remaining BAC clones were missing genes. This confirms the previous conclusion about the more random distribution of genes on the wheat genome (Devos et al. 2005). Interestingly and in comparison with rice, a high level of “truncated genes” was revealed [six gene relic or pseudogenes, several of which because of TE insertions (three confirmed cases)]. If the confirmed gene number (excluding hypothetical genes) identified in the 1.43-Mb sequences (eight) is extrapolated to the whole wheat chromosome 3B of 1 Gb estimated size, then 5594 genes might be present. A slightly higher number (6000) was calculated from BAC-end sequence analysis (Paux et al. 2006).
Representation of transposable elements:
In this study, TE dynamics, proliferation, and evolutionary pathways were analyzed and compared in 1.98 Mb of sequence from 13 BAC clones of the wheat B genome and 3.63 Mb of sequence from 19 BAC clones of the wheat A genome. These genomic sequences represent very small fractions (<0.03%) of their respective genomes. Nevertheless, it has been argued that, for studying abundant repeats, sequencing and annotation of a small proportion of the genome can be representative (Brenner et al. 1993; Vitte and Bennetzen 2006; Liu et al. 2007). We have been able to confirm the adequate representation where less variation in the proportion of the main TE superfamilies was observed when analyzing a large number of BAC clones (Figure 2). Interestingly, TE proportions observed in the 13 genomic regions of the B genome of wheat are similar to those obtained from 11 Mb of BAC-end sequences of wheat chromosome 3B (Paux et al. 2006). Similarly, TE proportions were not significantly different for the wheat A genome when they were compared with the different ploidy levels (see results).
Although they are representative of abundant wheat TEs available in the TREP database (Wicker et al. 2002; http://wheat.pw.usda.gov.ITMI/Repeats), the class I and class II TEs observed in the genomic sequences of the wheat B and A genomes may not cover all wheat TEs. It is expected that more wheat TEs will be identified, as more wheat genomic sequences will become available. This is particularly supported by the identification in this study of >21 different novel TE families, most of which (17) are retrotransposons. We also believe that low-copy TEs and those that tend to “compartmentalize” in specific regions, such as pericentromeric heterochromatin regions (which is not the case in our regions), would be missed, over-, or underrepresented in this study (Ma and Bennetzen 2006; Liu et al. 2007). This could be the case for the CACTA TEs, which show the highest variation in sequence proportion between regions because they tend to be clustered in the Triticeae genomes (our unpublished results and Wicker et al. 2003a, 2005).
Transposable elements proliferated differentially in the B and A genomes of wheat:
Abundance of TEs varies widely across different organisms. Human (Homo sapiens) DNA is composed of 45% (Lander et al. 2001) repetitive sequences, Drosophila melanogaster of 3.9% (Kaminker et al. 2002), and maize of 67% (Haberer et al. 2005; Liu et al. 2007) whereas TE content in the wheat genomic sequences analyzed in this study or in other studies (Li et al. 2004; Gu et al. 2006; Paux et al. 2006) is ∼80%. Proportions of different classes of TEs also vary among organisms. Class II TEs are almost >10 times less abundant than class I TEs and constitute a small fraction (<2%) of the human, rice (Piegu et al. 2006), maize (Kronmiller and Wise 2008), Arabidopsis, and cotton (Hawkins et al. 2006) genomes. In comparison, class II TE abundance is relatively high in the wheat B and A genomes (14.1 and 9.9%, respectively), the majority of which (95%) are CACTA TEs, which are particularly abundant in the Triticeae genomes (Wicker et al. 2003a, 2005). Class I retrotransposon abundance is relatively high in several plant genomes, 58.7 and 56.6% estimated in this study for the wheat B and A genomes, respectively; 40–50% in cotton species (Hawkins et al. 2006); 35–60% in rice species (Piegu et al. 2006); and 64% in maize (Liu et al. 2007; Kronmiller and Wise 2008).
In this study, combination of TE sequence analysis and classification, comparison of proportions of complete to incomplete copies, TE insertion date estimations, and PCR-based tracing of insertions allow us to compare TE proliferation periods and rates in the wheat B and A genomes (Figure 5). It is evident that TEs appear to proliferate differentially in waves of high activity followed by periods of low activity (Figure 5). Both genomes show similar rates and relatively old proliferation periods for the Athila retrotransposons (Figure 5). However, the Copia retrotransposons have proliferated relatively more recently in the A genome whereas a more recent Gypsy proliferation is observed in the B genome. Due to their biology and replication mechanism, it was not possible to directly estimate the CACTA class II TE insertion dates. We have estimated their proliferation periods and rates relative to that of the three main LTR retrotransposon superfamilies. In the wheat B genome, the CACTA TE high proliferation period started before and overlaps with that of the Athila retrotransposons. In the wheat A genome, in addition to the relatively old proliferation similar to that in the B genome, CACTA TEs continued to proliferate during the same period as Gypsy and Copia retrotransposons. Determining the ancient proliferation periods of CACTA TEs partially explains why CACTA TEs often tend to be clustered together (see results and Wicker et al. 2003a, 2005), although they were detected in almost all analyzed BAC clones. Differential proliferation of TEs provides a valid explanation for the size variation of closely related wheat genomes (Bennett and Smith 1976, 1991; http://data.kew.org/cvalues/homepage.html).
Four families (Angela, Wis, Sabrina, Fatima) were abundant, representing the majority of LTR retrotransposons in the B and A genomes of wheat, some of which proliferated differentially (see results). Proliferation of specific types of TEs in specific genomes (or species), leading to rapid genome size variation and sequence divergence, has also been observed in other plant species. Analysis of maize (Zea mays) genomic sequences suggests that the high percentage of LTR retrotransposons is due to proliferation of only a few families of TEs (Meyers et al. 2001; Liu et al. 2007; Kronmiller and Wise 2008). Similarly, comparison of TE proportions between various cotton species (Gossypium species) revealed differential lineage-specific expansion of various LTR–retrotransposon superfamilies and families, leading to threefold genome size differences (Hawkins et al. 2006). Species-specific differential retrotransposon expansions are also the cause of the size doubling of the Oryza australiensis genome as compared to cultivated rice (O. sativa) (Piegu et al. 2006).
This is the first time that dynamics as well as proliferation periods and rates of TEs have been compared between two closely related wheat genomes. This was possible only because in this study we sequenced 10 different genomic regions that constituted a genomic sequence data set representative of the wheat B genome. For the wheat A genome, more representative genomic sequence data were rendered publicly available. There have been initial attempts to evaluate TE proliferation in the wheat genomes. Li et al. (2004) analyzed the D genome of the diploid Ae. tauschii and showed that the copy number of most TEs have increased gradually following polyploidization. However, they used dot blots, which are not very accurate. Sabot et al. (2005) have updated TE annotation in wheat genomic sequences and reported their composition and distribution in relation to genes. They suggested that Copia TEs have been most active in the wheat A, B, and D genomes, combined together (Sabot et al. 2005). Accurate comparison of dynamics as well as proliferation periods and rates between individual genomes of wheat could not be conducted in the study of Sabot et al. (2005) as, in the genomic sequences available at that time, the A genome was overrepresented whereas the B genome was underrepresented. By using more representative genomic sequences in this study, we showed the more recent activation of the Copia and CACTA TEs in the wheat A genome but not in the B genome in which a more recent Gypsy proliferation is observed. Overrepresentation of the A-genome sequences in the study of Sabot et al. (2005) may explain the reason why they found that Copia TEs have been most active in the wheat A, B, and D genomes combined together. Thus our analysis, using representative sequence data sets, for the first time shows differential proliferation of TEs between the wheat A and B genomes and illustrates the inadequacy of combining sequence data sets from different genomes as was previously done.
Neither enhancement nor repression of transposable element proliferation following allotetraploidization:
As estimated from their insertion dates and confirmed by PCR-based tracing analysis, the majority of the differential proliferation of TEs in B and A genomes of wheat (87% and 83, respectively) occurred prior to the allotetraploidization event that brought them together in T. turgidum and T. aestivum <0.5 MYA (Huang et al. 2002; Dvorak et al. 2006; Chalupska et al. 2008). More importantly, the allotetraploidization event appears to have neither enhanced nor repressed retrotranspositions. We suggest that, in addition to the Ph1 gene preventing homeologous recombination (Griffiths et al. 2006), differential proliferation of TEs has also contributed to the rapid divergence of the B and A genomes of the wheat diploid progenitors and the relative stability of the natural wheat allopolyploids that occurs thereafter.
Different levels of stability, estimated as elimination of DNA sequences, were observed in newly synthesized wheat allopolyploids, depending on wheat genome combinations (Feldman and Levy 2005 and our unpublished results). The natural wheat allopolyploids combining the B and A genomes are relatively stable and cannot be exactly resynthesized because the diploid progenitor of the B genome is unidentified (Feldman et al. 1995; Blake et al. 1999; Huang et al. 2002; Dvorak et al. 2006). Nevertheless, by studying a synthetic wheat allotetraploid combining the A and S genomes (the closest identified diploid relatives to the progenitors of the A and of the B genomes of natural wheat polyploids), Kashkush et al. (2003) reported on transcriptional activation of the Wis LTR retrotransposon but not its transposition following allotetraploidization. This is in agreement with the lack of enhancement of transpositions observed in this study in wheat natural allopolyploids combining the A and B genomes. Comparatively, less TE proliferation, estimated as the increased rate of deletions and the decreased rate of insertions, was recently observed in the cotton polyploid species Gossypium hirsutum as compared to its diploid progenitors Gossypium arboretum and Gossypium raimondii (Grover et al. 2008).
Apparent transposable element proliferation as a balance between two evolutionary forces: TEs “transposition” and also their removal:
As in this study, the vast majority of complete retrotransposons studied so far were also estimated to be <3 million years old (SanMiguel et al. 1998, 2002; Wicker et al. 2003b, 2005; Gao et al. 2004; Ma et al. 2004; Du et al. 2006; Piegu et al. 2006; Wicker and Keller 2007). These findings imply that there are mechanisms of active deletion of LTR retrotransposons from the genome, such as unequal homologous recombination and illegitimate recombination (Vicient et al. 1999; Devos et al. 2002; Ma et al. 2004; Pereira 2004). Proliferation periods and rates estimated for TEs at a given evolutionary period are the result of both antagonist evolutionary forces: TE insertion activity (transpositions) (Bennetzen and Kellogg 1997) and the removal of TEs (Petrov et al. 2000; Petrov 2002a). Thus, it is not clear whether the insertions and/or truncation (removal) rates of TEs are constant or vary during genome evolution. The “burst of insertions” described for TEs could correspond to periods of (i) high insertion activity, (ii) low rates of TE removal, and/or (iii) combinations of both evolutionary forces.
The fact that Copia retrotransposons have been active until recently in the Arabidopsis thaliana genome allowed Pereira (2004) to calculate the rate of their elimination (or half-life) as 472,000 years, outside of centromeric regions. Using this method and assuming that repetitive sequences are removed from the genome at a constant rate, a higher half-life (79,000 years) was calculated for Copia removal in rice (Wicker and Keller 2007). As the insertion-date distribution of Copia retrotransposons in Triticeae (wheat and barley) is not exponential, Wicker and Keller (2007) suggested that their half-life is much longer than in rice, thus representing a major difference between small and large genomes of plants. Similar distributions are observed in our study for all three retrotransposon superfamilies in both B and A genomes of wheat. Our analysis suggests that lower proliferation of the LTR retrotransposons during the most recent period could account for these apparent nonexponential distributions of insertion dates (including Copia retrotransposons) (Figure 5).
Our study clearly shows that, during their evolution, specific types of TEs have undergone differential proliferation in specific wheat genomes (or species) but not in others, leading to rapid sequence divergence. Little is known about the mechanistic causes that lead to differential proliferation of a single or related group of TEs across the genome of a specific species. These rapid TE expansions could correspond to periods of relaxed selection pressure such as genome duplication, interspecific hybridizations (although this was not revealed in our study), or stress conditions. It is also possible that TE proliferation could be caused by advantageous mutations in the TE sequence. A third alternative is differential deregulation of epigenetic silencing that allows specific TE families to proliferate in specific genomes.
We sincerely thank J. Dolezel and M. Kubalakova (Institute of Experimental Botany, Olomouc, Czech Republic) for providing FISH mapping information for BAC clones B95G2, B95C9, B63B7, and B54F7; Joseph Jahier [Institut National de la Recherche Agronomique (INRA), Rennes, France] and Moshe Feldman (Weizmann Institute of Science) for valuable discussions and for providing wheat genotypes; Catherine Feuillet (INRA, Clermont-Ferrand, France) for providing the wheat deletion lines; Thomas Wicker (Zurich University) for valuable advice on novel transposable element classifications and CACTA TE evolution; Piotr Gornicki (University of Chicago) and anonymous reviewers for valuable discussion and constructive criticisms; and Heather McKhann (Centre National de Génotypage, Etude du Polymorphisme Génomique Vegetal-INRA, Evry, France) for valuable discussion and revision of the manuscript. This project was supported by the National Center for Sequencing (Centre National de Séquençage-Génoscope)/APCNS2003-Project: Triticum species comparative genome sequencing in wheat (http://www.genoscope.cns.fr/externe/English/). PCR-based tracing of retrotransposons insertions was funded by the Agence Nationale pour la Recherche Biodiversité Project (ANR-05-BDIV-015) and the ANR-05-Blanc project-ITEGE.
- Received June 6, 2008.
- Accepted August 7, 2008.
- Copyright © 2008 by the Genetics Society of America