Altered Patterns of Fractionation and Exon Deletions in Brassica rapa Support a Two-Step Model of Paleohexaploidy
Haibao Tang, Margaret R. Woodhouse, Feng Cheng, James C. Schnable, Brent S. Pedersen, Gavin Conant, Xiaowu Wang, Michael Freeling, J. Chris Pires


The genome sequence of the paleohexaploid Brassica rapa shows that fractionation is biased among the three subgenomes and that the least fractionated subgenome has approximately twice as many orthologs as its close (and relatively unduplicated) relative Arabidopsis than had either of the other two subgenomes. One evolutionary scenario is that the two subgenomes with heavy gene losses (I and II) were in the same nucleus for a longer period of time than the third subgenome (III) with the fewest gene losses. This “two-step” hypothesis is essentially the same as that proposed previously for the eudicot paleohexaploidy; however, the more recent nature of the B. rapa paleohexaploidy makes this model more testable. We found that subgenome II suffered recent small deletions within exons more frequently than subgenome I, as would be expected if the genes in subgenome I had already been near maximally fractionated before subgenome III was introduced. We observed that some sequences, before these deletions, were flanked by short direct repeats, a unique signature of intrachromosomal illegitimate recombination. We also found, through simulations, that short—single or two-gene—deletions appear to dominate the fractionation patterns in B. rapa. We conclude that the observed patterns of the triplicated regions in the Brassica genome are best explained by a two-step fractionation model. The triplication and subsequent mode of fractionation could influence the potential to generate morphological diversity—a hallmark of the Brassica genus.

ANCIENT polyploidies are prevalent in most eukaryotic lineages, including plants (Van De Peer et al. 2009; Jiao et al. 2011; Proost et al. 2011), fungi (Kellis et al. 2004), and animals (Jaillon et al. 2004; Aury et al. 2006). Much progress has been made in dating these evolutionary events and quantifying the retention and loss of gene duplicates after them. Gene content influences the potential for diversification and specialization of biological functions (Force et al. 1999) and the potential for increases in morphological complexity (Thomas et al. 2006). In both the eudicot and the monocot clades of flowering plants, there have been multiple rounds of polyploidy followed by selective gene losses (Tang et al. 2008, 2010), leaving the gene repertoire of many angiosperm species greatly expanded from an estimated ancestral (i.e., in the last common ancestor) gene number of 12,000–14,000 loci (Sterck et al. 2007; Tang et al. 2008).

Despite the initial expansion of gene numbers immediately following genome duplications, most lineages have since experienced drastic gene loss, genome downsizing (Bennett and Leitch 2005; Leitch and Leitch 2008), and ultimately genetic “diploidization” at many loci (Wolfe 2001). A number of mechanisms could lead to the diploidization, among which the “fractionation” of duplicate genes is a major force (Langham et al. 2004; Thomas et al. 2006). During the fractionation process, many gene copies with redundant functions and with product levels not under stringent control [“gene dosage” theory (Birchler and Veitia 2010)] tend to be lost, resulting in a reduction of gene complement that offsets the initial expansion from genome mergers. In the paleotetraploid maize, using the sorghum genome as an outgroup, the fractionation mechanism was shown to be predominantly short deletions, probably via intrachromosomal recombination, and is certainly not randomization by nucleotide substitutions (Woodhouse et al. 2010). By whatever mechanism, the initially near-identical subgenomes generated by whole-genome duplication events do not fractionate equally—one subgenome consistently has more genes retained on it than the other; this holds true for eukaryotes ranging from Paramecium to flowering plants and fish (Sankoff et al. 2010). This phenomenon, called “fractionation bias,” was first described in the Arabidopsis genome (Thomas et al. 2006) and later generalized throughout major eukaryote lineages with paleopolyploidies (Sankoff et al. 2010).

In plants, if not all eukaryotes, when two genomes find themselves in the same nucleus, one subgenome—as defined by fractionation bias—expresses its genes to a higher mRNA level than does the other subgenome. This is the phenomenon of genome dominance (Schnable et al. 2011). Since the 12-million-year-old maize paleotetraploid displays substantial genome dominance (Schnable et al. 2011), the result that Brassica rapa subgenome III expresses its genes to a higher level than does either subgenome I or II was more affirming than surprising (Wang et al. 2011). However, genome dominance is most evident when tetraploidy was recent: in synthetic and natural hybrids and allotetraploids of cotton (Flagel and Wendel 2010), wide hybrids of Arabidopsis species (Wang et al. 2006; Chang et al. 2010), allotetraploids of Tragopogon species (Buggs et al. 2010a,b), and synthesized Brassica lines (Gaeta et al. 2007; Xiong et al. 2011). It is not yet fully understood why genome dominance persists until today after tens of millions of years of evolution.

The diploid Brassica species were first hypothesized to have been triplicated on the basis of comparative mapping studies (Lagercrantz and Lydiate 1996; Lagercrantz 1998; Parkin et al. 2003, 2005). There was some skepticism on the basis of the observation that most loci were not triplicated; however, subsequent BAC-FISH (Lysak et al. 2005) and comparative BAC sequencing studies (Yang et al. 2006) further supported the triplication hypothesis. The recent sequencing of B. rapa has confirmed the genome triplication event that occurred in the common ancestor of all Brassica species (Wang et al. 2011).

It was demonstrated that B. rapa underwent biased fractionation—subgenome III has retained almost two-thirds of Arabidopsis thaliana orthologous genes, while subgenomes I and II have retained significantly fewer genes (Wang et al. 2011). On the basis of biased fractionation results much like those in B. rapa, the eudicot paleohexaploidy, known as the gamma event, was proposed to have happened by a two-step fractionation process (Lyons et al. 2008). Fortunately, the relatively recent paleohexaploidy in B. rapa and the position of the Arabidopsis genome as an outgroup provide a phylogenetic system with superior analytical power. The two-step fractionation hypothesis was suggested for Brassica’s biased fractionation, to explain the fact that subgenome I is the most fractionated genome and subgenome III the least fractionated genome (Wang et al. 2011). However, this hypothesis was not formally tested.

Herein, we test the “two-step fractionation” hypothesis by examining short, exonic deletions in retained Brassica genes, using Arabidopsis as the outgroup. Such deletions were associated with recent, ongoing biased fractionation in maize (Woodhouse et al. 2010). We found that subgenome II had more deletions than subgenome I or subgenome III, suggesting that a two-step process of genome fractionation did indeed occur. We also show that deletions tend to accumulate in multicopy retained genes rather than in genes retained as a single copy, a phenomenon best explained by relaxed selection in duplicate genes.


Partitioning of subgenomes according to number of retained genes

The identification of orthologous regions and partitioning into subgenomes follow the method described in supporting information, File S1, in the B. rapa release (Wang et al. 2011). Briefly, multiple chromosomal segments in B. rapa that are orthologous to the same A. thaliana segment are numbered accordingly, using the established “A to X” numbering system (Wang et al. 2011). All B. rapa segments that match to the same A. thaliana segment are partitioned into three subgenomes (for example, segments matching A. thaliana segment R are partitioned into R-I, R-II, and R-III) (Figure 1A). We exhaustively enumerated all partitions and evaluated each partition on the basis of heuristic rules that were detailed in Wang et al. (2011). After the partitioning, we counted the number of syntenic orthologs within each subgenome. According to the number of retained orthologous genes in each subgenome, each segment was classified and named I, II, and III for “most fractionated,” “moderately fractionated,” and “least fractionated,” respectively, for each A to X segment (Figure 1B). We then examined the number of orthologous genes in each block; nearly all blocks showed a significant difference (with P-value cutoff = 0.01) in gene numbers between the three subgenomes, with the only exception being block T. Finally, we concatenated each set of most fractionated (A-I, B-I . . . , to X-I), moderately fractionated (A-II, B-II . . . , to X-II) and least fractionated blocks (A-III, B-III . . . , to X-III), respectively, for downstream analyses.

Figure 1 

(A) Dot plot between B. rapa and A. thaliana, with B. rapa segments that are derived from the same A. thaliana origin grouped together, to illustrate the partitioning and test of nonrandom fractionation among B. rapa triplicated regions. This shows only one of the 24 sets of blocks (block R). The table under the dot plot contains the counts of A. thalianaB. rapa orthologs in the respective subgenomes. Gene losses are not equally distributed in most of the duplicated blocks, as tested by a χ2-test (P = 1 × 10−28 in the case of block R). (B) The partitioning of B. rapa chromosomes into three inferred subgenomes following the partitioning algorithm in Wang et al. (2011).

Determining the sequence divergence among B. rapa homeologs

For paired genes inferred from syntenic alignments, we aligned the protein sequences using CLUSTALW (Larkin et al. 2007) and used the protein alignments to guide coding sequence alignments by PAL2NAL (Suyama et al. 2006). To calculate Ks, we used the Nei–Gojobori method implemented in the yn00 program in the PAML package (Yang 2007). A Python script was used to create a pipeline for all the calculations and is available at The actual distribution of Ks values is modeled and fitted as a log-transformed normal distribution (Tang et al. 2008).

Automated cataloging of internal deletion sites within the B. rapa genes

The sites of deletions were identified by an automated pipeline, illustrated in Figure 2. Using the A. thaliana orthologs as reference, we aligned one, two, or three homeologs in B. rapa. For each of 3648 (A. thaliana, B. rapa) pairs, we detected deletions of various sizes in the B. rapa gene compared to the A. thaliana gene. The DNA sequences of the complete genes (containing all exons and introns) were extracted for the BLASTN comparisons. For each gene pair, we used BLASTN with parameters favoring short, strong sequence matches (word size 7, spike length 15 bp, low-complexity filter off). We identified all collinear high-scoring segment pairs (HSP) through the “heaviest increasing subsequence” algorithm (Kurtz et al. 2004). There are unmatching sequences (gaps) between adjacent HSPs. For each gap pair, we noted the size in A. thaliana, as well as in B. rapa, and identified all the sites that were smaller in the B. rapa genes. Links to the GEvo (Lyons and Freeling 2008) URL were configured in the spreadsheet to assist manual proofing.

Figure 2 

The pipeline for automated deletion discovery. We first compared between A. thaliana and B. rapa orthologous genes using BLASTN. From the initial BLASTN HSPs, we computed a set of collinear HSPs. The unmatching regions in A. thaliana and B. rapa are compared in a pairwise fashion, recording the sizes of the corresponding gaps in A. thaliana and B. rapa, in a notation of “Bite (AB)”. We selected only the deletions that have A > 30 bp and B < 10 bp, to screen for substantial downsizing in the B. rapa sequence. As examples, the bites in black color are selected on the basis of these criteria whereas the gray ones are ignored.

The changes in the sizes of the sequences are documented in the notation “Bite (AB)”, which means there are A bases in A. thaliana, but B bases in B. rapa (Figure 2). For example, “Bite (81 → 0)” means 81 A. thaliana bases were removed in the B. rapa gene. A very useful effect of this notation is that it is also possible for “B” to be negative. For example, “Bite (82 → −7)” means 82 A. thaliana bases were removed and adjacent HSPs overlap by 7 bases. This is an indication of 7 bases of flanking direct repeats (as proposed in Woodhouse et al. 2010). We applied cutoffs of A > 30 bp and B < 10 bp, to select DNA chunks that decreased in size from A. thaliana to B. rapa. Exonic deletions were further identified for the deletion locations that intersect A. thaliana exon locations, using the tool INTERSECTBED (Quinlan and Hall 2010).

The full catalog containing a total of 4539 deletion sites along with their locations, gene identifiers, deleted bases, and GEvo links is available in File S1.

Simulation of deletions of homeologs and likelihood-ratio test for model selection

On the basis of the initial hypothesis of a deletion mechanism that independently eliminates one gene at a time, a simulation of gene loss was carried out. Starting with a length equal to the number of all genes, genes were deleted at random until the simulated number of deletions was equal to the true observed number. The distribution of apparent deletion lengths for the run was then saved, and the preceding steps were repeated 1000 times. This gives a distribution of deletion lengths.

A genetic algorithm (GA) using 20 character states, each representing a deletion length of various lengths, was used to determine, given the region length and the distribution of observed deletion lengths, the most likely deletion model to achieve the best match between simulated and observed data. The fitness values of solutions in the genetic algorithm were scored after each step with the fittest solutions being those where the simulated number of deletion runs was least different from the observed number of runs. The components for our deletion size simulation include the following:

  • Simulate under “model 1 (with only deletion size of one gene)” and then report counts for various deletion sizes.

  • Simulate under “model 1+2 (with deletion size up to two genes)” and then report counts for various deletion sizes.

  • Continue the simulation. Add one more deletion size for each new model.

  • Likelihood-ratio test to see which model gives the best likelihood while keeping the model as simple as possible (based on Occam’s razor): The likelihood function is defined as lnL=iCilnpi, where Ci is the simulated count and pi is the actual frequency of deletion size i.

Scripts that perform the simulations and likelihood calculation are available at


One subgenome has retained significantly more genes than the other two

As noted by Wang et al. (2011), subgenomes I, II, and III have retained 5966, 7679, and 11,536 genes, respectively (ignoring genes that do not show conserved synteny with A. thaliana, e.g., those that are unique to B. rapa or have transposed) (Table 1). We find a similar trend in number of nucleotides per subgenome and number of genes per subgenome (both retained and nonretained) (Table 1). The difference in size among all three subgenomes (subgenome I is the smallest and subgenome III the largest) is primarily due to the level of biased fractionation among the three subgenomes. Conversely, whole-gene deletions are 2.1 times more frequent in I than in III (10,423 deletions in subgenome I vs. 4853 deletions in subgenome III). There are also significant differences in numbers of singletons (no whole genome duplicates) in the three subgenomes. The genes that exist only on I, II, and III total 1592, 2449, and 5211, respectively, which suggests that most single-copy genes are retained in the least fractionated subgenome III. In contrast, the differences in gene densities of the three subgenomes are less dramatic than the sheer counts (Table 1). We conclude that (1) the observed gene retention bias cannot be explained by uneven gene density (for example, varied level of heterochromatic vs. euchromatic sequences) and (2) the sequence removal mechanism that has shaped the retention bias did not exclusively target gene-rich regions.

View this table:
Table 1  The number of genes and retained genes in each of the three subgenomes in B. rapa when compared to A. thaliana

Genetic distances to A. thaliana orthologs cannot distinguish among B. rapa subgenomes

The median Ks value between A. thalianaB. rapa orthologs is 0.48, while the median Ks value between B. rapaB. rapa homeologs is 0.37 (Figure 3), supporting the conclusion that Brassica hexaploidy occurred after its divergence from Arabidopsis (Wang et al. 2011). Both A. thalianaB. rapa gene pairs and the B. rapaB. rapa gene pairs show a unimodal peak in the Ks distribution (Figure 3).

Figure 3 

Ks distribution between A. thalianaB. rapa orthologs and B. rapaB. rapa homeologs. Solid lines are the observed distribution, and dashed lines are the fitted distribution based on log-normal distribution (Tang et al. 2008).

We collected A. thaliana genes that were represented in B. rapa by two or three orthologs. For each of these orthologs, we noted their subgenome assignment and determined their Ks value in comparison with their single A. thaliana ortholog. The A. thalianaB. rapa Ks values were compared in a pairwise fashion with the “winner” subgenome inferred (Table 2). The distance between A. thaliana–subgenome III appears to be slightly larger than the distance between A. thaliana–subgenome II (χ2-test, P = 0.004), while the other two pairwise comparisons are not significant at the α = 0.01 level. This suggests that although Ks is able to clearly differentiate between the time of Arabidopsis–Brassica divergence and the hexaploidy, it fails to differentiate the three subgenomes within the hexaploidy.

View this table:
Table 2  “Horse race” Ks comparisons

Additionally, we employed a tree-based method to attempt to differentiate the B. rapa triplets. We used PhyML (Guindon and Gascuel 2003) to construct the phylogeny of the 3 B. rapa genes, using the single A. thaliana gene as an outgroup. We evaluated a total of 1655 trees with B. rapa triplets. A total of 952 (58%) trees had poor branch support (aLRT value ≤ 0.8), suggesting that in most cases, the relationships among the B. rapa triplets are poorly resolved. Even among the trees that have good resolution on the splitting of the triplets, 243 trees have the “(I, II), III” topology, 203 trees have the “(I, III), II” topology, and 257 trees show the “(II, III), I” topology. These counts do not favor a dominant topology (P = 0.04, χ2-test, significance level = 0.01).

In general, our findings are in agreement with previous results (Wang et al. 2011). On the sequence level, all three subgenomes appear equally diverged from A. thaliana, as would be expected if the divergence between the Arabidopsis and Brassica genomes predated the triplication.

Deletions in B. rapa genes through comparison to the A. thaliana ortholog

To understand the mechanism underlying biased whole-gene removals in the three subgenomes, we asked whether there are differences in the rate of sequence removals within the gene sequences. We cataloged a list of sequence removal events on the basis of pairwise comparisons between each B. rapa gene and its A. thaliana orthologs through an automated pipeline. Briefly, we listed the intervening gaps between the adjacent matching regions (HSPs) and checked whether the corresponding gap in B. rapa was substantially smaller than the corresponding gap in A. thaliana (see Methods) (Figure 2). In this study, we focused only on the deletions that are >30 bases. Shorter deletions are likely affected by the artifacts of sequence alignments, so this arbitrary cutoff is a result of our favoring accuracy over sensitivity.

Using our automated deletion discovery pipeline, we identified a total of 4539 deletion sites of 3648 B. rapa genes examined (14.5% of B. rapa genes inspected in this study) (all deletion sites identified are available in File S1). Some B. rapa genes have experienced more than one deletion. Gap sizes ranged from 31 bases (just above the computational cutoff) to 1363 bases, with the size distribution shown in Figure 4. There is an apparent excess of deletion sizes between 70 and 80 bases, in addition to the peak at smaller deletion size ranges.

Figure 4 

Size distribution of the deletions in B. rapa genes (sequences present in A. thaliana but removed in B. rapa) that we cataloged in this study. The distribution stops at 30 since we focus only on the deletions that are >30 bases (a computational cutoff).

Different parts of the genes have experienced different rates of deletion. Deletions in the intronic sequences were approximately eight times more likely than in exonic sequences (1.63% of total intronic bases vs. 0.24% of exonic bases; Table 3). 5-′ or 3-′ untranslated regions (UTRs) have incurred the fewest deletions, even fewer than in exons (0.04% of UTRs vs. 0.24% of exonic bases), suggesting that some UTRs have functional roles and are under strong purifying selection. Inferred deletions that fall within sequences corresponding to Arabidopsis exons are likely to be the most reliable and therefore are the types of deletions we used to investigate the mechanism of biased gene deletion.

View this table:
Table 3  The locations of the deletion sites in A. thaliana genes identified through A. thalianaB. rapa comparisons

Distribution of deletion sites among B. rapa homeologs

We anticipated that exonic deletions would be rarest in genes within subgenome III and that genes in subgenome I would be the most likely to have gaps. Unexpectedly, we found that a higher proportion of genes in subgenome II had deletions vs. those in subgenome I (7.9% vs. 7.1%; Table 4). This is true whether we count the number of deletion sites or count the number of deleted bases. These data track observation of whole-gene fractionation bias among the three subgenomes, in that subgenomes I and II had more numbers of genes with deletions than subgenome III. However, subgenome II still had more genes with exonic deletions than expected, given the overall genome fractionation bias as discussed earlier.

View this table:
Table 4  Number of deletions and exonic deletions within B. rapa genes, grouped on the basis of their subgenome assignments and copy numbers

We also observed differences of deletion frequencies in singlet, doublet, or triplet genes in B. rapa. Singlet genes contain significantly fewer deletions than doublets or triplets. A total of 6.4% vs. 7.1% and 7.3% of the singlet, doublet, and triplet genes contain deletions, respectively (Table 4). This is consistent with different selection regimes on single-copy genes relative to genes with duplicate copies. Single-copy genes are expected to be under stronger purifying selection than genes with duplicate copies that can potentially buffer their functions. We further note that the real differences of the strength of purifying selection on singlet and duplicate genes might be larger than we have observed. The deletions we have counted include the selectively neutral as well as deleterious deletions. Indeed, there is a background rate for neutral deletions, which are expected to be the same between singlet and duplicate genes. This background component in our deletion counts dilutes the signal, reflecting only the purifying selection.

Direct repeats flanking the removed sequences

In maize, sequence deletions flanked by direct repeats (Woodhouse et al. 2010) are associated with the biased fractionation among homeologous regions. In B. rapa, about one-third of exonic deletions were flanked by direct repeats, with length up to 19 bp (see Methods). Only one copy of the two original repeat units remained at the deletion site, probably as a direct result of the deletion mechanism (Figure 5). These data suggest that fractionation via small deletions occurs in B. rapa as it does in maize (Woodhouse et al. 2010) and may be a phenomenon general to plants.

Figure 5 

(A) A GEvo graphic of BLASTN output between orthologous genes in A. thaliana and three B. rapa homeologs. The top panel is a region in A. thaliana and used as the reference, and the following three panels are three B. rapa regions that were derived from the recent hexaploidy event. Arrows represent gene models and colored rectangles show the extents of BLASTN matches (high-scoring sequence pairs, HSPs). The colored rectangles (pink, tan, and brown) represent HSPs or regions with high sequence similarity to each other. A. thaliana is the reference sequence (top panel). As can be seen, a gap is evident (blue arrow) when comparing the HSPs of AT1G68590 and Bra038364 (bottom panel). The deleted sequence (circled in blue) is evident in comparison to other B. rapa homeologs. An overlap between the HSP blocks that flank the deletion can be seen (blue circle); this indicates that the predeleted sequence was flanked by direct repeats. To reproduce this analysis, go to ( (B) ClustalW alignment of the A. thaliana and the three B. rapa sequences from A. The sequences in the blue box in the whole B. rapa homeologs indicate the locations of the direct repeat sequence that originally flanked the deletion in the homeolog containing the deletion (Bra038364). (C) Proposed mechanism for the within-gene deletion via intrachromosomal illegitimate recombination.

Several instances of the flanking repeats are given in Table 5. Some repeats are low-complexity simple sequence repeats (SSRs), e.g., trinucleotide repeats (GAT)n, (TTC)n (Table 5). SSRs have been shown to have high potential for illegitimate recombination of genes (Rocha et al. 2002). Other repeat instances with higher nucleotide complexity are also present. Direct repeats are known to be hotspots of homologous recombination between the repeat units, making the intervening sequences more easily removed (Figure 5C).

View this table:
Table 5  Partial list of instances of internal deletions within B. rapa genes that are flanked by direct repeats

Distribution of transposable-element–related sequences

Any sort of mechanism that removes DNA in the genome could potentially be “induced” by a transposon bloom. Since we are testing the two-step model for paleohexaploid fractionation, a past transposon bloom could affect each subgenome in different ways. The deletion mechanism in plants has been hypothesized as an adaptation to fight “genetic obesity” (Devos et al. 2002).

Identification of the B. rapa interspersed repetitive elements followed published methods (Wang et al. 2011). Elements were categorized into classes, with long interspersed nuclear elements (LINEs), short interspersed nuclear elements (SINEs), long terminal repeats (LTRs), and DNA transposons being the largest classes (Table 6). Among all major classes of transposable elements (TEs), the distributions are not biased toward any subgenome, suggesting that the “background” insertion and removal rates are equal across the three subgenomes, at least when viewed as they exist today. None of the most abundant TE families with the large counts (>500 copies) across the genome showed any preference for a single subgenome (P-value cutoff = 0.01, χ2-test).

View this table:
Table 6  The number of major classes of transposable elements in three subgenomes


Two-step genome merger model could explain the retention bias

The two-step model for paleohexaploidy formation and fractionation (Lyons et al. 2008) suggests that two of the genomes came together first, and then the third genome was added some time later (Figure 6). The common way to form a hexaploid is between a diploid (2N) and a tetraploid (4N) cross, resulting in a triploid, which on doubling produces a hexaploid. If this were the case, two subgenomes could be in the same nucleus for a longer period of time (as a viable tetraploid) than the third, which is then relatively less fractionated than the first two. Additional support comes from the gene loss pattern between subgenomes I and II, where low-density regions of one of the two more fractionated genomes are compensated by less loss on the other, which indicates that the tetraploid genome (I + II) could be viable since most genes tend to have at least one copy (Wang et al. 2011).

Figure 6 

The proposed “two-step” model of genome mergers. First, genome I and genome II form a tetraploid, and subsequent addition of genome III forms the hexaploid. Such stepwise genome additions involve shifting roles of the “dominant” genome: in the formation of the tetraploid, subgenome II was the dominant genome, whereas in the hexaploid, subgenome III became the new dominant genome.

We devised an experiment to test for the B. rapa genes that recently underwent fractionation. We reasoned that, if subgenomes I and II had been “at war” for a long time, then perhaps the nondominant genome, I, had already lost nearly all the genes it could lose, so that, when III entered the fray, subgenome II would still have removable genes to be deleted. This is indeed what we observe: subgenome II is the one that has incurred the most exonic deletions, rather than subgenome I.

While the two-step model is the general expectation for the formation of a hexaploid, our model does require fractionation to also have occurred in two distinct steps. There are other alternatives that we have to rule out. For example, the two genome mergers might have occurred sequentially so there was no time to fractionate after the first step. While this alternative is technically two-step formation, in terms of fractionation it is one step. We rendered this hypothesis unlikely.

Rates of recent deletions are actively influenced by selection on gene functions

We found that deletions in the intronic sequences have occurred approximately eight times more frequently than deletions in the exonic sequences (Table 3). Intuitively, corresponding exons are usually much more conserved in genome comparisons than the sequences of the introns, when exons are under strong selective pressure to retain coding capacity. Although the majority of the deletions occurred in the intronic sequences, there are still deletions in the exonic sequences that are tolerated by the B. rapa genes in our comparisons. Both functionally and in terms of technical reliability, exonic deletions are the more relevant to our study.

We consider the deletions in exonic sequences to be recent events, following the assumption that once a gene undergoes short deletions within exons, it is increasingly likely to be rendered nonfunctional, and more deletions will eventually accumulate until the gene sequence is no longer identifiable. Therefore, the presence of one or a few small deletions in the exonic sequences of any gene suggests that the deletion event was relatively recent.

On the whole-gene level, we found that singly retained genes contain fewer sequence removals than the doublets or triplets. In general, genes with duplicate copies show a higher likelihood of functional compensation than single-copy genes (Gu et al. 2003). However, there are many exceptions to this rule. Many duplicate genes that have survived the sequence removal processes following polyploidy have diversified in their regulatory roles (Tang et al. 2010) and have acted as the hub or bottleneck enzymes of metabolic pathways (Wu and Qi 2010). One well-studied example is the FLC gene, for which all three copies are retained in the B. rapa genome (Schranz et al. 2002). All three B. rapa homeologs are additive, in a dosage-dependent manner. We failed to find exonic deletions in any one of the three copies of the FLC homeologs in B. rapa, suggesting that the structures of the three copies are relatively intact. There is likely selective advantage in retaining multiple copies of those genes critical to the adaptation to the environment, e.g., flowering time, so that the actual dosage of the protein can be finely modulated.

Recent deletions in B. rapa homeologs are affected by a complex interplay of genome dominance

Our initial expectation was that the ratios of small deletion events per subgenome should match the gene fractionation ratio per subgenome. In maize, the deletion bias matched the fractionation bias between homeologs (Woodhouse et al. 2010). However, in B. rapa, the ongoing deletions do not closely reflect the pattern of gene loss: among the three subgenomes of B. rapa, subgenome II appears to have undergone more deletions than either subgenome I or subgenome III (Table 4).

Our proposed two-step fractionation hypothesis is capable of explaining this unexpected deviation of subgenome II from the general trend of fractionations. During the first genome merger, fractionation bias favored the retention of genes from one subgenome vs. another, and then the more fractionated genome may have reached its level of saturation for fractionation (subgenome I). When a new genome was introduced, the earlier, more retained genome (subgenome II) may have undergone more fractionation in comparison to the new genome as well as the older, greatly fractionated genome (Figure 6). In this case, one would expect to see more deletions in the genome undergoing more recent fractionation. We would then hypothesize that the genome now undergoing the most fractionation is subgenome II. The two-step process involves a shifting role for subgenome II: after the first step, it is the dominant genome, but only until the second step, when subgenome III dominates both other subgenomes (I and II veteran subgenomes) (Figure 6).

Deletions, not point mutations, are the major mechanism for gene inactivation

We failed to observe consistent relationships among the three subgenomes using both the synonymous substitutions (Ks) and the gene tree method, indicating that base substitutions do not account for the fractionation biases or the three parental species are approximately equally diverged. In contrast, the deletion rates are distinctly different among the three subgenomes, suggesting that the deletions contribute more to the fractionation process than point mutations.

In mammals, the mode of gene inactivation appears to be a pseudogene pathway, which mostly involves point mutations that over time accumulate to render gene products nonfunctional (Schrider et al. 2009). In the case of B. rapa, we favor the sequence removal model as we have observed the presence of small deletions within exons that track the fractionation biases (more exonic deletions in subgenomes I and II than in subgenome III), while in comparison, frequencies of point mutations among the three subgenomes are similar.

Sequence removals are likely facilitated by illegitimate recombination via direct repeats

We argue that fractionation in B. rapa appears to be due in part to a short deletion mechanism via illegitimate recombination, similar to previous observations in maize (Woodhouse et al. 2010). Direct repeats exhibit high levels of recombination intensity (Rocha 2003). Bzymek and Lovett (2001) proposed three major mechanisms for illegitimate recombination: simple replication slippage, sister-chromosome exchange-associated slippage, and single-strand annealing. Presence of the genomic repeats in proximity, e.g., simple di- and trinucleotide repeats, increases the likelihood of illegitimate recombination (Table 5).

Illegitimate recombination is not the only avenue for sequence removals. Interspersed repeats also tend to carry and transpose DNA segments. We could not find such a bias on the basis of our scan for major types of repeat elements (Table 6), suggesting that the sequence removals—especially the “biased” removal patterns—are unlikely to be caused by interspersed repetitive elements, e.g., transposons or retrotransposons.

Cumulative small deletions set the ground for whole-gene removals

We simulated the process of generating the observed gap patterns. Our likelihood-ratio test combined with simulations (see Methods) suggested that small deletion sizes consisting of mostly single-gene and a small number of two-gene deletions are sufficient to generate the observed patterns. The pattern of predominantly one- or two-gene removals at a time is in concordance with the observation of small, internal deletion of genes, in which small chunks of sequences are removed per deletion event. However, when enough sequence removals have accumulated and/or removal of sequences containing critical functional domains has taken place, the entire gene function is compromised and thus more mutations will follow, since there is no purifying selection in place to protect against deleterious mutations. We therefore view the short deletion mechanisms as cumulative mutations that eventually resulted in whole-gene removals that have shaped the gene loss patterns we observe.

Alternative hypotheses that might explain genome dominance in B. rapa

Although we have support for the “two-stage” scenario through the observation that subgenome II contains the most exonic deletions, we do not exclude other possibilities that might also contribute to the current gene loss pattern among B. rapa subgenomes. For example, paleohexaploidy could have quickly taken place with no initial differences in fractionation, but the three genomes may instead have acquired different epigenetic marks, and these epigenetic differences included the fractionation differences observed in B. rapa today. “Genomic dominance” of subgenome D over subgenome A is clear in the allotetraploid cotton genome (Flagel and Wendel 2010). Differential epigenetic marks resulting in the differential expression strengths of duplicate genes might correlate with gene retention favoring the homeolog with higher expression levels. Work in A. suecica showed that homeologous gene loss is certainly correlated with levels of expression and perhaps histone modifications as well (Chang et al. 2010). Genomic and epigenomic changes directed toward one parental genome have also been observed in B. napus (in which the C subgenome tended to be preferentially modified) (Gaeta et al. 2007) and in Triticale polyploids (in which the rye subgenome tended to be preferentially modified) (Ma and Gustafson 2006). Biases in the strength and patterns of epigenetic modifications can lead to different selective constraints on the subgenomes. Such biases may be still ongoing in the modern-day Arabidopsis and maize genomes, and the bias is associated with differential epigenetic marks that result in differential expression levels between homeologs. Biased gene loss is the result of selection against the loss of the homeolog copy that has a higher expression value (which is more likely to compromise the biological function) (Schnable and Freeling 2011).

In any case, there is nothing equal about the behavior of the three different genomes in B. rapa. The competing model that involves differential epigenetic marking might have impact on the subgenome differences we observe, which we hope to evaluate by studying the patterns of gene expression or histone modifications in B. rapa using high-throughput RNA-seq or CHIP-seq data.


We appreciate financial support from the U.S. National Science Foundation (MCB-0820821 to M.F.).


  • Received November 29, 2011.
  • Accepted January 31, 2012.

Literature Cited

View Abstract