The r1 and b1 genes of maize, each derived from the chromosomes of two progenitors that hybridized >4.8 million years ago (MYA), have been a rich source for studying transposition, recombination, genomic imprinting, and paramutation. To provide a phylogenetic context to the genetic studies, we sequenced orthologous regions from maize and sorghum (>600 kb) surrounding these genes and compared them with the rice genome. This comparison showed that the homeologous regions underwent complete or partial gene deletions, selective retention of orthologous genes, and insertion of nonorthologous genes. Phylogenetic analyses of the r/b genes revealed that the ancestral gene was amplified independently in different grass lineages, that rice experienced an intragenomic gene movement and parallel duplication, that the maize r1 and b1 genes are descendants of two divergent progenitors, and that the two paralogous r genes of sorghum are almost as old as the sorghum lineage. Such sequence mobility also extends to linked genes. The cisZOG genes are characterized by gene amplification in an ancestral grass, parallel duplications and deletions in different grass lineages, and movement to a nonorthologous position in maize. In addition to gene mobility, both maize and rice regions experienced recent transposition (<3 MYA).
THE grass family (Poaceae) is well known for the economic importance of its members, such as rice (Oryza sativa), maize (Zea mays), barley (Hordeum vulgare), oats (Avena sativa), sorghum (Sorghum spp.), wheat (Triticum spp.), and rye (Secale cereale). Even though nuclear DNA content among members of the family varies extensively, >40-fold (Bennett and Smith 1976; Bennett et al. 2000), comparative mapping studies (Hulbert et al. 1990; Ahn and Tanksley 1993; Moore et al. 1995; Gale and Devos 1998; Feuillet and Keller 2002) have demonstrated that significant portions of the genomes are conserved (collinear) and that, aside from polyploidization, the large size of a genome is mostly due to a high content of repetitive elements (SanMiguel and Bennetzen 1998; Meyers et al. 2001). Large-scale studies using genetic maps based on DNA markers are limited to the detection of large-scale chromosomal rearrangements, such as translocations, intrachromosomal inversions, genome replication, chromosomal fusion, and rearrangements that characterize specific clades. For instance, three translocations marked the ancestral lineage of Panicoideae (Gale and Devos 1998) and genome replication, chromosomal fusion, and rearrangements occurred on the Brassica lineage after its divergence from the Arabidopsis thaliana lineage (Lagercrantz 1998). However, evolutionary forces that lead to speciation of closely related species work locally within the genomes. Recently, to uncover characteristics of local evolution of plant genomes, several researchers compared the genomic content and organization of orthologous regions at the sequence level across various grass species. Regions analyzed to date are the sh2/a1 region (Chen et al. 1997, 1998), the Lrk region (Feuillet and Keller 1999), the adh1/2 region (Tikhonov et al. 1999; Ilic et al. 2003), the 22-kD Zein cluster region (Song et al. 2002), the Vrn1 region (Ramakrishna et al. 2002a), the Rp1 region (Ramakrishna et al. 2002b), the Rph7 region (Brunner et al. 2003), and the lrs/lg region (Langham et al. 2004). Although these studies generally confirmed gene conservation (microcollinearity) among orthologous segments of various grass taxa, they also reported small-scale genic rearrangements, such as gene insertions, deletions, amplifications, inversions, and translocations.
These studies are too few in number to produce a general picture of genome evolution. Only a few of them compare sequences from more than two species and very few studied duplicated (homeologous) regions of polyploid species. Furthermore, evolutionary analytical methods accompanied only a small number of them. Phylogenetic analyses, providing interpretation of the pattern and direction of genome evolution and facilitating reconstruction of the common ancestor, when coupled with small-scale comparative genomic studies, can provide more accurate insight into the history of gene duplication and genic rearrangements. In this respect, rice (with 430 Mb of nuclear DNA per haploid genome), sorghum (750 Mb), and maize (2400 Mb) represent suitable species to study the patterns of local genomic evolution because of their moderate difference (5.8-fold) in genome size and known phylogenetic relationships. Because the divergence of a rice ancestor from the lineage that gave rise to sorghum and maize represents the largest temporal distance within the grass family, ∼50–70 million years ago (MYA; Wolfe et al. 1989), tracing genomic rearrangements between these taxa may allow us to reconstruct the genomic structure of the common ancestor of grasses. The time of sorghum and maize divergence from a common ancestor has been estimated to be 11.9 MYA (Swigoňová et al. 2004). An additional benefit is that the rice genome has largely been sequenced, with its chromosomes now available as 12 pseudomolecules (http://www.tigr.org/tdb/e2k1/osa1/pseudomolecules/info.shtml).
Here, we focus on the r/b chromosomal region mostly because the r1 and b1 genes of maize have been subjects of many genetic studies (Neuffer et al. 1997) and a comprehensive analysis of molecular evolution of the gene family in grasses was lacking. The r/b genes, together with members of the c/pl gene family, coregulate transcription of structural genes in the anthocyanin biosynthetic pathway (Ludwig and Wessler 1990; Dooner et al. 1991). Red and purple anthocyanins, the common plant pigments, are produced in a wide variety of tissues where they serve many different functions, such as attracting pollinators and dispersal agents and protecting against insects, pathogens, and UV light. Biochemical and molecular studies showed that one functional allele from each gene family is required for activation of the structural genes of the anthocyanin biosynthetic pathway and that the pigmentation pattern of a plant tissue reflects allelic constitution at both of the regulatory loci (Dooner 1983; Cone et al. 1986; Goff et al. 1991, 1992; Bodeau and Walbot 1992; Tuerck and Fromm 1994; Sainz et al. 1997; Grotewold et al. 1998; Lesnick and Chandler 1998).
The r/b gene family in maize is represented by the R locus (Dellaporta et al. 1988) on the long arm of chromosome 10 (here referred to as maize-10L), the B locus (Chandler et al. 1989) on the short arm of chromosome 2 (maize-2S), and displaced genes, such as the lc (Ludwig et al. 1989) and sn (Consonni et al. 1992) genes located ∼2 cM distal to the R locus (Dooner and Kermicle 1976; Gavazzi et al. 1990) and the hopi gene located 4.5 cM distal to the R locus (Petroni et al. 2000). (To obtain consistency throughout the manuscript, despite the previously published form, we use uppercase letters for loci and lowercase letters for genes.) Depending on the maize variety, the R locus may contain one (Dellaporta et al. 1988) or up to four r genes (Eggleston et al. 1995). Unlike the R locus, the B locus contains only one gene, the b1 gene (Patterson et al. 1991). Both the r1 and the b1 genes in maize can induce tissue pigmentation in other plants, such as wheat, barley, sorghum, petunia, Arabidopsis, Nicotiana, and Lycopersicon (Lloyd et al. 1992; Bilang et al. 1993; Casas et al. 1993; Quattrocchio et al. 1993; Galway et al. 1994).
In this study, we present a comparison of genomic content of orthologous chromosomal regions containing the r/b gene homologs from rice, maize, and sorghum. We examine which types of genome evolution resulted in the observed genomic structure of these regions across the three species and between the two duplicated (homeologous) regions of maize. We present a detailed analysis of molecular evolution of the r/b gene family in grasses and examine the evolutionary pattern of gene duplication of the tightly linked cis-zeatin O-glucosyltransferase (cisZOG) gene. We also examine the content and history of retrotransposition within these regions.
MATERIALS AND METHODS
Bacterial artificial chromosome isolation and mapping of r/b homologs:
An r1 probe obtained from maize by PCR of the lc gene sequence (GenBank accession no. M26227) was used to screen high-density filters for maize (Zea mays L. cv B73) and sorghum (Sorghum bicolor cv BTx623) bacterial artificial chromosome (BAC) libraries (Yim et al. 2002; http://www.tamu.edu/bacindex.html). Positive clones were digested by NotI and HindIII for fingerprint analysis and further analyzed by restriction mapping. The purified BAC DNA, isolated using a BAC DNA isolation kit (QIAGEN, Valencia, CA), was physically sheared and ligated into a pUC vector for shotgun libraries, as previously described by Song et al. (2001). Sequencing was done on an ABI 3700 capillary sequencer using the ABI PRISM BigDye Terminator Cycle Sequencing Ready Reaction kit (Applied BioSystems, Foster City, CA). Base calling and assembly were based on phred/phrap programs (Ewing et al. 1998). About 10× coverage was generated for all of the BACs and sequence gaps were finished by specific primers identified with the help of consed (Gordon et al. 1998) or by transposon minilibraries, constructed according to the manufacturer's instructions (Finnzyme, Helsinki).
Genes predicted by FGENESH (http://www.softberry.com/berry.phtml? topic=gfind&prg=FGENESH) and GENSCAN (http://genes.mit.edu/GENSCAN.html), using the monocots option, were further analyzed by homology searches using a variety of BLAST searches (Altschul et al. 1997) against GenBank nucleotide, protein, and EST databases. The intergenic regions were analyzed separately for the presence of transposable elements using homology-based searches against GenBank and TIGR Gramineae repeat database (http://tigrblast.tigr.org/euk-blast/index.cgi?project=plant.repeats).
Genomic sequences of the r/b homologs identified within the studied genomic intervals were aligned against the published cDNA of the lc gene (Ludwig et al. 1989) using the BESTFIT program of the Wisconsin GCG package (Devereux et al. 1984). Predicted ORFs of the cisZOG gene were aligned against the published cDNA of cisZOG1 and cisZOG2 genes of maize (AF318075 and AY082660, respectively). Multiple sequence alignment of full-length coding sequences performed by ClustalX (Thompson et al. 1997) was manually adjusted with reference to amino acid alignment using MacClade 4 (Maddison and Maddison 2000). In addition to the r/b genes identified within the studied intervals, 8 full-length and 15 partial sequences of r/b homologs were used from GenBank: maize lc (M26227), sn (X60706), hopi (AJ251719), rs (X15806), and b-Peru (X57276); rice plw-OSB1 (AB021079), plw-OSB2 (AB021080), ra (U39860–65), and rb (AP003310, U39867, U39868, U39866); Phyllostachys acuta r gene (U11448); Pennisetum glaucum r genes (U11444–47); and Tripsacum australe r gene (U11449).
Phylogenetic analyses and rate estimation:
Coding sequences of the r/b homologs corresponding to part of the second through part of the ninth exons of the lc gene and the entire coding sequence of the cisZOG homologs were used for phylogenetic analyses and rate estimations. Phylogenetic analyses were performed using PAUP* 4.0b10 (Swofford 1999). Maximum parsimony (MP) analyses were performed with branch-and-bound search, and maximum likelihood (ML) analyses were performed using heuristic searches with 100 random taxon addition replicates with TBR branch swapping. Evolutionary models used in ML analyses were selected using Modeltest 3.06 (Posada and Crandall 1998) with the Akaike Information Criterion. In the case of the r/b homologs, we also performed Bayesian analysis using MrBayes v3.0b4 (Huelsenbeck and Ronquist 2001) on data partitioned according to variable and conserved regions (see below). Support for individual nodes within tree topologies was evaluated by a nonparametric bootstrap analysis (Felsenstein 1985) performed with 100 pseudoreplicates with 10 random taxon addition heuristic searches in each pseudoreplicate. Synonymous and nonsynonymous distances were estimated by the maximum likelihood method implemented in codeml of PAML (Yang 1997) under the F3x4 model (Goldman and Yang 1994). Gene rates at synonymous sites were calculated as described by Swigoňová et al. (2004). For the cisZOG genes, the program r8s (Sanderson 2002a), which implements the semiparametric method using penalized likelihood (Sanderson 2002b), was used to assess variance in evolutionary substitution rates and incorporate such variance into the estimation of ages of lineages. The absolute ages of gene duplications were estimated using two calibration points. One was set to 50 MYA for the divergence of rice, sorghum, and maize (Wolfe et al. 1989) and another to 11.9 MYA for the divergence of sorghum and maize (Swigoňová et al. 2004).
Dating of retrotransposon insertion:
Both long terminal repeats (LTRs) of each identified LTR-retrotransposon were aligned using ClustalX (Thompson et al. 1997) and the resulting alignment was manually reviewed. Distance estimations between pairs of LTRs were based on Kimura's (1980) two-parameter model (K2P). According to the recent study by Ma and Bennetzen (2004) showing that nucleotide substitutions in the LTR-retrotransposons of rice (and, presumably, other grasses) occur at rates at least twofold higher than synonymous substitutions in functioning grass genes, the proposed rate of 1.3 × 10−8 substitutions/site/year was used for dating LTR-retrotransposon insertion times.
Isolation and sequencing of r/b orthologous regions in maize and sorghum:
Analysis of r/b homologs derived from a common ancestral chromosome requires the isolation of genomic clones of BACs from closely related chromosomal regions that are orthologous. Screening of the sorghum and maize BAC libraries with the r1 probe yielded such clones that provided a single BAC contig in sorghum and two BAC contigs in maize by fingerprint analysis. This result was expected, given the duplicated nature of the r/b region within the maize genome. To extend the sequence window for comparative analyses, we selected and fully sequenced one additional BAC clone corresponding to the 3′ extension of the maize-10L region (Table 1). To examine if conserved genes are located in a single BAC size extension of the sequenced contig, another BAC clone, representing the 3′ extension of the maize-2S region, was only partially sequenced (assembly ended in phase 2, 9 ordered contigs, 134,220 bp). Two additional conserved genes were identified in this extension. See Table 1 for descriptions of the BAC contigs used in our study.
Genomic comparison of orthologous regions:
Genes identified within the studied segments are listed in Table 2. Three of the 15 candidate genes in rice and 3 of the 13 candidate genes in sorghum are represented by multiple copies. All 11 genes recognized in the maize-10L segment and all 5 genes within the maize-2S interval are present in single copies. The rice and sorghum regions (Figure 1) share 9 consecutive genes in the same order (collinear). The observed gene density in the rice segment, 11.4 kb/gene, is slightly lower than estimates for the entire genome (7–9 kb/gene; Song et al. 2001; Feng et al. 2002; Goff et al. 2002; Sasaki et al. 2002; Yu et al. 2002; Rice Chromosome 10 Sequencing Consortium). The sorghum r/b region with 7.9 kb/gene exhibits a density similar to those previously reported from other sorghum regions (Chen et al. 1997; Tikhonov et al. 1999; Ramakrishna et al. 2002a). The gene density significantly differed between the two duplicated regions of maize (26.4 kb/gene in maize-10L and 49.1 kb/gene in the maize-2S segment). Within the window of 9 genes shared by rice and sorghum, the microcollinearity of the two regions is disrupted by the presence of genes 3A, 3B, 6, and 8 in the sorghum segment (Table 2).
Six of the consecutive genes shared by rice and sorghum (corresponding to rice genes 4, 5, 6, 7, 8, and 9) were also detected within the homeologous regions of maize; however, only the genes corresponding to rice genes 4, 6, and 8 are contemporaneously present in both of the maize regions as homeologous duplicates. Two of these shared genes, the r/b gene homolog and the glutathione peroxidase, are of full length. The third conserved gene (corresponding to rice gene 6) is truncated in both the rice and the maize-2S segments. Both maize segments experienced gene deletions, but the pattern of deletion is different on the maize-10L and -2S segments. Disregarding gene amplifications, the maize-10L region experienced deletion of five genes (corresponding to rice 7, 9, 10, 11, and 12) and the maize-2S experienced deletion of at least one gene (corresponding to rice gene 5). Microcollinearity on maize-10L is interrupted by the presence of two genes (5 and 6) not found in this region in rice and sorghum and by the absence of the cis-ZOG gene (rice gene 7). On maize-2S, genes 2 and 3 (corresponding to rice genes 7 and 6, respectively) are inverted relative to their positions in rice and sorghum (Figure 1).
The rice region was found to contain many fragmented LTR-retrotransposons (not shown) and two intact LTR-retrotransposons (Figure 1). We did not find any fragmented or intact retrotransposons in the sorghum segment, in accord with results for the adh1/2 (Tikhonov et al. 1999) and the orp1/2 regions (Lai et al. 2004). One DNA transposon, the putative TNP2, is shared by the maize-10L region and sorghum, suggesting that its insertion predated the time of sorghum and maize divergence. Both maize chromosomal segments are rich in LTR-retrotransposons. The identified retrotransposons constitute 55% of the maize-10L and 76% of the maize-2S genomic sequences. Of the 17 LTR-retrotransposons identified within the maize-10L region, only 4 are nested. In the maize-2S region, 5 LTR-retrotransposons are nested and 1 is present as a solo LTR within another LTR-retrotransposon. In both maize regions the identified LTR-retrotransposons seem to be full length.
Molecular analysis of the r/b homologs:
To investigate the relationship among the r/b gene homologs of sorghum and maize, we used rice as the outgroup. To extend the taxon sampling, we also included full-length sequences of previously described r/b homologs from maize (lc, sn, hopi, rs, and b-Peru) and rice (ra, rb, plw-OSB1, and plw-OSB2). The aligned coding sequences, containing 2016 nucleotides, correspond to part of the second exon through part of the ninth exon of the lc gene of maize (http://pgir.rutgers.edu/).
The rice region contains three r/b homologs (further referred to as r/b-O1, r/b-O2, and r/b-O3), the sorghum region contains two (r/b-S1 and r/b-S2), and each maize region contains one r/b homolog [r1 on maize-10L (representing a simplex R-P allele) and b1 in the maize-2S region]. [In this study, names of genes identified within the studied segments are written as a name of a gene homolog (such as r/b or cisZOG) followed by a hyphen, the first letter of the genus name of the organism (O for Oryza, S for Sorghum, Z for Zea), and a number indicating its occurrence on the studied genomic region from the 5′ to the 3′ end. In the case of the r/b genes of maize, we follow the traditional nomenclature. The r1 gene is from the maize-10L region (representing a simplex R-P allele) and the b1 gene is from the maize-2S region.] The nucleotide sequences of the two r/b homologs of sorghum are 89.8% identical and on average they share 86.8 and 88.2% nucleotide identity with the maize r1 (also as R-P in Figure 2) and b (as B-b in Figure 2) genes, respectively. The two maize r/b homologs are 87.5% identical in their nucleotide sequences. Rice r/b-O1 contains a premature stop codon at position 520–522 in the nucleotide alignment (http://pgir.rutgers.edu/) and is 99.2% identical to plw-OSB1 and 98.8% identical to the ra gene. Gene r/b-O2 is 99.8% identical to plw-OSB2 and also lacks a C terminus, as previously reported for plw-OSB2 (Sakamoto et al. 2001). Both genes have premature stop codons at their sequence, the first one at position 1648–1650 in the alignment. Therefore, substitution rates were estimated for all pairs of genes excluding r/b-O1, r/b-O2, and plw-OSB2, since these three are expected to evolve under different constraints. Rice r/b-O3 exhibits 70.1 and 60.9% nucleotide identity with r/b-O1 and r/b-O2, respectively.
Phylogenetic analyses of amino acid sequences as well as nucleotide sequences recovered the same topology for the gene tree (as shown in Figure 2A). All rice r/b sequences were recovered in a monophyletic group with two clades, one containing r/b-O2, plw-OSB2, and r/b-O3 sequences and the other containing the r/b-O1, plw-OSB1, and ra genes. The two sorghum r/b homologs were recovered as a sister clade to the clade containing maize b-Peru and the maize r/b homolog from chromosome 2S of B73 (b, also known as B-b). The r/b gene homolog from maize chromosome 10L (r1, also known as R-P) belongs to a clade that contains the r/b homologs rs, lc, sn, and hopi, all mapped to the long arm of chromosome 10.
To further examine the stability of the recovered relationships among the 15 full-length sequences of the r/b homologs (as shown in Figure 2A) and to investigate the phylogenetic relationship of the r/b homologs among other members of the grass family, we extended our data for an additional 15 partial sequences of r/b homologs that were already available in GenBank. Those sequences, covering only the conserved basic helix-loop-helix domains and the 5′ end of the C-terminal domains (exons 8 and 9 of the lc gene of maize), are from wild species of rice and also from P. acuta (tribe Bambuseae), P. glaucum (tribe Paniceae), and T. australe (tribe Andropogoneae). The recovered topology is largely in agreement with the topology recovered from the full-length sequences. In addition, the rice sequences were recovered in two well-supported clades (Figure 2B). One clade contains only sequences from the rice Ra locus (chromosome 4), while the other clade includes two sister clades, one containing r/b homologs mapping to chromosome 4 (the Ra locus) and another containing sequences from chromosome 1 (the Rb locus). According to the recent study by Paterson et al. (2004), who reported that the ancestral lineage leading to cereal taxa experienced whole-genome duplication at ∼70 MYA, we examined whether the Ra and Rb regions of rice are indeed the product of the ancient duplication by examining the surrounding genic structure of the Rb locus. Using the Rice FPC map of the Arizona Genomic Institute (http://www.genome.arizona.edu/fpc/rice/) we identified three overlapping BAC clones (AP003432, AP003310, and AP002908) that produced a 361,635-bp-long contig covering the Rb locus. Within this contig we identified two r/b homologs, here called rb1 and rb2, that were included in additional analyses. Sequence annotation did not recover any additional shared genes between the regions of rice chromosome 4 (Ra locus) and chromosome 1 (Rb locus), indicating that those two loci are not the products of a genomic duplication. Phylogenetic analysis using partial sequences of r/b homologs recovered rb1 and rb2 as sister taxa at the base of the r/b homologs from the Rb locus of other rice species (as shown in Figure 2B). Similarly, analysis of the full-length sequences placed the two Rb genes as a sister clade to the r/b-O2, r/b-O3, and plw-OSB2 clade (Figure 2A), indicating that the ra and rb genes are paralogs resulting from duplication of one single ancestral r/b gene at the lineage leading to rice. Using the r/b sequence from Antirrhinum majus delila (M84913) as the outgroup we estimated that the duplication of the ancestral r/b gene of rice occurred at ∼27.5 MYA.
In addition, all four r/b homologs of P. glaucum are from a monophyletic clade, indicating their origin by duplication of a single ancestral gene. Furthermore, the Tripsacum r/b gene is basal to sorghum and maize r/b genes. However, the two sorghum r/b homologs were recovered in polyphyletic position at the base of maize genes from maize chromosome 2S (Figure 2B). Yet, one has to keep in mind that this data set contains shorter sequences and a larger number of terminal taxa. Therefore, we performed additional, Bayesian, analysis of the full-length sequence data. We divided the data set according to five variable and three conserved regions, in which parameters of each partition were estimated independently. Nevertheless, the recovered topology supported the sister relationship of the two sorghum r/b sequences, with the support of Bayesian posterior probability of 100, further justifying the hypothesis that they represent paralogous genes that originated from a duplication of an ancestral gene after sorghum and maize lineages diverged from a common ancestor.
Molecular analysis of the cisZOG gene:
Given the range of mobility exhibited by the R/B locus in the last 50 million years, one wonders whether it might relate to its function and position in its chromosomal region. However, another locus in this region, just 43 kb downstream of the b1 gene, with even greater mobility, has a completely different function. The CisZOG locus encodes a protein that promotes cell division and plant growth (Mok 1994). Using sequence divergence we investigated the history of its gene duplication across the three grass species. All identified cisZOG ORFs consist of one 1377- to 1407-bp exon. Three copies of the cisZOG gene were predicted in the rice chromosomal segment (see genes 7A, 7B, and 7C in Figure 1, hereafter referred to as cisZOG-O1, cisZOG-O2, and cisZOG-O3, respectively). Genes cisZOG-O1 and cisZOG-O2 are 95.7% identical. The cisZOG-O3 gene shares 73.95 and 74.95% nucleotide identity with cisZOG-O1 and cisZOG-O2, respectively. Translation of the nucleotide sequences to the amino acid sequences revealed the presence of premature in-frame stop codons within the cisZOG-O3 gene at positions 1141–1143 and 1405–1407 of the sequence alignment (http://pgir.rutgers.edu/). A homology search using BLAST recovered another putative cisZOG gene on rice chromosome 7 (GenBank accession no. AP004378). This gene is 72.2% identical at the nucleotide sequence level to cisZOG-O1, 72.7% to cisZOG-O2, and 70% to the cisZOG-O3 gene. Phylogenetic analyses as well as a rate heterogeneity test showed that this gene, from rice chromosome 7, evolves at a significantly increased rate compared to all other cisZOG genes from the four studied regions. Moreover, its relative position to the other cisZOG homologs varied in different analyses. Thus, the chromosome 7 cisZOG gene is suspected to evolve under different constraints and was not used in further analyses.
Five cisZOG genes were identified within the sorghum interval (further referred to as cisZOG-S1 through cisZOG-S5), one of which is in an inverted position (7C in Figure 1). While the maize-2S region contains one cisZOG gene (cisZOG-Z1), the orthologous region of maize-10L lacks this gene. Martin et al. (2001) reported isolation of the cisZOG1 gene from maize that matches exactly cisZOG-Z1 from the maize-2S region. Veach et al. (2003) reported isolation of another gene, cisZOG2, from the maize genome but its chromosomal location was not known. The nucleotide sequences of cisZOG-Z1 and cisZOG2 are 98.1% identical. We included sequences of both maize genes in our analyses. Maximum parsimony analyses of nucleotides and amino acid sequences resulted in gene trees with the same topology as the maximum likelihood tree shown in Figure 3A. The estimated phylogram shows several long branches, for instance lineages of cisZOG-O3 and cisZOG-S5. A rate heterogeneity test revealed that rates of substitution vary significantly among lineages. To decrease the effect of long branches on the resulting phylogeny, we examined the number of variable sites in nonoverlapping segments of 100 sites. In the following analyses, we excluded the three most variable regions from the data. Analyses of this reduced data set (1098 characters) resulted in a tree topologically identical to that shown in Figure 3A. However, the rate heterogeneity test still uncovered significant differences for several lineages of the gene tree. Thus we performed several additional analyses with more extensive exclusion of variable characters and also exclusion of sequences that showed significant increases in their rate of change. The resulting trees, although often yielding less resolution (lineages were collapsing to polytomies), all had the same backbone topology as shown in Figure 3A.
Using the estimated phylogenetic relationship among the cisZOG genes, as shown in Figure 3A, we traced the history of gene duplication. For estimation of duplication times we used the cisZOG homologs from soybean (AF489873 and AF489874), Phaseolus (AF489876 and AF489877), and Lycopersicon (AY082661) as the outgroups. Rooting the gene tree at the outgroup species (as marked in Figure 3A) the cisZOG-S5 was recovered at the base of the ingroup genes. Hence, the cisZOG-S5 represents the oldest gene (here marked as a descendant of the A1 gene; Figure 3B) that was deleted independently on the maize and rice lineages. Furthermore, rice cisZOG-O1 and cisZOG-O2 are close relatives to cisZOG-S1, cisZOG-Z1, and cisZOG-Z2; therefore, they must have shared a common ancestor, which we will call the A2 gene (Figure 3B). The ancestral A2 gene duplicated independently on the lineage leading to rice (∼32 MYA) and on the lineage leading to maize (∼6.7 MYA). While rice cisZOG-O1 and cisZOG-O2 are located within the same chromosomal region, one maize gene, cisZOG-Z2, was transferred to a nonorthologous location in the maize genome. Since rice cisZOG-O3 and sorghum cisZOG-S2 form a clade in all analyses performed, they must have shared a common ancestor, which we will call the A3 gene. Because the common ancestry of cisZOG-S2, -S3, and -S4 genes can be traced to a later date than the split of rice and sorghum lineages, there must have been a duplication on the ancestral lineage of grasses that gave rise to genes A3 and A4. The ancestral gene A3 gave rise to cisZOG-O3 and cisZOG-S2 and gene A4 gave rise to sorghum cisZOG-S3 and cisZOG-S4 genes. The duplication of the A4 gene, however, happened before the time when the two progenitors of the maize genome diverged from sorghum (∼40 MYA). Thus, the immediate ancestor of rice, sorghum, and maize had four copies of cis-zeatin O-glucosyltransferase (A1, A2, A3, and A4; see Figure 3B). While the rice lineage experienced a duplication and two deletions of the ancestral genes, the sorghum still contains descendants of all four ancestral genes. The most extensive gene deletion occurred on the maize lineage that contains only two duplicates of one ancestral gene, and nine genes were removed from its genome.
Retrotransposons and their times of insertion:
While gene mobility has occurred throughout the evolution of this chromosomal region even before speciation, sequence mobility of transposable elements in this region occurred mostly during very recent times. Two full-length LTR-retrotransposons, closely related to maize Tekay and Huck, were identified in the rice segment (Figure 1). They inserted in the rice genomic sequence at ∼0.8 and 0.02 MYA, respectively. Large numbers of fragmented retrotransposons were also found in the rice region (not shown). The maize-10L region contains 17 LTR-retrotransposons, representing 7 different families, and the maize-2S region contains 12 LTR-retrotransposons from 5 families. The most common were Ji (38%), Huck (24%), and Opie (17%). Eight LTR-retrotransposons in the maize-10L regions are intact and not in a nested organization. Both LTRs were identified for 22 retrotransposons. Six of the 22 LTR-retrotransposons were inserted within another LTR-retrotransposon (see Figure 1). In all six cases, the two LTRs of a recipient retrotransposon showed greater nucleotide variation than the LTRs of an inserted retrotransposon. Using an intergenic retrotransposon nucleotide substitution rate of 1.3 × 10−8 substitutions/site/year (Ma and Bennetzen 2004), we calculated the time of insertion of the 22 LTR-retrotransposons (Table 3). All inserted in the maize genomic sequences within the last 3 million years. The most pronounced activity has occurred within the last 1 million years, when >80% of the LTR-retrotransposons inserted.
Microcollinearity at the R and B loci in rice, sorghum, and maize:
While there was already genetic and molecular evidence that the r1 gene could exist in multiple copies even in distant positions, we have now found through comparison of genomic sequences that its dynamic properties have already existed throughout the evolution of the grass family. Moreover, the larger window of sequences surrounding this locus also shows that such properties are far more general than anticipated. The orthologous r/b regions available for this analysis span a window of 9 genes for comparison of rice and sorghum and a window of 6 genes for simultaneous comparison of all four genomic segments. Microcollinearity of the rice and sorghum genomic regions is disrupted by 4 genes, here labeled 3A, 3B, 6, and 8. They are not present in any of the other orthologous segments, suggesting that they have been inserted within the sorghum interval after sorghum diverged from maize. The duplicated gene 3 shows sequence similarity to the kelch repeat-containing F-box family protein whose closest homologs are found on rice chromosomes 10 and 1, respectively. Gene 6 shows similarity to a no apical meristem (NAM) protein of rice and Arabidopsis and other NAC domain proteins. In a recent study, Ooka et al. (2003) found that NAM proteins are members of NAC gene families that have at least 75 predicted genes in the rice genome and 105 genes in the Arabidopsis genome. Gene 8 is similar to a probable small nuclear ribonucleoprotein, and homologous sequences were recovered from rice chromosomes 2, 3, 7, 9, and 12. The maize-10L region contains 2 inserted genes, 5 and 6, that are not shared with any other of the studied intervals. Gene 5 shows sequence similarity to an S-domain receptor-like protein kinase gene and similar sequences were recovered from rice chromosomes 1 and 7. This gene family in maize contains at least 4 genes (Zhang and Walker 1993) and multiple members were also reported from Arabidopsis and Brassica (Braun and Walker 1996). Gene 6 is similar to the aldose reductase-related protein and this gene was also predicted from rice chromosomes 1 and 10. In summary, the genic interruption of microcollinearity between rice, sorghum, and maize is caused mostly by insertion of high-copy-number genes in sorghum and maize that probably occurred after sorghum diverged from the maize progenitors over the past 11.9 million years (Swigoňová et al. 2004). The fact that gene movement may be associated with gene amplification was previously pointed out by other studies (Song et al. 2002; Lai et al. 2004).
In maize, deletion of homeologous copies of duplicated genes appears to be randomized across the homeologous regions of maize, since both segments experienced gene deletions. While the r/b region of maize-10L lost 56% (five of nine) of the genes that can be traced to the common ancestor of grasses, the maize-2S region lost only 17% (one of six) of the genes. This is in contrast to the 40% gene loss reported from each of the homeologous regions surrounding the Adh locus of maize (Ilic et al. 2003). This nonuniform deletion of duplicated genes from the homeologous loci of maize was also observed in a recent report by Lai et al. (2004). Deletions of redundant homeologous copies of maize genes over the last 12 million years has been accompanied by retention of at least one copy of each gene within the two r/b regions, similar to that observed in the maize adh region (Ilic et al. 2003). Lai et al. (2004) found only one gene that was conserved between sorghum and rice and was lost from both duplicated regions of maize. It is not yet known whether this gene was deleted from the maize genome or was transferred to a nonorthologous position.
The potential of homeologous duplicates to develop new functions seems to be rarely realized, since the most common fates of duplicated genes are gene silencing of one copy (Walsh 1995), interlocus recombination and concerted evolution of duplicates (Wendel et al. 1995; Wendel 2000), or retention of subfunctional duplicates (Cronn et al. 1999; Adams et al. 2003). Within the studied r/b region, three genes are present as homeologous duplicates. Both copies of the r/b gene and glutathione peroxidase gene (gp) are full length. However, the third gene, which shows similarity to an unknown conserved expressed protein from A. thaliana (corresponding to rice gene 6), is present only as a fragment in the maize-2S region. For the r/b genes, the estimated nonsynonymous substitution rate (Ka) was 0.088 (±0.010) and the synonymous substitution rate (Ks) was 0.309 (±0.042), yielding a Ka / Ks ratio of 0.284. For the gp gene, Ka was 0.035 (±0.010) and Ks was 0.128 (±0.038), giving a Ks /Ka ratio of 0.273. Relative rate tests (Muse and Gaut 1994) performed on the r/b and gp duplicated genes from the maize B73 genome that we studied, using rice as the outgroup, did not recover a significant difference in rates between the homeologous duplicates. Therefore, within the r/b region, one gene encountered possible silencing of one of the homeologous copies by gene fragmentation and two genes appear to be under selection to retain their function, i.e., purifying selection (Yang and Bielawski 2000).
While the r/b region of sorghum has a low content of transposable elements, as similarly reported for adh (Tikhonov et al. 1999), orp (Lai et al. 2004), and zein (Song et al. 2002) regions, other chromosomal segments contain up to 15–20% of retrotransposons (Ramakrishna et al. 2002a,b). Unlike sorghum, the rice r/b genomic sequence contains, in addition to two full-length LTR-retrotransposons, numerous fragmented LTR-retrotransposons in various states of truncation. Recently, Ma et al. (2004) reported analysis of 1000 LTR-retrotransposons from the rice genome, concluding that >75% of them are fragmented. They suggested that fragmentation of LTR-retrotransposons in rice is primarily due to unequal homologous and/or illegitimate recombination that leads to removal of retrotransposons from the rice genome. In sorghum, one solo LTR-retrotransposon was reported from the sh2/a1 (Chen et al. 1998) region and additional fragmented LTR-retrotransposons have been also found within the vrn1-homologous (Ramakrishna et al. 2002a) and rp1-homologous (Ramakrishna et al. 2002b) regions, indicating that similar events that formed the rice genome also influenced the evolution of the sorghum genome, although the dynamics of such rearrangements seem to significantly differ between the two species. For maize, it has been previously estimated that 60–80% of its genome is composed of repetitive DNA (SanMiguel et al. 1996; SanMiguel and Bennetzen 1998; Meyers et al. 2001). Consistent with reports from other regions of maize (SanMiguel et al. 1998), the r/b region contains ∼65% LTR-retrotransposons that have inserted within the last 2.7 million years. Unlike rice, most of the LTR-retrotransposons in the maize r/b region are full length, suggesting that maize may be on the route to genome expansion.
Evolution of the r/b gene family:
Purugganan and Wessler (1994) analyzed partial nucleotide sequences of r/b gene homologs (474 nt in total), covering the most conserved region of the genes, the bHLH domains and the C-terminal domains (exons 8 and 9), from six grass species. They reported that the two maize r/b homologs, the r1 (chromosome 10L) and b1 (chromosome 2S) genes, are as old as sorghum and maize. Hu et al. (1996), analyzing the amino acid sequences of the same region from maize and different rice species, reported that the r/b gene families arose recently and independently in different grasses. Furthermore, they concluded that in rice there are two clades, Ra (grouping sequences mapping to rice chromosome 4) and Rb (grouping sequences from rice chromosome 1), which evolved by duplication of an ancestral gene. Having full genomic sequences of the r/b genes from orthologous regions of rice, sorghum, and maize, we reinvestigated the pattern of molecular evolution of the r/b gene family in grasses and examined the relationship of the maize r/b homologs with the r/b genes of sorghum.
Phylogenetic analyses of full-length coding sequences of the r/b genes, based on nucleotide as well as on amino acid sequences (Figure 2A), showed that genes form clades according to the grass taxon they belong to, indicating that the common ancestor of grasses had a single r/b gene and that this ancestral r/b gene duplicated independently in the lineages leading to rice, sorghum, and maize. Including r/b sequences from additional grass taxa (Figure 2B), we recovered a gene tree topology similar to that observed by Hu et al. (1996), confirming the independent amplification of an ancestral r/b gene in different grasses. Unlike Hu and colleagues' study, our study included multiple r/b gene homologs from both rice loci. Within the studied rice region we identified three r/b homologs at the Ra locus. Two additional genes were found on chromosome 1, the Rb locus. We also included the two genes identified at the Ra locus of O. sativa T65-Plw, the plw-OSB1 and plw-OSB2 genes (Sakamoto et al. 2001). Interestingly, the rice genes are not recovered in monophyletic clades on the basis of their chromosomal position. The r/b homologs mapping to rice chromosome 1, the Rb locus, were recovered as a sister clade to a group of several r/b homologs mapping to rice chromosome 4, the Ra locus (Figure 2, A and B). According to the recent findings that the cereal genomes descended from a lineage that experienced a whole-genome duplication at ∼70 MYA (Paterson et al. 2004), we further examined whether the rice Rb locus is in synteny with the Ra locus. Annotation of a >360-kb contig surrounding the Rb locus did not recover any additional genes shared with the Ra locus, suggesting that the genomic region bearing the Rb locus is not a product of genomic duplication of the Ra locus. Unlike the Rb region, the Ra region of rice chromosome 4 is syntenic with the r/b regions of sorghum and maize; this finding indicates that the ancestor of rice had a single r/b gene at the Ra locus of chromosome 4 that later duplicated (∼27.5 MYA). One of the duplicates gave rise to the Ra-like paralogs and another to the Rb-like paralogs, one of which moved to a new chromosomal location (in the window of 5–13 MYA), the so-called Rb locus of chromosome 1 (shown in Figure 2C) where it further duplicated (∼4.7 MYA).
Our data also show that while the r/b gene complex at the maize R locus (maize chromosome 10L) arose by recent amplification of a single ancestral r gene (within the last 3 million years), the age of the two paralogous copies in sorghum can be traced to the base of the sorghum lineage (∼8.3 MYA). In a study based on systematic analyses of 11 homeologous duplicates from five duplicated chromosomal regions of maize, Swigoňová et al. (2004) showed that the node of the two maize progenitors and sorghum could not be unequivocally resolved; however, they also pointed out that only the r/b gene tree differed significantly from the unresolved tree in several tests. In this study, we performed analyses of nucleotides as well as amino acid sequences on a data set containing additional sequences from other maize varieties and several additional grass species (Figure 2). Our results consistently show that the maize b genes (maize chromosome 2S) are more closely related to both sorghum paralogs than to any of the maize r genes (chromosome 10L). Therefore, analyses of the r/b genes support the hypothesis that maize arose as an allotetraploid, as previously suggested from distance analyses of mdh and wx genes by Gaut and Doebley (1997) and that the R and B loci are allelic descendants of two different diploid maize progenitors.
The maize r and b genes are homeologous duplicates and both are selected to retain their function. The r and b genes differ in their tissue specificity and developmental timing (Styles et al. 1973). The r genes specify anthocyanin pigmentation in the aleurone, anthers, or coleoptiles, and the b genes determine pigmentation in leaves, sheaths, and tassel (Styles et al. 1973). In seed tissues, the r and b genes can functionally substitute for one another to determine pigment accumulation (Styles et al. 1973; Goff et al. 1990). Because the translated amino acid sequences of the r and b genes are highly similar (>91%) while the 5′ controlling regions are significantly heterogeneous, it has been postulated that the protein products are functionally equivalent and that the allelic diversity is inherent in differential expression of the r and b genes. The strongest evidence for this argument comes from the finding that the elements controlling tissue specificity reside within the 5′ variable region (Radicella et al. 1992). Additional functional differentiation occurred at the R locus, where various alleles exhibit a diverse array of tissue-specific expression patterns.
History of the cisZOG gene duplication:
Because the cisZOG locus is intimately associated with the r/b genes and has undergone extensive gene duplication and deletion, we felt that an analysis of its evolutionary history would provide an interesting comparison with the dynamics of the adjacent r/b genes. The cis-zeatin O-glucosyltransferase genes encode for cytokinins, naturally occurring plant hormones promoting cell division and plant growth through regulation of tissue differentiation, bud formation, senescence, and seed germination (Mok 1994). Two cisZOG genes were previously identified in maize; the first was the cisZOG1 gene, reported by Martin et al. (2001), which is 100% identical to the one identified within our maize-10L region, and the other was the unmapped gene cisZOG2, reported by Veach et al. (2003). The two genes are 98.1% identical in nucleotide sequences and 98.8% identical in their amino acid sequences. Both genes were found to be highly expressed in roots, while their expression pattern in kernels differed (high for cisZOG1 and low for cisZOG2). The observed difference in gene expression in kernels was suggested to be the result of promoter sequence heterogeneity (Veach et al. 2003).
Within the studied intervals, we identified three cisZOG genes in tandem triplication in the rice interval, five genes in sorghum, and one gene in the maize-2S region. Using phylogenetic analyses of nucleotide and amino acid sequences, we investigated the evolutionary pattern of cisZOG gene duplication in the three grass taxa. The reconstructed history of cisZOG duplication (Figure 3B) showed that the most immediate common ancestor of grasses contained four copies of the cisZOG gene. In the rice lineage, one ancestral gene duplicated (∼32 MYA) and two were lost. The sorghum lineage retains descendants of all four ancestral genes, one of which duplicated before the time when sorghum diverged from maize progenitors (∼40 MYA). The two maize cisZOG genes are the result of a recent duplication of one ancestral gene (∼6.7 MYA) that was followed by a transfer of one of the paralogous copies to a nonorthologous location. The ancestor of the two maize progenitors each had at least five copies of the cisZOG gene. Nevertheless, all copies but one were deleted, indicating extensive gene loss in maize. Hence, for both the r/b genes and the cisZOG genes in this region, extensive gene duplication has occurred. Interestingly, the timing of the events and the regions duplicated appear to be completely different, indicating that neither large-scale tandem duplication nor large deletions occur at a detected rate in this region. It is not clear whether the numerous small rearrangements in this region are an indication of an unusual lability of this chromosomal segment in several grass species or whether it might be an outcome of selective pressures for amplification and diversification of these gene families.
Timing and nature of LTR-retrotransposons in maize:
As in all other studied maize segments, the exceptional dynamism of transposon accumulation over the last few million years has led to tremendous differentiation of the chromosomal segments, as shown by the unique configurations in the two homeologous regions and in comparison to the apparently retrotranposon-free sorghum segment. In our study, the recognized LTR-retrotransposons represented 55 and 76% of the maize-10L and maize-2S segments, respectively, with Ji, Opie, and Huck the most common types. Although the oldest retrotransposon activity in these regions can be traced to 2.7 MYA (Table 3), the majority (81%) of the LTR-retrotransposons inserted within the last 1 million years. Unlike the multilevel nested LTR-retrotransposons reported from other maize regions (SanMiguel et al. 1998; Ramakrishna et al. 2002a; Ma et al. 2004), nesting of the LTR-retrotransposons was minimal in both maize regions (see Figure 1). About 47% (8/17) of the LTR-retrotransposons within the maize-10L region contain no retrotransposon inserts, indicating that the nested nature of retrotransposons is heterogeneous in the maize genome.
Composition and structure of the r/b region of the common ancestor of grasses:
Since the rice r/b region contains only nine genes, all shared with sorghum and maize, it is simplest to assume that the gene content of the rice region represents the gene content of the common ancestor of grasses, with a few exceptions caused by subsequent gene family duplication and deletion in the rice lineage. First, the common ancestor of grasses contained only one r/b homolog, which later amplified independently in different grass lineages. Second, the cisZOG gene was represented by at least four copies (the three copies in the rice segment represent only two of the ancestral genes, see above) and these experienced parallel amplifications and deletions along different grass lineages. Third, the single ancestral copper-transporting ATPase gene duplicated in the ancestral lineage that gave rise to maize and sorghum. We cannot judge the repetitive DNA content of this region in an ancestral grass, because the rate of removal of these sequences is so high that all traces of such elements are lost within 5–10 million years (Ma et al. 2004). While the rice region contains numerous fragmented retrotransposons, sorghum is lacking recognizable traces of retrotransposon activity and maize contains mostly full-size or nested retrotransposons with rather few solo LTR-retrotransposons. Therefore, different grass lineages show lineage-specific patterns of local evolution in genic and intergenic sequences. Recent studies comparing the genomic structure at the bz region and the region containing the 22-kD zein gene cluster of two different maize inbred lines (Fu and Dooner 2002; Song and Messing 2003) demonstrated that severe structural polymorphism can be observed even within haplotypes of a single species. To understand the full dynamics of plant genome evolution, finer-scale studies, investigating the pattern of local evolution among closely related taxa and different subspecies, will be needed.
We thank H. K. Dooner for valuable comments on the manuscript. We also thank W. Ramakrishna and V. Llaca for technical assistance in sequencing. This work was supported by National Science Foundation Plant Genome Program grant no. 9975618.
- Received August 9, 2004.
- Accepted October 13, 2004.
- Genetics Society of America