Comparative genomics is a powerful tool for gaining insight into genomic function and evolution. However, in plants, sequence data that would enable detailed comparisons of both coding and noncoding regions have been limited in availability. Here we report the generation and analysis of sequences for an unduplicated conserved syntenic segment (CSS) in the genomes of five members of the agriculturally important plant family Solanaceae. This CSS includes a 105-kb region of tomato chromosome 2 and orthologous regions of the potato, eggplant, pepper, and petunia genomes. With a total neutral divergence of 0.73–0.78 substitutions/site, these sequences are similar enough that most noncoding regions can be aligned, yet divergent enough to be informative about evolutionary dynamics and selective pressures. The CSS contains 17 distinct genes with generally conserved order and orientation, but with numerous small-scale differences between species. Our analysis indicates that the last common ancestor of these species lived ∼27–36 million years ago, that more than one-third of short genomic segments (5–15 bp) are under selection, and that more than two-thirds of selected bases fall in noncoding regions. In addition, we identify genes under positive selection and analyze hundreds of conserved noncoding elements. This analysis provides a window into 30 million years of plant evolution in the absence of polyploidization.
GENOME sequences are now rarely studied in isolation, but instead are examined alongside their neighbors on the tree of life. Most animal species of primary research importance in genetics—including human, mouse, Drosophila melanogaster, and Caenorhabditis elegans—now belong to whole “sequenced clades,” consisting of at least half a dozen and in some cases more than two dozen sequenced species (e.g., Lindblad-Toh et al. 2005; Rhesus Macaque Genome Sequencing and Analysis Consortium 2007; Clark et al. 2007; Miller et al. 2007; Stark et al. 2007) (http://www.genome.gov/Pages/Research/Sequencing/SeqProposals/CaenorhabditisSEQ.pdf). The same is true of the model yeast Saccharomyces cerevisiae (Cliften et al. 2003; Kellis et al. 2003). The species within each of these clades are evolutionarily close enough that noncoding as well as coding sequences can be aligned, yet distant enough that genomic comparisons reveal clear signatures of natural selection. In addition, the generally similar physiology, behavior, and genetics of the organisms within each clade help to facilitate comparative analyses. Comparative genomic analyses of sequenced clades have, among other things, allowed for the identification of new genes, regulatory elements, noncoding RNAs, and conserved sequences of unknown function (e.g., Guigó et al. 2003; Kellis et al. 2003; Bejerano et al. 2004; Siepel et al. 2007; Stark et al. 2007); shed light on duplication and rearrangement histories (Murphy et al. 2005; Jiang et al. 2007); produced refined phylogenies (Thomas et al. 2003; Murphy et al. 2007); and enabled the detection of rapidly evolving coding and noncoding sequences (Clark et al. 2003; Pollard et al. 2006).
In plants, however, comparable sequenced clades have not yet emerged. The main embryophytic (land-plant) species that have been fully sequenced—Arabidopsis thaliana (Arabidopsis Genome Initiative 2000), Oryza sativa (Goff et al. 2002; Yu et al. 2002), Medicago truncatula (Cannon et al. 2006), and Populus trichocarpa (Tuskan et al. 2006)—have been selected primarily for their individual importance as model species or agricultural crops, rather than for their value in comparative genomics. These genomes are sufficiently distant from one another that they generally do not align outside of coding regions. In addition, each genome has been considerably scrambled with respect to the others by millions of years of rearrangement, duplication, insertion, and deletion, further complicating comparative analyses. Consequently, with a few exceptions (Inada et al. 2003; Ma and Bennetzen 2004; Haberer et al. 2006; Freeling et al. 2007; Thomas et al. 2007), comparative genomic studies of plants have largely focused on content of protein-coding genes and repetitive elements (Ku et al. 2000; Quiros et al. 2001; Song et al. 2002; Ilic et al. 2003), rather than on the kind of detailed analysis of orthologous functional elements that has been possible in animals.
Moreover, comparative studies of plant genomes so far have largely dealt with species that have experienced recent whole-genome duplications (WGDs) (Ku et al. 2000; Quiros et al. 2001; Song et al. 2002; Ilic et al. 2003; Zhu et al. 2003). These studies have revealed striking differences between species in genome organization, perhaps induced by the massive genetic redundancy created by WGD (Lynch and Conery 2003; Semon and Wolfe 2007). However, they leave open the question of how plant genomes evolve in the absence of WGD, and they complicate comparisons with animal genomes, in which WGD is much less common (Otto and Whitton 2000). Furthermore, WGD creates additional challenges in comparative genomics, by producing dramatic differences in genome size and number of genes, many-to-many relationships among orthologs, and frequent disruptions in synteny.
The Solanaceae are highly important among flowering plants that have diversified in the absence of WGD. The Solanaceae family comprises >3000 species, including aquatic plants, desert dwellers, trees, ornamentals, and familiar crops such as tomato, potato, and pepper. It ranks third among plant families in economic importance, it is the most valuable in terms of vegetable crops, and it includes important model systems for fruit development (tomato and pepper), tuber development (potato), plant defense (tomato and tobacco), and anthocyanin pigments (petunia). Despite their great phenotypic diversity, all Solanaceae derived ∼40 million years ago (MYA) from an ancestral diploid with x = 12 chromosomes, and nearly all family members have maintained this chromosome number (Wikstrom et al. 2001; Wu et al. 2006). Moreover, members of the related family Rubiaceae (coffee family) are also diploid with x = 11 or x = 12, implying that any WGD in the history of the Solanaceae and the Rubiaceae occurred before their divergence ∼85 MYA (F. Wu and S. D. Tanksley, unpublished data). Comparative genomics of the Solanaceae has been an active area of research for two decades (Tanksley et al. 1988). In addition, an international project is underway to sequence the full euchromatic portion of the tomato genome (http://www.sgn.cornell.edu/about/tomato_sequencing.pl), and a well-developed bioinformatics infrastructure with strong support for comparative analyses is available (Mueller et al. 2005). While several recent studies have included comparative analyses of solanaceous ESTs (Van Der Hoeven et al. 2002; Ronning et al. 2003; Blanc and Wolfe 2004a; Rensink et al. 2005), so far a large-scale comparative analysis of genomic sequences in the Solanaceae has not been possible.
Here we report a comparative analysis of a conserved syntenic segment (CSS) in the genomes of five Solanaceae species. We compare the previously sequenced 105-kb ovate-containing region from chromosome 2 of tomato (Ku et al. 2000) with newly sequenced orthologous regions of the potato, eggplant, pepper, and petunia genomes. This CSS is present as a single copy in all five species, and it contains 17 distinct genes with mostly conserved order and orientation. However, its general conservation is punctuated by numerous small-scale differences, due to nucleotide substitutions, insertions and deletions, tandem duplications of individual genes, inversions, and transpositions. Our detailed comparison of these sequences provides new insights into the evolutionary history of an important group of plants, the evolutionary dynamics of plant genomes in the absence of WGD, and the selective pressures experienced by both coding and noncoding functional elements.
MATERIALS AND METHODS
Identification and sequencing of BACs:
The starting point for the study was a gene-rich BAC from the long arm of tomato chromosome 2 known to contain the ovate locus (BAC19, LE_HBa0106H06) (Ku et al. 2000). The 16 predicted genes from the BAC were first tested for copy number in the tomato genome via genomic Southern hybridization. Hybridization was carried out at 60° overnight using probes labeled with 32P and washed in 2× SSC for 20 min and in 1× SSC for 10 min. A hybridization probe for each gene was based on a single exon or two nearby exons (≥350 bp; see supplemental Table S1). The majority of the genes in the BAC were shown to be single copy by virtue of hybridization to a single restriction fragment in digests of at least one restriction enzyme. Several of these single-copy probes were genetically mapped on the high-density tomato genetic map (Frary et al. 2005) (http://www.sgn.cornell.edu/). All mapped to the expected position on chromosome 2.
Five of the single-copy probes (for genes 1, 5, 8, 15, and 17; see supplemental Table S2) were then used to screen BAC libraries for potato (Solanum bulbocastanum) (Song et al. 2000), eggplant (S. melongena cv. “black eggplant”) (J. Vrebalov and J. Giovannoni, unpublished data), petunia (Petunia inflata) (McCubbin et al. 2000), and pepper (Capsicum annuum) (J. Vrebalov and J. Giovannoni, unpublished data). Positive BACs were confirmed by Southern hybridization on HindIII-digested BAC DNA and further selected using three additional probes (for genes 3, 12, and 14; supplemental Table S2) for maximum gene overlap with the tomato BAC. All probes were confirmed to be present in single copy in potato, eggplant, pepper, and petunia by Southern hybridization with genomic sequences digested by more than two restriction enzymes. On the basis of these results, a single BAC from each species, hybridizing to the maximum number of tomato gene probes, was selected for further analysis: tomato (106H06), potato (027I18), eggplant (077N19), pepper (215H17), and petunia (126I14) (Figure 1). These BACs were, respectively, 135, 105, 122, 106, and 139 kb in size.
Each BAC clone was shotgun sequenced to 10× coverage and assembled using Phred and Phrap with default parameters. Gaps and low-quality regions were finished with sequences from PCR products to obtain final assemblies with minimum quality scores of 25. These were checked by BAC end sequences and by comparing virtual (electronic) and empirical (lab) restriction digests using HindIII, EcoRI, and BamHI.
Alignment and annotation of BAC sequences:
A multiple alignment for the entire CSS was constructed from the assembled sequences using BLASTZ (Schwartz et al. 2003) and the threaded blockset aligner (TBA) (Blanchette et al. 2004). Before processing by TBA, pairwise BLASTZ alignments were filtered by the University of California, Santa Cruz (UCSC), alignment “chaining” and “netting” pipeline (Kent et al. 2003), which uses conserved synteny to help ensure that orthologous sequences are aligned. The chains, nets, and multiple alignment were displayed and manually inspected in a “Solanaceae Genome Browser” based on the UCSC platform (http://genome-mirror.bscb.cornell.edu/cgi-bin/hgGateway?db=sol1).
Ab initio gene predictions were obtained for each BAC, using four different computational gene-finding programs—FGENESH (Solovyev et al. 1994), GenemarkHMM (Borodovsky and McIninch 1993), Genscan+ (Burge and Karlin 1997), and GlimmerM (Salzberg et al. 1998) (Arabidopsis training data set; Wortman et al. 2003). Independently, each BAC was screened against a large Solanaceae EST database (239,593 tomato ESTs, 134,365 potato ESTs, 3181 eggplant ESTs, and 20,738 pepper ESTs; http://www.sgn.cornell.edu/) and against the Arabidopsis proteome. An initial set of gene annotations was defined with the requirement that each gene be supported by at least two computational gene predictions, at least one solanaceous EST (with >95% identity over 80% sequence length if from the same species or BLASTN E-value <10−10 if from another species), or at least one Arabidopsis protein (tBLASTX E-value <10−10). These candidate gene structures were then evaluated for cross-species support, using the clean_genes program [part of the PHAST package (Siepel et al. 2005)], and were inspected manually in the Solanaceae Genome Browser alongside the multiple alignments, EST alignments, alignments of full-length mRNA sequences from GenBank, and protein alignments. This inspection turned up two apparent pseudogenes in tomato (both derived from gene 10; see results) and allowed for some minor refinements in the positions of splice sites, but otherwise supported the candidate predictions. Putative functions for genes were assigned on the basis of Arabidopsis homologs and predicted domains, where available (Table 1).
Repeat elements and low-complexity sequences within all sequences were soft masked using TRF (Benson 1999) and RepeatMasker (http://www.repeatmasker.org). A custom RepeatMasker library was produced by concatenating repeat elements from the Solanaceae Genomics Network Repeat Database (http://www.sgn.cornell.edu), TIGR plant repeats (http://www.tigr.org/tdb/e2k1/plant.repeats), Munich Information Center for Protein Sequence (MIPS) plant repeats (http://mips.gsf.de/proj/plant/webapp/recat), and plant repeats within the RepeatMasker library. The annotated repeats are displayed in the RepeatMasker and Simple Repeats tracks in the Solanaceae Genome Browser.
Alignments of orthologous coding regions:
For the 12 multispecies genes (Table 2), multiple alignments of orthologous protein-coding DNA sequences were extracted from the CSS-wide multiple alignment by concatenating the segments corresponding to the exons of each gene, as defined by the tomato gene models. Manual inspection suggested that no realignment was needed. For use in the analyses based on codon models, a hand-curated version of these alignments was created without frameshifts or stop codons. These alignments were truncated at frame-shifting insertions and deletions (indels) or premature stops near the 3′ ends of genes (Table 2), and any out-of-frame sequences between compensatory frame-shifting indels were masked by replacing them with N's. For the estimation of dates of divergence, the closest Arabidopsis ortholog of each gene was incorporated into these alignments (excluding gene 17), and for the analysis of gene trees, all paralogs and putative homologs in Arabidopsis and rice (based on TBLASTX matches and data reported by Ku et al. 2000) were added. These expanded alignments were created by aligning predicted peptide sequences with T_Coffee (Notredame et al. 2000) and then reverse translating to DNA sequences. They were also truncated at premature stop codons as necessary.
Gene trees were estimated by maximum likelihood using PhyML (Guindon and Gascuel 2003). In all cases the inferred tree topology was consistent with the species phylogeny shown in Figures 3 and 4, which is in agreement with previous phylogenetic studies of the Solanaceae (Olmstead and Palmer 1997; Olmstead et al. 1999). In an initial analysis, the two petunia copies of gene 12 grouped together, as expected, but the two pepper copies of gene 7 did not. However, a follow-up analysis using codon models supported a topology for gene 7 in which the two pepper genes grouped together (Figure 7A), as expected from the copy numbers of this gene in the different species.
Maximum-likelihood estimates of dN, dS, and ω = dN/dS were obtained using the codeml program (Yang 1997), with F3 × 4 codon frequencies, equal amino acid distances (aaDist = 0), a single ω across sites and across branches (model = 0, NSsites = 0), and the tree topology of Figure 3. Estimates were obtained separately for each of the multispecies genes, using the hand-curated alignments of coding regions, and for a pooled data set in which all alignments were concatenated. Fourfold degenerate (4D) sites were extracted using msa_view (from PHAST) and substitution rates for these sites were estimated using phyloFit (also from PHAST) with the general reversible (REV) model (Tavaré 1986). For each type of site, only sequences with three or more aligned orthologous sequences, including tomato, were included in the analysis.
Dates of divergence were estimated by applying codeml as above, but assuming a global molecular clock (clock = 1). The “fossil calibration” feature (Yang and Yoder 2003; see the “Global and Local Clocks” section of the PAML manual) was used to fix the Arabidopsis/Solanaceae (AS) divergence at the estimated dates of 110, 120, and 130 MYA, and the other divergence times were then estimated by maximum likelihood. The standard errors of the estimates were small compared to the uncertainty in the AS divergence date and were therefore ignored. The data do not strictly support the hypothesis of a global clock (likelihood-ratio test, LRT P = 7 × 10−15), but the branch length estimates were not dramatically altered by the assumption of a clock, and violations of this assumption are not expected to have a dramatic effect on the estimated dates.
The codeml program was also used to perform LRTs for positive selection, based on the branch-site model of Yang and Nielsen (2002) (model = 2, NSsites = 2). A separate LRT was performed for each gene and each branch of the tree by running codeml twice: once with fix_omega = 0 (alternative model) and once with fix_omega = 1, omega = 1 (null model). Nominal P-values were computed by assuming that twice the difference of the log likelihoods of these two models should have a null distribution that is a 50:50 mixture of a χ2-distribution and a point mass at zero (Zhang et al. 2005). These P-values were then corrected for multiple comparisons, using the method of Benjamini and Hochberg (1995).
Proportion of nucleotide sites under selection:
Conservation scores were produced using the program phastOdds (from PHAST) in sliding windows of 5, 10, and 15 bp. PhastOdds computes log-odds scores for each base, comparing a phylogenetic model of conserved evolution with a model of nonconserved evolution. The scores were averaged within windows. Specifically, the score Si for a window of size d and radius r = ⌊d/2⌋, centered at position i, was computed as(1)where Xj is the jth column of the multiple alignment, ψc is a phylogenetic model for conserved sites, ψn is a phylogenetic model for nonconserved sites, and P(Xj | ψx) is computed by Felsenstein's pruning algorithm (Felsenstein 1981). (A similar scoring procedure is described in more detail by Siepel et al. 2005.) The models ψc and ψn were estimated using the phastCons program with the REV model (see below). An alternative analysis in which ψc was estimated from coding exons and ψn was estimated from fourfold degenerate sites to estimate ψn produced nearly identical results. Alignment gaps were treated as missing data. Neutral scores were obtained by concatenating alignment columns from 4D sites into a pseudoalignment, randomly permuting them, and then applying phastOdds to this alignment.
The complete distribution of conservation scores, fall, was modeled as a mixture of neutral and selected components, fall(S) = πnfn(S) + πsfs(S), with mixture coefficients πn and πs (0 ≤ πn, πs ≤ 1, πn + πs = 1). The distributions fall and fn were obtained by Gaussian kernel density estimation from the sets of all scores and of neutral scores, respectively, excluding sites with bases from fewer than three species. The density function in R was used with kernel = “gaussian,” bandwidth (bw) of 0.15, 0.20, or 0.25, and n (the number of points) of 1024. The lower bound for πs was then estimated as πs = 1 − minS[fall(S)/fn(S)], as described by Chiaromonte et al. (2003). In this minimization, only scores S with fall > δ and fn(S) > δ (for small positive δ) were considered, to avoid distortion from regions of sparse data. Values of δ between 0.0001 and 0.01 produced essentially identical results. To estimate confidence intervals, bootstrap resampling both of all sites and of neutral sites was performed. Kernel density estimation and estimation of πs were performed for each of 1000 samples, and the 0.025 and 0.975 quantiles of the estimates of πs were taken as 95% confidence intervals. Estimates of πs were converted to estimates of γ by multiplying them by the fraction of bases that were aligned in three or more species (here 0.726), under the conservative assumption that unaligned bases are not under selection.
The posterior probability that each window Wi with score Si is under selection was computed as(2)where Zi is a random variable equal to 1 if Wi is under selection and equal to 0 otherwise (Chiaromonte et al. 2003). On the basis of the gene models and EST/mRNA data, each base was assigned to one of four annotation classes (see Figure 6), and each window was assigned to the class containing the largest number of its bases. These posterior probabilities were then used to compute expected fractions of windows that are under selection within each class and expected fractions of all selected windows that come from each class.
To test the sensitivity of the analysis to the alignment methods, all steps were repeated using an alternative alignment constructed by the Pecan program (B. Paten, K. Beal and E. Birney, unpublished data; http://www.ebi.ac.uk/∼bjp/pecan/). This analysis produced very similar results, but slightly higher estimates of γ (see supplemental material).
A history of indels was reconstructed by parsimony from the alignments of orthologous coding regions, using the program indelHistory (from PHAST). The inferred events were classified as insertions on particular branches of the phylogeny, deletions on particular branches, or ambiguous indels (with no outgroup data). Normalized indel rates for each gene were computed by dividing the estimated number of indels in that gene by the product of its length and the total neutral (dS) branch length of the phylogeny of the available species (with dS values as shown in Figure 3A). This normalization corrects for gene length (more indels are expected in longer genes), branch length (more indels are expected on longer branches), and differences in the sets of species represented at each gene (more indels can be observed where there are more data). The normalized indel rates were multiplied by 100 and expressed in units of indels per 100 neutral substitutions. A similar normalization was used to compare indel rates on different branches of the tree.
Identification and characterization of conserved elements:
Conserved elements were identified with phastCons, after tuning the parameters γ and ω to obtain 60% coverage of the annotated coding regions by conserved elements (see Siepel et al. 2005). All parameters were estimated from the data (including substitution model parameters, the branch lengths of the tree, and the scaling parameter ρ). The expected minimum length of a detectable conserved element was estimated as described by Siepel et al. (2005). Elements under lineage-specific selection were identified with DLESS and significance was assessed with phyloP (Siepel et al. 2006). Predictions with P ≥ 0.05 were discarded. The model estimated from 4D sites was used as the neutral model, and the tuning parameters were set to their default values. All phastCons and DLESS elements were analyzed with RNAz (Washietl et al. 2005), searched against the RFAM database (Griffiths-Jones et al. 2005) with INFERNAL (Eddy 2002), and examined for pre-miRNA and snoRNA structures with RNAmicro (Hertel and Stadler 2006) and snoReport (Hertel et al. 2008), respectively.
Known binding-site sequences from solanaceous plants were collected from the TRANSFAC (Wingender et al. 1996) and PLACE (Higo et al. 1999) databases, as well as from various sources in the primary literature. These included 17 transcription factors (TFs) with three or more independent sites. Position-specific score matrices (PSSMs) were derived for these 17 TFs by standard methods (supplemental Figure S4). The noncoding portion of the tomato genome was scanned for significant matches to each of these PSSMs, by computing log-odds scores with respect to a third-order Markov background model (estimated from all noncoding regions) and retaining all predictions with empirical P < 1.5 × 10−4 (as assessed by simulation from the background model). These predictions are displayed in the “Motif Predictions” track of the Solanaceae Browser.
Sequences, annotations, alignments, and genome browser:
BACs corresponding to a CSS from tomato chromosome 2 were isolated from tomato, potato, eggplant, pepper, and petunia and were sequenced (Figure 1). Gene annotations for each sequenced BAC were prepared by a combination of computational and manual methods, and a multiple alignment of all sequences was constructed, using methods that exploited the conserved synteny of the region to ensure that orthologous sequences were aligned (see materials and methods). Because the tomato sequence is the most complete and best annotated (due to reasonably extensive mRNA and EST data for tomato), it was selected as the reference sequence for the multiple alignment and was used as the main source of gene annotations. The BAC sequences were also annotated with the positions of transposable elements, simple sequence repeats, conserved elements, known regulatory motifs, and other features, as discussed below. Nearly all coding bases, and most noncoding bases, are aligned in the region (supplemental material; supplemental Table S3). The sequences, alignments, and annotations are displayed in a publicly available Solanaceae Genome Browser based on the UCSC platform (Kent et al. 2002) (Figure 2; http://genome-mirror.bscb.cornell.edu/cgi-bin/hgGateway?db=sol1).
Conservation of gene content, order, and structure:
The CSS contains 17 distinct genes, which are generally well conserved across species (Figure 1 and Table 1). However, small-scale duplications and losses have resulted in some differences in gene content. For example, Gene 9 is found in the same position and orientation in potato, eggplant, and petunia but is absent in tomato and pepper (Figure 1). An examination of the phylogenetic tree of these species (Figure 3) indicates that this gene must have been lost independently in both the tomato and the pepper lineages.
Gene 10 has the same position and orientation in all species but tomato. In its place in the tomato genome are two apparent pseudogenes, both aligning to portions of gene 10 from the other species. One pseudogene (tomato.10p) has what appears to be the ancestral position and orientation, while the other (tomato.10p′) is found in the same orientation but ∼3 kb upstream. Moreover, these two copies have a large (∼500-bp) region of similarity, suggesting that tomato.10p′ arose from the ancestral gene by a partially duplicative transposition. On the basis of the degree of divergence of these two sequences (∼12%), this event appears to have occurred soon after the separation of tomato and potato, ∼6 MYA. It is possible that this rearrangement is an example of transposon-mediated exon shuffling, as observed in grass genomes (Bennetzen 2007), but no known transposon was identified in the immediate vicinity.
Two other genes have apparently undergone gene expansion via tandem duplication—gene 7, which is present in two adjacent copies in pepper, and gene 12, which is present in two adjacent copies in petunia (Figure 1). In both cases, the duplicate copies are present in the same orientation, consistent with the mechanisms of tandem duplication. These duplications are lineage specific, and phylogenetic trees estimated by maximum likelihood suggest that they occurred relatively recently (see below). Both copies of both genes have intact open reading frames.
Gene order, like gene content, is largely conserved within the CSS. The only major exception is that the order and the orientation of genes 15 and 16 are reversed in petunia relative to tomato, potato, and eggplant—apparently the result of an inversion of ∼20 kb. The petunia inversion is likely to be a derived condition as the tomato/potato/eggplant configuration for these genes is shared with Arabidopsis (Figures 1 and 2). There is also a much smaller inversion of ∼800 bp in potato (Figure 2). Despite these differences, the CSS shows considerably more conservation than previously observed in comparisons of plant genomes (e.g., Ku et al. 2000; Song et al. 2002; Ilic et al. 2003), perhaps in part due to the absence of WGD in the evolution of the Solanaceae (see discussion).
On the basis of the multiple alignment for the region, we analyzed the open reading frames (ORFs) and exon–intron structures of orthologous genes within the CSS, focusing on the 12 genes that were present in tomato and sequenced in at least two other species (henceforth, the multispecies genes; Table 2). Seven of the 12 multispecies genes have well-conserved ORFs and exon–intron structures, with aligned start/stop codons and splice sites, no premature stop codons, and no frameshift indels (Table 2). Only one gene, gene 14, displays extensive disruptions to its ORF, primarily owing to a long (748 bp) CT-rich low-complexity region near its 3′ end, which contains several frame-shifting indels. The remaining four genes show minor changes in ORFs or structure, including compensatory frameshift indels, shifts in splice-site position, and frameshift indels near their 3′ ends that cause changes in stop codon position. Thus, the ORFs and exon–intron structures in this region have generally been fairly well conserved through the evolution of the Solanaceae, but more than a third of genes do show differences among species.
Rates of neutral substitution:
We used synonymous substitutions within the coding regions of the multispecies genes to estimate rates of neutral substitution in the Solanaceae. For this analysis, we pooled data from all 12 multispecies genes (Table 2), but excluded regions in which gene structure or reading frame was not conserved across species. The tree topology of Figure 3 was assumed, on the basis of previous studies and our own analysis (materials and methods). Using the codon model of Yang et al. (1998), we obtained estimates of dS for each branch that summed to 0.78 synonymous substitutions per synonymous site (Figure 3A). By extracting just the 4D sites (which have no effect on the encoded amino acid) and estimating branch lengths under the REV model of nucleotide evolution (Tavaré 1986), we arrived at a similar estimate of 0.73 substitutions/site (Figure 3C). The total neutral divergence of these Solanaceae genomes, therefore, is comparable to the neutral divergence within each of the groups of mammalian, Drosophila, Caenorhabditis, and Saccharomyces genomes that have recently been widely analyzed (e.g., Siepel et al. 2005). Indeed, it is nearly identical to the neutral divergence of the six eutherian mammalian genomes (human, chimpanzee, macaque, mouse, rat, and dog) that have been completely sequenced at present (0.83 substitions/site, as estimated from 4D sites in the ENCODE regions; Margulies et al. 2007). Thus, many of the analytical methods and computational tools that have been developed for comparative genomics of mammals should be well suited for these plant genomes.
Dates of divergence:
While there have been several detailed studies of the phylogeny of the Solanaceae (Olmstead and Palmer 1992; Olmstead et al. 1999), there are, to our knowledge, no published estimates of absolute dates of divergence throughout the family. However, good estimates are available for the date of the Arabidopsis/Solanaceae (rosid/asterid) divergence. By conditioning on these dates and assuming a molecular clock within the Solanaceae, we were able to obtain an approximate timescale for our five-species phylogeny.
In recent studies of divergence dates throughout the angiosperms, the AS divergence was estimated at 125 ± 5 MYA (Wikstrom et al. 2001) and ∼115 MYA (Magallon and Sanderson 2005). These studies were based on nonparametric rate smoothing and/or penalized likelihood methods, with multiple dates from the fossil record as calibration points. Earlier estimates of the AS divergence ranged from 112 to 156 MYA (Yang et al. 1999), but the fossil record for tricolpate pollen (present only in eudicots, which include rosids and asterids) strongly suggests that the AS divergence occurred ≤125 MYA (Bell et al. 2005). Here we assume a fairly inclusive range of 110–130 MYA for the AS divergence and use an intermediate value of 120 MYA as a point estimate.
We incorporated the closest Arabidopsis ortholog for each gene (Figure 1) into the multiple alignments for 11 of the multispecies genes (excluding gene 17, which has two equally distant orthologs in Arabidopsis). Phylogenies estimated separately for each gene strongly suggested that the identified Arabidopsis genes were truly orthologs and not more distant “outparalogs” (Sonnhammer and Koonin 2002), because they showed fairly consistent levels of divergence from the Solanaceae in all cases. On the basis of a pooled data set including all genes, we estimated dates of divergence on the basis of the method of Yang and Yoder (2003), assuming a codon model and a molecular clock. The AS divergence was fixed at the point estimate of 120 MYA, the lower limit of 110 MYA, or the upper limit of 130 MYA. This analysis yielded estimates of 6.2 MYA (range 5.1–7.3 MYA) for the divergence of tomato and potato, 13.7 MYA (11.6–16.0 MYA) for tomato and eggplant, 19.1 MYA (16.2–22.2 MYA) for tomato and pepper, and 31.2 MYA (26.9–35.9 MYA) for tomato and petunia (Figure 4; Table 3). Thus, these species appear to have diversified over the course of ∼30 MYA, with major splits occurring in intervals of ∼5–10 MY. For several reasons—including the dependency on clock assumptions, the relatively small number of genes considered, and the possibility of misidentification of Arabidopsis orthologs—the estimated dates should be considered rough approximations, but they do provide a general idea of the timeline for the diversification of the Solanaceae (see discussion).
Proportion of nucleotide sites under selection:
We estimated the proportion of nucleotides in the CSS that have been subject to natural selection during the evolution of the Solanaceae (here denoted γ) by the mixture-decomposition method that was used in the comparative analysis of the human and mouse genomes (Mouse Genome Sequencing Consortium 2002). This method requires only a conservation score for each nucleotide and the identification of a subset of nucleotides that are believed to be neutrally evolving. We used phylogenetic conservation scores computed by the phastOdds program in sliding windows of between 5 and 15 bp (see materials and methods) and treated the 1827 4D sites from the 12 multispecies genes as neutrally evolving (Table 2). The method decomposes the distribution of all conservation scores into a neutral component and a nonneutral, or “selected,” component, using the designated neutral sites as a guide. An estimate of γ can be obtained from the estimated mixture coefficient for the selected component of the distribution, by conservatively assuming any unaligned bases are not under selection (materials and methods). The method is essentially nonparametric and does not require any assumptions about the selected distribution (which is likely to be complex). However, it yields only an estimate of a lower bound for γ.
With 10-bp windows, this analysis yielded a lower-bound estimate of γ = 0.337 (bootstrapping 95% C.I., 0.303–0.373; Figure 5 and supplemental Table S4). Estimates were slightly lower with 5-bp windows (0.322; 95% C.I., 0.283–0.364) and slightly higher with 15-bp windows (0.375; 95% C.I., 0.355–0.395), probably because the power to detect selection increases with window size. They also depended slightly on a smoothing parameter for the score distributions. (The values given here are based on an intermediate smoothing bandwidth of 0.20; see supplemental Table S4 for full results.) Nevertheless, all estimates were between ∼0.3 and ∼0.4, suggesting that at least about a third of the CSS has evolved under natural selection. Only ∼18% of bases in the region fall in protein-coding regions, so at least roughly half of selected bases must have noncoding functions. Interestingly, ∼5% of windows under selection—or ∼1.5% of all windows—have lower, rather than higher, conservation scores than would be expected under neutrality (left peak of solid line in Figure 5). These windows may have been subject to positive selection during the evolution of the Solanaceae. Our estimates of γ are expected to be conservative as lower bounds, although elevated rates of substitution in 4D sites could conceivably produce an upward bias (see discussion).
To gain insight into the functional roles of the selected bases, we partitioned the sites in the CSS into four classes—coding (CDS), other transcribed (predominantly UTR), intronic, and intergenic sites—and examined the score distribution within each class. These classes of sites show clear differences in their scores, with coding sites being most conserved, intergenic sites being least conserved, and the other classes displaying intermediate levels of conservation (supplemental Figure S2), consistent with observations in other species. Assuming 10-bp windows, we estimate that at least 60% of coding windows, 39% of other transcribed and intronic windows, and 23% of intergenic windows are under selection (Figure 6A). By this analysis, only about one-third (32.9%) of the sites under selection come from coding regions, while 38.5% come from intergenic regions and 25.5% come from intronic regions. The remaining 3.1% come from other transcribed regions (Figure 6B).
Evolution of protein-coding sequences:
Synonymous and nonsynonymous substitutions:
From the pooled set of coding regions used to estimate dS (see above) we estimated dN = 0.16 nonsynonymous substitutions per nonsynonymous site (Figure 3B) and a ratio of ω = dN/dS = 0.21, indicating moderate purifying selection (e.g., Nielsen 2005). This estimate of ω is somewhat larger than genomewide estimates for Drosophila, Caenorhabditis, and Saccharomyces genomes (∼0.06–0.11; Kellis et al. 2003; Stein et al. 2003; Clark et al. 2007). It is also larger than estimates for nonprimate mammals (0.10–0.14), but it is similar to estimates for hominids (0.17–0.25) (Rhesus Macaque Genome Sequencing and Analysis Consortium 2007; Kosiol et al. 2008). Relatively few genomewide estimates are available for plants, but Tuskan et al. (2006) have reported ω = 0.40 on the basis of single-nucleotide polymorphisms in Populus, and Yu et al. (2005) have reported average KA/KS estimates (which are roughly comparable to ω-estimates) of 0.35 for the indica and japonica subspecies of O. sativa, with higher values in duplicated regions (0.720 in segmental duplications). Earlier estimates, based on smaller data sets, included KA/KS = 0.14 for A. thaliana and Brassica rapa ssp. pekinensis orthologs (Tiffin and Hahn 2002) (although this estimate may reflect a downward bias from the use of EST data; see, e.g., Wright et al. 2004), and KA/KS = 0.20 for duplicate genes in Arabidopsis (Zhang et al. 2002). Overall, these Solanaceae genes appear to be under somewhat weaker purifying selection than the genes of most animals, but under stronger purifying selection than the genes of at least some other plants. It is possible that the absence of a recent whole-genome duplication in the Solanaceae has contributed to the reduced estimates of ω in comparison with Populus and O. sativa.
Separate estimates of ω per gene indicate that most genes have ω < 0.2, suggesting moderate to strong purifying selection, but a few have values closer to 1, suggesting relaxation of constraint (Table 2). Gene 11, with the largest estimate of ω (0.53), also displays several differences between species in its ORF and exon–intron structure. We also tested all genes for positive selection, both across all branches and on individual branches of the phylogeny, using LRT methods (Yang and Nielsen 2002). Most genes (including gene 11) showed no significant evidence of positive selection, but evidence was detected for genes 7 and 12, both of which have undergone tandem duplications (Figure 7). This is consistent with Yu et al.'s (2005) finding of significantly elevated ω-estimates for tandemly duplicated genes and with observed enrichments for positive selection among duplicated genes (e.g., Rhesus Macaque Genome Sequencing and Analysis Consortium 2007). It should be noted that interlocus gene conversion between duplicate copies of genes can complicate analyses of positive selection.
Insertions and deletions:
Using a maximum parsimony-based method (materials and methods), we reconstructed an indel history (history of insertion and deletion events along the branches of the phylogeny) for each of the 12 multispecies genes. A total of 158 coding indels were identified, or an average of ∼13 per gene. However, there was considerable variation in indel number, with 6 genes contributing ≤5 indels each, and 2 genes (genes 11 and 14) responsible for more than half of all observed indels (Table 2). Because different numbers of species are represented at each gene and the genes have different lengths, indel rates were expressed in normalized units of indel events per 100 neutral substitutions (indels/hns) (materials and methods). The average normalized rate was estimated to be 1.7 indels/hns, with a range of 0–5.8 indels/hns per gene. The excess of indels in gene 14 is most likely explained by replication slippage in the 748-bp CT-rich low-complexity region near its 3′ end. Gene 11, despite its large number of indels, does not show a substantial elevation in its normalized indel rate (1.9 indels/hns).
No correlation was observed between ω and normalized indel rates (R2 = 0.001). This finding, together with the excess of indels in the low-complexity region of gene 14 and in several simple sequence repeats, suggests that mutation (e.g., replication slippage), rather than selection, may be responsible for most of the variance in indel rates. As expected, the vast majority (78%) of the observed indels have lengths that are exact multiples of three and therefore leave the open reading frames of their genes intact. The remainder generally occur near the 3′ ends of genes or co-occur with compensatory indels that maintain the reading frame. No significant differences were observed in the indel rates on individual branches of the tree.
Almost two-thirds of the observed indels (100/158) could be identified, on the basis of outgroup data, as either insertions or deletions, with deletions outnumbering insertions by ∼3:2 (Table 2). This preference for deletions over insertions is consistent with genomewide observations in mammals (Cooper et al. 2004; Rat Genome Sequencing Project Consortium 2004) and plants (Ma and Bennetzen 2004), but is not as pronounced as what has been observed in Drosophila (Petrov et al. 1996). The average sizes of insertions and deletions were similar (6.7-bp insertions, 7.8-bp deletions). Five of seven phylogenetically informative indels (i.e., shared by between 2 and n − 2 species in a region with n ≥ 4 aligned species) in nonrepetitive regions supported the inferred tree topology (Figure 3). The remaining two indels occurred in exon 3 of gene 12 and supported a phylogeny in which eggplant and pepper are grouped as sister taxa (supplemental Figure S3). These two indels might reflect homoplasy events or evolutionary scenarios—such as incomplete lineage sorting or duplication followed by loss—that would cause this exon's phylogeny to differ from the species tree.
The BAC sequences from all five species were screened for transposable elements (TEs), simple sequence repeats, and low-complexity sequences using TRF (Benson 1999) and RepeatMasker (http://www.repeatmasker.org), with a custom RepeatMasker library assembled from several databases (materials and methods). The estimated repeat content ranged from 9.6% of bases in eggplant to 23.0% in potato (supplemental Table S6). In all cases, TEs accounted for the majority of repeats, covering between 5.8% (eggplant) and 16.4% (potato) of bases. By applying the same methods to four BACs identified as being in euchromatin by Wang et al. (2006), we obtained estimates of repeat content ranging from 11.5 to 23.0% (median 12.8%), suggesting that the CSS is fairly typical of the euchromatic portions of Solanaceae genomes in this respect. Notably, the overall fractions of repetitive bases in these genomes are considerably higher, because they consist predominantly of repeat-rich heterochromatin (∼75% of the tomato genome is estimated to be heterochromatin; Wang et al. 2006). On the basis of 4300 randomly sheared sequences, we estimate the overall repeat content of the tomato genome to be ∼56% (L. Mueller, unpublished data), which is comparable to reported values for grass genomes—e.g., 58–66% in maize (Messing et al. 2004; Haberer et al. 2005) and 35% in rice (International Rice Genome Sequencing Project 2005).
We observed a strong depletion of TEs in introns relative to intergenic regions (P < 2.6 × 10−22, Fisher's exact test), probably because intronic TEs tend to disrupt transcription or splicing and hence are eliminated by natural selection. This depletion was observed in all species except eggplant, where an unusually low repeat content suggested the possibility of poor detection sensitivity. As in maize and rice (Haberer et al. 2005), LTR transposons were the most abundant among the annotated classes of repeats, except in tomato, where DNA transposons were more abundant (supplemental Table S6).
The mixture decomposition analysis above provided bulk statistical estimates of fractions of the genome under selection, but did not identify specific bases of interest. Therefore, the program phastCons (Siepel et al. 2005) was used to predict specific genomic elements that are significantly more conserved across species than would be expected under neutral evolution and hence are likely to be under purifying selection (Figure 2). The program was calibrated by requiring 60% coverage of coding regions by conserved elements, as estimated above. The predicted conserved elements covered 20.2% of all bases, including 27.9% of transcribed noncoding, 16.7% of intronic, and 8.3% of intergenic bases (supplemental Table S7). These fractions are somewhat lower than those predicted by the mixture decomposition method, most likely because many selected bases are not detectable by phastCons, due to short element length, weak selective pressure, or positive rather than negative selection. (For this data set, the expected minimum length of a detectable element is ∼10 bp, assuming all bases within the element obey phastCons's model of conserved evolution.)
The predicted elements that fell outside of annotated protein-coding genes were identified as conserved noncoding sequences (CNSs; Figure 2B). These CNSs covered 11.2% of all noncoding bases, substantially more than the ∼2% found in an analysis of rice and maize (using somewhat different methods) (Inada et al. 2003). As in animal genomes, they were shorter and less variable in length than conserved coding sequences (median lengths 26 and 60 bp), indicating possible cis-regulatory function. They were similar in length to CNSs recently identified in Arabidopsis (median length 24 bp; Thomas et al. 2007). The CNSs were considerably more conserved than nonsynonymous sites in protein-coding genes, with substitution rates ∼15% lower (Figure 3D), but this difference may simply reflect the fact that stronger conservation is required to detect shorter elements.
In addition, elements showing evidence of lineage-specific differences in selective pressure were identified using the program DLESS (Siepel et al. 2006). The vast majority of the 97 identified elements were predicted to be under purifying selection on all branches of the phylogeny and corresponded closely with phastCons predictions. However, 3 elements were significantly diverged in one species yet conserved in the others (supplemental Table S8). These elements were predicted by DLESS as lineage-specific “losses”—that is, elements under purifying selection that had experienced relaxation of constraint on a particular branch of the phylogenetic tree. However, they could instead be subject to lineage-specific positive selection, which the program does not explicitly model. One element overlapped an exon of gene 11 (Figure 2D), another overlapped an exon of gene 12, and the third fell in an intron of gene 12. Notably, 2 of the 12 multispecies genes showed evidence of lineage-specific selection. No significant evidence of positive selection could be found in the protein-coding regions of these elements, using the branch-site LRT of Yang and Nielsen (2002), and no motif or other functional evidence could be associated with the noncoding elements. Regardless, these elements show striking cross-species differences at the sequence level and, as potential determinants of phenotypic differences between species, they may be worthy of future study.
Using RNAz (Washietl et al. 2005), the elements identified by phastCons and DLESS were examined for evidence of RNA secondary structure. There were 36 distinct elements (71 including overlapping elements from the two sets) with at least one moderately high-scoring predicted structure (probability >50%). Based on an estimated false-positive rate of 37%, an expected 23 of these 36 elements are true positives. Just under one-fourth of the predictions were in introns or intergenic regions, while the remainder overlapped coding regions, on either the sense or the antisense strand. Only 11% (8/71) of predicted structures showed significant similarity to known structures in the RFAM database (Griffiths-Jones et al. 2005) (supplemental Table S9), but this could in part reflect a dearth of plant RNAs in the database. An example of a high-scoring structure with a significant match in the database is shown in Figure 8. In addition to the RNAz predictions, we found five predicted C/D box snoRNAs using snoReport (Hertel et al. 2008), two of which matched known snoRNA structures in RFAM. We also searched for miRNA precursors using RNAmicro (Hertel and Stadler 2006) but found none. Interestingly, we observed a significant enrichment for 3′ exons among the exons of multi-exon genes that had overlapping RNAz-predicted structures (P = 0.045, Fisher's exact test; predictions on sense strand only). Overall, 7 of 23 exons with overlapping structures were last exons, and most of these structures did not extend substantially into the 3′-UTR. This is consistent with data indicating that a large proportion of human and mouse stop codons fall in RNA loop structures, which perhaps contribute to efficient termination of translation (Shabalina et al. 2006).
We scanned the noncoding regions of the CSS for matches to 16 known Solanaceae regulatory motifs (supplemental Figure S4), using position-specific score matrices (PSSMs) and stringent significance thresholds (materials and methods). In some cases, predicted binding sites corresponded closely to CNSs identified by phastCons (Figure 2C). However, of 480 predicted binding sites, only 26 overlapped CNSs, and none of the motifs were significantly enriched or depleted for matches in CNSs (supplemental Table S10). Thus, these regulatory elements appear not to be maintained (in position or sequence) by strong purifying selection. However, weak detection power for binding sites and for conserved sequences that are short in length or subject to weak selection may contribute to these observations.
Comparative genomics has a venerable history in plants, going back at least 20 years (Bonierbale et al. 1988; Tanksley et al. 1988; Gale and Devos 1998). However, sequence data that would allow detailed comparisons of both coding and noncoding regions in plant species at moderate evolutionary distances have been slow to emerge. In this article, we take a step toward addressing this deficiency by presenting an analysis of orthologous genomic sequences in plants similar to the ones that have been so valuable in recent functional and evolutionary studies of animal genomes. Our analysis of an unduplicated CSS from five species in the family Solanaceae reveals, among other things, that the CSS is generally well conserved but has numerous small-scale differences between species, that at least about a third of the region is under selection, that it contains more selected bases in noncoding than in coding regions, and that most of its genes have experienced moderate to strong purifying selection, but two recently duplicated genes show evidence of positive selection.
With evidence for only two gene-gain and two gene-loss events since the divergence of the Solanaceae, the CSS shows greater conservation of gene content than has been observed in other comparative analyses of plant genomes. For example, CSSs in the dicot species P. trichocarpa, A. thaliana, and M. truncatula display frequencies of gene loss in at least one species of ≥40% (Kevei et al. 2005). Similarly high rates of gene loss have been observed in comparisons of tomato and Arabidopsis (Ku et al. 2000) and of the grass species maize, sorghum, and rice (Ilic et al. 2003). Notably, these other comparisons have all included species that have undergone relatively recent WGDs (e.g., maize or Arabidopsis), and the species with the most recent WGD has typically experienced the most gene loss (Ilic et al. 2003). Thus, it seems likely that the increased conservation of gene content observed in the Solanaceae is at least in part due to the absence of a recent WGD, as would be predicted by theory (Walsh 1995; Lynch and Conery 2003). Further support for this conjecture is provided by the high rate of gene loss in the six regions of the Arabidopsis genome homologous to the CSS (Figure 1), which evidently arose from duplications subsequent to the Arabidopsis/Solanaceae divergence. [Three of these segments—At2, At4, and At5-B—exhibit some conserved synteny and can be traced back to the α-WGD in Arabidopsis (Bowers et al. 2003).] Fifteen of the Solanaceae genes have at least one identifiable Arabidopsis homolog in these regions, and 8 have two or more homologs. Remarkably, in 7 of these 8 cases, all but one homolog has been pseudogenized. Altogether, more than half of the Arabidopsis genes in these regions have been lost, and these gene-loss events have been accompanied by a large number of rearrangement and insertion events. The degree of conservation in the Solanaceae is strikingly high by contrast.
We estimate that the five Solanaceae species studied here all derive from a most recent common ancestor that lived ∼30 MYA. This would imply that these species diversified well after the breakup of the Gondwana supercontinent, because most major continental separations are estimated to have occurred between 180 and 105 MYA (McLoughlin 2001). Therefore, our estimates support a hypothesis of long-distance—rather than intra-Gondwanan—seed dispersal to explain the worldwide distribution of the Solanaceae (D'Arcy 1991; Olmstead and Palmer 1992, 1997). They are consistent with Olmstead and Palmer's (1997) proposal of a New World origin and initial diversification of the Solanum, followed by dispersal to and further diversification in Africa, Australia, and other parts of the world. However, certain genera of Solanaceae, such as Schizanthus and Schwenkia, are believed to have branched off somewhat earlier than petunia (Olmstead and Palmer 1992; Olmstead et al. 1999), and our data do not help in dating these events.
Our estimate of 5.1–7.3 MYA for the tomato/potato divergence is substantially older than a recent estimate of 1.6–3.3 MYA, based on a large collection of ESTs (Blanc and Wolfe 2004b). This difference appears to stem primarily from the assumption in that study of an absolute rate of 1.5 × 10−8 synonymous substitutions per site per year, as estimated for the Adh and Chs loci in Arabidopsis (Koch et al. 2000). If this substitution rate were assumed for our data, the AS divergence would be estimated at only 76.5 MYA, in conflict with most other evidence. Thus, it appears that Koch et al.'s estimated substitution rate is too high, at least for the Solanaceae. By instead conditioning on the best available estimated dates for the AS divergence, we estimate a substitution rate in the Solanaceae of only 9.6 × 10−9 (8.3 × 10−9–1.1 × 10−8) synonymous substitutions per site per year, ∼1.6 times lower than Koch et al.'s estimate. Interestingly, Tuskan et al. (2006) observed a similar, but more dramatic, rate slowdown in Populus compared with Koch et al.'s estimate for Arabidopsis. It is worth stressing that our methods assume a molecular clock not only within the Solanaceae but also between Arabidopsis and the Solanaceae. If substitution rates in the Solanaceae are indeed lower than those in Arabidopsis, then we will have underestimated the divergence times within the Solanaceae. Such a difference in rates would imply that the root of the phylogeny in Figure 4 should slide toward the Solanaceae, which would have the effect of pulling the Solanaceae divergence events back in time. Any underestimation of the neutral distance between Arabidopsis and the Solanaceae due to saturation of synonymous sites would have a similar effect on the estimated divergence times. For these reasons, we believe the tomato/potato divergence occurred at least ∼5–7 MYA.
The proportion of nucleotide sites in a genome that are subject to natural selection (γ) is a key quantity in understanding how genomes evolve and in evaluating the completeness of functional annotations. There has been considerable interest in estimating this quantity from comparative sequence data, especially for mammalian (e.g., Shabalina et al. 2001; Mouse Genome Sequencing Consortium 2002; Cooper et al. 2004; Smith et al. 2004; Lindblad-Toh et al. 2005), Drosophila (Bergman and Kreitman 2001; Halligan et al. 2004; Andolfatto 2005), and Caenorhabditis (Shabalina and Kondrashov 1999) genomes. However, we know of no published estimate of this quantity for plant species. In this study, we find evidence that at least about a third of bases within the CSS are under selection and that more than two-thirds of selected bases fall in noncoding regions. Therefore, like mammalian and Drosophila genomes (Mouse Genome Sequencing Consortium 2002; Chiaromonte et al. 2003; Andolfatto 2005), this region appears to contain a large amount of genomic “dark matter”—that is, functional DNA that is invisible to current methods for genome annotation. It is not known how representative the CSS is of the entire Solanaceae genomes, but it is worth noting that the gene density of the region is fairly typical for euchromatin. The CSS has one gene per 7 kb compared with one gene per ∼6.7 kb in the euchromatic portion of the tomato genome (Wang et al. 2006). Values of γ in the gene-poor heterochromatin are likely to be considerably lower. Notably, because they are based completely on between-species divergence, our estimates should primarily reflect ancient selection—operating on the scale of millions of years—rather than more recent, possibly human-induced selection.
Several sources of possible error—such as lineage-specific selection, (negative) selection in so-called neutral sites, and failure to align divergent sequences—would tend to produce underestimates, rather than overestimates, of γ. However, spuriously high measurements of the neutral rate of substitution (i.e., spuriously low neutral conservation scores) could lead to an upward bias in γ. Such a bias could result from choosing neutral sites that were actually evolving faster than average or from alignments of nonorthologous neutral bases. Our use of 4D sites (which fall in easy-to-align coding regions) and our synteny-based alignment methods make it unlikely that alignment error in neutral sites has a major effect on γ. We have also repeated the analysis with an alternative alignment and arrived at very similar results (supplemental material), providing further evidence that alignment error is not a dominant factor in the analysis. In addition, 4D sites have generally been found to evolve somewhat more slowly, rather than faster, than other neutral sites (e.g., Hardison et al. 2003). Indeed, we find that a phylogeny estimated from 1915 sites in ancestral repeats in the CSS has >10% greater total branch length than the one estimated from 4D sites (data not shown). This suggests that forces that may produce elevated rates of substitution in 4D sites, such as CpG methylation (Hobolth et al. 2006) or genetic draft, are probably more than offset by weak negative selection on synonymous sites. Another possible source of bias is that the phastOdds scores—which depend on estimated models of neutral and conserved evolution—are influenced by differences between 4D and other sites in base composition, branch length proportions, and substitution patterns. However, two quite different estimation schemes for these models, based on different sets of sites, led to very similar estimates of γ (materials and methods), suggesting that the mixture decomposition is being driven by conservation and not by ancillary properties. Taken together, these lines of evidence suggest that our estimate of is conservative as a lower bound for the fraction of bases under selection in the CSS.
Our ability to address major questions about the evolution of the Solanaceae and of other plant species in this study has clearly been limited by having sequence data for only one ∼100-kb region. The completion of the tomato genome sequence, and new sequence data for other Solanaceae species and for coffee, will allow for more definitive answers to these and many other questions. Nevertheless, this study demonstrates the potential of comparative sequence analysis to provide insight into the evolutionary forces that have shaped modern-day plant genomes.
We thank Rod Wing for assistance in BAC sequencing and T. Vinar, L. Mueller, J. S. Pedersen, and Y. Xu for assistance in the analysis. This work was partially supported by National Science Foundation grants DBI-0116076 (A.S.) and 0421634 (S.T.) and National Institutes of Health training grant T32 GM007617-27 (A.D.).
- Received February 11, 2008.
- Accepted June 23, 2008.
- Copyright © 2008 by the Genetics Society of America