Comparative Sequencing in the Genus Lycopersicon: Implications for the Evolution of Fruit Size in the Domestication of Cultivated Tomatoes
T. Clint Nesbitt, Steven D. Tanksley


Sequence variation was sampled in cultivated and related wild forms of tomato at fw2.2—a fruit weight QTL key to the evolution of domesticated tomatoes. Variation at fw2.2 was contrasted with variation at four other loci not involved in fruit weight determination. Several conclusions could be reached: (1) Fruit weight variation attributable to fw2.2 is not caused by variation in the FW2.2 protein sequence; more likely, it is due to transcriptional variation associated with one or more of eight nucleotide changes unique to the promoter of large-fruit alleles; (2) fw2.2 and loci not involved in fruit weight have not evolved at distinguishably different rates in cultivated and wild tomatoes, despite the fact that fw2.2 was likely a target of selection during domestication; (3) molecular-clock-based estimates suggest that the large-fruit allele of fw2.2, now fixed in most cultivated tomatoes, arose in tomato germplasm long before domestication; (4) extant accessions of L. esculentum var. cerasiforme, the subspecies thought to be the most likely wild ancestor of domesticated tomatoes, appear to be an admixture of wild and cultivated tomatoes rather than a transitional step from wild to domesticated tomatoes; and (5) despite the fact that cerasiforme accessions are polymorphic for large- and small-fruit alleles at fw2.2, no significant association was detected between fruit size and fw2.2 genotypes in the subspecies—as tested by association genetic studies in the relatively small sample studied—suggesting the role of other fruit weight QTL in fruit weight variation in cerasiforme.

DOMESTICATION of crops was one of the most profound and rapid events in plant evolution, irreversibly altering the distribution of plant species on the earth and enabling human civilization to come into existence. Domestication of individual plant species was usually enabled by one or more dramatic changes in the anatomy of the species, allowing certain desirable parts of the plant (from a human perspective) to become greatly exaggerated (e.g., seed-bearing cob in maize or fruit of tomato, melon, etc.). Over recent years, evidence has accumulated to support the hypothesis that the majority of these dramatic anatomical changes can be attributed to a few loci and that selection for these loci by our ancestors rendered alterations in overall genomic diversity of the species (Doebleyet al. 1997; Grandilloet al. 1999).

In 1997, Doebley et al. reported the cloning of teosinte branched1 (tb1), a key gene associated with the evolution of wild Mexican grass teosinte into modern maize. Further studies have documented the changes in genetic variability in and around the tb1 locus (Wanget al. 1999). Other than in maize, the molecular events accompanying domestication are largely unknown. Recently, however, fw2.2, a major quantitative trait locus (QTL) underlying the domestication of tomato, was cloned (Fraryet al. 2000). fw2.2 encodes a protein controlling fruit growth and mutations at this locus resulted in a major increase in fruit size during tomato domestication (Alpertet al. 1995; Fraryet al. 2000). This locus makes the largest contribution to the difference in fruit size between most cultivated tomatoes and their small-fruited wild species counterparts (Alpertet al. 1995).

Lycopersicon (Mill.), the genus that includes the cultivated tomato, is composed of nine small-fruited species, most of which are limited in distribution to a small area in western Peru, Chile, and Ecuador (Rick 1976). Only Lycopersicum esculentum var. esculentum, the domesticated tomato, and L. esculentum var. cerasiforme, its small-fruited feral putative congener, are found outside this narrow range, being common throughout many parts of the world, especially in Mesoamerica and the Caribbean (Rick 1976). Historical and linguistic studies suggest that the cultivated tomato was most likely selected from wild forms of cerasiforme (Jenkins 1948; Rick 1976); however, phylogenetic/diversity studies based on isozymes and DNA polymorphisn have not clarified this issue (Ricket al. 1974; Rick and Fobes 1975; Miller and Tanksley 1990; Williams and St. Clair 1993).

While the geo-historical events underlying tomato domestication are poorly understood, even less is known about the impacts of domestication on genome diversity in tomato. Currently, fw2.2 is the only cloned locus known to be involved in the domestication of tomato fruit. The goal of this study was to apply phylogenetic and population genetic techniques to determine the nature and origin of the mutations in fw2.2 that have enabled domestication and to understand the impact of domestication-related selection at the locus on the tomato genome. In an attempt to shed light on these issues, a series of fw2.2 alleles (both coding and upstream regions) were sequenced in accessions of (1) modern tomato, (2) L. esculentum var. cerasiforme, and (3) L. pimpinellifolium. Variation at fw2.2 was then contrasted with variation in other loci believed not to be involved in fruit size control: orf44, an anonymous gene adjacent to fw2.2; Adh2 (encoding alcohol dehydrogenase); and two random, single-copy sequences, TG10 and TG11. The latter three loci are on different chromosomes than fw2.2 and hence would not be subjected to “hitchhiking” effects due to linkage disequilibrium. These studies also permit an estimate of radiation time for the genus Lycopersicon and the divergence of cultivated tomato from its closest living wild relative, L. pimpinellifolium.

View this table:

Lycopersicon accessions used in this study, showing the loci sequenced from each


Plant materials: The plant accessions used in this study are listed in Table 1. The accessions of L. cheesmanii, L. hirsutum, L. parviflorum, L. pennellii, L. peruvianum, and L. pimpinellifolium chosen for this study have been used in previous mapping populations and are known to carry alleles at the fw2.2 locus associated with a small-fruited phenotype, referred to as “small-fruit alleles” (Grandilloet al. 1999). The modern cultivars of L. esculentum var. esculentum used in the study carry the “large-fruit allele” of fw2.2 (Grandilloet al. 1999; S. D. Tanksley, unpublished data). Accessions of L. esculentum var. cerasiforme represent the “core collection” of the Tomato Genetic Resources Center, University of California at Davis.

Locus selection and primer design: In addition to the coding sequence (dubbed “orfx” in Fraryet. al. 2000) and ∼2.7 kb upstream of the fw2.2 locus (Figure 1A), several additional loci were selected to be used as controls for sequence comparisons: (1) orf44, the open reading frame of unknown function immediately adjacent to fw2.2 (Fraryet al. 2000; see Figure 1A); (2) a 489-nucleotide region of the Adh2 gene, including parts of exons 1-4 and introns 1-3 (Figure 1B); and (3) two unlinked single-copy genomic clones, TG10 and TG11 (Bernatzky and Tanksley 1986). The Adh2 gene was chosen because (1) it is in on a different chromosome than fw2.2, (2) is a relatively highly conserved gene, containing several introns and exons in a short region, and (3) its function is not directly related to early floral organ development (Longhurstet al. 1994) and thus is not necessarily subject to the same selection pressures or history experienced by fw2.2. TG10 and TG11 are anonymous genomic sequences unlinked to fw2.2 (chromosomes 9 and 10, respectively; Bernatzky and Tanksley 1986). The sequences of TG10 and TG11 contain no continuous open reading frames, have no significant similarity to any sequences in the GenBank nucleotide databases (BLASTN and TBLASTX), and are used to represent intragenic, relatively less-conserved noncoding sequence. For some accessions, two restriction fragment length polymorphism (RFLP) markers, TG91 and TG167, flanking the fw2.2 region (Fraryet al. 2000), were also sequenced. For each locus, primers were designed from available L. esculentum var. esculentum sequence, and these primer sets successfully amplified single bands in all other taxa (see below for conditions). A summary of primer sequences used for amplification is listed in Table 2.

DNA isolation, PCR amplification, purification, and sequencing: Tomato genomic DNA used for sequence analysis in this study was isolated from greenhouse-grown plants using the protocol described by Fulton et al. (1995). Using this DNA, PCR fragments were amplified and directly sequenced. Each PCR reaction used 0.5 μl (∼100 ng) of tomato DNA and was amplified with the following thermocycler conditions: 94° denaturization (1 min), 50° annealing (1 min), and 68° elongation (2 min), for 35 cycles. PCR products used as templates for sequencing were first examined by gel electrophoresis and then cleaned using QIAGEN’s (Valencia, CA) Qia-Quick spin columns. Fragments were sequenced in both directions from the same primers used for amplification, unless stated otherwise. All new sequences generated in this study have been submitted to the GenBank sequence database (accession nos. AY097061--AY097189).

Sequence analysis tools: Examination and manipulation of nucleotide sequences were conducted using the suite of programs in DNASTAR’s (Madison, WI) Lasergene software package. Sequence alignments were first generated using the Clustal V method of DNASTAR Megalign (gap penalty = 10, gap length penalty = 10) and then refined by hand. Multiple sequence reads for very long regions [fw2.2 5′ untranslated region (UTR)] were assembled into contigs using the Phred/Phrap (Ewing and Green 1998; Ewinget al. 1998) and Consed (Gordonet al. 1998) software packages. Phylogenetic inferences were drawn with the assistance of β versions of PAUP* 4.0 (Swofford 1998). Trees presented in this study were identified as the single most-parsimonious tree (unless stated otherwise) using a branch-and-bound search, treating gaps in the alignment as missing, and using sequence from L. pennellii LA716 as the outgroup. Sequence divergence estimates and other molecular population genetics statistics were generated using the DnaSP v3.53 software package (Rozas and Rozas 1999). Sliding-window analysis of nucleotide variability was conducted using the SWAN program of Proutsky and Holmes (1998). “Statistical parsimony” analysis (Templetonet al. 1992) used the TCS v1.13 software package (Clementet al. 2000), and subsequent nested analysis of variance (NA-NOVA) used SPSS for Windows v10.0.

Figure 1.

—Fragments amplified for sequence analysis. (A) The fw2.2 region of tomato chromosome 2 (illustration based on Fraryet al. 2000), including the upstream and coding regions of fw2.2 and the coding region of orf44. The region upstream of the fw2.2 open reading frame was amplified as four separate fragments (frags 1-4). The positions of RFLP markers TG91 and TG167 (amplified and sequenced in some accessions) are also noted. (B) The Adh2 gene (illustration based on Longhurstet al. 1994). Amplification includes introns 1-3 and parts of exons 1-4. Note that the fragment amplified from Adh2 does not include the region similar to Adh2 pseudogenes PSA1 and PSA2 (see Longhurstet al. 1994). Nucleotide sequences of individual primers, depicted in the figure as short arrows below each amplified fragment, are given in Table 2. Not shown: TG10 and TG11 fragments amplified and sequenced.

Fruit weight evaluation of L. esculentum var. cerasiforme accessions: To evaluate the association of fruit weight with fw2.2 alleles among L. esculentum var. cerasiforme accessions, a single plant of each cerasiforme accession listed in Table 1 was grown in the field in Ithaca, New York, during the summer season of 2000. Fifteen red fruits of each accession were collected at maturity and weighed individually.


Sequence divergence within the genus Lycopersicon: On the basis of the sequences of the four loci examined, divergence estimates of various Lycopersicon alleles from L. esculentum var. esculentum alleles are presented in Table 3: Ks is calculated as the number of synonymous nucleotide substitutions per site, Ka is the number of nonsynonymous substitutions per site, and K is the number of substitutions per site in noncoding sequence. The values are calculated using the Jukes-Cantor method (α= 1, β= 1) and represent divergence from the allele of L. esc. var. esculentum cv M82 (the allelic sequences of this accession are identical to those of other L. esc. var. esculentum accessions examined, with the exception of a single-nucleotide substitution observed in the TG10 allele of TA1210; see Figure 2). Standard errors for the divergence estimates were calculated using the method proposed by Kimura (1980). In general, sequence divergence between species represents a few substitutions per hundred sites, even between the most distantly related species in the genus. At a few loci (e.g., Adh2), no sequence variation was detected among some of the species tested.

To pool data from multiple loci, the significance of the variability in divergence values must be evaluated. The allelic divergence values estimated for given species pairs appear to be highly variable across loci examined. For example, K estimates for the divergence of alleles of L. hirsutum and L. esculentum cv. M82 range from ∼5 to 76 substitutions per thousand sites, depending upon the locus examined. Some of this variability is likely to be due to differences in lengths of sequence examined at each locus (i.e., sampling error). To test whether the observed heterogeneity is significant, a simple analysis of variance of the divergence estimates (nonzero values only) was conducted for each species comparison, using the standard errors in Table 3. In most cases, analysis of variance of Ks values could not be conducted due to the invariant nature of the sequences (i.e., no variance estimates). Where analysis could be conducted on Ks estimates (L. pimp. LA369, L. hirs., and L. penn.), no significant difference was found among the values. On the other hand, in most cases heterogeneity among K estimates was significant—i.e., between-locus variation was significantly greater than within-locus variation (P < 0.05). The only exception was among the K estimates between M82 and L. cheesmanii, which were not significantly variable. Thus, because of this significant heterogeneity among divergence estimates, any inferences based upon pooled silent-site sequence data should be made with caution. Finally, Ka values are also significantly heterogeneous among the loci (i.e., in general, orf44 is more conserved than fw2.2), but this result is not surprising as it is not uncommon for different genes to experience different degrees of conservation.

View this table:

Oligonucleotide primers used for fragment amplification and sequencing

Estimated divergence times for the genus Lycopersicon: To provide a temporal context in which to evaluate the evolution of fw2.2 alleles, an attempt was made to date the divergence times of species in the genus Lycopersicon. However, this exercise was done with the knowledge that rates of nucleotide substitution are notoriously variable in plants, making it extremely difficult to arrive at a suitable rate for use with molecular clock models (Muse 2000). Gaut (1998) estimated a rate of 6.03 × 10-9 synonymous substitutions per site per year (ds) for plant nuclear genes, and a recent report applied this estimate to comparisons of L. esculentum and Arabidopsis thaliana (Kuet al. 2000). Given the significant locus-dependent variability in allelic divergence estimates, inferences of divergence time of species within the genus based upon this data are somewhat tenuous. Nonetheless, divergence times inferred from pooled silent-site divergence could be taken to represent very general estimates of the timing of genus radiation. Using these assumptions, Table 3 shows the estimated time, in millions of years before present (BP), that a given accession and L. esculentum cv. M82 diverged from a common ancestor. These results suggest that the genus Lycopersicon began its initial radiation >7 million years ago and that L. esculentum and its nearest relatives, L. cheesmanii and L. pimpinellifolium, diverged from a common ancestor ∼1 million years BP. These dates are consistent with a recent study, which suggested that the genus Solanum, the paraphyletic taxon that includes Lycopersicon, diverged from its nearest related genus ∼12 million years BP (Wikstromet al. 2001).

Gene trees of Lycopersicon sequences: To evaluate the relationships among the species in the genus Lycopersicon, parsimony-based gene trees inferred from each of the sequences used in this study are shown in Figure 2. Because they introduce a large number of incongruities into the gene trees, the cerasiforme alleles are omitted from these trees for clarity and are discussed further below. In the cases of fw2.2, orf44, and Adh2, both introns and exons together were used to generate the trees. In general, ∼500 nucleotides that include some noncoding sequence were adequate to resolve the relationships among the alleles of most species. Additionally, Figure 3 shows a tree based upon combined data.

View this table:

Sequence divergence (Ks, Ka, and K) for selected nuclear loci in Lycopersicon, with standard errors

Figure 2.

—Six gene trees of sequences from the genus Lycopersicon. Adh2, TG10, and TG11 are unlinked loci. The sequences used to infer trees of fw2.2, orf44, and Adh2 include both introns and exons. To depict all trees on comparable scales, branch lengths (number of inferred steps) are divided by the length of the sequence (in nucleotides) used to construct the tree. In each case, a single most-parsimonious tree was identified. Percentages of 100 bootstrap replications are given for nodes with bootstrap values >50%. Tree statistics are as follows: fw2.2 5′ UTR, tree length (l) = 384, consistency index (CI) = 0.94, retention index (RI) = 0.91; fw2.2, l = 48, CI = 0.96, RI = 0.95; orf44, l = 37, CI = 1.00, RI = 1.00; Adh2, l = 24, CI = 0.95, LI = 0.89; TG10, l = 24, CI = 1.00, RI = 1.00; and TG11, l = 22, CI = 0.91, RI = 0.89.

The branching patterns of these individual and combined gene trees are generally consistent with most other published trees of the genus Lycopersicon (Palmer and Zamir 1982; Miller and Tanksley 1990; Bretoet al. 1993). However, an anomalous placement of L. hirsutum near L. pimpinellifolium accessions in the TG11 tree was noted, suggesting that some lineage sorting or introgression may be associated with this species. Additionally, some sources have suggested that L. peruvianum may be an artificial, heterogeneous taxon (Rick 1963, 1986; Miller and Tanksley 1990), having one subgroup of individuals most closely related to L. pennellii and L. hirsutum and a second group more closely related to L. parviflorum. The L. peruvianum accession used in this study, LA1708, appears to fall into the latter group.

Relative rate test: Differences in the relative rates of nucleotide substitution between lineages could be indicative of differences in past selection pressure experienced by each lineage. Selection during the process of tomato domestication could conceivably have led to a greater accumulation of nucleotide change either in the species L. esculentum in general or at the fw2.2 locus in particular. To test these hypotheses, the simplified relative rate tests proposed by Tajima (1993) were applied to each of the five loci used in this study. Using L. pennellii as the outgroup, each locus was tested to determine if the L. esculentum var. esculentum sequence had evolved at a different rate than that of the sequence from L. pimpinellifolium or L. cheesmanii, its nearest wild relatives. The null hypothesis predicts that the branch length from L. pennellii to L. esculentum will be the same as the lengths from L. pennellii to L. pimpinellifolium or to L. cheesmanii.

For all five loci examined, using both Tajima’s D1 (assumes rates of transition and transversion are equal) and D2 (does not assume equal rates) tests, none of the test statistics were significant, providing no support for differences in mutation rates in the lineages leading to these four species. However, the statistical power of the relative rate tests is probably not very strong due to the limited number of substitutions among taxa. To increase testing power, the Tajima D1 and D2 tests were also conducted on the pooled sites from all five loci, but the test statistics were also not statistically significant in this case. Thus, neither fw2.2 nor other tested loci appear to have diverged at a faster rate in the lineage leading to cultivated L. esculentum. The corollary is that there is no evidence that the fw2.2 allele of L. esculentum var. esculentum has accumulated more (or fewer) changes than the alleles carried by related wild species.

Figure 3.

—Tree from combined sequences for the genus Lycopersicon. The sequence used to generate the phylogeny is a concatenation of all sequences used to generate trees in Figure 2 (fw2.2 5′ UTR, fw2.2, orf44, Adh2, TG10, and TG11). Percentages of 100 bootstrap replications are given for nodes with bootstrap values >50%. Tree shown is single most-parsimonious tree, length = 563, consistency index = 0.94, retention index = 0.89.

Sequence-based inferences of functional differences between fw2.2 alleles: Sequence analysis of the fw2.2 region has important implications for identifying the genetic polymorphism(s) in fw2.2 that is causally related to the variation in fruit weight associated with this locus. Frary et al. (2000) reported three nonsynonymous substitutions between L. esculentum and L. pennellii in the coding region of fw2.2. However, further sequencing of the fw2.2 transcription unit in other species of the genus reveals that two of the three substitutions are autapomorphies of L. pennellii. The third substitution (AA 3) is shared by all species of the genus except L. esculentum and L. cheesmanii; as this accession of L. cheesmanii is known to carry a small-fruit allele (Patersonet al. 1991), this substitution is not likely to be associated with a change in fruit size. Aside from these three changes, all of the fw2.2 alleles among the taxa examined are identical at the protein level. Furthermore, these three substitutions fall between the putative first (M1) and second (M12) methionine. Sequence-based promoter analysis, such as PROSCAN (Prestridge 1995) and the Hamming clustering method (Milanesiet al. 1996), fail to identify standard initiation motifs (TATA, CAAT box, CG box, etc.; reviewed in Bucher 1990) in the vicinity of either start site. Because some uncertainty is associated with the determination of the start site, the actual start site may be M12, making all of the potentially nonsynonymous substitutions among the alleles fall in the upstream, noncoding region. In either case, the phenotypic differences between large and small alleles of fw2.2 cannot be attributed to any functional differences in the FW2.2 protein itself.

Within the 2.7-kb region upstream of the fw2.2 start site, only eight synapomorphies are unique to the L. esculentum var. esculentum alleles: three transitions, one transversion, and four indels 1, 2, 9, and 10 nucleotides (nt) in length, all deletions in var. esculentum. This suggests that the phenotype of fw2.2 is likely to be due to one or more nucleotide changes in the upstream promoter region of the gene and supports the hypothesis that phenotypic differences may be due to differential expression of large- and small-fruit alleles (Fraryet al. 2000).

Sliding-window analysis (SWAN) of nucleotide variability: A sliding-window analysis was used to quantify the genus-wide nucleotide variability in the upstream UTR of fw2.2 in an attempt to determine whether any of the eight large-fruit synapomorphies described above fall within a relatively conserved domain of the fw2.2 promoter region. Nucleotide variability at the fw2.2 locus (including fw2.2 5′ UTR, fw2.2, and orf44) was calculated using the SWAN software package (Proutsky and Holmes 1998), and the results are shown in Figure 4. Figure 4A depicts the mean and standard deviation (SD) of nucleotide variability on the basis of the entire length of the sequence. To prevent the relatively conserved coding regions in the sequence (right half of graph) from biasing the mean and SD, Figure 4B shows the same graph, but calculates mean and SD upstream and downstream of the fw2.2 start site separately.

In Figure 4, A and B, there are clearly regions that are conserved more highly than others, in particular the coding regions of fw2.2 and orf44. Additionally, at least two regions in the fw2.2 5′ UTR show relatively low variability, although these “valleys” are not statistically significant (<2 standard deviations from the mean in both graphs). None of the eight large-fruit synapomorphies in the promoter region of fw2.2 (marked with “Δ”) appear to fall within well-conserved regions—on the contrary, they seem to lie in areas of average or higher variability. If any of the eight large-fruit synapomorphies do in fact fall within an important, conserved domain, those domains may be so short as to not stand out against the background of random variation in sequence variability along the length of the alignment.

Figure 4.

—SWAN of nucleotide variability in the fw2.2 region, including the 5′ UTR region of fw2.2, the fw2.2 transcription unit, and the adjacent orf44 transcription unit (depicted with heavy bars below the graph). (A) Mean and standard deviation of variability calculated from the full length of the sequence. (B) Mean and standard deviation calculated separately for the regions upstream (left) and downstream (right) of the fw2.2 start site. Solid horizontal bar, mean variability; dashed horizontal bars, one standard deviation from the mean. The positions of the eight large-fruit synapomorphies are denoted “Δ” beneath the graph. Nucleotide position numbers are relative to the fw2.2 start site. Sequences used for the calculation include all non-esculentum accessions shown in Figure 2 (eight accessions total). Accessions of L. esculentum were omitted from the SWAN analysis to prevent putative “large allele” mutations from adding to the calculation of variability.

Diversity of L. esculentum var. cerasiforme alleles across five loci: Because small-fruited L. esculentum var. cerasiforme is thought to be the wild progenitor of the large-fruited domesticated cultivars, a 951-nucleotide fragment of the fw2.2 5′ UTR (spanning five of the eight large-fruit synapomorphies) was sequenced from a sample of 39 cerasiforme accessions. The coding region of fw2.2 was not examined among the cerasiformes, as previous results suggested polymorphisms in this region are not likely to be important to variation in fruit size. The allelic diversity among the cerasiforme accessions, with sequences of the same fragment from the L. esculentum var. esculentum, L. cheesmanii, and L. pimpinellifolium accessions examined above, is depicted by the gene tree in Figure 5. Seven different haplotypes were identified among the cerasiforme accessions (denoted A-G). Most of the cerasiforme accessions carry the haplotype identical to the domesticated, large-fruited esculentum varieties.

Figure 5 also includes the country of origin of the accessions examined. Although the B haplotype—the allele identical to the “large allele” carried by var. esculentum—is distributed throughout the natural geographical range of var. cerasiforme, haplotypes E, F, and G appear to be restricted in distribution to areas sympatric with L. pimpinellifolium (Peru). Haplotypes A, C, and D are also found in areas sympatric with L. pimpinellifolium, in Ecuador and Peru, but are more frequently found outside this region.

Figure 5.

—Gene tree of sequences from L. pimpinellifolium (Pi), L. cheesmanii (Ch), L. esculentum var. esculentum (E), and L. esculentum var. cerasiforme (C), based on a 951-nucleotide subset of the fw2.2 5′ UTR. Tree shown is the single most-parsimonious tree, using deletions as a fifth state (tree length = 79, consistency index = 0.9241, retention index = 0.9250). Percentages of 100 bootstrap replications are given for nodes with bootstrap values >50%. The placements of character changes on the tree are as in the most-parsimonious tree; vertical hatch marks on branches denote individual substitutions or indels inferred along each branch, numbered by alignment position upstream of the fw2.2 start site. Solid hatches denote synapomorphies, and shaded hatches denote inferred homoplasies. Of the eight large-fruit allele synapomorphies in the fw2.2 5′ UTR (discussed in text), five are included in this tree and are marked with asterisks. The seven haplotypes observed among the cerasiforme accessions are denoted with boldface letters (A-G) to the right of the tree. Also included, in parentheses after the cerasiforme accession numbers, are (1) the overall rank in mean fruit weight among the 39 cerasiforme accessions (with 1 = smallest weight), (2) mean fruit weight (in grams) of each accession, and (3) the country of origin.

To contrast allelic diversity of fw2.2 with the rest of the genome, Adh2, TG10, and TG11 sequences from a sample of 10 of the 39 cerasiformes were examined. Cerasiforme alleles at each locus appear as a paraphyletic clade with members grouping with alleles either from the domesticated esculentum or from the L. pimpinellifolium accessions (Figure 6). Moreover, cerasiforme alleles fall into different subclades, depending on which gene is examined. LA292 (C3 in Figure 6), for example, carries an esculentum-like allele at fw2.2 and Adh2, but a pimpinellifolium-like allele at TG10 and TG11. In contrast, the small set of domesticated esculentums always group together. In fact, with the exception of a single-nucleotide difference in the TG10 allele of TA1496 (E3), no allelic diversity was observed among the esculentums. The cerasiformes thus represent a diverse population containing an admixture of both esculentum- and pimpinellifolium-like alleles and suggest that the subspecies may be derived from hybridizations between L. esculentum domesticates and L. pimpinellifolium wild forms.

Figure 6.

—Allelic diversity at four loci within a sample of 10 cerasiformes. The four trees correspond to four of the six gene trees in Figure 2 (fw2.2 5′ UTR, Adh2, TG10, TG11) and show branching only among accessions of L. esculentum var. esculentum (E), L. esculentum var. cerasiforme (C), L. cheesmani (Ch), and L. pimpinellifolium (P). Accessions included are: M82 (E1), TA496 (E2), TA1496 (E3), TA1210 (E4), LA1455 (C1), LA1226 (C2), LA292 (C3), LA1312 (C4), LA1420 (C5), LA1204 (C6), LA1712 (C7), LA1574 (C8), LA2688 (C9), LA1388 (C10), LA483 (Ch), LA1601 (P1), LA369 (P2), and LA1589 (P3). The asterisks in the TG11 tree denote that accession C5 (LA1420) appears to be heterozygous at this locus. All trees are drawn to the same scale, representing the number of inferred steps (for clarity, not corrected for sample length as in Figure 2). Sequence data from fw2.2 and orf44 were not collected for this subset of taxa.

If the presence of pimpinellifolium-like alleles represents recent introgression into L. esculentum var. cerasiforme from L. pimpinellifolium, then some linkage disequilibrium may be detectable by observing closely linked markers. TG91 and TG167, two RFLP markers flanking the fw2.2 region by <0.1 cM or 100 kb upstream and downstream, respectively (see Figure 1; Fraryet al. 2000), were also sequenced in the accessions used above. Although there were polymorphisms between the L. esculentum var. esculentum and L. pimpinellifolium alleles at both loci, the 10 L. esculentum var. cerasiforme accessions were monomorphic and identical to the L. esculentum var. esculentum allele at both loci. Because the same accessions were polymorphic for fw2.2 alleles, this suggests that if the pimpinellifolium-like alleles are introgressions from L. pimpinellifolium, they must have occurred far enough in the past that linkage to TG91 and TG167 has been broken. TG91 and TG167 are also the only markers observed in this study for which no pimpinellifolium-like alleles are detected among the cerasiformes (with the caveat that only 10 accessions were sampled).

Molecular population genetics analysis of L. esculentum accessions: Sequence-based genetic analysis was performed on L. esculentum accessions (both cerasiforme and cultivated types) to make inferences about the history of L. esculentum population structure. A summary of basic population statistics is presented in Table 4. The most striking result in the table is the near absence of polymorphism among the four modern cultivars—only a single-nucleotide substitution in one var. esculentum accession was observed in a sample of >7 kb. While the sample of cultivars is small, it contained a sample of diverse types. Two accessions (M82 and TA496) are modern processing tomatoes producing “roma-type” fruit and two (TA1210 and TA1496) are heirloom varieties, one with extremely large fruit (TA1496) and one with bell-pepper-shaped fruit (TA1210). This lack of variation in var. esculentum is consistent with previous surveys of var. esculentum diversity, which determined levels of polymorphism among cultivated tomatoes to be extremely low (Miller and Tanksley 1990). This lack of diversity is most likely a reflection of at least three population bottlenecks in the history of modern cultivars: (1) initial domestication, (2) transfer of varieties to Europe by Spanish explorers, and (3) subsequent breeding efforts by primarily U.S. breeders (Rick 1976).

Many population models infer historic selection pressures on the basis of observed violations of neutral nucleotide substitutions (Kimura 1980). For example, on the basis of this neutral theory, the Hudson-Kreitman-Aguadé (HKA) test predicts that loci that evolve at higher rates should have higher levels of within-species polymorphism as compared to polymorphism at other neutral loci (Hudsonet al. 1987). Similarly, McDonald and Kreitman (1991) proposed the test that for neutral loci the ratio of fixed nonsynonymous to synonymous substitutions between species should be equal to the same ratio within species. However, the absence of polymorphism within var. esculentum accessions sequenced in this study limits the ability to apply these neutral theory-based methods [i.e., HKA tests whether 0 = 0; McDonald-Kreitman (MK) causes division by 0 errors]. That nucleotide diversity among var. esculentum alleles appears to be lower than diversity among var. cerasiforme alleles is consistent with a population bottleneck in the history of var. esculentum. However, nucleotide polymorphism at the level of individual genes may not be adequate to make robust population inferences about past selection pressures without a potentially prohibitively large amount of nucleotide sequence—>7 kb from many more than four accessions.

View this table:

Estimates of nucleotide diversity within the species L. esculentum

Test for association of genotype and fruit weight phenotype in L. esculentum var. cerasiforme: Mean fruit weight (from a 15-fruit sample) of each of the 39 cerasiforme accessions studied was superimposed upon the gene tree in Figure 5. The phenotypic data provide an ideal opportunity to evaluate so-called “measured genotypes” (Boerwinkleet al. 1987)—in this case, to assign fruit-weight effects to individual haplotypes of the fw2.2 5′ UTR. Clearly there is a large range in fruit weight among the cerasiformes—a nearly 12-fold difference from smallest to largest. Due to sequence identity or similarity to alleles of known phenotype (i.e., the alleles carried by the var. esculentum, L. pimpinellifolium, and L. cheesmanii accessions), the initial expectation was that plants carrying haplotypes A and B would have significantly larger fruit than those carrying all other alleles (C-G). Yet, although the cerasiformes in the A-B clade have slightly larger fruit (mean = 10.3 g, SD = 7.8) than those in the C-G clade (mean = 7.4 g, SD = 8.2), this difference is not significant (one-tailed t-test, P = 0.146).

To attribute phenotypic effects to individual haplotypes, the NANOVA method proposed by Templeton et al. (1987) was utilized. This method is based upon the assumption that changes in phenotype follow the same evolutionary history represented by the cladogram and is therefore dependent upon (1) confidence in the cladogram and (2) the assumption that recombinant alleles are rare. The Templeton-Crandall-Sing (TCS) methods (Templetonet al. 1992; Clementet al. 2000) were used to evaluate these assumptions. First, all seven cerasiforme haplotypes can be assembled into a single network (with no closed loops, which would signify recombination) within the 95% parsimony limit (13 steps)—i.e., each step within the cladogram is likely to be parsimonious. Second, although the cladogram contains a number of homoplasies, no recombinant alleles could be identified using the TCS method; in particular, there were no postulated recombination events that could resolve two or more homoplasies (Aquadroet al. 1986). Therefore, NANOVA was performed using the most parsimonious tree of cerasiforme haplotypes.

The nesting categories used for NANOVA are illustrated in Figure 7. Because many of the haplotypes are separated by multiple steps—requiring a large number of inferred, intermediate haplotypes that make no statistical contribution to the model—a modification of the grouping method of Templeton et al. (1987) was used. Rather than strictly nest the groupings on the basis of single-step increments, nesting categories were based more generally upon “subclades.” The lowest level of nesting (level 0) represents individual haplotypes, labeled A-G. The next level of nesting (level 1) groups the haplotypes into four subclades, haplotypes A and B (1), C and D (2), E and F (3), and G (4), and the highest nesting level (level 2) divides the taxa into two groups, A and B (I) and C-G (II). Thus, the NANOVA model of fruit weight variance contains three terms: variation among level 2 clades, variation among level 1 clades within level 2 clades, and variation among level 0 clades within level 1 clades within level 2 clades.

The results of NANOVA are summarized in Table 5. As with the one-tailed t-test above, the contrast expected to be most significant—variation between level 2 clades— is not significant. That is, there is no evidence that the fruit of plants carrying putative “large alleles” (inferred from sequence identity) are significantly larger than those carrying putative “small alleles.” However, several other terms in the model are significant. First, there is significant variation among level 1 clades. All of this variation can be attributed to variation among level 1 clades within level 2 clade II, because there is only one level 1 clade within level 2 clade I. Multiple comparisons among the three level 1 clades in level 2 clade II [using the Bonferroni method to account for multiple comparisons (Neteret al. 1985)] reveals that the contrasts of clades 2 vs. 4 and 3 vs. 4 are significant (P = 0.015 and 0.038, respectively), but 2 vs. 3 is not. Finally, the only significant variation identified within level 1 clades (that is, among haplotypes) was within level 1 clade 1; variation between haplotypes A and B is significant. Thus, the NANOVA method has identified three branches in the cladogram that have a change in fruit size associated with them: (1) the branch between haplotypes A and B, (2) the branch between level 1 clades 3 and 4, and (3) the branch between level 1 clade 2 and its inferred common ancestor with clade 4. These significant branches are marked with asterisks in Figure 7.

Figure 7.

—Haplotype categories used for nested analysis of variance, following the procedure outlined by Templeton et al. (1987). A-G represent the seven cerasiforme haplotypes of the fw2.2 5′ UTR, as defined in Figure 5i.e., the “level 0 clades.” Arrows depict the phylogenetic relationships among the seven haplotypes inferred in Figure 3 (note: arrows represent multiple steps and are not drawn to scale). Solid lines enclose the four “level 1 clades” designated by numbers (1-4), and dashed lines enclose the two “level 2 clades” designated by Roman numerals (I and II). The small circle represents an inferred intermediate haplotype, and its exact categorical placement is irrelevant to the statistical analysis. Asterisks indicate those branches inferred to be significantly associated with variation in fruit weight.

View this table:

Nested analysis of variance of fruit weight among 39 L. esculentum var. cerasiforme accessions, following the method of Templeton et al. (1987)

Lack of significance might suggest that the mutations associated with the fw2.2 phenotype may fall outside the sequenced promoter region and are not in perfect linkage disequilibrium with that region. Or, perhaps more likely, a large portion of the fruit weight variation in cerasiforme may be attributable to polymorphism at several of the other known fruit weight quantitative trait loci (Grandilloet al. 1999), and the contribution of fw2.2 is too small to be detected against this background. Further, it should be noted that although significant associations were detected between phenotype and some branches in the cladogram, it is not necessarily true that mutations along those branches cause the observed phenotype. Rather, phenotype could also be caused by changes outside of the sequenced region that are in linkage disequilibrium with those observed mutations. Finally, it is curious to note that the haplotypes most significantly associated with decreased fruit size (A, C, and D) are observed in accessions outside the natural range of L. pimpinellifolium.


The fw2.2 phenotype cannot be explained by differences in protein structure or function. Instead, data presented here support the observation of Frary et al. (2000) that the fruit-size phenotype is likely due to differences in expression of the gene, probably as a result of one of eight mutations in the 2.7 kb upstream of the fw2.2 gene. Further, the very low rate of nonsynonymous substitutions among the coding sequences of most taxa examined here (fw2.2, orf44, and Adh2) suggests that much of the phenotypic diversity within the genus may be due to the changes within the noncoding sequences in the genome. Although the observation of considerably lower diversity among var. esculentum accessions relative to var. cerasiforme accessions is consistent with a bottleneck in the history of var. esculentum, genetic diversity among var. esculentum accessions is too low to make neutral theory-based inferences about historic selection pressures. That is, the distinction between a selective sweep and neutral lineage sorting cannot be made at the loci examined. Tajima’s relative rate test, however, does suggest that the large-fruited var. esculentum allele of fw2.2 has not accumulated more (or fewer) substitutions than other alleles in the genus.

Phylogenies of Lycopersicon have been inferred using a variety of molecular methods: chloroplast DNA (Palmer and Zamir 1982), mitochondrial DNA (McClean and Hanson 1986), RFLPs (Miller and Tanksley 1990), and isozymes (Bretoet al. 1993). This study represents the first reconstruction of Lycopersicon phylogeny based upon the sequence of individual nuclear loci. Although sequence distances between species are not great, they are generally large enough to produce robust phylogenies from a sample of 300-500 nucleotides. However, among L. pimpinellifolium, L. esculentum var. cerasiforme, and L. esculentum var. esculentum, incongruities are observed (Figures 5 and 6), which may be due to the fact that these species are entirely interfertile and gene flow among them has been well documented in areas where they are sympatric (Rick 1950, 1958; Ricket al. 1974; Rick and Fobes 1975; Rick and Holle 1990; Williams and St. Clair 1993). Frequent introgressions among these taxa make it extremely difficult, if not impossible, to track the exact origins of individual alleles—var. cerasiforme appears to represent an admixture of alleles from L. esculentum varieties and L. pimpinellifolium. There are pimpinellifolium-like alleles among cerasiforme accessions collected in areas that are not sympatric with L. pimpinellifolium (some accessions with haplotypes C and D). Although there is certainly a great deal of gene flow within L. esculentum, it also seems unlikely that the high proportion of large-fruit alleles among the cerasiformes could be explained entirely by recent introgressions from domesticated types. Thus is it probably reasonable to infer that the allelic diversity among the cerasiformes today is not entirely a result of recent introgression and may be similar to the diversity that would have been available to early tomato cultivators.

Because fruit of the cerasiformes are already considerably larger than those of the other members of the genus (Rick 1958), it is conceivable that the large allele of fw2.2 arose not among relatively recent domesticates selected from the cerasiformes in Mesoamerica, but further in the past, perhaps predomestication, when L. esculentum var. cerasiforme first diverged from the other species in the genus in the Andes. If these molecular-clock-based divergence dates are reasonably accurate (Table 3), then the large and small alleles could have diverged from a common ancestor >1 million years BP. Although the conversion of an fw2.2 allele from small phenotype to large need only have been the most recent substitution in its divergence from its common ancestor with L. pimpinellifolium, fw2.2 may have acquired its large-fruit nature long before humans entered the New World (Wenke 1990).

Unlike teosinte branched1, fw2.2 is a QTL and does not condition a dramatic morphological change in tomato fruit, but rather an incremental one. An association of large-fruit phenotype with presence of putative large-fruit alleles of fw2.2 could not be detected among cerasiformes accessions against the background of what are likely to be many other genes affecting fruit weight in tomato (Grandilloet al. 1999). The range in fruit size among the cerasiforme accessions examined here is >15 times greater than the difference in size between near-isogenic lines differing at the fw2.2 locus (Alpertet al. 1995). If the variation in cerasiforme fruit size present today is at all representative of the variation present for the early agriculturalists, then they might not have even noticed a spontaneous mutation in the fw2.2 locus. Instead, the evolution of fruit size during the domestication of tomato is likely to represent a very long path of lineage sorting and gene “stacking” of alleles at many loci—some of which, including the large allele of fw2.2, could have existed for millennia before the first Americans.


The authors thank J. Doyle, M. Jahn, and A. Clark for their critical review of this manuscript and L. Patton and A. Patton for their encouragement in the field. The work was supported by grants from the U.S. Department of Agriculture National Research Initiative Cooperative Grants Program (no. 96-35300-36460), the Binational Research and Development Fund (no. US 2427-94), and the National Science Foundation (no. 9872617) to S.D.T.


  • Sequence data from this article have been deposited with the EMBL/GenBank Data Libraries under accession nos. AY097061-AY097189.

  • Communicating editor: T. F. C. Mackay

  • Received December 5, 2001.
  • Accepted June 6, 2002.


View Abstract