Abstract
We present a multilocus sequencing study to assess patterns of polymorphism and divergence in the closely related wild tomato species, Solanum peruvianum and S. chilense (Solanum section Lycopersicon, Solanaceae). The data set comprises seven mapped nuclear loci (≈9.3 kb of analyzed sequence across loci) and four local population samples per species that cover much of the species' range (between 80 and 88 sequenced alleles across both species). We employ the analytical framework of divergence population genetics (DPG) in evaluating the utility of the “isolation” model of speciation to explain observed patterns of polymorphism and divergence. Whereas the isolation model is not rejected by goodness-of-fit criteria established via coalescent simulations, patterns of intragenic linkage disequilibrium provide evidence for postdivergence gene flow at two of the seven loci. These results suggest that speciation occurred under residual gene flow, implying that natural selection is one of the evolutionary forces driving the divergence of these tomato species. This inference is fully consistent with their recent divergence, conservatively estimated to be ≤0.55 million years. We discuss possible biases in the demographic parameter estimates due to the current restriction of DPG algorithms to panmictic species.
THE biological and geographic determinants of species divergence have long been contentious, and it is now increasingly appreciated that patterns of genetic variation and differentiation may provide valuable insights into the evolutionary processes shaping this divergence. The importance of geographic isolation in facilitating evolutionary divergence as a consequence of mutation and genetic drift (or additionally, adaptive differentiation) was recognized early, and the process of allopatric speciation is uncontroversial on theoretical grounds (Mayr 1963; Losos and Glor 2003; Coyne and Orr 2004). If residual gene flow characterized the divergence of incipient species, however, modes other than strict allopatric speciation must be invoked, and these invariably require natural selection as one of the factors underlying species divergence. In addition to the putatively rare cases of sympatric speciation (e.g., Savolainen et al. 2006), divergence under residual gene flow may proceed in parapatry; i.e., geographically adjacent populations may be subject to directional selection that incidentally confers reproductive isolation (Endler 1977; Turelli et al. 2001; Gavrilets 2003). Another scenario is an initial period of divergence in allopatry followed by secondary contact allowing gene flow and thus direct selection for stronger interspecific barriers (reinforcement of reproductive isolation; Rice and Hostert 1993; Coyne and Orr 2004; Hoskin et al. 2005). Some researchers posit that interspecific hybridization and postdivergence gene flow following secondary contact may promote novel advantageous gene combinations in populations of mixed ancestry, perhaps contributing to adaptive divergence and speciation (e.g., Arnold 1997; Rieseberg et al. 2004; Seehausen 2004; Mallet 2005).
Multilocus DNA sequences collected within and among closely related species contain a wealth of historical–demographic information and are particularly informative when considered in the framework of genealogical (coalescence) models (e.g., Tajima 1989; Hudson 1990; Rosenberg and Nordborg 2002). As an extension of population genetic procedures to the species level, the analytical framework of divergence population genetics (DPG) encompasses coalescent-based models to infer historical attributes of lineage divergence from a common ancestor and to assess the utility of simple speciation models (Hey and Kliman 1993; Wakeley and Hey 1997; Wang et al. 1997; Machado et al. 2002; Hey and Machado 2003; Hey and Nielsen 2004). The DPG approach accommodates the stochastic nature of lineage sorting (Edwards and Beerli 2000; Hudson and Turelli 2003) and thus the (gradually decreasing) segregation of shared ancestral polymorphism in the descendant species, as these become more differentiated through genetic drift and the accumulation of new mutations.
The “isolation” model of speciation (the WH model; Wakeley and Hey 1997) assumes divergence in isolation without subsequent gene flow and as such is an explicit model of allopatric speciation. The WH model allows quantitative predictions about patterns of nucleotide diversity across multiple loci, and sequence data obtained from recently diverged taxa can provide scaled estimates of population-size changes and the timing of speciation, as well as probe for signatures of postdivergence gene flow (Wakeley and Hey 1997; Wang et al. 1997; Kliman et al. 2000; Machado et al. 2002; Broughton and Harrison 2003). More recently, bidirectional gene flow following initial species divergence has been incorporated as additional model parameters in the isolation-with-migration (IM) model (Nielsen and Wakeley 2001; Hey and Nielsen 2004), but a notable restriction of its current implementation is the assumption of nonrecombining data within loci. Despite this limitation, the IM model appears to enjoy increasing popularity in empirical studies (e.g., Dolman and Moritz 2006; Kronforst et al. 2006; Lawton-Rauh et al. 2007).
The currently available coalescent-based speciation models (WH and IM models) further assume panmixia within both extant and ancestral species, an assumption that is rarely tested or even discussed in empirical applications of these models. We suspect that not only DPG studies in plants but also those in many animal groups are at risk of using inappropriate models of divergence, given the likely importance of population subdivision in many taxa; similar concerns have recently been raised in the context of statistical phylogeography (Knowles and Carstens 2007). One of the goals of this study is to examine the utility of the DPG approach for species exhibiting population structure.
Multilocus genealogical studies of speciation scenarios are still very limited for plants (Ramos-Onsins et al. 2004; Städler et al. 2005; Zhang and Ge 2007). Building on our pilot study that was limited to single populations per species (Städler et al. 2005), this article provides an in-depth assessment of the divergence process between two closely related wild tomato species, Solanum peruvianum and S. chilense (Solanum section Lycopersicon, Solanaceae). Previous studies of Lycopersicon using a variety of molecular markers have generally found low levels of differentiation among species (e.g., Miller and Tanksley 1990; Baudry et al. 2001; Peralta and Spooner 2001), implying a fairly recent divergence of the tomato clade. In particular, our multilocus study of three self-incompatible species demonstrated variation between species pairs in the proportion of loci showing fixed interspecific differences vs. those with appreciable numbers of shared polymorphisms (Städler et al. 2005). These differential signals of genealogical divergence highlight the suitability of wild tomatoes as a plant speciation model under the DPG framework; they also imply widespread incomplete lineage sorting.
We used seven effectively unlinked nuclear loci to estimate population parameters and scaled divergence time between these outcrossing tomato species. While we cannot reject the isolation model based on overall goodness-of-fit criteria, patterns of linkage disequilibrium (LD) are indicative of historical gene flow between the diverging lineages. Moreover, we obtained consistent estimates of larger population size for S. peruvianum and a recent divergence time of ≤0.55 million years.
MATERIALS AND METHODS
Study system and sampling:
For this in-depth study, we chose two of the previously used self-incompatible wild tomato species. In contrast to our exploratory study of three taxa (Städler et al. 2005), we adopt the current taxonomic treatment of tomatoes as a section within the large genus Solanum (e.g., Spooner et al. 1993; Olmstead et al. 1999; Peralta and Spooner 2001). Our study species are the widely distributed S. peruvianum and the southernmost tomato species, S. chilense. Native to western South America, the two morphologically differentiated species have partly overlapping ranges in the arid coastal regions of southern Peru and northern Chile, west of the continental divide (Rick and Lamm 1955; Rick 1979, 1986; Taylor 1986; see Figure 1).
Geographic ranges of S. peruvianum (dark shading) and S. chilense (cross hatching) and approximate locations of the sampled populations. The S. peruvianum populations, from south to north (open diamonds), are Tarapaca (TAR, elevation 400 m), Arequipa (ARE, 2180 m), Nazca (NAZ, 2140 m), and Canta (CAN, 2050 m). The S. chilense populations, from south to north (solid dots), are Antofagasta (ANT, 2900 m), Tacna (TAC, 1250 m), Moquegua (MOQ, 2450 m), and Quicacha (QUI, 1830 m). ANT and TAR were previously studied by Städler et al. (2005; see their Figure 1). Note the region of sympatry in northernmost Chile and southern Peru. S. peruvianum sensu lato extends to northern Peru, but populations north of ∼9°–10°S latitude are now regarded as the separate species S. arcanum (see text).
Systematists have recently proposed recognizing four species in what traditionally was regarded as the polymorphic S. peruvianum (Peralta et al. 2005; Spooner et al. 2005). According to their proposition, our sampling of three natural populations in central and southern Peru (see below) would encompass both the new entity S. corneliomuelleri and S. peruvianum sensu stricto. However, there appear to be neither molecular data nor crossing results that would validate the proposed split of S. corneliomuelleri from S. peruvianum sensu stricto; we thus treat all of our new samples as S. peruvianum. There is no published evidence for interspecific hybridization between S. chilense and S. peruvianum in their natural habitats, in concordance with the strong reproductive barriers uncovered in experimental crossing studies (Rick and Lamm 1955; Rick 1979, 1986).
For each of the two study species, three new population samples were collected in southern and central Peru in May 2004 (T. Städler, G. Clostre and T. Marczewski); geographical locations and population nomenclature are shown in Figure 1. With the exception of the Canta population (S. peruvianum), all new samples are from regions of sympatry with the other species, even though this may not be true at a local scale. The Canta population, however, is far north of the S. chilense species range. Five to seven plants were collected per population and geographical coordinates and altitude were determined by GPS. We sampled ∼3–5 g of fresh leaf tissue per plant and stored it in plastic bags with silica gel until our return to Munich. Voucher specimens have been deposited at the herbaria Uníversidad San Marcos (Lima, Peru) and Munich Systematic Botany (Munich). Our exploratory study used one accession (equivalent to a population sample) of each species, obtained from the Tomato Genetics Resource Center at the University of California at Davis (Städler et al. 2005). Both these accessions were from northern Chile (S. chilense: accession LA2884, Antofagasta, five plants; S. peruvianum: LA2744, Tarapaca, five plants).
Choice of marker loci:
For this study focusing on multiple populations per species, we chose a subset of the loci previously used in our initial surveys (Baudry et al. 2001; Roselius et al. 2005; Städler et al. 2005); CT066, CT093, CT166, CT179, CT198, CT208, CT251, and CT268 are eight anonymous, single-copy cDNA markers previously mapped by Tanksley et al. (1992; see http://www.sgn.cornell.edu). Given that it proved impossible to sequence so many samples at all the previous loci, this reduced set of genes was chosen primarily because it yielded very similar proportions among the isolation model parameters as when using the full set of genes (Städler et al. 2005). Another study (Arunyawat et al. 2007) that focused on tests of neutrality, population subdivision, and linkage disequilibrium found evidence of non-neutral evolution at locus CT208 where a clinal pattern of variation was detected in S. chilense, possibly reflecting an incomplete (ongoing) selective sweep. Because it is paramount to avoid the inclusion of loci under positive selection in testing the isolation model, we decided to base our WH simulations on the seven remaining loci that show no obvious departures from neutral expectations.
Sequencing and haplotype determination:
Genomic DNA was extracted from dried leaf tissue using the DNeasy plant mini kit (QIAGEN GmbH, Hilden, Germany). PCR conditions generally followed those of our previous studies (Roselius et al. 2005; Städler et al. 2005; Arunyawat et al. 2007); they, as well as all PCR primer information, can be accessed at http://www.zi.biologie.uni-muenchen.de/evol/Downloads.html. Sequencing was performed on an ABI 3730 DNA analyzer (Applied Biosystems, Foster City, CA). Distinct haplotypes within heterozygous individuals were resolved by applying a suite of haplotype-specific sequencing primers. In most cases, we exploited putative or confirmed single nucleotide polymorphisms (SNPs) to anchor the 3′-end of sequencing primers that were intended to resolve the heterogeneous PCR products. This approach enabled us to verify SNPs (and indel variation) and to establish haplotype phase on the basis of overlapping information supported by multiple primer pairs. Sequence alignments were initially done either in Sequencher (Gene Codes, Ann Arbor, MI) or in Sequence Navigator (Applied Biosystems, Darmstadt, Germany) and adjusted manually in MacClade (Maddison and Maddison 1992).
Estimating nucleotide diversity within and between populations:
Standard population genetic analyses of the sequence data were performed using the program packages DnaSP, version 4.0 (Rozas et al. 2003) and SITES (Hey and Wakeley 1997). As a measure of intrapopulation polymorphism, we calculated both Watterson's (1975) estimator θW of the population mutation parameter, θ (= 4 Neu, where Ne is the effective population size and u the mutation rate per generation per site), and the average pairwise sequence divergence within samples, πw (Nei 1987). Given our sampling of multiple populations, we obtained species-wide estimates of the parameter θ by calculating πbetween (πb). This was done by including all pairwise comparisons of sequences obtained from different demes but excluding pairwise comparisons within demes, thus eliminating undesirable effects of the scattering phase of the coalescent process in substructured populations (Wakeley 1999, 2001). We also tested for departures from neutral equilibrium expectations by applying the intraspecific, standard Tajima's (1989) D test; for details on this and other tests of neutrality, see Arunyawat et al. (2007). In this article, we restrict our attention to SNPs (excluding all insertion–deletion polymorphisms) and report all estimates of nucleotide diversity as per-site values.
Fitting the isolation speciation model:
The multilocus data were fitted to the WH isolation model (Wakeley and Hey 1997; Wang et al. 1997). This simple model of allopatric speciation assumes that an ancestral, panmictic species characterized by the population mutation parameter θA gave rise to two extant species at time τ in the past (τ = 2ut, where t is the number of generations since speciation). The extant species are characterized by the mutation parameters θ1 and θ2, respectively. Hence, effective population size is assumed to be constant within species, but is allowed to change at the time of speciation. The model further assumes the neutrality of segregating variants and no gene flow (introgression) subsequent to the initial species divergence.
As shown by Wakeley and Hey (1997), the expectations of the observable quantities Sx1, Sx2, Ss, and Sf (exclusive polymorphisms for species 1 and 2, shared polymorphisms and fixed differences, respectively) are functions of the four model parameters. Hence, by equating observations with expectations, their moment-based algorithm yields the parameter estimates from multilocus data. For each pair of interspecific populations [e.g., Tarapaca (TAR)–Moquegua (MOQ), Canta (CAN)–Tacna (TAC), etc.] and for each of the seven loci, we calculated the number of observations for each of the four site categories (excluding sites with observable multiple hits) and simultaneously estimated the population recombination parameter γ (Hey and Wakeley 1997), using the program SITES. For each pairwise comparison, we ran 10,000 coalescent simulations using a modified WH program (Wakeley and Hey 1997; Wang et al. 1997; Städler et al. 2005).
To avoid biasing the recombination rate downward in the WH simulations, we let γ be “unknown” for the S. chilense samples, except in a few cases where γ for the S. peruvianum sample was estimated to be zero but the S. chilense sample had a γ > 0; in such cases, we used the S. chilense γ and let the S. peruvianum γ be “unknown.” As implemented in WH, the relative magnitudes of the model parameters θ1, θ2, and θA determine the level of recombination for “unknown” entries, such that γ2 = γ1 × (θ2/θ1) and γA = γ1 × (θA/θ1), where γ1 in our case is the previously estimated locus-specific recombination parameter for the S. peruvianum sample. This has the effect of imposing a fixed ratio of γ's in a given interspecific comparison, but allows for variation in levels of recombination across loci.
Prompted by our finding that species-wide comparisons do not exhibit fixed interspecific differences at any of the loci (conditions under which the WH parameter estimates are unreliable and simulations assessing the fit of the model fail; Wakeley and Hey 1997), we also used an alternative approach to estimate the scaled divergence time τ and the ancestral population size θA. Under the isolation model, the expected average number of interspecific pairwise differences, E (d12) = τ + θA (Wakeley and Hey 1997, p. 854), and a reasonable estimate of τ can be obtained from the minimum number of interspecific pairwise differences, min (kij), among all sampled sequences (Takahata and Nei 1985). We thus calculated d12 as well as min (kij) (summed over all loci) and estimated the ancestral size as θA = d12 − min (kij).
Linkage disequilibrium test of gene flow:
Machado et al. (2002) introduced a test of gene flow based on patterns of LD among specific classes of segregating sites, i.e., using a subset of total intragenic LD. Under a scenario of gene flow, LD among pairs of shared polymorphisms (average = ) in the recipient species should tend to be positive (i.e., preponderance of ancestral–ancestral and/or derived–derived SNP associations), and LD among pairs of sites where one member is a shared and the other an exclusive polymorphism (average =
) should tend to be negative (i.e., preponderance of ancestral-derived SNP associations). Both expected effects can be seen as a consequence of insufficient time for recombination to erode LD (given that introgression has occurred after initial species separation) compared to the situation where shared polymorphisms represent truly ancestral mutations, i.e., those preceding speciation.
Unlike our initial study (Städler et al. 2005), the availability of outgroup sequences from species outside the tomato clade allowed us to use the LD test statistic proposed by Machado et al. (2002). Here, we use sequence data generated from either S. lycopersicoides (CT093, CT268) or S. ochranthum (all other loci; Roselius et al. 2005) to polarize LD. It may be expected that polarized LD has greater power in detecting historical introgression than our previous, unpolarized LD-based test (Städler et al. 2005), although this has not been formally evaluated. Using the LD measure D′ (= D/Dmax, possible range from −1 to +1), the observed values of the test statistic x (; Machado et al. 2002; computed in the SITES program) were confronted with expectations generated by the same set of WH simulations that was used to test the quality of fit of the isolation model. These coalescent simulations thus incorporate the estimated demographic parameters and scaled divergence time as well as the locus-specific recombination rates that were estimated from the data. For these analyses, only loci with at least four pairs of sites in each of the above categories were used, which excludes loci with observed Ss values <4.
RESULTS
Single nucleotide polymorphism and distribution of species-wide diversity:
Levels of nucleotide variation at eight nuclear loci, patterns of population differentiation, and tests of neutrality have been summarized in Arunyawat et al. (2007). For our purpose of testing simple models of speciation, we utilized the seven loci that appear to evolve neutrally and recalculated population-specific mean diversity levels and multilocus Tajima's D values (Table 1). Species-wide nucleotide diversity (estimated with πb) is somewhat higher in S. peruvianum (0.01347) than in S. chilense (0.01173). For both species, three local population samples each harbor >80% of the species-wide diversity, while one population is characterized by lower levels of diversity (50–60% of species-wide diversity; ARE and ANT). The multilocus Tajima's (1989) D statistic is slightly, but not significantly, negative in the three high-diversity S. peruvianum samples TAR (−0.22), NAZ (−0.14), and CAN (−0.55), while estimates of D are close to zero in the three high-diversity S. chilense samples TAC (0.03), MOQ (−0.05), and QUI (0.13).
Average population-level and species-wide nucleotide diversity
Both the S. peruvianum ARE sample and the S. chilense ANT sample depart from the other populations by exhibiting lower polymorphism, an excess of intermediate-frequency polymorphism (positive D), and a deficit of distinct haplotypes (Table 1; Arunyawat et al. 2007). It is likely that these low-diversity samples reflect local or regional bottlenecks and/or greater degrees of isolation from neighboring demes (i.e., lower proportions of immigrants via gene flow), compared to the majority of our local samples. Because we are interested in demographic estimates reflecting species-wide patterns subsequent to their recent divergence, we decided to restrict our tests of the isolation model to the six high-diversity populations and certain pooled samples (see below).
Although there is clear evidence that levels of nonsynonymous polymorphism are markedly lower than levels of silent polymorphism (Arunyawat et al. 2007), as expected under purifying selection, D estimates based on only silent sites are very similar to those based on all sites (data not shown). This suggests that the segregating nonsynonymous variants are not under strong negative selection but rather behave in a nearly neutral manner. Because there was no obvious evidence of non-neutrality in the data comprising these seven loci, we felt justified in using the entire SNP data for testing the isolation model of speciation.
Polymorphic site categories in interspecific population contrasts:
The among-locus distributions of four categories of polymorphic sites, namely exclusive polymorphisms in species 1 and 2 (Sx1 and Sx2, respectively), polymorphisms that are shared among species (Ss), and fixed differences between the species (Sf) represent summary statistics of the data and collectively yield the WH isolation model parameters (Wakeley and Hey 1997). When the three high-diversity population samples per species were treated as combined (pooled) samples, there were no fixed differences between these two species at any of the loci (data not shown). However, under such conditions, the WH approach cannot yield reliable estimates of the model parameters and simulations assessing the fit of the model fail (Wakeley and Hey 1997; but see below).
As a first approximation to the model assumption of panmixia, we contrasted all nine combinations of interspecific pairs of high-diversity populations. Table 2 presents the distribution of polymorphic sites for the population pair characterized by the highest proportion of species-wide diversity (96.4% for CAN and 90.0% for MOQ). This locus-by-locus tabulation clearly shows two features of the multilocus data that are found in all interspecific population comparisons: all loci exhibit multiple shared polymorphisms while there are only very few fixed interspecific differences at one or two of the loci (Table 2). Table 3 summarizes the polymorphic site counts for all nine comparisons, and these data emphasize the ubiquity of shared polymorphisms and the paucity of fixed differences (Sf between two and seven for these interspecific comparisons based on single populations per species).
Distribution of polymorphic sites in the comparison of the most variable populations (Canta–Moquegua)
Summary of polymorphic site counts in 12 interspecific sample comparisons (S. peruvianum–S. chilense)
Parameter estimates and model fitting:
Supplemental Table 1 (http://www.genetics.org/supplemental/) presents WH parameter estimates for all nine pairwise (interspecific) population comparisons. Across these nine comparisons, the estimates of ancestral population size (θA) are comparable to those for the S. chilense samples (θ2), whereas the effective size of S. peruvianum (θ1) is estimated to be larger. The WH parameter estimates have broad confidence intervals (supplemental Table 1). The means of simulated values are generally very close to the point estimates, except that the simulated θ's for S. peruvianum samples are often somewhat higher than the corresponding point estimates (data not shown). A notable result is that the T estimates (= τ/θ1) imply very recent divergence from a common ancestor, in the range of 0.11–0.20 units (where time is measured in units of two Ne S. peruvianum generations).
Coalescent simulations implementing the estimated recombination levels did not reject the simple isolation model, as neither the Wang–Wakeley–Hey (wwh, Wang et al. 1997) nor the χ2 (Kliman et al. 2000) test statistics approach significance in any of the pairwise interspecific comparisons. In particular, the generally high values of Pwwh reflect the absence of even moderate differences in the proportions of Ss and Sf among loci (Table 2), while the fairly low value for the Nazca–Quicacha comparison reflects additional deviations due to variable levels of exclusive polymorphisms among loci for these two samples (supplemental Table 1 at http://www.genetics.org/supplemental/). Summarizing these contrasts between observed data and expectations under the simple WH isolation model, it is evident that our data fit the model surprisingly well; this is mainly due to the fairly even distribution of the polymorphic site classes across loci.
We also ran these analyses on three pooled data sets, which were motivated as follows. Arunyawat et al. (2007) found that genetic differentiation among the S. chilense populations Tacna and Moquegua is very low (FST < 0.02). In addition, the two most polymorphic S. peruvianum populations (Tarapaca and Canta; Table 1) exhibit the least differentiation within that species (FST ≈ 0.09). Additional considerations were the need to observe fixed interspecific differences in these pooled samples as well as high proportions of species-wide diversity (πw/πb). Hence, we generated a pooled TAC/MOQ (S. chilense) sample and contrasted it with both CAN and a pooled TAR/CAN sample (S. peruvianum); additionally, the pooled TAR/CAN sample was contrasted with MOQ. The number of fixed differences for these interspecific contrasts across all seven loci is two (Table 3); pooling other populations eliminates fixed differences entirely, as was observed for the species-wide data set.
Table 4 presents the WH estimates and goodness-of-fit statistics for these pooled sample contrasts. As was observed for the single-population comparisons, the demographic estimates imply a roughly stable population size for S. chilense compared to the ancestral species, while S. peruvianum appears to have undergone a population-size expansion. Likewise, coalescent simulations do not reject the simple isolation model, again reflecting the fairly even among-locus distributions of the polymorphic site classes (data not shown). Compared to the single-population contrasts, variances for the population mutation parameters of the extant taxa (θ1 and θ2) tend to be lower (Table 4), probably reflecting additional power due to the higher number of sequenced alleles/SNPs.
WH parameter estimates and isolation model fitting
Because an alternative approach to estimating the size of the ancestral species and the scaled divergence time has been suggested for cases of very recent divergence (i.e., lack of fixed interspecific differences; Wakeley and Hey 1997), we also calculated the minimum and average number of pairwise interspecific sequence differences [min (kij) and d12, respectively; Table 5]. From the data summed over all loci, we obtained estimates of θA = 91.7 and τ = 51. This alternative estimate of θA is very close to the conventional WH estimates, whereas the alternative estimate of τ represents a roughly twofold increase over most of the conventional WH estimates (Table 4, supplemental Table 1 at http:/www.genetics.org/supplemental/).
Minimum and average interspecific sequence differences
Linkage disequilibrium test of postdivergence gene flow:
We additionally tested for postdivergence gene flow by analyzing patterns of intragenic LD, using information that remains completely untapped by the multilocus goodness-of-fit test that is often the only assessment of the suitability of the isolation model. Table 6 presents empirical data and simulation results using the LD test statistic x, which should exhibit more positive values under a scenario of postdivergence gene flow. Five of six mean observed x values are significantly elevated, with stronger signals seen in the most inclusive sample contrast (TAR/CAN vs. TAC/MOQ) and the allopatric contrast CAN vs. TAC/MOQ. Locus CT066 exhibits significantly high x values for all samples, and locus CT166 shows significant or nearly significant x values for several of the contrasts. Finally, all three S. chilense samples show marginally significant x values for locus CT179, whereas there is no LD-based signal of gene flow at any other locus (Table 6). We acknowledge that after applying the conservative Bonferroni correction to accommodate multiple testing, none of the observed x values at individual loci would be considered significant. However, a binomial test reveals that the probability of observing four or more P-values < 0.07 (of 14 tested locus × sample combinations for each of the three sample contrasts shown in Table 6) by chance is very low (P < 0.014), providing evidence that patterns of intragenic LD are incompatible with divergence in isolation.
Linkage-disequilibrium test of postdivergence gene flow
In addition to the three pooled-sample comparisons, we also subjected the individual population contrasts to the LD test of postdivergence gene flow. Figure 2 plots the P-values for individual loci in both species, expressed as medians of the individual population contrasts. Consistent with the results for the pooled-sample comparisons, median P-values for loci CT066 and CT166 are between 0.030 and 0.124, reflecting significant or marginally significant values of the LD test statistic x in many single-population contrasts. These results suggest bidirectional interspecific gene flow following initial species divergence, whereas there is no compelling evidence for such a scenario at the other loci (Figure 2). We stress that the expected values of the LD test statistic x were obtained via coalescent simulations implementing the estimated demographic and divergence-time parameters for each data set. Because the τ estimates from the conventional WH analyses are expected to be underestimates due to violations of the panmixia assumption (see discussion), the LD-based results should be regarded as conservative because the timescale to erode any LD signal of historical gene flow by recombination was probably larger than simulated.
Scatter plot of average locus-specific rates of recombination and median P-values for single-population contrasts (LD test of gene flow). Recombination is expressed as γ per site, and the locus-specific values are the means obtained for the S. peruvianum populations (i.e., both species are plotted with the S. peruvianum recombination estimates). The P-values are locus-specific medians obtained from up to nine interspecific population contrasts. The lower left area of the plot contains the data for loci CT066 and CT166 (cf. Table 6).
DISCUSSION
The principal limitation of our previous study of speciation scenarios in wild tomatoes was the availability of only single populations per species (Städler et al. 2005). Here we have shown that consistent demographic estimates under the WH isolation model (Wakeley and Hey 1997; Wang et al. 1997) are obtained when high-diversity populations of both S. peruvianum and S. chilense are chosen to portray their respective genealogical histories. In contrast, our original S. chilense sample (Antofagasta) can now be interpreted as reflecting a local or regional bottleneck (on the basis of unusually low levels of polymorphism and strong haplotype structure) that is certainly not characteristic of the species-wide demographic history following speciation. Consequently, relying on such samples would result in misleading WH parameter estimates, as was anticipated in our previous study (Städler et al. 2005). Our current demographic estimates imply that neither extant species underwent a strong bottleneck at the time of (or following) lineage divergence and thus appear to rule out peripatric (peripheral isolates) speciation. The inferred historical demography, however, appears to be consistent with either initial range fragmentation or other speciation modes not relying on population bottlenecks.
Inferring demographic history and divergence time under population subdivision:
Coalescent models for subdivided species are a useful guide for interpreting the demographic inferences and identifying possible biases of our analyses under the WH framework. Wakeley (1999) introduced the distinction between the “scattering” and the “collecting” phase of the coalescent process in a subdivided population (island model). The brief scattering phase traces the genealogy of a local sample until all remaining lineages (looking backward in time) are located in different demes, and it is characterized by local coalescent events and migration events to different demes. The timescale of the ensuing, much longer collecting phase depends on the rate of migration between demes and the number and size of demes in the total population, in that ancestral lineages can coalesce only when they occupy the same deme (Wakeley 1999, 2001; Wakeley and Aliacar 2001). Given that we have sampled multiple gene copies per population, we have to consider effects of the scattering phase and the potential nonexchangeability of the sampled sequences.
Our high-diversity local samples contain significant proportions of the species-wide diversity, as suggested by the πw/πb ratios (Table 1) and the correspondingly moderate estimates of population differentiation in both S. peruvianum (FST ≈ 0.138) and S. chilense (FST ≈ 0.188; excluding the low-diversity samples ARE and ANT, respectively; Arunyawat et al. 2007). This implies that local coalescent events during the scattering phase are fairly rare. In this regard, some of our local samples (in particular CAN and MOQ) approximate species-wide samples where single sequences are obtained from a larger number of demes. However, we have noted that pooling of more than two populations per species eliminates all fixed interspecific differences, while single-population comparisons and selected pooled-sample contrasts show between two and seven fixed differences (Table 3). We thus “see” the expected effects of (i) underestimating the within-species diversity using single-deme samples and (ii) underestimating divergence between species when considering species-wide samples (Wakeley 2000, 2003; Ingvarsson 2004); both are consequences of the distribution of diversity within and between local demes under population subdivision and the (potentially) higher species-wide effective population size under restricted migration (Slatkin 1987; Strobeck 1987; Whitlock and Barton 1997; Pannell 2003).
We have attempted to counter the first effect by evaluating several pooled samples (constrained by the need to observe fixed differences in the data as well as high proportions of the species-wide diversity) and have calculated alternative estimates of θA and τ that should be less biased under population structure. All estimates of θA agree that the ancestral species had a fairly large effective size (θA ≈ 90) while the estimator of τ based on the minimum number of interspecific pairwise differences (τ = 51) is almost twice as large as most WH estimates (Table 4, supplemental Table 1 at http://www.genetics.org/supplemental/). Because gene divergence necessarily predates species divergence, this higher estimate of τ should be upwardly biased, but postdivergence gene flow would, of course, have the opposite effect. Using a weighted estimate of the neutral mutation rate (5.1 × 10−9/site/year; Roselius et al. 2005) and the per-site τ estimate (51/9093 ≈ 0.00561), a rough estimate of the divergence time for these species is 0.55 million years.
On the basis of the same multilocus sequence data, our observations of significantly negative Tajima's (1989) D and/or Fu and Li's (1993) D values in pooled samples can be explained only by demographic/range expansion in both species (Arunyawat et al. 2007). The signature of species-wide expansion is expected to be evident only under high rates of gene flow (when samples are taken from single demes; Ray et al. 2003) or generally in pooling population samples or using species-wide samples (Arunyawat et al. 2007). The reason is that, depending on levels of gene flow, local-sample genealogies are distorted due to scattering-phase coalescent events (leading to short external branches; Ray et al. 2003; De and Durrett 2007) whereas pooling has effects equivalent to adding migrants (i.e., unrelated individuals yielding sample genealogies with longer external branches).
Given these expectations, it is unclear why the WH estimates for the extant species (θ1 and θ2) do not reflect the marked expansion inferred from other features of the data (Arunyawat et al. 2007), not even for the most inclusive sample contrast TAR/CAN–TAC/MOQ. Part of the explanation could be that, given the low estimates of τ (whether τ ≈ 27 or 51), there is not as much information about the descendant species in the data as about their common ancestor, unlike under much longer divergence times (Wakeley and Hey 1997). The “boundary phenomenon” of observing no fixed differences in many pooled sample contrasts is likely to be another (yet related) factor that contributes to underestimating θ1 and θ2 in our WH analyses; it also prevented systematic evaluations of randomly generated multi-deme samples. While this particular constraint would not be expected in comparisons of more diverged species, even species-wide samples obtained from subdivided species do not satisfy the critical assumption of panmixia in both the WH and IM models of divergence; this seems underappreciated in empirical studies. There is a need for more realistic models of speciation that explicitly take population subdivision into account in extracting signals of species' demographic history from sequence data.
Fit of the isolation model:
Despite the fact that the genealogical histories of our samples violate the model's assumption of within-species panmixia, we found that the data fit the simple isolation model quite well, which is primarily a consequence of observing similar proportions of exclusive polymorphisms, shared polymorphisms, and fixed differences across loci (Tables 2 and 3). An evaluation of multilocus studies using the WH approach suggests that only very recent or current introgression may allow formal rejection of the isolation model, making this a very conservative test. For example, several multilocus DPG studies were able to reject the isolation model due to large differences in patterns of shared polymorphisms and fixed differences among loci, a signature attributed to recent interspecific gene flow at some (but not all) regions of the genome (Machado et al. 2002; Besansky et al. 2003; Ramos-Onsins et al. 2004). However, unless multilocus studies happen to include both types of loci, the (similar) proportions of observed site categories across loci may indicate a good fit of the isolation model, even under postdivergence or recent interspecific gene flow at all loci. This highlights the very conservative nature of the goodness-of-fit criteria in assessing the isolation model.
Sources of shared polymorphisms and signatures of historical gene flow:
In addition to reflecting truly ancestral mutations as envisaged under the isolation model, shared polymorphisms between recently diverged taxa can arise through introgression subsequent to species divergence, a biologically plausible process under parapatric speciation or upon secondary contact after some divergence in allopatry. One approach that may be informative about the historical source of mutations yielding shared polymorphisms among species uses the proportion of shared polymorphisms among all segregating sites. For the pooled S. chilense sample TAC/MOQ, this proportion is 80/348 (= 0.230) in contrast with the sympatric S. peruvianum sample TAR, whereas it is 83/336 (= 0.247) when compared to the allopatric CAN sample. Inspection of Table 3 reveals that these proportions are generally higher in single-population comparisons. Similar patterns hold when the focal species is S. peruvianum: in contrast with the S. chilense sample TAC–MOQ, this proportion is 80/302 (= 0.265) for the sympatric S. peruvianum sample TAR, whereas it is 83/363 (= 0.229) for the allopatric CAN sample. This latter, somewhat lower proportion, however, is unlikely to be due to consequences of allopatry but instead reflects the higher level of (exclusive) polymorphism in CAN (Table 3). Overall, these patterns indicate that shared polymorphisms tend to be geographically widespread in both species (whereas many exclusive polymorphisms are geographically restricted and overall rare), consistent with them reflecting ancestral mutations. These signatures appear to make a scenario of very recent introgression in areas of current sympatry unlikely, which is in agreement with the strong postzygotic barriers revealed in crossing experiments (Rick and Lamm 1955; Rick 1979, 1986; and see below).
A second approach for probing the historical genesis of shared polymorphisms among recently diverged species is implemented in the LD-based test of gene flow (Machado et al. 2002). Using coalescent simulations under the inferred demographic and divergence-time parameters, we found significantly elevated mean values of the LD test statistic x in five of six pooled-population contrasts (Table 6), as well as in several single-population comparisons (data not shown). More importantly, two loci exhibit consistently high x values across comparisons, reflecting stronger LD for a subset of intragenic LD than expected under isolation-model conditions (Figure 2). We emphasize that the coalescent simulations were run with what are likely underestimates of the true divergence time (see above), implying that the P-values are conservative in assessing the probabilities of the observed x values. Moreover, our seemingly incongruent results (failure to reject speciation by isolation due to the good fit of the multilocus distribution of the four classes of segregating sites vs. evidence for postdivergence gene flow based on patterns of intragenic LD) are not hard to reconcile, given the very conservative behavior of the goodness-of-fit tests and the different aspects of the data considered by each approach.
We argue that, in general, nonsignificant mean values of the LD test statistic are less informative than the data for single loci because a significantly high x at one or more loci might be offset by low values at other loci, conceivably yielding a nonsignificant mean x. Under this rationale, there appears to be evidence for bidirectional gene flow at loci CT066 and CT166. These two loci were among those suggestive of interspecific gene flow in our initial study using single populations per species (Städler et al. 2005). Importantly, this genealogical signal is not restricted to regions of current sympatry, as best seen in the contrasts involving the allopatric CAN population (Table 6). The Ss/Stotal ratios discussed above are in excellent agreement with this geographically dispersed signature of postdivergence gene flow.
Figure 2 displays the relationship between LD-based signatures of gene flow and estimates of the population recombination parameter γ across loci. As a group, the two loci with low P-values (i.e., high observed values of the test statistic x) are characterized by an average level of recombination that is ∼50% of that for the five other loci, although there is a lot of scatter for this latter group. Because LD characterizing a given suite of SNPs gradually decays over time because of recombination, any LD-based signature of gene flow should be maintained longer in regions of low recombination. However, many studies concur that the initial build-up of reproductive isolation is likely to involve a few loci or genomic regions (with those regions being protected from introgression), while most of the genome remains permeable to interspecific gene flow for much longer (Rieseberg et al. 1999; Turner et al. 2005; Machado et al. 2007; Yatabe et al. 2007). Accordingly, the observed scatter in the LD-based data when arranged by estimated recombination rate appears to be compatible with postdivergence gene flow at only some loci or with the cessation of gene flow at different points in time for several genomic regions (loci).
Our evidence for postdivergence gene flow between these taxa implies that some of the polymorphisms shared between S. peruvianum and S. chilense do not represent genuine ancestral mutations. Unlike some other large-scale DPG studies that found sharing of entire haplotypes between species with partially overlapping ranges (Machado et al. 2002; Besansky et al. 2003; Ramos-Onsins et al. 2004), our evidence for introgression is much more subtle and simultaneously more difficult to uncover. The implications of having found evidence for a divergence-with-gene-flow model of speciation are, perhaps, also more interesting than for species that continue to exchange genes.
Implications of patterns of postzygotic reproductive isolation:
Most accessions of S. peruvianum that have been tested are isolated from S. chilense by strong intrinsic postzygotic incompatibilities in that usually high incidences (≈97%) of embryonic breakdown were observed in experimental crosses (Rick and Lamm 1955; Rick 1979, 1986). These results have been obtained under noncompetitive interspecific pollination, and it is thus unknown if post-pollination or other prezygotic barriers contribute to reproductive isolation in sympatry. Interestingly, the postzygotic barrier is strongest in regions of sympatry and farther south, where only S. chilense occurs. Experimental crosses using S. peruvianum sensu lato accessions from northern Peru, however, yielded partially fertile F1 hybrids in appreciable frequencies (Rick and Lamm 1955; Rick 1986). Most of these northern Peruvian accessions have been proposed to constitute the novel species S. arcanum (Peralta et al. 2005; Spooner et al. 2005), but the striking feature of the strongest reproductive barriers concurrent with sympatry and low molecular divergence remains.
Our study provides a conservative estimate of divergence time (0.55 million years) that emphasizes the recency of speciation, and subsequent range expansions inferred for both taxa (Arunyawat et al. 2007) can explain both the geographically widespread signal of postdivergence gene flow and the strong reproductive isolation revealed in crosses using allopatric accessions (e.g., S. chilense from the southern part of its range). These combined biological facets are consistent with reinforcement of reproductive isolation; in taxa with substantial maternal provisioning of offspring, postzygotic barriers might arise by direct selection for hybrid inviability (Grant 1966; Coyne 1974; Wallace 1988; Coyne and Orr 2004). Our LD-based evidence for postdivergence gene flow is fully consistent with the rapid evolution of near-complete postzygotic isolation and implies the involvement of natural selection as one of the forces governing species divergence. As with all studies using a DPG approach, the inference of natural selection in the history of species divergence rests on evidence of nonallopatric speciation (i.e., evidence for postdivergence gene flow) and thus is necessarily indirect. Complementary approaches will be required to demonstrate (past) natural selection on genes or genomic regions directly involved in important trait differences or reproductive incompatibility among species, as discussed elsewhere (Städler et al. 2005; Noor and Feder 2006).
Conclusions:
Our extensive study of the two closely related wild tomato species S. peruvianum and S. chilense has uncovered evidence for species divergence under residual gene flow. The geographically dispersed signature of postdivergence gene flow (i.e., not restricted to regions of current sympatry) is consistent with introgression and subsequent spread through much of the species' ranges, either via range expansions or intraspecific gene flow. More generally, divergence under residual gene flow implies a period of secondary contact following some divergence in allopatry, if not a wholly parapatric or sympatric mode of speciation. The demographic estimates under the WH isolation model imply a modest population (or range) expansion for S. peruvianum and an effective size for S. chilense similar to that of the ancestral species. However, these WH estimates of demographic history are believed to be biased, as other aspects of the multilocus data provide clear evidence for marked (range) expansions in both taxa. These biases may be jointly caused by particular properties of our data set (lack of fixed differences) and the complexities of the coalescent process in subdivided populations that are not accounted for in current DPG implementations.
Acknowledgments
We thank Gertraud Feldmaier-Fuchs for expert laboratory assistance and Bernhard Haubold and Peter Pfaffelhuber for providing code and valuable input on coalescent simulations. We are most grateful to Asunción Cano for essential logistic and administrative help in Lima and to Gabriel Clostre and Tobias Marczewski for assistance throughout the collection trip in Peru. Collection and export of tomato samples was made possible through permits issued by the Peruvian Instituto Nacional de Recursos Naturales (INRENA), authorization nos. 020-2004-INRENA-IFFS-DCB and 003754-AG-INRENA. This work was funded by the Deutsche Forschungsgemeinschaft through its Priority Program “Radiations—Origins of Biological Diversity” (SPP-1127), grant Ste 325/5 to W.S., and by a fellowship from the Deutscher Akademischer Austauschdienst to U.A.
Footnotes
- Received September 11, 2007.
- Accepted October 18, 2007.
- Copyright © 2008 by the Genetics Society of America