Nucleotide variation in eight effectively unlinked genes was surveyed in species-wide samples of the closely related outbreeding species Arabidopsis halleri and A. lyrata ssp. petraea and in three of these genes in A. lyrata ssp. lyrata and A. thaliana. Significant genetic differentiation was observed more frequently in A. l. petraea than in A. halleri. Average estimates of nucleotide variation were highest in A. l. petraea and lowest in A. l. lyrata, reflecting differences among species in effective population size. The low level of variation in A. l. lyrata is concordant with a bottleneck effect associated with its origin. The A. halleri/A. l. petraea speciation process was studied, considering the orthologous sequences of an outgroup species (A. thaliana). The high number of ancestral mutations relative to exclusive polymorphisms detected in A. halleri and A. l. petraea, the significant results of the multilocus Fay and Wu H tests, and haplotype sharing between the species indicate introgression subsequent to speciation. Average among-population variation in A. halleri and A. l. petraea was ∼1.5- and 3-fold higher than that in the inbreeder A. thaliana. The detected reduction of variation in A. thaliana is less than that expected from differences in mating system alone, and therefore from selective processes related to differences in the effective recombination rate, but could be explained by differences in population structure.
MULTILOCUS analyses of nucleotide variation in closely related species not only reveal variation that has accumulated independently in each species since their divergence, but also may detect variation originally segregating in their common ancestor, which has not yet been lost. Indeed, early in the process of speciation, newly formed species share polymorphisms due to recent divergence from a common ancestor (Clark 1997; Wakeley and Hey 1997). As time goes on and species diverge further, polymorphisms that were once shared are lost or fixed due to selection or genetic drift, and new mutations arise independently in each species. The effects of these processes are observed late in speciation as the loss of shared variants between species and the accumulation of fixed differences between species. Between these two extremes, there is an intermediate stage where some loci share polymorphisms, while other loci exhibit fixed differences between species.
Speciation models have been developed to describe the expected behavior of genetic polymorphism within and between species following divergence (Wakeley and Hey 1997; Klimanet al. 2000; Machadoet al. 2002). In the isolation speciation model (Hey and Kliman 1993; Hiltonet al. 1994), an ancestral population splits into two descendant species with no subsequent gene flow. Population and speciation parameters can be inferred using this model from the observed numbers of shared and exclusive polymorphisms and fixed differences (Wakeley and Hey 1997; Wanget al. 1997). This type of analysis has been extensively applied to data from closely related Drosophila species (e.g., Hiltonet al. 1994; Klimanet al. 2000) and has recently been used in tomato (Baudryet al. 2001) and in maize species (Tiffin and Gaut 2001). Here, we use divergence population genetics (Klimanet al. 2000) to study speciation in closely related species of Arabidopsis.
Arabidopsis halleri and A. lyrata are closely related out-crossing species whose divergence from the inbreeder A. thaliana is estimated to have occurred ∼5 MYA (Kochet al. 2000). A. halleri ssp. halleri (henceforth A. halleri)is present in Central Europe, whereas A. lyrata ssp. petraea (henceforth A. l. petraea) is patchily distributed in Northern and Central Europe, and A. lyrata ssp. lyrata (henceforth A. l. lyrata) is found only in North America (Mitchell-Olds 2001; Mitchell-Olds and Clauss 2002). A. halleri and A. l. petraea distribution areas overlap in some Central European areas (Hegi 1986) and have been shown to produce semifertile hybrids in the laboratory (Macnairet al. 1999). On the other hand, A. l. petraea and A. l. lyrata are two subspecies of A. lyrata that are completely interfertile but have an allopatric distribution. Here, we study nucleotide variation in eight unlinked genes in species-wide samples of the closely related species A. halleri and A. l. petraea and in three of these genes in A. l. lyrata. We aim to establish whether the species A. halleri and A. l. petraea (and also the subspecies A. l. petraea and A. l. lyrata) have remained isolated since their split.
This work also aims to quantify species-wide levels of variation in the studied outbreeding Arabidopsis species and to establish, through comparison with A. thaliana, the effect of mating system and population structure on intraspecific variation. In mutation-drift equilibrium, differing levels of variation might indicate differences in either the effective population size or the neutral mutation rate. They might also reflect differences in mating system, which influence effective population size (Pollack 1987; Nordborg and Donnelly 1997) and overall rates of recombination (Golding and Strobeck 1980). The reduced effective recombination in species with a high level of inbreeding is expected to increase the effects of genetic hitchhiking accompanying selective sweeps (Maynard Smith and Haigh 1974; Kaplanet al. 1989) and background selection (Charlesworthet al. 1993) and may greatly influence genetic variability in inbreeding species (Nordborget al. 1996). Meta-population dynamics also have important effects on levels of genetic variation, and variability levels may even be quite similar in selfing and outcrossing species (Ingvarsson 2002). Although some studies have compared some of these species using microsatellite data (e.g., van Treurenet al. 1997; Westman and Kresovich 1998; Clausset al. 2002), to date few studies have used intraspecific DNA variation to investigate the effects of mating system in plants. In some comparisons, they have revealed the expected reduction of species-wide nucleotide polymorphism in inbreeders relative to outbreeders (Liu et al. 1998, 1999; Baudryet al. 2001; Clauss and Mitchell-Olds 2003; Wrightet al. 2003), but not in others (Savolainenet al. 2000).
MATERIALS AND METHODS
Species sampling: Table 1 describes the geographic origin of the individuals of each species used in this study and the genes sequenced for each individual. DNAs of most A. halleri and A. l. petraea individuals were kindly provided by Pierre Saumitou-Laprade and Helmi Kuittinen, respectively. In other cases, seeds were obtained from T. Mitchell-Olds' personal collection (some A. halleri and A. l. petraea individuals and all A. l. lyrata individuals) and from the Nottingham Arabidopsis Stock Centre (A. thaliana ecotypes Col-0, Mrk-0, Kas-1, Lip-0, Ler, and Ll-0; Nottingham, United Kingdom). Seeds were planted and grown in growth chambers under standard 16-hr light and 8-hr dark conditions. Leaves were harvested from individual plants prior to flowering, and genomic DNA was extracted as described (Rogers and Bendich 1985; Ausubelet al. 1997).
PCR amplification and sequencing: The genes studied code for enzymes in the phenylpropanoid pathway (CAD, cinnamoyl alcohol dehydrogenase; CHI, chalcone isomerase; CHS, chalcone synthase; DFR, dihydroflavonol reductase; F3H, flavanone-3-hydroxylase; and FAH-1, ferulate-5-hydroxylase), which may influence resistance to UV-radiation, and enzymes involved in glucosinolate biosynthesis (GS, glucosyltransferase; and MAM-L, methylthioalkylmalate-like synthase), and thus in plant defense against insects. The PCR amplification primers used, amplification conditions, and description of the fragment amplified for each of the eight genes studied are as described in Kuittinen et al. (2002). In those cases where the length of the amplified fragment was >600 bp, additional sequencing primers spaced at ∼400-bp intervals were designed.
In A. thaliana, sequences (one per ecotype) were obtained on both strands by direct sequencing of PCR products. In the other species, two sequences were obtained per individual. For the CHS, GS, and MAM-L loci, the PCR products were run on agarose gels, the appropriate-sized bands were excised, and the DNA was extracted from the gel slices using the QIA-quick gel extraction kit (QIAGEN, Valencia, CA). The purified fragments were subsequently cloned into a pCRII vector using the TOPO-TA cloning system (Invitrogen, Carlsbad, CA). After transformation of TOP10F bacteria, plasmid DNA was extracted from individual transformed colonies using a plasmid DNA isolation kit (QIAGEN). The M13 universal and M13 reverse primers, as well as internal locus-specific primers, were used for sequencing. Both strands of 4–18 plasmids per locus per plant were sequenced using the Big Dye Terminator 2.0 kit (Applied Biosystems, Foster City, CA) and run on an ABI 3700 sequencer (Applied Biosystems). For the CAD, CHI, DFR, F3H, and FAH1 loci, sequences were obtained by direct sequencing of PCR products after their purification with QIA-quick columns (QIAGEN); in all lines with two or more polymorphisms, the cloning and sequencing strategy indicated above was used.
Data analyses: Contigs were assembled using Sequencher (Gene Codes, Ann Arbor, MI), SeqMan (DNASTAR, Madison, WI), and SeqEd version 1.0.3 program (Applied Biosystems). Multiple sequences for each gene were aligned using either a combination of methods implemented in Megalign (DNASTAR) or ClustalW (Thompsonet al. 1994), with some additional manual alignment. Sequences were edited with the MacClade (version 3.06 or 4.0) program (Maddison and Maddison 1992, 2000).
The DnaSP version 3.98 (Rozas and Rozas 1999) and the SITES (Hey and Wakeley 1997) software packages were used for most analyses of DNA polymorphism and to perform some neutrality tests. The program HKA (distributed by Jody Hey through http://lifesci.rutgers.edu/heylab) was used to perform multilocus HKA tests and to estimate population parameters. The program MEGA version 2.0 (Kumaret al. 1994) was used for phylogenetic reconstruction. The program WH (as described in Wanget al. 1997) was used to test the isolation model of speciation.
Nucleotide variation was estimated as nucleotide diversity, π (Nei 1987). The level of variation between locations (πB), which refers to the average number of pairwise differences between geographic locations, was used as a measure of species-wide diversity (as proposed by Wakeley and Aliacar 2001). Nucleotide diversity within locations (πw) was also estimated. Note that our πB does not refer to net variation between locations, i.e.,to πT –πw (Nei 1987). Synonymous and nonsynonymous variation was estimated according to Nei and Gojobori (1986). Insertion/deletion polymorphisms were excluded from the analyses. Hudson's Snn statistic was used to test for population differentiation (Hudson 2000).
Several tests were used to detect possible departure from the predictions of an equilibrium neutral model: tests based on either the frequency spectrum of polymorphisms or the haplotype distribution (Tajima 1989; Fu and Li 1993; Fu 1996; Fay and Wu 2000) and tests based on the relationship between intraspecific and interspecific variation (Hudsonet al. 1987; McDonald and Kreitman 1991). Segregating sites that included multiple hits were excluded from the calculation of the neutrality tests.
For the Fay and Wu H test (Fay and Wu 2000), coalescent simulations that included recurrent mutation were performed, which results in a more conservative test. The coalescent simulations were done using the level of variation (π) rather than a fixed number of mutations. Analyses were performed with total and with silent positions, using both an unbiased and a biased transition/transversion ratio (ratio values equal to 0.5 and 2.0, respectively). Time of divergence to the outgroup was estimated taking into account the ratio of divergence to nucleotide diversity (K/π). Therefore, the time to isolation (Td) of the outgroup sequence was calculated as ((K/π) – 1)/2 (Hudsonet al. 1987), where Td is expressed in 4Ne generations and Ne is the effective population size of the studied population.
The nonparametric sign test (Sokal and Rohlf 1995) was performed to detect positive (or negative) tendencies across loci, for a given statistic. After performing coalescent simulations for each locus, conditioned on the number of segregating sites, the median value of the test statistic under consideration (e.g., Tajima's D) was obtained for each locus. The observed number of loci with positive or negative values can be compared to the null hypothesis, where positive or negative values both have probability (p) of ½ under neutrality. The multilocus P value was then obtained by calculating the cumulative binomial under the assumption of p = 0.5.
Sampling strategy for neutrality tests: The null hypothesis of most tests of neutrality assumes panmixia in addition to mutation-drift equilibrium. Deviations from panmixia, e.g., population subdivision, can thus lead to rejection of the null hypothesis even if neutrality holds. Since sequences in the present study were obtained by sampling within and between locations throughout each species' distribution area, we tried to minimize the effect of population subdivision in the neutrality tests. If no genetic differentiation was detected for a particular locus and species, the sample was considered to be panmictic, and all sequences were used in neutrality tests. Otherwise, a single randomly chosen sequence per location (constructed sample) was used, and all (or a subset of all) combinations were tested. This mixed sampling strategy uses all the information available whenever there is no evidence for population subdivision and minimizes the effect of subdivision on neutrality tests if subdivision is present. The use of a single sequence per deme conforms to the assumptions of a standard equilibrium model, where θ= 4Neμ and Ne is rescaled as a function of population structure (Wakeley and Aliacar 2001). To determine whether migration among demes is correlated with geographic distance, a Mantel (1967) test was used. Tests were performed using the number of nucleotide differences between samples and the corresponding geographic distance measured in kilometers. Geographic distances were obtained with Chris Michels' program (http://jan.ucc.nau.edu/cvm/latlongdist.html). The Mantel tests were performed with the zt version 1.0. program (Bonnet and Van de Peer 2002).
Multilocus tests of neutrality: Neutrality tests using multilocus data from a single species were performed following Hey (as detailed in the HKA program) for Tajima's (1989) DT, Fu's (1996) Fs, Fu and Li's (1993) DF and FF, and Fay and Wu's (2000) H tests. When, as in this example, loci are either not linked or loosely linked, then the history of each locus can be considered to be independent and forces affecting one particular locus should not affect other loci. Under strict neutrality and complete independence between loci, the effective population size (Ne) would be the only population parameter shared by all loci. The multilocus test statistic (m[DT], m[Fs], m[DF], m[FF], and m[H]) was obtained, in each case, by summing across loci the corresponding test statistic values for each locus and dividing by the total number of loci, which gives them all the same weight. The empirical distribution of the multilocus test statistic under neutrality was obtained from coalescent-based simulations. The simulations were run independently for each locus (with the corresponding sample size) and were conditioned on the number of segregating sites. For each replicate, the multilocus test statistic (e.g., m[DT]) was obtained by averaging the corresponding test statistic (e.g., DT) across loci.
Inferring ancestral variation: The multilocus sequence comparison of two closely related species allows detection of four classes of mutations (excluding sites with multiple hits, Smhits; Figure 1A), which can be grouped into three different categories (Wakeley and Hey 1997): (i) segregating sites where mutations are exclusive to one species (Sx1, Sx2), (ii) fixed differences between species (Sf), and (iii) shared polymorphisms (Ss). Polymorphisms in this latter category can either reflect ancestral polymorphisms or be the result of coincident mutations. The SITES (Hey and Wakeley 1997) software package was used to obtain the observed numbers of mutations in each class.
As depicted in Figure 1B, availability of the homologous sequences from an outgroup species allows partitioning of fixed differences into the two lineages of closely related species (Sf1, Sf2) and also detection of two new classes of mutations (Sf1x2, Sf2x1). These latter classes correspond to mutations that are fixed in one species but still segregate as polymorphisms in the other species. Unlike biallelic variation, multiple hits are grouped into a single category because of the difficulty of unambiguously establishing the direction of mutations. These two new classes of mutations and the shared polymorphisms class constitute, therefore, the third category of mutations, hereafter called ancestral mutations. The numbers of mutations in each class were obtained using the calc_mut program (available from the authors).
The hypergeometric distribution (Rozas and Aguadé 1993; Clark 1997) was used to obtain the probabilities that the observed numbers of ancestral mutations (Ss, Sf1x2, and Sf2x1) were due to recurrent mutation. Probabilities were obtained separately for each class of putatively ancestral mutations and for each locus, considering all positions and only silent positions. The same reasoning as in Clark (1997) was used to obtain expected numbers of shared mutations, with the following reasoning applied to the other two classes: a unique mutation can be assigned to class Sf1x2 (or Sf2x1; Figure 1B) if it were an ancestral mutation, while it could have two possible origins (one segregating mutation in species 2 and one fixed mutation in the outgroup species, or one segregating mutation in species 2 and a fixed mutation in species 1) if it had originated by recurrent mutation. Then, the number of recurrent mutations is the number expected by chance in the categories formed by Sx2 + Sf1x2 vs. Sfout + Sf1 + Sf1x2 given the total number of nucleotide sites in the locus. The probability of having the observed number of Sf1x2 mutations is then obtained using the hypergeometric distribution. Loci with less than five Sf1x2 (and Sf2x1) mutations were considered together.
Population parameters for two extant closely related species were estimated from the observed distribution of variation within and between species (Sx1, Sx2, Sf, Ss) under the simple isolation model of speciation (here named WH) described in Wang et al. (1997). This model postulates a split, at some time in the past, of a panmictic ancestral population into two descendant populations, with no subsequent gene flow between the descendant populations. Under the assumption that variation is neutral and that mutations occur according to the infinite-sites model, diversity in the ancestral (θA) and each descendant (θ1 and θ2) species, as well as the time from speciation (Tsp, measured in Ne generations of one of the descendant species), can be estimated.
Levels of DNA sequence polymorphism and divergence from A. thaliana: In both A. halleri and A. l. petraea, sequences of eight independent genes (Table 2) were obtained from one to two individuals from each of three to five populations (Tables 1 and 3). A subset of these genes was sequenced in samples from A. l. lyrata and A. thaliana (Table 1). Two sequences per individual were obtained for each locus, except in A. thaliana (see materials and methods). Samples consisted of individuals collected across each species' distribution area. In general, each location was represented by one individual, although for A. halleri two individuals from the same location were sequenced in some cases (Table 1). For each species, the length of the region analyzed for each gene varied between ∼500 and 2000 nucleotides; these numbers are, however, lower when all species are considered (Table 2). Figures S1–S8 (at http://www.genetics.org/supplemental/) summarize intra- and interspecific variation.
For each gene and species, nucleotide diversity estimates for total and silent variation are presented in Tables 3 and 4, while estimates of synonymous and nonsynonymous variation are given in Table 5. In A. halleri, A. l. petraea, and A. l. lyrata, estimates of variation within and among locations were obtained in addition to estimates based on all sequences (Table 3). Comparison of the averages of among-population variation between the three outcrossing species yielded similar results to those obtained using all sequences: (i) variation was higher in A. l. petraea than in A. halleri, and (ii) variation was lowest in A. l. lyrata (Table 3). Levels of polymorphism varied across loci, with DFR and GS being the overall most variable genes. Variation was lower at nonsynonymous sites than at synonymous sites both within and among species. In the three taxa, within-population variation was generally a small fraction of total variation. Indeed, significant genetic differentiation between locations was detected in four out of eight genes in A. halleri, in seven out of eight in A. l. petraea, and in all three in A. l. lyrata (Snn, Table 3).
Table 5 shows the variation at synonymous and nonsynonymous sites, as well as the synonymous/nonsynonymous ratio for polymorphism and divergence. Average levels of divergence (number of substitutions per site) between these species and A. thaliana were ∼0.16 for synonymous and 0.015 for nonsynonymous, approximately the same order of magnitude as estimates reported by Wright et al. (2003). The nonsynonymous/synonymous ratios were always less than unity, as expected for purifying selection at nonsynonymous positions. Nonsynonymous/synonymous ratios were generally larger for polymorphism than for divergence.
We compared nucleotide variation in six genes (CHI, CHS, F3H, FAH1, GS, and MAM-L) between A. thaliana (Kuittinen and Aguadé 2000; Aguadé 2001; this study) and the outbreeding species A. halleri and A. l. petraea. The average total species-wide estimates of variation in A. thaliana for the homologous regions surveyed in the outbreeding species were 0.0035 and 0.0061 for total and silent variation, respectively. These estimates were calculated among ecotypes, which are equivalent to the average estimates among locations presented here for the outcrossing species; hence these estimates are directly comparable across taxa. In A. halleri, the average estimates of nucleotide variation in the same six regions were 0.0048 and 0.0097 for total and silent variability, respectively, while they were 0.0096 and 0.0195 in A. l. petraea. We observe that the average levels of variability are ∼1.5–3 times lower in the inbreeding species A. thaliana than in the outcrossing species A. halleri and A. l. petraea. Thus, we clearly observe reduced variation in the inbreeder, a reduction that is concordant at least in sign with the difference in mating system. We obtained analytical results that could also explain the observed levels of variation in species with different population structure and mating system (appendix).
Detecting ancestral variation in A. halleri and A. lyrata: Table 6 presents the distribution of variable sites in the A. halleri/A. l. petraea and the A. l. lyrata/A. l. petraea pairs, using A. thaliana as the outgroup. In the A. halleri/A. l. petraea comparison, most genes harbored variants in the two new classes of putatively ancestral polymorphisms (Sf1x2, Sf2x1), while only three loci exhibited shared polymorphisms (Ss). The two newly proposed classes of mutations make an important contribution to the total number of ancestral mutations: 38 out of 71 mutations in the A. halleri/A. l. petraea comparison (Table 6A) and 31 out of 36 in the A. l. lyrata/A. l. petraea comparison (Table 6B). The possible difference in population size of these subspecies is also reflected in the number of segregating sites in each new category. Indeed, the total number of mutations fixed in the A. halleri branch that still segregate in A. l. petraea (24) is higher than those fixed in A. l. petraea that still segregate in A. halleri (14). This is consistent with A. halleri having a lower effective population size, since the coalescence time is shorter for A. halleri than for A. l. petraea. The same effect can be detected in the A. l. lyrata/A. l. petraea comparison (28 vs. 3; Table 6B).
The hypergeometric distribution (Rozas and Aguadé 1993; Clark 1997) was used to assess whether the number of putatively ancestral polymorphisms detected in the A. halleri/A. l. petraea comparison (Table 6A) could be the result of recurrent mutation. When all positions were considered, recurrent mutation was insufficient to explain the observed number of Sf1x2 mutations in all cases (Table 7A), while significant or nearly significant departures were also detected when considering only silent positions (Table 7A). Similar results were obtained for Sf2x1 mutations (Table 7A). Finally, recurrent mutation was also insufficient to explain the number of shared polymorphisms in each locus that had mutations of this class (Table 7).
In the A. l. lyrata/A. l. petraea comparison, the presence of ancestral mutations in the classes Sf1x2 and Ss (Table 6B) could not be explained by a recurrent mutation process. Indeed, significant or nearly significant departures from the null hypothesis were detected in most cases (Table 7B). The presence of only a single silent Sf2x1 mutation precluded performing the corresponding test.
Neighbor-joining trees (Saitou and Nei 1987) showed alleles from a given species clustering together except in three loci: F3H, GS, and DFR (not shown). Indeed, the pattern observed at the GS locus is suggestive of recent gene flow from A. halleri into A. l. petraea (Figure S7), with one of the two main haplotype classes present in A. l. petraea being nearly identical (across ∼1000 bp) to sequences present in A. halleri. Although these two species shared polymorphisms and had no fixed differences between them at the DFR locus, the distribution of polymorphic sites at DFR did not reveal interspecific haplotype sharing (Figure S4). There was no clearly identifiable region in either DFR or any of the other loci with ancestral mutations that we might have attributed to introgression. However, the heterogeneous patterns observed in GS and the other loci do not preclude the possibility of other, undetected introgressed variants.
Table 8 presents the population parameter estimates from the WH isolation model of speciation (Wakeley and Hey 1997). In the A. halleri/A. l. petraea comparison, estimates of population and speciation parameters were similar for most constructed combinations (results not shown). Diversity in the ancestral population (θA) is estimated to be 105-fold higher than that in extant populations, which would imply a 105-fold difference in effective population size between the ancestor and the descendant species (assuming mutation-drift equilibrium and homogeneous mutation rates). Therefore, the estimated time to speciation was rather low compared to estimates from nucleotide divergence (results not shown). The biological implausibility of the parameters estimated from the WH model points to some violation of the neutrality and/or complete isolation assumptions of the model in the A. halleri/A. l. petraea comparison.
The low number of segregating sites in A. l. lyrata probably affects the estimation of population and speciation parameters in the A. l. lyrata/A. l. petraea comparison (Table 8), and these estimates should therefore be interpreted with caution. Although estimates differed between the two constructed combinations, they suggest a lower effective population size for A. l. lyrata.
Neutrality tests: The implausible parameter estimates produced by the isolation model in the A. halleri/A. l. petraea species comparison may be caused by the violation of assumptions of the model. Natural selection, nonequilibrium demographic processes, or introgression may explain these results. We used neutrality tests for individual loci, as well as several multilocus tests, to test for deviations from a neutral equilibrium model. For individual loci, most tests were nonsignificant, as summarized by the average values of the test statistics over all combinations (Table 9). The only exception was the highly significant Fay and Wu's H statistic for CHS in A. halleri. Indeed, 13 of the 15 polymorphisms in this gene were due to ancestral variants present in a single highly divergent sequence (Figure S2). Also, pairwise MK (McDonald-Kreitman) tests were performed for all loci and species pairs. A marginally significant deviation was detected for CHS in the A. halleri/A. l. petraea comparison (P < 0.049) and for CHS in the A. halleri/A. thaliana comparison (P < 0.047). However, no significant deviation from neutral expectation was detected after Bonferroni correction (results not shown).
Despite the general lack of significance observed, in some tests the deviations from equilibrium neutral patterns were in the same direction (for each particular sample combination and species) in a majority of the genes (Table 9). That was the case for the H statistic, which was negative in seven out of eight genes in A. halleri and in six out of eight in A. l. petraea, and also for the Tajima's D and Fu and Li's D and F in A. l. lyrata. These trends prompted us to perform multilocus tests.
Polymorphism in A. halleri and in A. l. petraea was compared to divergence of each species relative to A. thaliana with the multilocus HKA test. The large number of combinations of constructed samples for all loci (162 × 322 = 2.62 × 105 for A. halleri and 326 × 16 = 1.72 × 1010 for A. l. petraea) precluded performing all possible tests. After performing multilocus analyses for several combinations, it was clear that test results in both A. halleri and A. l. petraea were consistent for most data subsamples (results not shown). Therefore, only results for one combination per species are shown (Figure 2). In the A. halleri/A. thaliana comparison, the HKA test rejected the null hypothesis of proportionality between polymorphism and divergence (P < 0.002), while the test statistic value was close to significance in the A. l. petraea/A. thaliana comparison (P < 0.073). In both cases, the largest contributions to the overall test statistic were due to DFR, GS, and MAM-L (Figure 2). Significant values can be explained by a larger variance in the polymorphism/divergence ratio than that expected under a neutral equilibrium model. This large variance might be attributable to selection on a subsample of the loci or to violation of assumptions of the model. The multilocus HKA test was not significant in the A. l. lyrata/A. thaliana comparison.
We employed multilocus analyses incorporating across-loci averages for each statistic and nonparametric sign tests (see materials and methods). As above, test results were similar for most combinations. In A. halleri, most single and multilocus tests gave nonsignificant results. Indeed, only the multilocus m[H] test for A. halleri yielded a significant result (Figure 3A), as well as the nonparametric sign test (eight H values under the median; P = 0.0039). A similar result was obtained in this species when all tests were performed for the combination that yielded the most conservative value for the m[H] statistic: only the tests using the H statistic were significant. The m[H] test was also significant when the CHS gene, which was highly significant by itself, was excluded. In A. l. petraea, only the nonparametric sign test for the H statistic was significant (seven of the eight H values under the median; P = 0.035). The m[H] value was also under the median, indicating a similar, but nonsignificant, deviation (Figure 3A). Equivalent results were obtained for Fay and Wu's H test considering recurrent variation (see materials and methods), except for the sign test in the species A. l. petraea, where half of possible combinations were nonsignificant. The significant results of the tests for A. halleri and A. l. petraea indicate deviations from the neutral equilibrium model at a multilocus level rather than for single loci. Significant values of the Fay and Wu statistic have been generally explained by positive directional selection, but they can also be the result of recent gene flow (Fay and Wu 2000). Introgression may be the best explanation for these results.
For A. lyrata ssp. lyrata, both data subsamples gave similar results, although m[DF] and m[FF] were significant in only one of the two possible combinations (P = 0.048 and P = 0.035). Figure 3B shows the individual and average multilocus tests for Tajima's D and Fu and Li's D statistics. The same negative pattern was found for all loci, which may indicate population expansion or other demographic processes.
Relationship between A. halleri and A. lyrata: Comparative analysis of nucleotide variation at eight loci in A. halleri and A. l. petraea and at three loci in A. l. petraea and A. l. lyrata can shed light on the process of speciation at different timescales. In general, species recently originated from a common ancestor should present few fixed differences, due to the long time required for mutations to accumulate and become fixed. Genetic variants shared by these species may represent polymorphisms present in their common ancestor that have been maintained either by chance or by some form of balancing selection. However, they also may result from introgression between species or from recurrent mutations.
Our comparative analysis of variation in the two subspecies of A. lyrata (A. l. petraea and A. l. lyrata) is based on only three genes, with very low levels of variation in A. l. lyrata. The observed data can generally be explained by a simple isolation model. The parameters estimated using the WH model indicated a significant reduction of variability in A. l. lyrata, suggesting a bottleneck during the colonization of the New World from Eurasia.
In the A. halleri/A. l. petraea comparison, recurrent mutation was insufficient to explain the observed number of putatively ancestral mutations, indicating that most of them are in fact the result of individual mutation events. The high number of mutations observed in the ancestral class could thus be due either to their persistence in isolated taxa since speciation or to an ancient divergence with some degree of subsequent gene flow between species. Results from the WH isolation model were biologically implausible, which suggests violation of some assumptions in this model, quite possibly neutrality or isolation of species. Also, gene trees of some loci showed that alleles from the same species were not clustering together. Specifically, the GS locus shared haplotypes between these two species, which may be explained by introgression.
By combining single and multilocus tests of neutrality we can distinguish between processes affecting individual genes and forces influencing the entire genome. The multilocus m[H] test showed a significant or nearly significant deviation from the simple neutral model in A. halleri and A. l. petraea, respectively, and significant values for the sign test in both species. Although multilocus tests indicate deviation from neutral equilibrium expectations for the complete set of loci analyzed, the negative value of the H statistic for each locus clearly shows an excess of recent mutations at high frequency at every locus. This generalized excess suggests a multilocus effect that could be the result of demographic factors, independent selective sweeps, or introgression. The sign of the test statistic, the sampling strategy used in the tests (see materials and methods), and the lack of correlation between geographic and genetic distance (results of Mantel tests not shown) exclude population subdivision as the main cause of the detected deviation. It is also highly unlikely that the multilocus deviation was due to positive selection since it would require independent selective sweeps at a majority of the loci studied. Population reduction could also explain a negative tendency of the Fay and Wu H values across the genome (results not shown). If this were the case, this process should be also observed in positive Tajima's D and Fu and Li's D and F values, which is not observed (Table 9). Therefore, we also exclude that hypothesis. Finally, recurrent mutation also cannot explain the observed data, as we included that process in coalescent simulations (see results). Thus, the most parsimonious explanation of the significant results of the multilocus H test in the closely related species A. halleri and A. l. petraea is reacquisition of ancestral mutations by introgression (Fay and Wu 2000).
Selection could also play an important role in the maintenance (or loss) of variation added by introgression. Indeed, the pattern of variation detected in A. halleri and A. l. petraea for some loci could be better explained by introgression (see above), despite that haplotype sharing was not generally observed. This observation, together with the detection in HKA tests of the same genes showing similar deviations from neutral expectations in both species (Figure 2), might be explained more easily by selection sweeping (or maintaining) the additional introgressed variation in some loci and not in others than by the stochasticity associated with genetic drift.
A hybridization event is biologically plausible, as interspecific crosses between A. halleri (pollen) and A. l. petraea have resulted in interfertile F1 individuals (Macnairet al. 1999). Hybrids formed in the laboratory have greatly reduced vigor relative to nonhybrids of both species (M. R. Macnair, personal communication). Thus, although interspecific hybrids have not been reported in nature, rare hybridization events may still occur on an evolutionary timescale.
The existence of gene flow between species alters the pattern of both within- and between-species variation. Indeed, in a simple scenario with two species 1 and 2 that possess mutations only in classes Sx1, Sx2, Sf1, and Sf2 (Figure 1), asymmetrical gene flow from species 1 into species 2 would result in some mutations switching to another class. Some mutations would switch from class Sx1 to either class Ss or Sx2, some from class Sf1 to class Sf1x2, and some from class Sf2 to Sx2. There would be, therefore, a decrease of exclusive variation in species 1 (where fixed and polymorphic mutations would switch to the ancestral mutation class), and an increase in species 2 (by addition of polymorphic mutations). Therefore, one asymmetrical gene flow event between two species with similar effective population sizes (and thus levels of variation) would result in increased variability in the recipient species, a reduction in the number of exclusive polymorphisms in the donor species, and reduction of the estimated time since isolation. On the other hand, reciprocal gene flow would result in more ancestral mutations and fewer fixed mutations in both species, which would increase the apparent level of variation in the ancestral population and reduce the estimated time since isolation. This pattern is consistent with the results of the WH model for the A. halleri/A. l. petraea comparison, supporting, therefore, the hypothesis of introgression occurring after the split of these species. The inclusion of outgroup sequences in our study has dramatically increased the number of ancestral mutations detected (Table 6) and therefore the information regarding the ancestral history of the two species. Speciation models would greatly benefit both from the addition of an out-group species and from considering migration subsequent to species divergence since this would improve the estimates of isolation time and levels of variation in the ancestral and extant species and may improve the power to detect introgression.
Levels of nucleotide variability in A. halleri and A. lyrata: Genetic differentiation between geographic locations can be suggestive of (although not equivalent to) population subdivision (Table 3, Snn tests). In our study, significant genetic differentiation between populations is observed more frequently in A. l. petraea than in A. halleri, indicating that the former species (with a discontinuous distribution extending throughout North and Central Europe) has a higher level of population structure than A. halleri (distributed across Central Europe).
Several factors including migration rate between demes and number of demes might influence a species' effective population size when there is population substructure and might consequently affect the species-wide level of variation (e.g., Whitlock and Barton 1997; Pannell and Charlesworth 1999; Pannell and Barrett 2001; Wakeley 2001; Ingvarsson 2002; Laporte and Charlesworth 2002). Additionally, asymmetrical gene flow between closely related species can cause a differential increase of variation in the two species and, thus, of their effective population size (see above). The average level of genetic variation among populations (total and silent variation in Table 3) is higher in A. l. petraea than in A. halleri. The higher level of variation detected in A. l. petraea than in A. halleri reflects differences in effective population size between the species that could be caused both by differences in their degree of population substructure (as is shown in a simple example in the appendix, Equation A4) and by differential levels of introgression.
The low level of variability detected in A. l. lyrata, relative to A. l. petraea, might be explained by a bottleneck accompanying the relatively recent colonization event from the Old World (Mitchell-Olds 2001). The generally negative values of most multilocus test statistics suggest that population expansion followed colonization. Low levels of variation relative to A. l. petraea have also been observed in six loci studied in two populations of A. l. lyrata (Wrightet al. 2003), in agreement with our data. On the other hand, a survey of variation in the Adh gene detected two main haplotypes in a single population of A. l. lyrata with no indication of population expansion (Savolainenet al. 2000). Although the differing patterns of variation among these studies could be due to differences in sampling strategy, they could also be explained if selection maintained variation in the Adh locus. Multilocus analyses of variation in large samples from multiple populations of A. l. lyrata are necessary to elucidate the demographic history of this subspecies.
Considerations regarding the levels of variation in the outbreeding species A. halleri and A. lyrata relative to the inbreeding species A. thaliana: Comparison of within-species levels of variation in outbreeding and inbreeding species may reveal the effect of mating system on intraspecific variation. If the assumptions of a Wright-Fisher model hold for the two species being compared (except for mating system in the selfing species), a twofold reduction in the effective population size is expected under complete selfing, and consequently the level of variation should be lower in the inbreeding than in the outbreeding species (Pollack 1987; Nordborg and Donnelly 1997). Indeed, Ingvarsson (2002) has shown that the ratio of within-population genetic diversity in selfing vs. outbreeding species is much lower in structured populations. In addition, the ratio of the metapopulation levels of genetic diversity in inbreeding vs. outbreeding species can also be heavily influenced by the structure of the populations. Indeed, total levels of variation could be quite similar in selfing and in outcrossing species if migration and effective population size are much lower in inbreeding species than in outcrossing species (Ingvarsson 2002).
We also developed a simple analytical expression of the expected ratio of species-wide variation in inbreeding vs. outbreeding species under the same hypothetical population structure in both species (appendix). In the absence of population extinction (appendix, Equation A3), the expected ratio is approximated by the expression (½M + 1)/(M + 1), where M is the population migration parameter 4Nm. A twofold reduction would thus be expected under high migration, with a milder reduction under low migration. On the other hand, the effective rate of recombination is greatly reduced in inbreeding species (Golding and Strobeck 1980). High levels of inbreeding are thus expected to increase the effects of genetic hitchhiking accompanying selective sweeps (Maynard Smith and Haigh 1974; Kaplanet al. 1989) and background selection (Charlesworthet al. 1993) and may greatly influence genetic variability in inbreeding species (Nordborget al. 1996). A higher than twofold reduction would thus be generally expected in the standing level of variation in the selfing species relative to the outbreeder.
We have not detected, however, the very drastic reduction of variation in A. thaliana that would be expected from differences in mating system, the effect of gene flow in the outbreeders' levels of variation, and the increase in hitchhiking and background selection effects resulting from effectively reduced recombination in the inbreeder. Indeed, A. thaliana has a worldwide distribution and exhibits a metapopulation structure (Sharbelet al. 2000; Mitchell-Olds 2001), while both A. halleri and A. l. petraea are restricted to Europe and with possibly different degrees of population subdivision. Under neutrality and a reasonable range of parameters (Ingvarsson 2002; appendix), the expected effective population size of the inbreeder could be similar to or higher than that of the outbreeder. Thus, differences in population structure can counteract the effect of mating system on variation, which could obscure the effects of selective processes such as genetic hitchhiking and background selection.
We thank J. Wakeley and W. Stephan for analytical suggestions, M. J. Clauss for helpful comments on the manuscript, H. Kuittinen and P. Saumitou-Laprade for providing several A. halleri and A. l. petraea DNA samples, and J. Kroymann for MAM-L amplification primers and unpublished A. thaliana MAM-L sequences. S.R.-O. and M.A. also thank G. Blasco for her excellent technical assistance and Serveis Científico-Tècnics, Universitat de Barcelona for sequencing facilities. B.S. and T.M.-O. thank D. Schnabelrauch and A. Figuth for outstanding sequencing services and M. Voigt and J. Fritsche for excellent technical assistance. This work was supported by the Bundesministerium für Bildung und Forschung, the Max-Planck Gesellschaft, European Union contract no. QLRT-2000-01097; by U.S. National Science Foundation grant DEB-9527725 to T.M.-O.; by grants PB97-0918 and BMC2001-2906 from Comisión Interdepartamental de Ciencia y Tecnología, Spain, by grant 2001SGR-101 from Comissió Interdepartamental de Recerca i Tecnologia, Catalonia, Spain, and by Departament d'Universitats, Recerca i Societat de la Informació, Catalonia, Spain (Distinció per la Promoció de la Recerca Universitària) to M.A.; and by grant BIO4-CT98-0299 from the European Community to T.M.-O. and M.A.
We are interested in comparing the effective population size (Ne) of two species that exhibit a metapopulation structure but differ with respect to mating system and/or in the degree of subdivision. If we define 2Ne as the expected time of coalescence between two sequences, we can calculate a particular Ne for each species under a simple metapopulation model when considering two autosomal sequences sampled from two individuals in two different demes. In the simple model used, where a diploid, hermaphrodite species is considered, a metapopulation consists of a number D of demes that remains constant over time because every recolonization event is followed by an extinction event. Each deme has an extinction/colonization rate, e, and a constant population size of N diploid individuals. An extinction/recolonization event is defined as a process that creates one population of size N in one generation by recolonization by a single seed. Each individual has a probability m of migrating to any other deme (i.e., this is an isotropic or symmetric and conservative migration model; Wakeley 2001). Migration and recolonization events are assumed to occur by seed dispersal. It is also assumed that generations are discrete. Finally, s is the probability that one individual reproduces by selfing.
According to where coalescence events occur, four states can be inferred from this model: a higher state where events occur between sequences in different demes, a second state between sequences from different individuals in the same deme, a third state between different sequences from the same individual, and the last and absorbent state with a single common sequence. Assuming that migration and extinction rates are reasonably small and N is large enough, the following matrix gives the approximate transition probabilities: (A1)
We are interested in obtaining the expected time of coalescence at the highest level (named ET1), because it can be assumed to be the expected time of coalescence (i.e., the effective population size) at the species level. An accurate simplification of ET1 is then (A2) where M = 4Nm, E = 4eN, and S = (2 – s)/2.
For two species with equal population structure but different mating systems (s = 1 vs. s = 0), the ratio of the effective population sizes at the highest level would be (A3)
Therefore, under the simple model used, the effective population size of a selfing species would be reduced in relation to an outbreeding species, but the reduction in the inbreeder is significant only when M is large and E is not. In that case, the reduction in the inbreeder approaches one-half only when M is large.
For two outbreeding species named 1 and 2 with the same total number of individuals (i.e., D1 · N1 = D2 · N2 = NT), no extinction/recolonization and the same m values, but a different number of demes and individuals per deme (i.e., N1 ≪ N2, D1 ≫ D2), the ratio would be (A4) where MT = 4NTm.
Classical predictions of the island model are then obtained. Under the simple model used, highly structured species (i.e., with many small demes) would have a higher effective population size, and therefore a higher level of variation, if both species have the same mutation rate, than less-structured species (i.e., with fewer, larger demes).
We can combine differences in mating system and population structure. Considering extinction/recolonization only in the highly structured completely selfing species (i.e., species 1), and assuming the same total number of individuals, the same m values, N1 ≪ N2 and D1 ≫ D2, the ratio would be (A5) where ET = 4NTe.
In that case, calculations performed under the conditions described above show that the inbreeding/outbreeding Ne ratio is highly dependent on the extinction/recolonization rate and that this ratio will be higher than unity except for very high extinction/recolonization rate values (data not shown). In the case of lower migration rates for the inbreeding species, the ratio would be even higher (data not shown).
Sequence data from this article have been deposited with the EMBL/GenBank Data Libraries under accession nos. AJ619855–AJ619939 and AJ582819–AJ582910.
Communicating editor: O. Savolainen
- Received February 14, 2003.
- Accepted September 11, 2003.
- Copyright © 2004 by the Genetics Society of America