| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Genetics, Vol. 169, 1601-1615, March 2005, Copyright © 2005
doi:10.1534/genetics.104.033795
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
,3
* Department of Genetics and Evolution, Max-Planck Institute of Chemical Ecology, 07745 Jena, Germany
Max-Planck Institute of Plant Breeding Research, 50829 Köln, Germany
2 Corresponding author: Department of Genetics and Evolution, Max-Planck Institute of Chemical Ecology, Hans-Knöll-Str. 8, D-07745 Jena, Germany.
E-mail: schmid{at}ice.mpg.de
| ABSTRACT |
|---|
|
|
|---|
Therefore it is important to disentangle the effects of different evolutionary processes on sequence variation if one attempts to identify genes involved in adaptation. Theoretical analyses showed that the simultaneous analysis of genetic variation at multiple loci is a powerful approach to identifying genome-wide acting processes (e.g., WALL 1999). If common patterns of polymorphism are observed at different independent loci, they likely result from a genome-wide acting evolutionary process rather than from multiple independent processes acting locally on individual genes (GLINKA et al. 2003; HAMBLIN et al. 2004; AKEY et al. 2004; TENAILLON et al. 2004).
Over the past decade, Arabidopsis thaliana has become an important model organism for the analysis of genetic variation in plants (MITCHELL-OLDS 2001), and sequence variation at more than a dozen loci was surveyed to elucidate the role of selection. Two major patterns emerged from these studies. At some loci, sequence variation is characterized by a pattern of two or three distinct and sometimes highly divergent haplotypes (e.g., HANFSTINGL et al. 1994; KAWABE et al. 1997; CAICEDO et al. 1998; STAHL et al. 1999; KAWABE et al. 2000; HAUSER et al. 2001; OLSEN et al. 2002; TIAN et al. 2002; KROYMANN et al. 2003; CLAUSS and MITCHELL-OLDS 2004) and at other loci by an excess of rare polymorphisms (e.g., KAWABE and MIYASHITA 1999; PURUGGANAN and SUDDITH 1999; KUITTINEN and AGUADé 2000; HAGENBLAD and NORDBORG 2002; OLSEN et al. 2002). Tests of neutrality with both groups of loci frequently rejected the null hypothesis of neutral evolution. These results suggest that in A. thaliana numerous genes are subject to balancing selection (excess of intermediate-frequency polymorphisms or haplotypes) or selective sweeps (excess of rare polymorphisms) at or near the surveyed genes.
Although some of the analyzed genes were selected due to a putative role in adaptive evolution, the nonneutral patterns of genetic variation may also be influenced by the demographic history and other characteristics of this species and thus confound tests of neutrality. A. thaliana is a highly self-fertilizing species (99%; ABBOTT and GOMES 1989; BERGELSON et al. 1998). This level of inbreeding reduces effective population size and the effective rate of recombination (reviewed by CHARLESWORTH 2003). In comparison to outcrossing species, a lower nucleotide diversity is expected due to background selection (CHARLESWORTH et al. 1993) or hitchhiking with selective sweeps (MAYNARD SMITH and HAIGH 1974). Average levels of linkage disequilibrium are expected to be increased due to a low effective recombination rate (NORDBORG and DONNELLY 1997), resulting in a strong haplotype structure in samples obtained from a single deme (NORDBORG 2000). Such a haplotype structure may resemble patterns observed under balancing selection and confound neutrality tests such as Tajima's D (TAJIMA 1989). Furthermore, there is evidence of a large-scale genetic population structure, which reflects a history of glacial refugia and postglacial recolonization of the native species range (SHARBEL et al. 2000), suggesting that recent changes in both population size and population structure have played an important role in shaping the current structure of genetic variation in A. thaliana.
The purpose of this study is to investigate the effect of demographic history and self-fertilization on genome-wide patterns of sequence variation in A. thaliana and to test the hypothesis that a standard neutral model is a suitable null model for the identification of genes involved in adaptation. We analyzed several hundred short genomic regions (sequence-tagged sites, STS) in up to 12 accessions and fitted the observed patterns of variation to various demographic models using coalescent simulations and maximum-likelihood analysis. Most of the sequence data used for this study were taken from a previous survey aimed at large-scale discovery of single-nucleotide polymorphism (SNP) markers in A. thaliana (SCHMID et al. 2003). We reject a neutral model of sequence polymorphism, but did not obtain a different demographic model that performed significantly better. Our analysis indicates that both the selfing nature and the demographic history of A. thaliana have a significant effect on genome-wide sequence polymorphism and that the standard neutral model is not appropriate for tests of neutrality in this species. By using empirical distributions of descriptive statistics such as Tajima's D, we were able to identify new candidate loci that may have been targets of recent selection.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Sequencing, quality trimming, and annotation:
A total of 595 STS loci were chosen essentially randomly from the A. thaliana genome sequence and constitute protein-coding and noncoding regions (SCHMID et al. 2003). PCR, sequencing, and base calling were done as described in SCHMID et al. (2003). Heterozygous sequences were identified by tagging ambiguous base calls with the polyphred program (NICKERSON et al. 1997) and subsequent visual inspection of the corresponding trace files. STS loci with heterozygous sequences were excluded from further analysis (n = 27) because they may consist of paralogous sequences that were amplified with the same primer pair. To avoid the inclusion of additional nonallelic (i.e., paralogous) sequences every sequence was compared to the Col-0 genome sequence using the BLASTN program. If the sequences of a STS locus had best hits with different regions in the genome, the STS was excluded from further analysis (n = 26). After these filtering steps, the consensus sequences were aligned separately for every STS locus with ClustalW (THOMPSON et al. 1994) and the alignment was manually corrected with a sequence editor. Only STS loci with at least 100 alignment positions and a sequence from at least eight accessions were retained for further analysis. Coding regions of alignments were annotated using the Col-0 reference sequences and the MIPS annotation (version 11, July 2003; ftpmips.gsf.de/cress). Before the analysis, alignments were trimmed to exclude regions or sequences with missing or low-quality data. With the exception of the visual control of the alignment quality and problematic annotations, the whole assembly and annotation process was performed in an automated fashion and controlled by programs written in the Python programming language.
Population genetic analysis:
The standard population genetic analyses were carried out with the program DnaSP 3.99 (ROZAS and ROZAS 1999). Nucleotide diversity of a multiple alignment was calculated as
(or
T), the average pairwise nucleotide divergence (TAJIMA 1983; NEI 1987), and
W, the number of segregating sites (WATTERSON 1975). Nucleotide frequencies were analyzed by calculating the following test statistics: Tajima's D (TAJIMA 1989); Fu and Li's D*, F* (no outgroup) and D, F (with outgroup; FU and LI 1993); Fay and Wu's H (FAY and WU 2000); and a modification of Tajima's D-statistic, D/Dmin (SCHAEFFER 2002). The latter was calculated as D/Dmin = (
S/an)/|
min S/an| and
min = 2S/n, where S is the number of segregating sites and
. In an analogous manner, we also calculated H/Hmin for Fay and Wu's H-statistic. In this case,
. Under a given demographic model the expected values of these statistics are homogeneous for different loci with different numbers of segregating sites, which thus facilitates comparisons among loci (SCHAEFFER 2002). The test of MCDONALD and KREITMAN (1991) was calculated for STS with >80 aligned codons and with an outgroup sequence. Population genetic analyses based on haplotype structure were not performed, because this information was incomplete for many of the studied loci (i.e., internal columns of the alignment were masked as a consequence of the automated masking of low-quality sequences). Sequence divergence for silent, synonymous, and nonsynonymous variation was calculated using the correction of JUKES and CANTOR (1969), as implemented in DnaSP 3.99 (ROZAS and ROZAS 1999).
Effect of sequencing errors:
Tajima's D-, Fu and Li's D*-, and Fay and Wu's H-statistics are sensitive to rare polymorphisms and should be affected by sequencing errors. To investigate the effect of sequencing errors, we conservatively assumed that all errors result in singleton polymorphisms. The number of expected false-positive singleton polymorphisms (i.e., sequencing errors) was calculated by multiplying the total number of base pairs across all alignments with a given error rate of false base calls. From the set of observed singletons, the corresponding number was randomly selected and subsequently removed from the alignments. Then, Tajima's D, Fu and Li's D*, and Fay and Wu's H were calculated using these alignments. The last three steps were repeated 100 times per error rate, and the averages and standard deviations were calculated from individual simulations. Sequencing error rates investigated ranged from 107 to 102.
Although the observed mean of these statistics changes with sequencing errors, changes are minor over a wide range of error rates (Figure 1). We used the neighbor quality standard (NQS; ALTSHULER et al. 2000) for quality trimming. A phred quality score of 30 for the focal base call and a score of 20 for the five neighboring base calls on the left and right sides, respectively, were used as cutoff scores. With these parameters the expected sequencing error rate for single reads is 5 x 105 false base calls per base (ALTSHULER et al. 2000). However, since two-thirds of total bases were sequenced from both directions, the real error rate is likely to be much lower. On the basis of simulation results, we expect that sequencing errors do not significantly affect our analyses of polymorphism.
|
, was obtained by maximum-likelihood (ML) analysis using an analytical result from coalescence theory that infers the probability of
given the number of segregating sites, S, and sample size n at a locus (TAVARé 1984, Equation 9.5):
![]() |
We first calculated the probabilities of individual loci for a range of
per nucleotide ranging from a minimum of 0.0001 to a maximum of 0.10. To correct for differences in the alignment length,
was multiplied by the sequence length and then used to calculate the probabilities. The multilocus likelihood for every value of
was then calculated as the sum of the log probabilities across loci for a given value of
. To test the hypothesis that the data are better described by more than one value of
, likelihoods were calculated for two (M2), three (M3), and four (M4) different
values for the whole sample and individual values for every locus (M no. loci). We first sorted loci numerically according to locus-specific ML estimates of
. Then, for the M2 model, we used this sorted list to separate the loci into two different groups and searched for the best likelihood values in each group. The two exclusive groups with the best likelihood were then chosen. We also used this strategy to obtain
values for the M3 and M4 models, although a heuristic search was performed because not all possible groups of loci could be investigated. In this search, all loci were first grouped randomly for 500 iterations, and then 500 additional iterations were performed with the best groups of loci to estimate the likelihood. Likelihood-ratio tests (LRT) for comparisons between different models were performed. The distribution of the LRT statistic was obtained by coalescent simulations and not from a
2-distribution since it was difficult to ascertain the degrees of freedom. We simulated 1000 new samples using the simplest model (for example, comparing model M1 vs. M2, we simulated 1000 times the number of segregating sites for each locus given a
value per nucleotide estimated by the model M1). We then recalculated the maximum likelihood for each simulated sample and stored the LRT value of each iteration. Finally, the LRT value for the observed data was contrasted with the simulated LRT distribution.
Levels of variation at synonymous and in noncoding sites were calculated by ML. We also calculated the likelihood under the assumption that both groups of sites have the same level of polymorphism (one
is estimated) and that
values differ between the two groups of genes. The likelihoods were then compared in a LRT.
To compare the observed genome-wide distribution of descriptive statistics with the expected distribution under a standard neutral model, we performed coalescent simulations conditioning on the genome-wide estimate of
(using one or more
values) obtained by maximum-likelihood estimation (HUDSON 1990, 1993), using programs by S. RAMOS-ONSINS (unpublished data) and R. Hudson (HUDSON 2000). For multilocus analyses, conditioning on
is preferable to S (S. RAMOS-ONSINS, unpublished data). Simulations were carried out individually for each locus and repeated 10,000 times. We calculated two kinds of statistics:
divergenceobserved/2
(HUDSON et al. 1987), equal to eight measured N generations. In the case of B. drummondii, s/v = 1.0 and Tout
12. P-values were calculated as the probability that a neutral, simulated value of a given statistic was smaller or larger than the observed value.
Multilocus HKA tests (HUDSON et al. 1987) were calculated for silent positions using the silent segregating sites and the silent divergence value (corrected after JUKES and CANTOR 1969). HKA tests were calculated separately for loci with A. lyrata and B. drummondii as outgroup sequences. The significance of the HKA tests was obtained using a
2-distribution (HUDSON et al. 1987).
Alternative demographic models:
The population history of A. thaliana appears to be determined by separated populations during the Pleistocene and subsequent expansion into Central and Eastern Europe during the past 18,000 years (SHARBEL et al. 2000). For this reason, three alternative demographic models were evaluated that take the demographic history into account.
Logistic growth model:
This model estimates the ML parameters under logistic growth given the frequency spectrum of segregating sites (with or without outgroup). In the logistic growth model (e.g., FU 1997), the population size changes back in time as follows:
![]() |
, was fixed to
= 10/(t1 t0). Estimated model parameters include
, the start and end time points of the logistic growth phase, and the growth rate N0/N1 (setting N0 = 1 as the current population size). Likelihood estimates were calculated by generating 100200 coalescent trees for each set of parameters with no intralocus recombination. We then calculated the probability of the same number of mutations occurring with frequency i as observed in the data by using a Poisson distribution (J. WAKELEY, personal communication). The log-likelihood value was calculated for each locus and the sum of all loci was stored. The best parameters were searched for by using a grid of 10,000 parameter values. Also, Metropolis-Hastings Markov chain Monte Carlo (METROPOLIS et al. 1953; HASTINGS 1970; KUHNER et al. 1995) sampling was performed. We used the simplest algorithm for sampling parameters and accepted the new parameters in the chain if
![]() |
2-distribution might be too liberal. For this reason, we also calculated averages and variances across all loci for all summary statistics using the parameters estimated with the logistic growth model to evaluate whether the estimated parameters fit the observed data.
Refugia and bottleneck models:
The second demographic model assumes a single panmictic population in the distant past and then a subdivision into several refugia for a certain time period and a subsequent admixture into a single current population (Figure 2; HUDSON 2000; WALL 2000). The time (in generations) from the present to the past split into several small populations (refugia), the number of refugia, the relative population size of each refugium (considering the present population size as N0 = 1), and the time until refugia merged into a single population were taken as general parameters for all loci. Values of
and the relative contribution of each refugial population to the present population (freqr) were individual parameters for each locus. To be more conservative, the recombination parameter was set to zero, and, to simplify, the parameter freqr was considered equal for all loci. In the case of the bottleneck model, we used the same parameters as in the refugia model but the number of refugia was fixed to one. A ML analysis of the refugia model requires estimation of a large number of model parameters and is currently too expensive computationally, but we performed an exploratory analysis with a limited number of parameter combinations. In these analyses, we assumed a present effective population size of 2N = 106 and one generation per year. The time (from present to past) until refugia merged into a single population was fixed to cold periods at 104 generations ago (2 x 102 relative to N generations) and 2 x 104 (4 x 102), and the refugia were maintained for 103 or 5 x 103 generations (102) until a split into refugial populations. The relative population sizes in the refugia phase were set to 0.05, 0.01, 0.005, and 0.001 in relation to the present, and the number of refugia was set to 1 (bottleneck), 2, 5, and 10 refugia. The relative ancestral population sizes of these refugial populations (Nancestral/N0) were assumed to have been smaller than the current one and were fixed to 0.1, but 1 was also considered. To better adjust the model, we also tried additional parameter values in the vicinity of the best-fitting parameter combinations.
|
Correlations of genetic diversity with GC content at silent sites (GC3) and estimated local recombination rate were analyzed by linear, quadratic, and quantile regression analysis using the R statistical package (http://www.r-project.org). The latter method is suitable for testing whether the edges of distributions with polygonal shapes reflect a random pattern or a significant relationship between two parameters (SCHARF et al. 1998; KOENKER and HALLOCK 2001). We used the quantreg package of R and a quadratic model to compare the quadratic regression coefficients for 5095% quantiles and calculated the significance of observed regression coefficients by comparing 10,000 permutations of value pairs.
Data availability:
The sequences generated for this study were submitted to the STS section of GenBank under accession nos. CW672529, CW672721. Annotated alignments can be obtained from http://kiwi.ice.mpg.de/athapop.
| RESULTS |
|---|
|
|
|---|
Average levels of polymorphism are summarized in Table 1. The majority of loci are distributed close to the mean of the silent variation (
W = 0.006), and few loci (39) are highly polymorphic (
W > 0.02; Figure 3). The relationship between synonymous and nonsynonymous polymorphism in coding regions is shown in Figure 4A. Among the set of 153 STS, which cover protein-coding regions of at least 80 codons, 90 (59%) have a ratio of
n/
s < 1, suggesting purifying selection against deleterious amino acid replacement polymorphisms. Among the most polymorphic genes, five have a
n/
s ratio of close to 1. They are annotated as "hypothetical" or "putative" and may constitute redundant, nonfunctional, or incorrectly annotated genes. Three of these genes contain at least one allele with a premature stop codon or an out-of-frame insertion/deletion, suggesting that they are pseudogenes.
|
|
|
noncoding = 0.0091,
synonymous = 0.0088, ML = 414.33 398.41 = 812.74) and compared to a
silent value estimated by combining all sites (
silent = 0.0090, ML = 812.82), the diversity between the two types of sites was not significantly different (LRT = 0.148, P = 0.70). Therefore, we combined synonymous and noncoding sites in the following analyses.
Effect of recombination:
The levels of variation across the genome can be affected by different local recombination rates. Polymorphism levels do not differ between centromeric and noncentromeric regions (Table 2) and within noncentromeric regions there is no correlation between recombination rate and silent diversity (not shown). In regions of low and high recombination, polymorphism levels tend to be reduced (Figure 5), but linear (R2 = 0.04; P > 0.1), quadratic (R2 = 0.06; P > 0.1), and quantile regression coefficients (5095% quantiles; P > 0.05) were not significant. In A. thaliana, pairwise linkage disequilibrium can extend up to 250 kb (NORDBORG et al. 2002), suggesting that neighboring loci may have similar levels of polymorphism. This was confirmed by our observation of a higher proportion of loci with similar polymorphism levels among closely located (<250 kb) than among distant loci (using a small difference of 0.005 as cutoff; G-test, P = 0.039). Therefore, to be conservative, we used only loci that are separated by at least 250 kb for the analysis of demographic models (195 loci).
|
|
silent values. By comparing observed values with distributions obtained by coalescent simulations and testing for differences in an LRT (Table 3), we found that a model with two different estimates for silent
(M2;
1 = 0.0041,
2 = 0.0213) best explained the distribution of silent variation under a neutral model. These estimates were used in the following simulations.
|
|
|
Since two groups of loci with low and high levels of variation were observed, we tested whether distributions of summary statistics differ between them (Table 6). In both cases, the average of Tajima's D is negative and significantly different from the expectations of the neutral model. When variances are considered, only the highly polymorphic group differs from the neutral model. The observation of an increased variance of test statistics in the group of highly polymorphic loci may be interpreted as a footprint of selection or of other evolutionary forces (see below).
|
2 = 146.87 with 42 d.f., P < 0.0001; Figure 7) and for 20 loci with a B. drummondii outgroup sequence (
2 = 54.38 with 19 d.f., P < 0.0001). Six of 43 loci (14%) were mainly responsible for the high significance of the HKA test in the comparison between A. thaliana and A. lyrata (AtV23, TIGR3437, GOLM25, At3est48, AtV11, and TIGR1736), and 4 of 20 loci (20%) between A. thaliana vs. B. drummondii (GOLM66, TIGR1744, TIGR1518, and GOLM80). Most of these loci have higher levels of polymorphism and lower levels of divergence than expected and are included in the group of highly polymorphic genes (Tables 3 and 6). Significant multilocus HKA tests are not consistent with the neutral panmictic model and may result from selection. It should be noted that a deviation from the assumption of panmixia may increase the variance of the observed data and contribute to the significant test result (HUDSON et al. 1987; RAMOS-ONSINS et al. 2004).
|
Testing alternative demographic models:
Since a standard neutral model is not consistent with the observed data, we considered alternative demographic models. A. thaliana occurs frequently at disturbed sites, which have expanded over the last 6000 years with the spread of human agriculture; hence it seems plausible to consider a model of recent population expansion. Furthermore, we observed a significant negative average of Tajima's D, which may be interpreted as a footprint of recent population expansion (TAJIMA 1989). We first studied a logistic growth model with four parameters (see MATERIALS AND METHODS) and found little statistical support for this model, because a higher likelihood was obtained with a model of constant population size (195 STS; ML = 1557.3) than with logistic growth (ML = 1562 for an estimated growth rate of 0.74). A higher likelihood for the constant population size model suggests that population expansion alone is not sufficient for explaining observed patterns of polymorphism.
We also compared our data against parameter combinations that take into account population expansion, such as those estimated by INNAN and STEPHAN (2000) using a smaller number of loci (
= 0.10, N1/N0 = 0.57, Tstart = 0, and Tend = 0.6, measured in N generations). Using these parameters, we always found a decrease in the variance of Tajima's D relative to the observed variance (see Figure 6 and Table 5). The high variance exhibited in the observed distribution of Tajima's D also suggests that our data are not consistent with a simple population expansion model.
|
Identification of nonneutrally evolving outlier loci:
The genome-wide distribution of nucleotide variation is a mixture of distributions resulting from demographic processes and selection. For this reason, the high variance we observe in the distribution of the test statistics may be caused by "outlier loci" that evolve under positive or balancing selection. Our data set contains 28 STS loci that reject a standard null model in Tajima's D, Fu and Li's D*, Fay and Wu's H, and the HKA and McDonald-Kreitman tests of neutrality and thus may have been targets of recent selection (supplementary information at http://www.genetics.org/supplemental/). These outlier loci may be responsible for the high variance observed in the empirical distributions and the rejection of a neutral model. A better fit to a neutral demographic model may be observed if they are excluded from the analysis (LUIKART et al. 2003).
To evaluate this possibility, we identified all loci in the 2.5% tails on both sides of the empirical distributions of all statistics (see supplementary information). Thirteen loci (6.8%) were excluded from the set of 195 loci and 12 loci (27.9%) from the set with an outgroup sequence from A. lyrata. In the latter case, we also excluded loci with significant MK and HKA tests. The removal of outlier loci resulted in smaller variances, but they still differed significantly from expected variances (not shown). However, we obtained a better fit to the logistic growth model with the set of loci containing an outgroup sequence (n = 31,
= 0.011, N1/N0 = 0.235, Tstart = 0.0158, and Tend = 0.0817, measured in N generations; ML = 140.85 when compared to the constant size neutral model,
= 0.004, ML = 145.44). No better fit was observed with the larger data set.
We also evaluated the effect of outlier loci by removing loci with high estimates of
identified by the M2 model described above (Table 3). Using the remaining 128 loci (66%), we obtained a better fit to the logistic growth model (
= 0.0103, N1/N0 = 0.211, Tstart = 0.0180, and Tend = 0.0796, measured in N generations; ML = 585.01) than to the standard neutral model (
= 0.004, ML = 603.15). The parameter estimates for the logistic growth model are similar to the ones in the previous analysis. In addition, observed averages and variances of all test statistics were compared with simulated distributions using the estimated parameters of the logistic growth model (not shown). No test showed a significant difference between observed and expected distributions, suggesting that a logistic growth model is more consistent with the patterns of variation at the 128 loci than is a neutral model of constant population size. Thus, some of the 67 high-diversity loci may have been either targets of selection or affected by demographic processes not included in our models. In the latter case, the exclusion of outlier loci may have resulted in an artificial reduction of the variance and thus an improved fit to the demographic model.
| DISCUSSION |
|---|
|
|
|---|
Sequence diversity in A. thaliana has been analyzed previously at more than a dozen loci, but the relative roles of demography and selection on patterns of genetic variation have not been rigorously studied, except by INNAN and STEPHAN (2000). The average level of silent nucleotide diversity observed among all loci in our data (
W = 0.00896) is very similar to the mean diversity of 14 genes surveyed previously (0.009; SHEPARD and PURUGGANAN 2003). Furthermore, we find an excess of low-frequency polymorphisms as indicated by an average negative estimate of Tajima's D, which has also been noted in earlier studies (PURUGGANAN and SUDDITH 1999; KUITTINEN and AGUADé 2000; SHEPARD and PURUGGANAN 2003). Although previous studies focused on specific regions or protein-coding genes of particular interest, our data agree with levels and patterns of variation observed in these studies.
The large number of loci investigated in this study allowed us to search for possible causes of different levels of polymorphism among loci. A higher diversity at synonymous sites than at noncoding sites was also found in maize, Drosophila, and humans (TENAILLON et al. 2001), but the differences are more pronounced (1.5- to 2-fold) in these species than in our study (1.2-fold). Furthermore, synonymous diversity and GC3 were not correlated. The low level of codon usage bias of A. thaliana (DURET and MOUCHIROUD 1999) suggests that selection for optimal codon usage is not very strong and that synonymous polymorphisms are completely neutral, possibly because of a reduced effective population size due to the high selfing rate (BUSTAMANTE et al. 2002).
The local rate of recombination is correlated with nucleotide diversity in a number of plant (DVORáK et al. 1998; KRAFT et al. 1998; STEPHAN and LANGLEY 1998) and animal species (e.g., BEGUN and AQUADRO 1992; GLINKA et al. 2003). We did not find such a relationship in A. thaliana, suggesting that either differences in the effective rate of recombination are too small to have a measurable effect on nucleotide diversity or background or positive selection are not strong enough to cause the observed relationship between genetic diversity and recombination found in other species. Similarly, the local recombination rate is not correlated with the distribution of transposable elements (WRIGHT et al. 2003) and only weakly correlated with the frequency of tandemly repeated genes (ZHANG and GAUT 2003). Variation in recombination rates does not appear to be a strong force in structuring the genome of A. thaliana. On the other hand, the observation of similar levels of nucleotide diversity among neighboring genes (<250 kb) may result from a low effective recombination rate and extended regions of the genome then have similar evolutionary histories (NORDBORG and TAVARé 2002). This finding is supported by the observation of correlated polymorphism levels within a 40-kb genomic region around the CLAVATA2 locus (SHEPARD and PURUGGANAN 2003) and within a 170-kb region around the MAM locus of A. thaliana (HAUBOLD et al. 2002). Correlated patterns of polymorphism among physically linked loci that result from background or positive selection occurring in a particular genomic region may interfere with the analysis of the demographic processes. Therefore, to minimize the effect of selection on our analysis of demographic models, we included only physically distant loci.
Rejection of a neutral panmictic model of evolution:
Our data show a significant, genome-wide deviation from a standard mutation-drift model of evolution (Table 6), which may result from the absence of panmixia, temporary changes in population size and structure, or selection at independent loci. We first want to consider possible effects of population structure and population growth on our analyses. The accessions included in this survey are genetically distantly related, as indicated by a genealogy with long terminal branches (SCHMID et al. 2003). Every accession therefore is assumed to represent a single individual from different local demes, making it impossible to observe a deviation from the standard mutation-drift model at the level of the deme (e.g., WAKELEY 2004). Using Wakeley's terminology (WAKELEY 1999), our accessions represent the "collecting phase" of the coalescent of a metapopulation (i.e., the coalescing of lines from different demes) and not the "scattering phase" (coalescence events within individual demes). The collecting phase of a coalescent is equivalent to a single standard population (WAKELEY and ALIACAR 2001) and can be analyzed with the general coalescent as we have done in this study. This conclusion seems robust enough to be applicable to different metapopulation structures (WAKELEY 2001; WAKELEY and ALIACAR 2001).
Under this assumption, deviations at a genomic level from the neutral panmictic model might be a consequence of those events that affect the entire metapopulation structure, like temporary changes in population size (expansion, bottlenecks), or also subdivision of the metapopulation (i.e., subdivision of this species in isolated refugia for a certain time period). The significant excess of low-frequency polymorphisms could be explained by an expansion process, as has been suggested (PURUGGANAN and SUDDITH 1999; INNAN and STEPHAN 2000; KUITTINEN and AGUADé 2000), but ML estimates of an alternative logistic growth model were not different from the standard neutral model. Furthermore, the high variance observed in our empirical distributions is not expected under population expansion and a simple expansion model is not sufficient to explain the difference of observed data from a neutral model. We were able to obtain a better fit to a logistic growth than to a standard neutral model when highly polymorphic loci were excluded. However, this result needs to be interpreted with caution because we removed one-third of all loci from the analysis on the basis of a ML analysis of
values under a neutral model without knowing whether these loci are highly polymorphic due to selection or due to a neutral process such as a locally increased mutation rate or admixture from previously separated subpopulations representing glacial refugia (SHARBEL et al. 2000). To account for the latter possibility, a refugia model and bottleneck models were also considered. Although we investigated only a few biologically realistic parameter combinations, we could not detect a combination that was consistent with the observed data.
A second explanation for the observed deviation from a neutral model is selection. If selection occurred only at single loci, its effect on the distribution of summary statistics should be minor given the large number of loci analyzed. On the other hand, selective sweeps or balancing selection at many loci may contribute to the significant deviation of the mean and the variance of Tajima's D from the expectation of a neutral model. Such an explanation is not supported by the analysis of a subset of loci, for which we were able to obtain a sequence from one of outgroup species, because the observed distribution of Fay and Wu's H was not significantly different from the expectation of a standard neutral model. Thus, the negative averages of Tajima's D and Fu and Li's statistics do not seem to result from positive selection at or hitchhiking of multiple loci.
Alternatively, purifying selection against slightly deleterious mutations may have caused the excess of low-frequency polymorphisms. The efficacy of selection against deleterious mutations may be weak compared to that of other species (BUSTAMANTE et al. 2002), but it is nevertheless operating in A. thaliana, as indicated by the lower nonsynonymous than synonymous nucleotide diversity (Figure 4). The effect of purifying selection on the frequency spectrum is difficult to quantify, especially in noncoding regions, because little is known about their functional and evolutionary constraints. As discussed above, synonymous polymorphisms may not be exposed to purifying selection and thus contribute little to the deviation from a neutral model. Although it does not appear that positive or negative selection is solely responsible for the observed deviation from a neutral model, the observation of highly negative values of Fay and Wu's H at some loci and the high variance in the distribution of summary statistics may result from variable selection pressures at different loci.
It should be noted that we did not take the geographic origin of accessions used for this study into account. The excess of rare polymorphisms may also result from fixation of locally occurring polymorphisms by selection and thus represent geographic differentiation between demes of a self-fertilizing species (HEDRICK and HOLDEN 1979). Such an interpretation is consistent with a weak, but significant presence of a geographic population structure in the natural species range (SHARBEL et al. 2000). However, our current sample of 12 accessions is not large enough to address this question.
Modified tests of neutrality:
A frequent goal of surveys of sequence variation is to evaluate whether a particular gene evolves under selection. Furthermore, genome-wide analyses of genetic variation aim at identifying novel "adaptive trait genes" that were subject to positive or balancing selection (reviewed by LUIKART et al. 2003). In sequence surveys, the rejection of the null hypothesis of neutral evolution in one of the numerous available tests of neutrality is often taken as evidence that a gene evolves under selection (e.g., PURUGGANAN and SUDDITH 1998; OLSEN et al. 2002; KROYMANN et al. 2003; MAURICIO et al. 2003). According to our results, a standard neutral model should not be used as a null hypothesis for neutrality tests in A. thaliana because of the effects of demographic history on nucleotide diversity. Thus, if one attempts to test the hypothesis that a given gene has been the target of natural selection, the challenge consists of differentiating between the effects of demography and selection on genetic variation.
One approach to account for demography in tests of selection is to use a modified null model that incorporates the demographic history of a species and selection at independent loci. This allows the estimation of demographic parameters and of the likelihood that individual loci evolved under selection. Using our data, we could not formulate an alternative model for such a purpose, because we were not able to identify all the demographic and selective forces that have shaped observed variation. It will be a considerable challenge to develop such a model for a species with a complex demographic history given the large number of parameters involved. An alternative approach is to use empirical distributions of various descriptive statistics derived from randomly sequenced genomic loci and to identify the outlier loci in such statistics (BLACK et al. 2001; LUIKART et al. 2003). Outlier loci are defined as falling into the extreme tails of the empirical distribution and thus exhibit unusual patterns of variation. We have calculated the critical values for various descriptive statistics using the empirical distributions obtained from our data (Table 7) and they may be useful for neutrality tests in future sequence surveys of novel genes of interest. In this case, however, one would have to consider the effect of using different accessions and different sequence lengths on descriptive statistics of nucleotide diversity before comparing them to such a distribution (PLUZHNIKOV and DONNELLY 1996). A modification of this empirical approach is to use combinations of test statistics and to identify those genes that reject more than one neutrality test. In our data set, a small number of loci (e.g., AtV9, AtV11, AtV23, and GOLM80) fulfill this criterion. Candidate adaptive trait genes can be easily found by such an approach, but further investigations will be necessary, because the short genomic segments studied here are not sufficient to fully characterize patterns of polymorphism at a locus and thus to infer the role of selection. Despite these concerns, the use of empirical distributions appears to be a useful alternative to a model-based analysis to identify genes that may have been targets of selection because the demographic history is taken into account.
|
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
| FOOTNOTES |
|---|
1 These authors contributed equally to this work. ![]()
3 Present address: Department of Biology, University of Bielefeld, 33594 Bielefeld, Germany. ![]()
| LITERATURE CITED |
|---|
|
|
|---|
ABBOTT, R. J., and M. F. GOMES, 1989 Population genetic structure and outcrossing rate of Arabidopsis thaliana. Heredity 42: 411418.
AKEY, J., M. EBERLE, M. RIEDER, C. CARLSON, M. SHRIVER et al., 2004 Population history and natural selection shape patterns of genetic variation in 132 genes. PLoS Biol. 2: e286.[CrossRef][Medline]
ALTSHULER, D., V. POLLARA, C. C. W. VAN ETTEN, J. BALDWIN, L. LINTON et al., 2000 SNP map of the human genome generated by reduced representation sequencing. Nature 407: 513516.[CrossRef][Medline]
BEGUN, D., and C. AQUADRO, 1992 Levels of naturally occurring DNA polymorphism correlate with recombination rates in D. melanogaster. Nature 356: 519520.[CrossRef][Medline]
BERGELSON, J., C. PURRINGTON and G. WICHMANN, 1998 Promiscuity in transgenic plants. Nature 395: 25.[Medline]
BLACK, W., C. BAER, M. ANTOLIND and N. DUTEAU, 2001 Population genomics: genome-wide sampling of insect populations. Annu. Rev. Entomol. 46: 441469.[CrossRef][Medline]
BUSTAMANTE, C., R. NIELSEN, S. SAWYER, K. OLSEN, M. PURUGGANAN et al., 2002 The cost of inbreeding in Arabidopsis. Nature 416: 531534.[CrossRef][Medline]
CAICEDO, A. L., B. A. SCHAAL and B. N. KUNKEL, 1998 Diversity and molecular evolution of the RPS2 resistance gene in Arabidopsis thaliana. Proc. Natl. Acad. Sci. USA 96: 302306.
CHARLESWORTH, B., M. MORGAN and D. CHARLESWORTH, 1993 The effect of deleterious mutations on neutral molecular variation. Genetics 134: 12891303.[Abstract]
CHARLESWORTH, D., 2003 Effects of inbreeding on the genetic diversity of populations. Philos. Trans. R. Soc. Lond. B 358: 10511070.[CrossRef][Medline]
CLAUSS, M., and T. MITCHELL-OLDS, 2004 Functional divergence in tandemly duplicated Arabidopsis thaliana trypsin inhibitor genes. Genetics 166: 14191436.
COPENHAVER, G., K. NICKEL, T. KUROMORI, M. BENITO, S. KAUL et al., 1999 Genetic definition and sequence analysis of Arabidopsis centromeres. Science 286: 24682474.
DURET, L., and D. MOUCHIROUD, 1999 Expression pattern and, surprisingly, gene length shape codon usage in Caenorhabditis, Drosophila, and Arabidopsis. Proc. Natl. Acad. Sci. USA 96: 44824487.
DVORáK, J., M. LUO and Z. YANG, 1998 Restriction fragment length polymorphism and divergence in the genomic regions of high and low recombination in self-fertilizing and cross-fertilizing Aegilops species. Genetics 148: 423434.
FAY, J. C., and C.-I WU, 2000 Hitchhiking under positive Darwinian selection. Genetics 155: 14051413.
FU, Y.-X., 1997 Statistical tests of neutrality of mutations against population growth, hitchhiking and background selection. Genetics 147: 915925.[Abstract]
FU, Y.-X., and W.-H. LI, 1993 Statistical tests of neutrality of mutations. Genetics 133: 693709.[Abstract]
GLINKA, S., L. OMETTO, S. MOUSSET, W. STEPHAN and D. DE LORENZO, 2003 Demography and natural selection have shaped genetic variation in Drosophila melanogaster: a multi-locus approach. Genetics 165: 12691278.
HAGENBLAD, J., and M. NORDBORG, 2002 Sequence variation and haplotype structure surrounding the flowering time locus FRI in Arabidopsis thaliana. Genetics 161: 289298.
HAMBLIN, M., S. MITCHELL, G. WHITE, J. ALLEGO, R. KUKATLA et al., 2004 Comparative population genetics of the Panicoid grasses: sequence polymorphism, linkage disequilibrium and selection in a diverse sample of Sorghum bicolor. Genetics 167: 471483.
HANFSTINGL, U., A. BERRY, E. E. KELLOG, J. T. COSTA, III, W. RüDIGER et al., 1994 Haplotypic divergence coupled with lack of diversity at the Arabidopsis thaliana alcohol dehydrogenase locus: roles for both balancing and directional selection. Genetics 138: 811828.[Abstract]
HASTINGS, W., 1970 Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57: 57109.
HAUBOLD, B., J. KROYMANN, A. RATZKA, T. MITCHELL-OLDS and T. WIEHE, 2002 Recombination and gene conversion in a 170-kb genomic region of Arabidopsis thaliana. Genetics 161: 12691278.
HAUPT, W., T. FISCHER, S. WINDERL, P. FRANSZ and R. TORRES-RUIZ, 2001 The CENTROMERE1 (CEN1) region of Arabidopsis thaliana: architecture and functional impact of chromatin. Plant J. 27: 285296.[CrossRef][Med