## Abstract

The simultaneous analysis of multiple genomic loci is a powerful approach to studying the effects of population history and natural selection on patterns of genetic variation of a species. By surveying nucleotide sequence polymorphism at 334 randomly distributed genomic regions in 12 accessions of *Arabidopsis thaliana*, we examined whether a standard neutral model of nucleotide sequence polymorphism is consistent with observed data. The average nucleotide diversity was 0.0071 for total sites and 0.0083 for silent sites. Although levels of diversity are variable among loci, no correlation with local recombination rate was observed, but polymorphism levels were correlated for physically linked loci (<250 kb). We found that observed distributions of Tajima's *D-* and *D*/*D*_{min}- and of Fu and Li's *D-*, *D**- and *F-*, *F**-statistics differed significantly from the expected distributions under a standard neutral model due to an excess of rare polymorphisms and high variances. Observed and expected distributions of Fay and Wu's *H* were not different, suggesting that demographic processes and not selection at multiple loci are responsible for the deviation from a neutral model. Maximum-likelihood comparisons of alternative demographic models like logistic population growth, glacial refugia, or past bottlenecks did not produce parameter estimates that were more consistent with observed patterns. However, exclusion of highly polymorphic “outlier loci” resulted in a fit to the logistic growth model. Various tests of neutrality revealed a set of candidate loci that may evolve under selection.

THE structure of genetic polymorphism in a genome is influenced by different evolutionary processes. They can be grouped into processes that affect the whole genome (historical population growth or geographic population structure) and into processes that act locally or are variable across the genome (mutation, recombination, and selection). A frequent goal of studies of DNA sequence variation is to elucidate whether a gene of interest has been the target of recent natural selection. This goal is achieved by comparing patterns of variation observed at a locus with the variation expected under a neutral model that assumes no significant fitness effect of the polymorphisms segregating at a locus. Such comparisons can be confounded by the demographic history of a population because demographic processes may lead to patterns of variation that are similar to those observed under selection. For example, if there is an excess of low-frequency polymorphisms at a locus, this could result from a recent population expansion, from purifying selection against deleterious polymorphisms, or from the recent selective fixation of an advantageous allele at this locus.

Therefore it is important to disentangle the effects of different evolutionary processes on sequence variation if one attempts to identify genes involved in adaptation. Theoretical analyses showed that the simultaneous analysis of genetic variation at multiple loci is a powerful approach to identifying genome-wide acting processes (*e.g.*, Wall 1999). If common patterns of polymorphism are observed at different independent loci, they likely result from a genome-wide acting evolutionary process rather than from multiple independent processes acting locally on individual genes (Glinka *et al.* 2003; Hamblin *et al.* 2004; Akey *et al.* 2004; Tenaillon *et al.* 2004).

Over the past decade, *Arabidopsis thaliana* has become an important model organism for the analysis of genetic variation in plants (Mitchell-Olds 2001), and sequence variation at more than a dozen loci was surveyed to elucidate the role of selection. Two major patterns emerged from these studies. At some loci, sequence variation is characterized by a pattern of two or three distinct and sometimes highly divergent haplotypes (*e.g.*, Hanfstingl *et al.* 1994; Kawabe *et al.* 1997; Caicedo *et al.* 1998; Stahl *et al.* 1999; Kawabe *et al.* 2000; Hauser *et al.* 2001; Olsen *et al.* 2002; Tian *et al.* 2002; Kroymann *et al.* 2003; Clauss and Mitchell-Olds 2004) and at other loci by an excess of rare polymorphisms (*e.g.*, Kawabe and Miyashita 1999; Purugganan and Suddith 1999; Kuittinen and Aguadé 2000; Hagenblad and Nordborg 2002; Olsen *et al.* 2002). Tests of neutrality with both groups of loci frequently rejected the null hypothesis of neutral evolution. These results suggest that in *A. thaliana* numerous genes are subject to balancing selection (excess of intermediate-frequency polymorphisms or haplotypes) or selective sweeps (excess of rare polymorphisms) at or near the surveyed genes.

Although some of the analyzed genes were selected due to a putative role in adaptive evolution, the nonneutral patterns of genetic variation may also be influenced by the demographic history and other characteristics of this species and thus confound tests of neutrality. *A. thaliana* is a highly self-fertilizing species (99%; Abbott and Gomes 1989; Bergelson *et al.* 1998). This level of inbreeding reduces effective population size and the effective rate of recombination (reviewed by Charlesworth 2003). In comparison to outcrossing species, a lower nucleotide diversity is expected due to background selection (Charlesworth *et al.* 1993) or hitchhiking with selective sweeps (Maynard Smith and Haigh 1974). Average levels of linkage disequilibrium are expected to be increased due to a low effective recombination rate (Nordborg and Donnelly 1997), resulting in a strong haplotype structure in samples obtained from a single deme (Nordborg 2000). Such a haplotype structure may resemble patterns observed under balancing selection and confound neutrality tests such as Tajima's *D* (Tajima 1989). Furthermore, there is evidence of a large-scale genetic population structure, which reflects a history of glacial refugia and postglacial recolonization of the native species range (Sharbel *et al.* 2000), suggesting that recent changes in both population size and population structure have played an important role in shaping the current structure of genetic variation in *A. thaliana*.

The purpose of this study is to investigate the effect of demographic history and self-fertilization on genome-wide patterns of sequence variation in *A. thaliana* and to test the hypothesis that a standard neutral model is a suitable null model for the identification of genes involved in adaptation. We analyzed several hundred short genomic regions (sequence-tagged sites, STS) in up to 12 accessions and fitted the observed patterns of variation to various demographic models using coalescent simulations and maximum-likelihood analysis. Most of the sequence data used for this study were taken from a previous survey aimed at large-scale discovery of single-nucleotide polymorphism (SNP) markers in *A. thaliana* (Schmid *et al.* 2003). We reject a neutral model of sequence polymorphism, but did not obtain a different demographic model that performed significantly better. Our analysis indicates that both the selfing nature and the demographic history of *A. thaliana* have a significant effect on genome-wide sequence polymorphism and that the standard neutral model is not appropriate for tests of neutrality in this species. By using empirical distributions of descriptive statistics such as Tajima's *D*, we were able to identify new candidate loci that may have been targets of recent selection.

## MATERIALS AND METHODS

### Plant material:

Twelve accessions from *A. thaliana* were included in this survey, which consist of five accessions previously used in genetic mapping (Col-0, Cvi-0, L*er*, Nd-0, and Ws-0) and an additional 7 accessions (Ei-2, CS22491, Gü-0, Lz-0, Wei-0, Ws-0, and Yo-0) with a high average genetic distance to other accessions (Sharbel *et al.* 2000). These lines are available from the stock centers. Single accessions each from the closely related species *A. lyrata* ssp. *lyrata* and from *Boechera drummondii* were used as outgroups. These two species have an evolutionary distance to *A. thaliana* of 5–8 and 10 million years, respectively (Koch *et al.* 2000).

### Sequencing, quality trimming, and annotation:

A total of 595 STS loci were chosen essentially randomly from the *A. thaliana* genome sequence and constitute protein-coding and noncoding regions (Schmid *et al.* 2003). PCR, sequencing, and base calling were done as described in Schmid *et al.* (2003). Heterozygous sequences were identified by tagging ambiguous base calls with the polyphred program (Nickerson *et al.* 1997) and subsequent visual inspection of the corresponding trace files. STS loci with heterozygous sequences were excluded from further analysis (*n* = 27) because they may consist of paralogous sequences that were amplified with the same primer pair. To avoid the inclusion of additional nonallelic (*i.e*., paralogous) sequences every sequence was compared to the Col-0 genome sequence using the BLASTN program. If the sequences of a STS locus had best hits with different regions in the genome, the STS was excluded from further analysis (*n* = 26). After these filtering steps, the consensus sequences were aligned separately for every STS locus with ClustalW (Thompson *et al.* 1994) and the alignment was manually corrected with a sequence editor. Only STS loci with at least 100 alignment positions and a sequence from at least eight accessions were retained for further analysis. Coding regions of alignments were annotated using the Col-0 reference sequences and the MIPS annotation (version 11, July 2003; ftpmips.gsf.de/cress). Before the analysis, alignments were trimmed to exclude regions or sequences with missing or low-quality data. With the exception of the visual control of the alignment quality and problematic annotations, the whole assembly and annotation process was performed in an automated fashion and controlled by programs written in the Python programming language.

### Population genetic analysis:

The standard population genetic analyses were carried out with the program DnaSP 3.99 (Rozas and Rozas 1999). Nucleotide diversity of a multiple alignment was calculated as π (or θ_{T}), the average pairwise nucleotide divergence (Tajima 1983; Nei 1987), and θ_{W}, the number of segregating sites (Watterson 1975). Nucleotide frequencies were analyzed by calculating the following test statistics: Tajima's *D* (Tajima 1989); Fu and Li's *D**, *F** (no outgroup) and *D*, *F* (with outgroup; Fu and Li 1993); Fay and Wu's *H* (Fay and Wu 2000); and a modification of Tajima's *D*-statistic, *D*/*D*_{min} (Schaeffer 2002). The latter was calculated as *D*/*D*_{min} = (π − *S*/*a _{n}*)/|π

_{min}−

*S*/

*a*| and

_{n}*π*

_{min}= 2

*S*/

*n*, where

*S*is the number of segregating sites and . In an analogous manner, we also calculated

*H*/

*H*

_{min}for Fay and Wu's

*H*-statistic. In this case, . Under a given demographic model the expected values of these statistics are homogeneous for different loci with different numbers of segregating sites, which thus facilitates comparisons among loci (Schaeffer 2002). The test of McDonald and Kreitman (1991) was calculated for STS with >80 aligned codons and with an outgroup sequence. Population genetic analyses based on haplotype structure were not performed, because this information was incomplete for many of the studied loci (

*i.e.*, internal columns of the alignment were masked as a consequence of the automated masking of low-quality sequences). Sequence divergence for silent, synonymous, and nonsynonymous variation was calculated using the correction of Jukes and Cantor (1969), as implemented in DnaSP 3.99 (Rozas and Rozas 1999).

### Effect of sequencing errors:

Tajima's *D-*, Fu and Li's *D**-, and Fay and Wu's *H*-statistics are sensitive to rare polymorphisms and should be affected by sequencing errors. To investigate the effect of sequencing errors, we conservatively assumed that all errors result in singleton polymorphisms. The number of expected false-positive singleton polymorphisms (*i.e.*, sequencing errors) was calculated by multiplying the total number of base pairs across all alignments with a given error rate of false base calls. From the set of observed singletons, the corresponding number was randomly selected and subsequently removed from the alignments. Then, Tajima's *D*, Fu and Li's *D**, and Fay and Wu's *H* were calculated using these alignments. The last three steps were repeated 100 times per error rate, and the averages and standard deviations were calculated from individual simulations. Sequencing error rates investigated ranged from 10^{−7} to 10^{−2}.

Although the observed mean of these statistics changes with sequencing errors, changes are minor over a wide range of error rates (Figure 1). We used the neighbor quality standard (NQS; Altshuler *et al.* 2000) for quality trimming. A phred quality score of 30 for the focal base call and a score of 20 for the five neighboring base calls on the left and right sides, respectively, were used as cutoff scores. With these parameters the expected sequencing error rate for single reads is 5 × 10^{−5} false base calls per base (Altshuler *et al.* 2000). However, since two-thirds of total bases were sequenced from both directions, the real error rate is likely to be much lower. On the basis of simulation results, we expect that sequencing errors do not significantly affect our analyses of polymorphism.

### Multilocus analysis:

An estimate of genome-wide nucleotide polymorphism, θ, was obtained by maximum-likelihood (ML) analysis using an analytical result from coalescence theory that infers the probability of θ given the number of segregating sites, *S*, and sample size *n* at a locus (Tavaré 1984, Equation 9.5):

We first calculated the probabilities of individual loci for a range of θ per nucleotide ranging from a minimum of 0.0001 to a maximum of 0.10. To correct for differences in the alignment length, θ was multiplied by the sequence length and then used to calculate the probabilities. The multilocus likelihood for every value of θ was then calculated as the sum of the log probabilities across loci for a given value of θ. To test the hypothesis that the data are better described by more than one value of θ, likelihoods were calculated for two (M2), three (M3), and four (M4) different θ values for the whole sample and individual values for every locus (M no. loci). We first sorted loci numerically according to locus-specific ML estimates of θ. Then, for the M2 model, we used this sorted list to separate the loci into two different groups and searched for the best likelihood values in each group. The two exclusive groups with the best likelihood were then chosen. We also used this strategy to obtain θ values for the M3 and M4 models, although a heuristic search was performed because not all possible groups of loci could be investigated. In this search, all loci were first grouped randomly for 500 iterations, and then 500 additional iterations were performed with the best groups of loci to estimate the likelihood. Likelihood-ratio tests (LRT) for comparisons between different models were performed. The distribution of the LRT statistic was obtained by coalescent simulations and not from a χ^{2}-distribution since it was difficult to ascertain the degrees of freedom. We simulated 1000 new samples using the simplest model (for example, comparing model M1 *vs.* M2, we simulated 1000 times the number of segregating sites for each locus given a θ value per nucleotide estimated by the model M1). We then recalculated the maximum likelihood for each simulated sample and stored the LRT value of each iteration. Finally, the LRT value for the observed data was contrasted with the simulated LRT distribution.

Levels of variation at synonymous and in noncoding sites were calculated by ML. We also calculated the likelihood under the assumption that both groups of sites have the same level of polymorphism (one θ is estimated) and that θ values differ between the two groups of genes. The likelihoods were then compared in a LRT.

To compare the observed genome-wide distribution of descriptive statistics with the expected distribution under a standard neutral model, we performed coalescent simulations conditioning on the genome-wide estimate of θ (using one or more θ values) obtained by maximum-likelihood estimation (Hudson 1990, 1993), using programs by S. Ramos-Onsins (unpublished data) and R. Hudson (Hudson 2000). For multilocus analyses, conditioning on θ is preferable to *S* (S. Ramos-Onsins, unpublished data). Simulations were carried out individually for each locus and repeated 10,000 times. We calculated two kinds of statistics:

The average and the population variance for a given statistic were calculated for all the loci combined in each iteration, as described by J. Hey in the HKA program (http://lifesci.rutgers.edu/heylab). For statistics that use an outgroup sequence (Fu and Li's

*D*and*F*and Fay and Wu's*H*), simulations were carried out by taking into account recurrent mutations at the same position, and we included the ratio of transition*vs.*transversion (*s*/*v*) and the time of divergence from the outgroup species. When*A. lyrata*was taken as the outgroup, we used*s*/*v*= 1.2 (observed) and the time to the ancestral species was calculated as*T*_{out}≃ divergence_{observed}/2θ (Hudson*et al.*1987), equal to eight measured*N*generations. In the case of*B. drummondii*,*s*/*v*= 1.0 and*T*_{out}≃ 12.*P*-values were calculated as the probability that a neutral, simulated value of a given statistic was smaller or larger than the observed value.Sign tests for a given statistic were performed by comparing the observed value with the median (

*i.e.*, 50% of the values) obtained by coalescent simulations and assigning a positive (negative) unit if the observed value was larger (smaller) than the median (Sokal and Rohlf 1995). Finally, the sum of all positive and negative values was compared to a binomial distribution having a probability of*P*= 0.5 to obtain the critical values of the test.

Multilocus HKA tests (Hudson *et al.* 1987) were calculated for silent positions using the silent segregating sites and the silent divergence value (corrected after Jukes and Cantor 1969). HKA tests were calculated separately for loci with *A. lyrata* and *B. drummondii* as outgroup sequences. The significance of the HKA tests was obtained using a χ^{2}-distribution (Hudson *et al.* 1987).

### Alternative demographic models:

The population history of *A. thaliana* appears to be determined by separated populations during the Pleistocene and subsequent expansion into Central and Eastern Europe during the past 18,000 years (Sharbel *et al.* 2000). For this reason, three alternative demographic models were evaluated that take the demographic history into account.

### Logistic growth model:

This model estimates the ML parameters under logistic growth given the frequency spectrum of segregating sites (with or without outgroup). In the logistic growth model (*e.g.*, Fu 1997), the population size changes back in time as follows: Here, *N _{t}* is the population size at time

*t*in the past (expressed in

*N*

_{0}generations),

*N*

_{0}is the population size at the present time

*t*

_{0}, and

*N*

_{1}is the population size at time

*t*

_{1}. The population growth rate, γ, was fixed to γ = 10/(

*t*

_{1}−

*t*

_{0}). Estimated model parameters include θ, the start and end time points of the logistic growth phase, and the growth rate

*N*

_{0}/

*N*

_{1}(setting

*N*

_{0}= 1 as the current population size). Likelihood estimates were calculated by generating 100–200 coalescent trees for each set of parameters with no intralocus recombination. We then calculated the probability of the same number of mutations occurring with frequency

*i*as observed in the data by using a Poisson distribution (J. Wakeley, personal communication). The log-likelihood value was calculated for each locus and the sum of all loci was stored. The best parameters were searched for by using a grid of 10,000 parameter values. Also, Metropolis-Hastings Markov chain Monte Carlo (Metropolis

*et al.*1953; Hastings 1970; Kuhner

*et al.*1995) sampling was performed. We used the simplest algorithm for sampling parameters and accepted the new parameters in the chain if where logran(0, 1) is the logarithm of a random number between 0 and 1, and log

*P*(

*D*|

*E*′)

*is the logarithm of the probability that the observed frequency spectra will have the new sampled parameters in the locus*

_{i}*i*, and log

*P*(

*D*|

*E*)

*is the same but given the previous parameters in the Markov chain. Otherwise, we rejected the new parameters and accepted the previous parameters for the current step in the chain. The range of displacement for each parameter and for each iteration was adjusted empirically. Five chains of 10*

_{i}^{6}iterations starting from distant parameter values were run first, and then a final chain was run to obtain the best parameters for this model. Using this approach, the likelihood values for the same parameter values varied depending on the number of coalescent simulations performed. Therefore, a LRT assuming a χ

^{2}-distribution might be too liberal. For this reason, we also calculated averages and variances across all loci for all summary statistics using the parameters estimated with the logistic growth model to evaluate whether the estimated parameters fit the observed data.

### Refugia and bottleneck models:

The second demographic model assumes a single panmictic population in the distant past and then a subdivision into several refugia for a certain time period and a subsequent admixture into a single current population (Figure 2; Hudson 2000; Wall 2000). The time (in generations) from the present to the past split into several small populations (refugia), the number of refugia, the relative population size of each refugium (considering the present population size as *N*_{0} = 1), and the time until refugia merged into a single population were taken as general parameters for all loci. Values of θ and the relative contribution of each refugial population to the present population (freqr) were individual parameters for each locus. To be more conservative, the recombination parameter was set to zero, and, to simplify, the parameter freqr was considered equal for all loci. In the case of the bottleneck model, we used the same parameters as in the refugia model but the number of refugia was fixed to one. A ML analysis of the refugia model requires estimation of a large number of model parameters and is currently too expensive computationally, but we performed an exploratory analysis with a limited number of parameter combinations. In these analyses, we assumed a present effective population size of 2*N* = 10^{6} and one generation per year. The time (from present to past) until refugia merged into a single population was fixed to cold periods at 10^{4} generations ago (2 × 10^{−2} relative to *N* generations) and 2 × 10^{4} (4 × 10^{−2}), and the refugia were maintained for 10^{3} or 5 × 10^{3} generations (10^{−2}) until a split into refugial populations. The relative population sizes in the refugia phase were set to 0.05, 0.01, 0.005, and 0.001 in relation to the present, and the number of refugia was set to 1 (bottleneck), 2, 5, and 10 refugia. The relative ancestral population sizes of these refugial populations (*N*_{ancestral}/*N*_{0}) were assumed to have been smaller than the current one and were fixed to 0.1, but 1 was also considered. To better adjust the model, we also tried additional parameter values in the vicinity of the best-fitting parameter combinations.

### Analysis of recombination rates and codon usage:

Local rates of recombination were obtained by comparing the physical and genetic distance of markers by polynomial regression as described by Zhang and Gaut (2003), who essentially applied the method of Kliman and Hey (1993). Centromeric regions were defined by taking the physical position of the genetically defined centromeres (Copenhaver *et al.* 1999) on the pseudochromosomes (obtained from MIPS, v270703) and then extension of these regions in both directions from the putative midpoint of the centromere by adding the estimated length of centromeres (Haupt *et al.* 2001).

Correlations of genetic diversity with GC content at silent sites (GC3) and estimated local recombination rate were analyzed by linear, quadratic, and quantile regression analysis using the R statistical package (http://www.r-project.org). The latter method is suitable for testing whether the edges of distributions with polygonal shapes reflect a random pattern or a significant relationship between two parameters (Scharf *et al.* 1998; Koenker and Hallock 2001). We used the quantreg package of R and a quadratic model to compare the quadratic regression coefficients for 50–95% quantiles and calculated the significance of observed regression coefficients by comparing 10,000 permutations of value pairs.

### Data availability:

The sequences generated for this study were submitted to the STS section of GenBank under accession nos. CW672529, CW672721. Annotated alignments can be obtained from http://kiwi.ice.mpg.de/athapop.

## RESULTS

### Summary of polymorphism and divergence:

Of 595 STS loci sequenced by Schmid *et al.* (2003), 334 were retained for further analysis after quality trimming. For the present study, we obtained 118 STS (20%) from *A. lyrata* and 82 STS (14%) from *B. drummondii*. Of these sequences, we included 31 from *A. lyrata*, 26 from *B. drummondii*, and 14 from both outgroup species in the analysis. Outgroup sequences were then available for 71 STS loci. A summary table and a physical map of STS loci in *A. thaliana* is provided as supplementary information at http://www.genetics.org/supplemental/.

Average levels of polymorphism are summarized in Table 1. The majority of loci are distributed close to the mean of the silent variation (θ_{W} = 0.006), and few loci (39) are highly polymorphic (θ_{W} > 0.02; Figure 3). The relationship between synonymous and nonsynonymous polymorphism in coding regions is shown in Figure 4A. Among the set of 153 STS, which cover protein-coding regions of at least 80 codons, 90 (59%) have a ratio of π_{n}/π_{s} < 1, suggesting purifying selection against deleterious amino acid replacement polymorphisms. Among the most polymorphic genes, five have a π_{n}/π_{s} ratio of close to 1. They are annotated as “hypothetical” or “putative” and may constitute redundant, nonfunctional, or incorrectly annotated genes. Three of these genes contain at least one allele with a premature stop codon or an out-of-frame insertion/deletion, suggesting that they are pseudogenes.

Levels of sequence variation were not correlated with codon usage measured as the GC content at silent sites (GC3; *R*^{2} = 0.0036, *P* > 0.5). Diversity at noncoding sites is 1.2-fold lower than that at coding synonymous sites and 7.6-fold lower at nonsynonymous than at synonymous sites of coding regions (Table 1). When noncoding and synonymous variation were estimated by ML (θ_{noncoding} = 0.0091, θ_{synonymous} = 0.0088, ML = −414.33 − 398.41 = −812.74) and compared to a θ_{silent} value estimated by combining all sites (θ_{silent} = 0.0090, ML = −812.82), the diversity between the two types of sites was not significantly different (LRT = 0.148, *P* = 0.70). Therefore, we combined synonymous and noncoding sites in the following analyses.

### Effect of recombination:

The levels of variation across the genome can be affected by different local recombination rates. Polymorphism levels do not differ between centromeric and noncentromeric regions (Table 2) and within noncentromeric regions there is no correlation between recombination rate and silent diversity (not shown). In regions of low and high recombination, polymorphism levels tend to be reduced (Figure 5), but linear (*R*^{2} = 0.04; *P* > 0.1), quadratic (*R*^{2} = 0.06; *P* > 0.1), and quantile regression coefficients (50–95% quantiles; *P* > 0.05) were not significant. In *A. thaliana*, pairwise linkage disequilibrium can extend up to 250 kb (Nordborg *et al.* 2002), suggesting that neighboring loci may have similar levels of polymorphism. This was confirmed by our observation of a higher proportion of loci with similar polymorphism levels among closely located (<250 kb) than among distant loci (using a small difference of 0.005 as cutoff; *G*-test, *P* = 0.039). Therefore, to be conservative, we used only loci that are separated by at least 250 kb for the analysis of demographic models (195 loci).

### Testing a neutral panmictic model of evolution:

We first asked whether the data are more consistent with a panmictic neutral model with 1, 2, 3, 4, or 195 different θ_{silent} values. By comparing observed values with distributions obtained by coalescent simulations and testing for differences in an LRT (Table 3), we found that a model with two different estimates for silent θ (M2; θ_{1} = 0.0041, θ_{2} = 0.0213) best explained the distribution of silent variation under a neutral model. These estimates were used in the following simulations.

Simulated distributions of Tajima's *D*, *D*/*D*_{min}, and Fu and Li's *D**- and *F**-test statistics were compared with the observed distributions (Table 4 and Figure 6). Since the results obtained with *D*, *D*/*D*_{min}, and Fu and Li's *F**- and *D**-statistics are very similar to those with Tajima's *D*, they are not mentioned further in the text but are shown in the tables. Simulated and observed distributions differed for most summary statistics, indicating that patterns of nucleotide polymorphism in our data are not consistent with a panmictic neutral model. There is an excess of low-frequency mutations as indicated by a negative average of Tajima's *D* and larger than expected variances of empirical distributions. It should be noted that these analyses are conservative because they assume no intralocus recombination (Tajima 1989).

For the analyses of Fu and Li's *D-* and *F-* and Fay and Wu's *H*-test statistics, which require an outgroup sequence, we used 43 STS loci with *A. lyrata* and 20 STS loci with *B. drummondii* as outgroups. These loci are unlinked (physical distance >250 kb). Averages and variances of Fu and Li's *D-* and *F*-statistics differed significantly, but Fay and Wu's *H*-statistic was concordant with the neutral panmictic model (Table 4 and Figure 6). Thus, the deviation from the neutral panmictic model is not a consequence of an excess of a high frequency of derived mutations.

Since two groups of loci with low and high levels of variation were observed, we tested whether distributions of summary statistics differ between them (Table 6). In both cases, the average of Tajima's *D* is negative and significantly different from the expectations of the neutral model. When variances are considered, only the highly polymorphic group differs from the neutral model. The observation of an increased variance of test statistics in the group of highly polymorphic loci may be interpreted as a footprint of selection or of other evolutionary forces (see below).

Multilocus HKA tests based on silent polymorphisms were computed for all loci with outgroup sequences. The HKA test results were highly significant for 43 loci with an *A. lyrata* outgroup sequence (χ^{2} = 146.87 with 42 d.f., *P* < 0.0001; Figure 7) and for 20 loci with a *B. drummondii* outgroup sequence (χ^{2} = 54.38 with 19 d.f., *P* < 0.0001). Six of 43 loci (14%) were mainly responsible for the high significance of the HKA test in the comparison between *A. thaliana* and *A. lyrata* (*AtV23*, *TIGR3437*, *GOLM25*, *At3est48*, *AtV11*, and *TIGR1736*), and 4 of 20 loci (20%) between *A. thaliana vs. B. drummondii* (*GOLM66*, *TIGR1744*, *TIGR1518*, and *GOLM80*). Most of these loci have higher levels of polymorphism and lower levels of divergence than expected and are included in the group of highly polymorphic genes (Tables 3 and 6). Significant multilocus HKA tests are not consistent with the neutral panmictic model and may result from selection. It should be noted that a deviation from the assumption of panmixia may increase the variance of the observed data and contribute to the significant test result (Hudson *et al.* 1987; Ramos-Onsins *et al.* 2004).

As a final test of the neutral model, we compared the levels of polymorphism and divergence between synonymous and nonsynonymous sites for loci with >80 codons, using the McDonald-Kreitman (MK) test (McDonald and Kreitman 1991). Individual MK tests were significant only in 3 of 63 loci (*AtIV20*, *AtV9*, and *GOLM29*). This result is expected under the assumption of a neutral panmictic model with a 5% rejection probability for each locus (3 rejections/63 total loci = 4.8%).

### Testing alternative demographic models:

Since a standard neutral model is not consistent with the observed data, we considered alternative demographic models. *A. thaliana* occurs frequently at disturbed sites, which have expanded over the last 6000 years with the spread of human agriculture; hence it seems plausible to consider a model of recent population expansion. Furthermore, we observed a significant negative average of Tajima's *D*, which may be interpreted as a footprint of recent population expansion (Tajima 1989). We first studied a logistic growth model with four parameters (see materials and methods) and found little statistical support for this model, because a higher likelihood was obtained with a model of constant population size (195 STS; ML = −1557.3) than with logistic growth (ML = −1562 for an estimated growth rate of 0.74). A higher likelihood for the constant population size model suggests that population expansion alone is not sufficient for explaining observed patterns of polymorphism.

We also compared our data against parameter combinations that take into account population expansion, such as those estimated by Innan and Stephan (2000) using a smaller number of loci (θ = 0.10, *N*_{1}/*N*_{0} = 0.57, *T*_{start} = 0, and *T*_{end} = 0.6, measured in *N* generations). Using these parameters, we always found a decrease in the variance of Tajima's *D* relative to the observed variance (see Figure 6 and Table 5). The high variance exhibited in the observed distribution of Tajima's *D* also suggests that our data are not consistent with a simple population expansion model.

During the Pleistocene, *A. thaliana* may have been restricted to several isolated refugial populations before expanding and merging into the widespread distribution seen in the present (Sharbel *et al.* 2000). To account for such a population structure, we considered the refugia and bottleneck models (materials and methods). Again, we were unsuccessful in explaining the observed data, although only a few combinations of parameters were considered (Table 5). For some parameter combinations, we obtained negative averages for Tajima's *D* and the Fu and Li statistics, but did not observe the high variances seen in the simulations of the neutral model, although they also can be a consequence of population subdivision (Ramos-Onsins *et al.* 2004). In no case did the simulated distributions correspond with the negative averages and high variances of Tajima's *D* in the observed data. The same results were obtained with the bottleneck model, which is a refugia model with a single refugium (Table 5). This suggests that a more complex model incorporating additional parameters is necessary to explain the observed data.

### Identification of nonneutrally evolving outlier loci:

The genome-wide distribution of nucleotide variation is a mixture of distributions resulting from demographic processes and selection. For this reason, the high variance we observe in the distribution of the test statistics may be caused by “outlier loci” that evolve under positive or balancing selection. Our data set contains 28 STS loci that reject a standard null model in Tajima's *D*, Fu and Li's *D**, Fay and Wu's *H*, and the HKA and McDonald-Kreitman tests of neutrality and thus may have been targets of recent selection (supplementary information at http://www.genetics.org/supplemental/). These outlier loci may be responsible for the high variance observed in the empirical distributions and the rejection of a neutral model. A better fit to a neutral demographic model may be observed if they are excluded from the analysis (Luikart *et al.* 2003).

To evaluate this possibility, we identified all loci in the 2.5% tails on both sides of the empirical distributions of all statistics (see supplementary information). Thirteen loci (6.8%) were excluded from the set of 195 loci and 12 loci (27.9%) from the set with an outgroup sequence from *A. lyrata*. In the latter case, we also excluded loci with significant MK and HKA tests. The removal of outlier loci resulted in smaller variances, but they still differed significantly from expected variances (not shown). However, we obtained a better fit to the logistic growth model with the set of loci containing an outgroup sequence (*n* = 31, θ = 0.011, *N*_{1}/*N*_{0} = 0.235, *T*_{start} = 0.0158, and *T*_{end} = 0.0817, measured in *N* generations; ML = −140.85 when compared to the constant size neutral model, θ = 0.004, ML = −145.44). No better fit was observed with the larger data set.

We also evaluated the effect of outlier loci by removing loci with high estimates of θ identified by the M2 model described above (Table 3). Using the remaining 128 loci (66%), we obtained a better fit to the logistic growth model (θ = 0.0103, *N*_{1}/*N*_{0} = 0.211, *T*_{start} = 0.0180, and *T*_{end} = 0.0796, measured in *N* generations; ML = −585.01) than to the standard neutral model (θ = 0.004, ML = −603.15). The parameter estimates for the logistic growth model are similar to the ones in the previous analysis. In addition, observed averages and variances of all test statistics were compared with simulated distributions using the estimated parameters of the logistic growth model (not shown). No test showed a significant difference between observed and expected distributions, suggesting that a logistic growth model is more consistent with the patterns of variation at the 128 loci than is a neutral model of constant population size. Thus, some of the 67 high-diversity loci may have been either targets of selection or affected by demographic processes not included in our models. In the latter case, the exclusion of outlier loci may have resulted in an artificial reduction of the variance and thus an improved fit to the demographic model.

## DISCUSSION

### Genetic diversity in *A. thaliana*:

The main goal of our study was to conduct a multilocus analysis of nucleotide polymorphism in *A. thaliana* to identify patterns of variation that are visible at the genome-wide level. An understanding of such patterns is important for the interpretation of genetic variation at individual loci of interest. We analyzed patterns of genetic variation using a sequencing survey of 334 loci from 8–12 accessions at randomly chosen regions of the *A. thaliana* genome and in the outgroup species *A. lyrata* ssp. *lyrata* and *B. drummondii*. We used summary statistics to describe the observed patterns of variation and performed ML analyses to estimate population parameters from the data.

Sequence diversity in *A. thaliana* has been analyzed previously at more than a dozen loci, but the relative roles of demography and selection on patterns of genetic variation have not been rigorously studied, except by Innan and Stephan (2000). The average level of silent nucleotide diversity observed among all loci in our data (θ_{W} = 0.00896) is very similar to the mean diversity of 14 genes surveyed previously (0.009; Shepard and Purugganan 2003). Furthermore, we find an excess of low-frequency polymorphisms as indicated by an average negative estimate of Tajima's *D*, which has also been noted in earlier studies (Purugganan and Suddith 1999; Kuittinen and Aguadé 2000; Shepard and Purugganan 2003). Although previous studies focused on specific regions or protein-coding genes of particular interest, our data agree with levels and patterns of variation observed in these studies.

The large number of loci investigated in this study allowed us to search for possible causes of different levels of polymorphism among loci. A higher diversity at synonymous sites than at noncoding sites was also found in maize, Drosophila, and humans (Tenaillon *et al.* 2001), but the differences are more pronounced (1.5- to 2-fold) in these species than in our study (1.2-fold). Furthermore, synonymous diversity and GC3 were not correlated. The low level of codon usage bias of *A. thaliana* (Duret and Mouchiroud 1999) suggests that selection for optimal codon usage is not very strong and that synonymous polymorphisms are completely neutral, possibly because of a reduced effective population size due to the high selfing rate (Bustamante *et al.* 2002).

The local rate of recombination is correlated with nucleotide diversity in a number of plant (Dvorák *et al.* 1998; Kraft *et al.* 1998; Stephan and Langley 1998) and animal species (*e.g.*, Begun and Aquadro 1992; Glinka *et al.* 2003). We did not find such a relationship in *A. thaliana*, suggesting that either differences in the effective rate of recombination are too small to have a measurable effect on nucleotide diversity or background or positive selection are not strong enough to cause the observed relationship between genetic diversity and recombination found in other species. Similarly, the local recombination rate is not correlated with the distribution of transposable elements (Wright *et al.* 2003) and only weakly correlated with the frequency of tandemly repeated genes (Zhang and Gaut 2003). Variation in recombination rates does not appear to be a strong force in structuring the genome of *A. thaliana*. On the other hand, the observation of similar levels of nucleotide diversity among neighboring genes (<250 kb) may result from a low effective recombination rate and extended regions of the genome then have similar evolutionary histories (Nordborg and Tavaré 2002). This finding is supported by the observation of correlated polymorphism levels within a 40-kb genomic region around the *CLAVATA2* locus (Shepard and Purugganan 2003) and within a 170-kb region around the *MAM* locus of *A. thaliana* (Haubold *et al.* 2002). Correlated patterns of polymorphism among physically linked loci that result from background or positive selection occurring in a particular genomic region may interfere with the analysis of the demographic processes. Therefore, to minimize the effect of selection on our analysis of demographic models, we included only physically distant loci.

### Rejection of a neutral panmictic model of evolution:

Our data show a significant, genome-wide deviation from a standard mutation-drift model of evolution (Table 6), which may result from the absence of panmixia, temporary changes in population size and structure, or selection at independent loci. We first want to consider possible effects of population structure and population growth on our analyses. The accessions included in this survey are genetically distantly related, as indicated by a genealogy with long terminal branches (Schmid *et al.* 2003). Every accession therefore is assumed to represent a single individual from different local demes, making it impossible to observe a deviation from the standard mutation-drift model at the level of the deme (*e.g.*, Wakeley 2004). Using Wakeley's terminology (Wakeley 1999), our accessions represent the “collecting phase” of the coalescent of a metapopulation (*i.e.*, the coalescing of lines from different demes) and not the “scattering phase” (coalescence events within individual demes). The collecting phase of a coalescent is equivalent to a single standard population (Wakeley and Aliacar 2001) and can be analyzed with the general coalescent as we have done in this study. This conclusion seems robust enough to be applicable to different metapopulation structures (Wakeley 2001; Wakeley and Aliacar 2001).

Under this assumption, deviations at a genomic level from the neutral panmictic model might be a consequence of those events that affect the entire metapopulation structure, like temporary changes in population size (expansion, bottlenecks), or also subdivision of the metapopulation (*i.e.*, subdivision of this species in isolated refugia for a certain time period). The significant excess of low-frequency polymorphisms could be explained by an expansion process, as has been suggested (Purugganan and Suddith 1999; Innan and Stephan 2000; Kuittinen and Aguadé 2000), but ML estimates of an alternative logistic growth model were not different from the standard neutral model. Furthermore, the high variance observed in our empirical distributions is not expected under population expansion and a simple expansion model is not sufficient to explain the difference of observed data from a neutral model. We were able to obtain a better fit to a logistic growth than to a standard neutral model when highly polymorphic loci were excluded. However, this result needs to be interpreted with caution because we removed one-third of all loci from the analysis on the basis of a ML analysis of θ values under a neutral model without knowing whether these loci are highly polymorphic due to selection or due to a neutral process such as a locally increased mutation rate or admixture from previously separated subpopulations representing glacial refugia (Sharbel *et al.* 2000). To account for the latter possibility, a refugia model and bottleneck models were also considered. Although we investigated only a few biologically realistic parameter combinations, we could not detect a combination that was consistent with the observed data.

A second explanation for the observed deviation from a neutral model is selection. If selection occurred only at single loci, its effect on the distribution of summary statistics should be minor given the large number of loci analyzed. On the other hand, selective sweeps or balancing selection at many loci may contribute to the significant deviation of the mean and the variance of Tajima's *D* from the expectation of a neutral model. Such an explanation is not supported by the analysis of a subset of loci, for which we were able to obtain a sequence from one of outgroup species, because the observed distribution of Fay and Wu's *H* was not significantly different from the expectation of a standard neutral model. Thus, the negative averages of Tajima's *D* and Fu and Li's statistics do not seem to result from positive selection at or hitchhiking of multiple loci.

Alternatively, purifying selection against slightly deleterious mutations may have caused the excess of low-frequency polymorphisms. The efficacy of selection against deleterious mutations may be weak compared to that of other species (Bustamante *et al.* 2002), but it is nevertheless operating in *A. thaliana*, as indicated by the lower nonsynonymous than synonymous nucleotide diversity (Figure 4). The effect of purifying selection on the frequency spectrum is difficult to quantify, especially in noncoding regions, because little is known about their functional and evolutionary constraints. As discussed above, synonymous polymorphisms may not be exposed to purifying selection and thus contribute little to the deviation from a neutral model. Although it does not appear that positive or negative selection is solely responsible for the observed deviation from a neutral model, the observation of highly negative values of Fay and Wu's *H* at some loci and the high variance in the distribution of summary statistics may result from variable selection pressures at different loci.

It should be noted that we did not take the geographic origin of accessions used for this study into account. The excess of rare polymorphisms may also result from fixation of locally occurring polymorphisms by selection and thus represent geographic differentiation between demes of a self-fertilizing species (Hedrick and Holden 1979). Such an interpretation is consistent with a weak, but significant presence of a geographic population structure in the natural species range (Sharbel *et al.* 2000). However, our current sample of 12 accessions is not large enough to address this question.

### Modified tests of neutrality:

A frequent goal of surveys of sequence variation is to evaluate whether a particular gene evolves under selection. Furthermore, genome-wide analyses of genetic variation aim at identifying novel “adaptive trait genes” that were subject to positive or balancing selection (reviewed by Luikart *et al.* 2003). In sequence surveys, the rejection of the null hypothesis of neutral evolution in one of the numerous available tests of neutrality is often taken as evidence that a gene evolves under selection (*e.g.*, Purugganan and Suddith 1998; Olsen *et al.* 2002; Kroymann *et al.* 2003; Mauricio *et al.* 2003). According to our results, a standard neutral model should not be used as a null hypothesis for neutrality tests in *A. thaliana* because of the effects of demographic history on nucleotide diversity. Thus, if one attempts to test the hypothesis that a given gene has been the target of natural selection, the challenge consists of differentiating between the effects of demography and selection on genetic variation.

One approach to account for demography in tests of selection is to use a modified null model that incorporates the demographic history of a species and selection at independent loci. This allows the estimation of demographic parameters and of the likelihood that individual loci evolved under selection. Using our data, we could not formulate an alternative model for such a purpose, because we were not able to identify all the demographic and selective forces that have shaped observed variation. It will be a considerable challenge to develop such a model for a species with a complex demographic history given the large number of parameters involved. An alternative approach is to use empirical distributions of various descriptive statistics derived from randomly sequenced genomic loci and to identify the outlier loci in such statistics (Black *et al.* 2001; Luikart *et al.* 2003). Outlier loci are defined as falling into the extreme tails of the empirical distribution and thus exhibit unusual patterns of variation. We have calculated the critical values for various descriptive statistics using the empirical distributions obtained from our data (Table 7) and they may be useful for neutrality tests in future sequence surveys of novel genes of interest. In this case, however, one would have to consider the effect of using different accessions and different sequence lengths on descriptive statistics of nucleotide diversity before comparing them to such a distribution (Pluzhnikov and Donnelly 1996). A modification of this empirical approach is to use combinations of test statistics and to identify those genes that reject more than one neutrality test. In our data set, a small number of loci (*e.g.*, *AtV9*, *AtV11*, *AtV23*, and *GOLM80*) fulfill this criterion. Candidate adaptive trait genes can be easily found by such an approach, but further investigations will be necessary, because the short genomic segments studied here are not sufficient to fully characterize patterns of polymorphism at a locus and thus to infer the role of selection. Despite these concerns, the use of empirical distributions appears to be a useful alternative to a model-based analysis to identify genes that may have been targets of selection because the demographic history is taken into account.

## Acknowledgments

We thank the Gesellschaft für wissenschaftliche Datenverarbeitung (GWDG) in Göttingen for allowing us to use their computing facilities. N. Spies and T. Heinze helped with the programming, and I. Schumacher provided advice on implementing the HKA test. We thank L. Zhang and B. Gaut for providing us with the estimates of recombination rates and M. Clauss, D. de Lorenzo, M. Hamblin, A. Lawton-Rauh, S. Schaeffer, and E. Wheeler for discussion and comments on the manuscript. This work was funded by the German Ministry of Science project grants to T.M.-0.(0312275C/4) and to B.W. (0312275D/7), by the Emmy-Noether program of the Deutsche Forschungsgemeinschaft to K.J.S. (Schm 1354/2-2), and by the Max-Planck Society.

## Footnotes

↵

^{1}These authors contributed equally to this work.↵

^{3}*Present address:*Department of Biology, University of Bielefeld, 33594 Bielefeld, Germany.Sequence data from this article have been deposited with the EMBL/GenBank Data Libraries under accession nos. CW672529, CW672721.

Communicating editor: S. W. Schaeffer

- Received July 21, 2004.
- Accepted November 22, 2004.

- Genetics Society of America