NATURAL selection affects the amount and pattern of DNA variation maintained in a population. However, it is very difficult to find evidence for natural selection from data of nucleotide sequence variation because there are many other factors affecting DNA polymorphism. One of the main factors is the demographic history of a population, including changes in population size and population structure.
To infer natural selection from DNA polymorphism data, Tajima's D test (Tajima 1989) is often used. Tajima's D statistic is based on the difference between two estimates of the amount of variation. One estimate is obtained from the number of segregating sites (Watterson 1975) and the other is based on the average number of pairwise differences (Nei and Li 1979; Tajima 1983). In a constantsize neutral equilibrium population, the expectation of Tajima's D is nearly zero because the expectations of both estimates are the same. When some kind of balancing selection is acting, Tajima's D tends to be positive. On the other hand, purifying selection can generate negative values of Tajima's D.
However, as discussed in Tajima (1993) and Charlesworth et al. (1993), changes of population size can also affect Tajima's D. In a population with decreasing size, the expectation of Tajima's D is positive, while a negative Tajima's D is predicted for a population with increasing size. Therefore, when the observed value of Tajima's D deviates significantly from 0, it is very difficult to know what the cause of such a deviation is. To understand the mechanism maintaining nucleotide variation in a population, it is important to distinguish these two factors, natural selection and changes of population size.
In this note, the effects of these two factors are successfully distinguished in the wild plant, Arabidopsis thaliana. A. thaliana is a highly selfing species with <1% outcrossing (Abbott and Gomes 1989). It is known that the A. thaliana natural population consists of a number of colonies. Sampled individuals from a colony are called ecotypes. There is almost no genetic variation within each colony (Todokoroet al. 1995; Bergelsonet al. 1998), suggesting that the members of each colony may share a very recent common ancestor and that migration among colonies is rare. Thus, intraspecific variation in A. thaliana depends largely on variation among ecotypes. Usually, population studies of this species are conducted for a sample of ecotypes.
It is believed that the A. thaliana population expanded recently, originating from an area of the Himalayas (Priceet al. 1994). Innan et al. (1996) investigated DNA sequence polymorphism of the Adh region and supported this idea because a number of rare variants with a negative Tajima's D value were observed. A lack of association between the genealogy and geographic origin of the sampled ecotypes is also consistent with a recent expansion of the A. thaliana population. Similar results were obtained for other nuclear gene regions (e.g., ChiA, Kawabeet al. 1997; and CAL, Purugganan and Suddith 1998). At this time, data for eight gene regions are available, of which six loci exhibit a negative Tajima's D value. If we exclude Rpm1, where strong overdominant selection may be acting (Stahlet al. 1999), the average of Tajima's D is −1.34 (SD = 0.90), supporting the idea of a recent rapid expansion of A. thaliana.
To further investigate the demographic history of A. thaliana, we introduce a coalescent model of a metapopulation with N colonies. We assume that the size of all colonies is the same and that there is no migration among colonies. We also assume that a colony appears and gets extinct at fixed rates and that the number of colonies is constant. Consider a case where n ecotypes are sampled. To study the genealogical relationship among the ecotypes, we apply the theory of coalescence processes (Griffiths 1980; Kingman 1982; Hudson 1983; Tajima 1983) in the following way. For a coalescent among n genes, let t_{i} (i = 1, 2, 3, … , n − 1) be the time during which i + 1 genes coalesce into i genes. Then, the distribution of t_{i} is given by
Next, we consider a metapopulation whose number of colonies is increasing with a constant rate r per g generations. Let N_{0} be the present number of colonies. Following Slatkin and Hudson (1991), the coalescent time (t_{i}) is given by
To estimate the demographic parameter N_{0}r, we analyzed the amplified fragment length polymorphism (AFLP) data of Miyashita et al. (1999). It is reasonable to use AFLP data for this purpose because the anonymous fragments detected in this analysis presumably cover the whole genome. In Miyashita et al. (1999), a total of 472 distinct fragments were observed for 38 ecotypes sampled from around the world. Their observation of a lack of linkage disequilibrium among the fragments suggests that almost all fragments follow independent genealogical histories.
The AFLP data by Miyashita et al. (1999) consist of 38 ecotypes, including an apparent outlier, ecotype Fl3. For Fl3, the observed number of fragments is about twice that of the remaining 37 ecotypes. Fl3 appears to be an allotetraploid hybrid between A. thaliana and one of its close relatives (N. T. Miyashita and A. Kawabe, personal communication). In the following analysis, Fl3 is therefore excluded. For the remaining 37 ecotypes, the average proportion of shared fragments,
Assuming that the number of A. thaliana colonies has increased exponentially, we can find an estimate of the demographic parameter, N_{0}r, by computer simulation using the leastsquares method. The procedure of simulating AFLPs in a constantsize population was developed by Innan et al. (1999) and Miyashita et al. (1999). The simulation method for an exponentially growing population is similar to that of a constantsize population. First, a random ancestral sequence (genome) with the length M bp is constructed. This ancestral sequence consists of four nucleotides, A, T, G, and C, at equal frequencies. Innan et al. (1999) demonstrated that the effect of GC content on the AFLP polymorphism is very small if intraspecific variation is considered. Next, the sequence is divided into m parts with the length M/m. This means that the whole genome is assumed to have m independent genealogies. Recombination rate, genome size, and the number of chromosomes are related to m (see below). m random genealogies for n sequences are constructed and branch lengths are determined by Equation 2. Following a given genealogy for each part, random mutations are placed on the ancestral sequence. The number of mutations on a branch follows a Poisson distribution with mean μTM/m, where μ is the mutation rate per site per g generations and T is the length of the branch measured in generations. Then, we have n descendant sequences, from which we obtain the AFLP fragments according to the primer sequences. Finally, the lengths of the detected fragments are scored for n descendants.
In our simulations, n = 37 is assumed. The mutational parameter, θ_{0} = 4N_{0}μ, is adjusted to produce the average proportion of pairwise shared bands,
Simulating 2000 replicates for a given value of N_{0}r, we obtained the expected frequency spectrum of AFLPs. N_{0}r was varied in the range of 0–1.5 with an interval of 0.1, and the expected distribution of the frequency spectrum was obtained for each N_{0}r. Using these expectations, the fit to the observed spectrum (Miyashitaet al. 1999) was investigated by the leastsquares method. To estimate N_{0}r, we fitted a fourthorder polynomial to the discrete data points. Using this procedure, the minimum value of the sum of squares was obtained for N_{0}r = 0.57. We also used χ^{2} to measure the goodness of fit instead of the sum of squares. In this case, the best fit was also found for N_{0}r = 0.57. Figure 1 shows both the observed spectrum of A. thaliana AFLPs and the expected one with N_{0}r = 0.57. The observed frequency spectrum is in excellent agreement with the expectation.
Using this estimate of N_{0}r = 0.57, we also investigated the properties of Tajima's D (Tajima 1989) and Fu and Li's D* tests (Fu and Li 1993) in an exponentially growing population. In our simulations, neutral genealogies without recombination (Hudson 1983, 1990; Tajima 1983, 1989) were used. Figure 2A shows the distribution of Tajima's D for n = 17. The filled squares represent the distribution of Tajima's D under a constantpopulationsize model (N_{0}r = 0) and the open squares represent that for an exponentially growing population with N_{0}r = 0.57. It is known that the distribution of Tajima's D under a constantpopulationsize model approximately follows the beta distribution with mean = 0 and variance = 1 (Tajima 1989). As shown in Figure 2A, the distribution with N_{0}r = 0.57 is skewed toward negative values. The mean is −0.3 and its variance is 0.55, which is smaller than that under a constantpopulationsize model. Accordingly, it is expected that the confidence limits of the distribution may be very different between these two models. Although the confidence limits for the negative values of Tajima's D are not much different from those of a constantsize model, those for the positive values are much smaller than those under a constantsize model when N_{0}r = 0.57. This implies that Tajima's D does not follow a beta distribution in growing populations. Similar results were obtained for Fu and Li's D* test (Figure 2B). The peak of the distribution for N_{0}r = 0.57 moves toward negative values and the distribution is very different from that in a constantsize model especially for positive values.
The P values for Tajima's D and Fu and Li's D* tests were reexamined for eight nuclear loci in A. thaliana, assuming N_{0}r = 0.57 (Table 1). As expected from Figure 2A, when Tajima's D is <−1.5, the P values with N_{0}r = 0.57 are not much different from those of a constantsize model. For positive Tajima's D, the P value decreases for N_{0}r = 0.57, suggesting that the power of Tajima's D with N_{0}r = 0.57 is stronger than that of a constantsize model. Similarly, when Fu and Li's D* is <−2.0, the P values with N_{0}r = 0.57 are not different from those of a constantsize model. If Fu and Li's D* is positive, the P value with N_{0}r = 0.57 is smaller than that of a constantsize model. In the case of Fu and Li's D* test for the CAL locus, the neutral hypothesis is rejected at the 5% level only when a growing population with N_{0}r = 0.57 is assumed. For two cases, i.e., Tajima's D for PI and Fu and Li's D* for Rpm1, the P value becomes <1% when N_{0}r is changed from 0 to 0.57.
In studies of particular gene regions, an excess of rare polymorphic sites was detected, resulting in significantly large negative values of Tajima's D and Fu and Li's D* (Kawabeet al. 1997; Purugganan and Suddith 1999). As possible causes, an expansion of the A. thaliana population and/or purifying selection against deleterious mutations were proposed. In this study, two neutrality tests, Tajima's D and Fu and Li's D*, were conducted without the effect of population expansion. It was shown that the negative values of both test statistics for AP3, ChiA, and PI were still significant (Table 1). Furthermore, Fu and Li's D* for CAL is significantly negative at the 5% level when N_{0}r = 0.57, although it is not significant in a constantsize model. This result suggests that purifying selection against deleterious mutations is acting in these regions. This is consistent with the observation of a relatively high level of singleton replacement polymorphism in A. thaliana (Kawabeet al. 1997; Purugganan and Suddith 1998, 1999).
In this study, we inferred the rate of population size expansion of A. thaliana from the frequency spectrum of AFLPs obtained by Miyashita et al. (1999). For N_{0}r = 0.57, the observed spectrum is in excellent agreement with the expected one (Figure 1). The reason for this may be that almost all fragments detected as AFLPs are independent of each other, which is consistent with the lack of linkage disequilibrium observed between microsatellite loci (Innanet al. 1997) and AFLP loci (Miyashitaet al. 1999). It may suggest that population surveys using AFLP analysis are an appropriate method to understand the demographic history of a population. Furthermore, using this estimate of population growth, we reevaluated the critical values of two of the most important neutrality tests, Tajima's D and Fu and Li's D*. Discrepancies between constantsize and exponentially growing populations were found for positive values of both test statistics.
Acknowledgments
We thank two reviewers, M. Aguadé and D. Carlini, for valuable comments. A. Kawabe, N. T. Miyashita, and E. A. Stahl provided data sets. This work was supported in part by National Science Foundation grant DEB989679 and by funds from the University of Rochester.
Footnotes

Communicating editor: M. Aguadé
 Received December 10, 1999.
 Accepted April 3, 2000.
 Copyright © 2000 by the Genetics Society of America