NATURAL selection affects the amount and pattern of DNA variation maintained in a population. However, it is very difficult to find evidence for natural selection from data of nucleotide sequence variation because there are many other factors affecting DNA polymorphism. One of the main factors is the demographic history of a population, including changes in population size and population structure.
To infer natural selection from DNA polymorphism data, Tajima's D test (Tajima 1989) is often used. Tajima's D statistic is based on the difference between two estimates of the amount of variation. One estimate is obtained from the number of segregating sites (Watterson 1975) and the other is based on the average number of pairwise differences (Nei and Li 1979; Tajima 1983). In a constant-size neutral equilibrium population, the expectation of Tajima's D is nearly zero because the expectations of both estimates are the same. When some kind of balancing selection is acting, Tajima's D tends to be positive. On the other hand, purifying selection can generate negative values of Tajima's D.
However, as discussed in Tajima (1993) and Charlesworth et al. (1993), changes of population size can also affect Tajima's D. In a population with decreasing size, the expectation of Tajima's D is positive, while a negative Tajima's D is predicted for a population with increasing size. Therefore, when the observed value of Tajima's D deviates significantly from 0, it is very difficult to know what the cause of such a deviation is. To understand the mechanism maintaining nucleotide variation in a population, it is important to distinguish these two factors, natural selection and changes of population size.
In this note, the effects of these two factors are successfully distinguished in the wild plant, Arabidopsis thaliana. A. thaliana is a highly selfing species with <1% outcrossing (Abbott and Gomes 1989). It is known that the A. thaliana natural population consists of a number of colonies. Sampled individuals from a colony are called ecotypes. There is almost no genetic variation within each colony (Todokoroet al. 1995; Bergelsonet al. 1998), suggesting that the members of each colony may share a very recent common ancestor and that migration among colonies is rare. Thus, intraspecific variation in A. thaliana depends largely on variation among ecotypes. Usually, population studies of this species are conducted for a sample of ecotypes.
It is believed that the A. thaliana population expanded recently, originating from an area of the Himalayas (Priceet al. 1994). Innan et al. (1996) investigated DNA sequence polymorphism of the Adh region and supported this idea because a number of rare variants with a negative Tajima's D value were observed. A lack of association between the genealogy and geographic origin of the sampled ecotypes is also consistent with a recent expansion of the A. thaliana population. Similar results were obtained for other nuclear gene regions (e.g., ChiA, Kawabeet al. 1997; and CAL, Purugganan and Suddith 1998). At this time, data for eight gene regions are available, of which six loci exhibit a negative Tajima's D value. If we exclude Rpm1, where strong overdominant selection may be acting (Stahlet al. 1999), the average of Tajima's D is −1.34 (SD = 0.90), supporting the idea of a recent rapid expansion of A. thaliana.
To further investigate the demographic history of A. thaliana, we introduce a coalescent model of a metapopulation with N colonies. We assume that the size of all colonies is the same and that there is no migration among colonies. We also assume that a colony appears and gets extinct at fixed rates and that the number of colonies is constant. Consider a case where n ecotypes are sampled. To study the genealogical relationship among the ecotypes, we apply the theory of coalescence processes (Griffiths 1980; Kingman 1982; Hudson 1983; Tajima 1983) in the following way. For a coalescent among n genes, let ti (i = 1, 2, 3, … , n − 1) be the time during which i + 1 genes coalesce into i genes. Then, the distribution of ti is given by
Next, we consider a metapopulation whose number of colonies is increasing with a constant rate r per g generations. Let N0 be the present number of colonies. Following Slatkin and Hudson (1991), the coalescent time (ti) is given by
To estimate the demographic parameter N0r, we analyzed the amplified fragment length polymorphism (AFLP) data of Miyashita et al. (1999). It is reasonable to use AFLP data for this purpose because the anonymous fragments detected in this analysis presumably cover the whole genome. In Miyashita et al. (1999), a total of 472 distinct fragments were observed for 38 ecotypes sampled from around the world. Their observation of a lack of linkage disequilibrium among the fragments suggests that almost all fragments follow independent genealogical histories.
The AFLP data by Miyashita et al. (1999) consist of 38 ecotypes, including an apparent outlier, ecotype Fl-3. For Fl-3, the observed number of fragments is about twice that of the remaining 37 ecotypes. Fl-3 appears to be an allotetraploid hybrid between A. thaliana and one of its close relatives (N. T. Miyashita and A. Kawabe, personal communication). In the following analysis, Fl-3 is therefore excluded. For the remaining 37 ecotypes, the average proportion of shared fragments,
Assuming that the number of A. thaliana colonies has increased exponentially, we can find an estimate of the demographic parameter, N0r, by computer simulation using the least-squares method. The procedure of simulating AFLPs in a constant-size population was developed by Innan et al. (1999) and Miyashita et al. (1999). The simulation method for an exponentially growing population is similar to that of a constant-size population. First, a random ancestral sequence (genome) with the length M bp is constructed. This ancestral sequence consists of four nucleotides, A, T, G, and C, at equal frequencies. Innan et al. (1999) demonstrated that the effect of GC content on the AFLP polymorphism is very small if intraspecific variation is considered. Next, the sequence is divided into m parts with the length M/m. This means that the whole genome is assumed to have m independent genealogies. Recombination rate, genome size, and the number of chromosomes are related to m (see below). m random genealogies for n sequences are constructed and branch lengths are determined by Equation 2. Following a given genealogy for each part, random mutations are placed on the ancestral sequence. The number of mutations on a branch follows a Poisson distribution with mean μTM/m, where μ is the mutation rate per site per g generations and T is the length of the branch measured in generations. Then, we have n descendant sequences, from which we obtain the AFLP fragments according to the primer sequences. Finally, the lengths of the detected fragments are scored for n descendants.
In our simulations, n = 37 is assumed. The mutational parameter, θ0 = 4N0μ, is adjusted to produce the average proportion of pairwise shared bands,
Simulating 2000 replicates for a given value of N0r, we obtained the expected frequency spectrum of AFLPs. N0r was varied in the range of 0–1.5 with an interval of 0.1, and the expected distribution of the frequency spectrum was obtained for each N0r. Using these expectations, the fit to the observed spectrum (Miyashitaet al. 1999) was investigated by the least-squares method. To estimate N0r, we fitted a fourth-order polynomial to the discrete data points. Using this procedure, the minimum value of the sum of squares was obtained for N0r = 0.57. We also used χ2 to measure the goodness of fit instead of the sum of squares. In this case, the best fit was also found for N0r = 0.57. Figure 1 shows both the observed spectrum of A. thaliana AFLPs and the expected one with N0r = 0.57. The observed frequency spectrum is in excellent agreement with the expectation.
The AFLP frequency spectrum in A. thaliana. Solid boxes show the observed spectrum and open boxes the expected one in an exponentially growing population with N0r = 0.57. N0r was estimated by fitting a fourth-order polynomial to the discrete data points. The minimum point of the curve was obtained for N0r = 0.57.
Using this estimate of N0r = 0.57, we also investigated the properties of Tajima's D (Tajima 1989) and Fu and Li's D* tests (Fu and Li 1993) in an exponentially growing population. In our simulations, neutral genealogies without recombination (Hudson 1983, 1990; Tajima 1983, 1989) were used. Figure 2A shows the distribution of Tajima's D for n = 17. The filled squares represent the distribution of Tajima's D under a constant-population-size model (N0r = 0) and the open squares represent that for an exponentially growing population with N0r = 0.57. It is known that the distribution of Tajima's D under a constant-population-size model approximately follows the beta distribution with mean = 0 and variance = 1 (Tajima 1989). As shown in Figure 2A, the distribution with N0r = 0.57 is skewed toward negative values. The mean is −0.3 and its variance is 0.55, which is smaller than that under a constant-population-size model. Accordingly, it is expected that the confidence limits of the distribution may be very different between these two models. Although the confidence limits for the negative values of Tajima's D are not much different from those of a constant-size model, those for the positive values are much smaller than those under a constant-size model when N0r = 0.57. This implies that Tajima's D does not follow a beta distribution in growing populations. Similar results were obtained for Fu and Li's D* test (Figure 2B). The peak of the distribution for N0r = 0.57 moves toward negative values and the distribution is very different from that in a constant-size model especially for positive values.
The P values for Tajima's D and Fu and Li's D* tests were reexamined for eight nuclear loci in A. thaliana, assuming N0r = 0.57 (Table 1). As expected from Figure 2A, when Tajima's D is <−1.5, the P values with N0r = 0.57 are not much different from those of a constant-size model. For positive Tajima's D, the P value decreases for N0r = 0.57, suggesting that the power of Tajima's D with N0r = 0.57 is stronger than that of a constant-size model. Similarly, when Fu and Li's D* is <−2.0, the P values with N0r = 0.57 are not different from those of a constant-size model. If Fu and Li's D* is positive, the P value with N0r = 0.57 is smaller than that of a constant-size model. In the case of Fu and Li's D* test for the CAL locus, the neutral hypothesis is rejected at the 5% level only when a growing population with N0r = 0.57 is assumed. For two cases, i.e., Tajima's D for PI and Fu and Li's D* for Rpm1, the P value becomes <1% when N0r is changed from 0 to 0.57.
The distributions of Tajima's D and Fu and Li's D* statistics. They were obtained by computer simulation, in which n = 17 was assumed. Since in half of the genes in Table 1, the sample size is 17, simulation results for n = 17 are shown. In the simulation for a constant-size population, 4N0μ is assumed to be 20 so that the expectation of the average number of pairwise differences is 20, which appears to be a typical value (see Table 1). In an exponentially growing population with N0r = 0.57, 4N0μ = 28.6 is assumed. The expectation of the average number of pairwise differences is also 20. (A) The solid squares show the distribution of Tajima's D in a constant-population-size model, and the open squares show the distribution in an exponentially growing population with N0r = 0.57. (B) The distribution of Fu and Li's D* in a constant-population-size model and that in an exponentially growing population with N0r = 0.57.
Summary of Tajima's D and Fu and Li's D* tests
In studies of particular gene regions, an excess of rare polymorphic sites was detected, resulting in significantly large negative values of Tajima's D and Fu and Li's D* (Kawabeet al. 1997; Purugganan and Suddith 1999). As possible causes, an expansion of the A. thaliana population and/or purifying selection against deleterious mutations were proposed. In this study, two neutrality tests, Tajima's D and Fu and Li's D*, were conducted without the effect of population expansion. It was shown that the negative values of both test statistics for AP3, ChiA, and PI were still significant (Table 1). Furthermore, Fu and Li's D* for CAL is significantly negative at the 5% level when N0r = 0.57, although it is not significant in a constant-size model. This result suggests that purifying selection against deleterious mutations is acting in these regions. This is consistent with the observation of a relatively high level of singleton replacement polymorphism in A. thaliana (Kawabeet al. 1997; Purugganan and Suddith 1998, 1999).
In this study, we inferred the rate of population size expansion of A. thaliana from the frequency spectrum of AFLPs obtained by Miyashita et al. (1999). For N0r = 0.57, the observed spectrum is in excellent agreement with the expected one (Figure 1). The reason for this may be that almost all fragments detected as AFLPs are independent of each other, which is consistent with the lack of linkage disequilibrium observed between microsatellite loci (Innanet al. 1997) and AFLP loci (Miyashitaet al. 1999). It may suggest that population surveys using AFLP analysis are an appropriate method to understand the demographic history of a population. Furthermore, using this estimate of population growth, we reevaluated the critical values of two of the most important neutrality tests, Tajima's D and Fu and Li's D*. Discrepancies between constant-size and exponentially growing populations were found for positive values of both test statistics.
Acknowledgments
We thank two reviewers, M. Aguadé and D. Carlini, for valuable comments. A. Kawabe, N. T. Miyashita, and E. A. Stahl provided data sets. This work was supported in part by National Science Foundation grant DEB-989679 and by funds from the University of Rochester.
Footnotes
-
Communicating editor: M. Aguadé
- Received December 10, 1999.
- Accepted April 3, 2000.
- Copyright © 2000 by the Genetics Society of America