Abstract
Estimates of the scaled selection coefficient, γ of Sawyer and Hartl, are shown to be remarkably robust to population subdivision. Estimates of mutation parameters and divergence times, in contrast, are very sensitive to subdivision. These results follow from an analysis of natural selection and genetic drift in the island model of subdivision in the limit of a very large number of subpopulations, or demes. In particular, a diffusion process is shown to hold for the average allele frequency among demes in which the level of subdivision sets the timescale of drift and selection and determines the dynamic equilibrium of allele frequencies among demes. This provides a framework for inference about mutation, selection, divergence, and migration when data are available from a number of unlinked nucleotide sites. The effects of subdivision on parameter estimates depend on the distribution of samples among demes. If samples are taken singly from different demes, the only effect of subdivision is in the rescaling of mutation and divergencetime parameters. If multiple samples are taken from one or more demes, high levels of withindeme relatedness lead to low levels of intraspecies polymorphism and increase the number of fixed differences between samples from two species. If subdivision is ignored, mutation parameters are underestimated and the species divergence time is overestimated, sometimes quite drastically. Estimates of the strength of selection are much less strongly affected and always in a conservative direction.
ONE of the primary goals of population genetics has been to measure and to understand the role of natural selection in shaping variation within and between species. Now that molecular technologies allow genetic variation to be assayed with relative ease, this goal seems within reach. A number of different approaches to studying selection have been proposed (Hudson and Kaplan 1988; Neuhauser and Krone 1997; Yang 1998; Donnellyet al. 2001; Slatkin and Bertorelle 2001), and a multitude of neutrality tests, reviewed by Nielsen (2001), can be applied if appropriate genetic data are gathered. This work considers Sawyer and Hartl’s (1992) method, which belongs to a class of methods that use overall levels of polymorphism and divergence at two or more categories of sites in samples of DNA sequences from a pair of species. Hudson et al. (1987) were the first to propose such a method, in which the categories were different loci, followed by McDonald and Kreitman (1991), who categorized sites within a locus as being either synonymous or nonsynonymous with respect to changes in the amino acid sequence of the protein product. Both methods assumed no intralocus recombination and allowed the hypothesis of strict selective neutrality to be tested. Shortly afterward, by assuming Kimura’s (1969) infinitesites mutation model, i.e., with free recombination between sites, Sawyer and Hartl (1992) showed that McDonaldKrietman test data could be used not only to test neutrality but also to estimate selection, mutation, and divergencetime parameters.
Nielsen (2001) pointed out that McDonaldKreitman and related tests, in which sites can be classified a priori, provide a very powerful framework for inferences about natural selection, in contrast to tests like Tajima’s (1989) and Fu and Li’s (1993), which measure deviations from the highly variable process of neutral coalescence. It is likely that McDonaldKreitman and related methods will become the mainstay of genomic analyses of the role of selection. In two recent works, modified McDonaldKreitman tests were applied to genomic data from Drosophila, suggesting that 45% of the amino acid differences between Drosophila simulans and D. yakuba resulted from positive selection (Smith and EyreWalker 2002) and that positive selection at a relatively small number of genes is responsible for the divergence of D. simulans and D. melanogaster (Fayet al. 2002). Bustamante et al. (2002) used a modified SawyerHartl method to show that Arabidopsis species have experienced a higher proportion of deleterious amino acid substitutions than Drosophila species, in which positive selection is common, and attributed the difference to high levels of inbreeding in Arabidopsis.
An obvious shortcoming of these methods is that they assume the species under study are panmictic, i.e., not geographically or otherwise subdivided. It is well known that this assumption is incorrect for many species (Slatkin 1985). When there is no intralocus recombination, McDonald and Kreitman (1991) point out that shared genealogical history should control for the effects of demography when sites can be categorized a priori. It is less clear that this should be the case when collections of unlinked sites are used to estimate selection, mutation, and divergencetime parameters as in Sawyer and Hartl’s (1992) method. It is possible that the effects of subdivision on the numbers of polymorphisms and fixed differences at synonymous and nonsynonymous sites could lead to errors in inferences. Therefore, the goal of this work is to extend the Poisson random field (PRF) theory of polymorphism and divergence developed by Sawyer and Hartl (1992) to include subdivided species. To do this, it is first shown that in the limit of a large number of subpopulations or demes allelefrequency dynamics at a single locus in a population with islandmodel migration (Wright 1931; Moran 1959; Maruyama 1970; Latter 1973) are governed by a diffusion process that has the same form as the usual WrightFisher diffusion, e.g., see Ewens (1979), but with a timescale different from that of the panmictic case. Then, the assumption of free recombination between sites allows the PRF model to be used to predict the patterns of variation in samples from a pair of islandmodel species.
The diffusion result is obtained using Theorem 3.3 in Ethier and Nagylaki (1980) and relies upon the fact that the process of migration and drift within subpopulations occurs on a much faster timescale than changes in allele frequency by drift and selection in the total population. The result thus depends on a stochastic equilibrium of allele frequencies within demes with respect to migration and drift, which is also described. This follows some recent work (Cherry and Wakeley 2003) in which simulations supported the existence of such a diffusion under the additional assumption that demes are very large and migration rates correspondingly small. The present analysis shows that this additional assumption is unneccessary. The assumption of infinite deme sizes and infinitesimal migration rates was also made in the recent coalescent work on neutral largenumberofdemes models (Wakeley 1998, 2001), and it is made below in The expected number of neutral segregating sites, when the forward and backward results are compared. Otherwise, here it is assumed that the demes are finite in size and the migration rates are unconstrained.
This work makes a connection between the PRF theory and work on the robustness of the coalescent process to population structure (Nordborg 1997; Möhle 1998), in particular for the case of geographic structure (Wakeley 1998, 2001). The two are related by showing that the effective size of the ancestral, coalescent process is the same as that of the forwardtime diffusion of allele frequencies and that the forward and backwardderived predictions for the expected number of segregating sites in a sample are the same under neutrality. We expect such connections between forward and backward approaches to exist, a fact that is well established in the case of panmictic populations (Ewens 1990; Möhle 2001). Like the forward (Nagylaki 1980) and backward (Notohara 1993) strongmigration limits, these results and those of Wakeley (1998, 2001) for the coalescent process are based on a “separation of timescales.” In this case, the fast processes are migration and drift within demes and the slow process is drift and possibly selection in the total population, which is mediated by migration. The effective size of the population is rescaled and patterns of genetic variation depend on how a sample is distributed among demes. In contrast, under the usual strongmigration limit, the only effect of structure is to rescale the effective size of the population (Nagylaki 1980; Notohara 1993; Nordborg 1997; Charlesworth 2001).
The main result presented here, besides the existence of the diffusion (9) below, is that, if mutations are introduced at a constant rate per generation and sites segregate independently of one another, the PRF results of Sawyer and Hartl (1992) can be applied, but with a correction that depends on how samples are taken among demes. If each sample is taken from a different deme, then Sawyer and Hartl’s (1992) results apply directly, but with slightly different mutation and divergencetime parameters. If some or all of the samples come from the same deme, the PRF results must be corrected for the effect of drift and migration within demes. Failure to recognize this can cause serious errors in the estimation of mutation rates and divergence times, but not, surprisingly, of selection coefficients.
THEORY
A population or species is assumed to be subdivided into D demes of equal size N. The organisms are assumed to be haploid, but the results will hold for diploid organisms if N is replaced with 2N, if selection is additive, and if migration is gametic. The island model of migration (Wright 1931; Moran 1959) is assumed: a fraction m of each deme is replaced by migrants every generation and all migrants are randomly sampled from a migrant pool to which all demes contribute equally. In each generation, migration occurs first, followed by selection, and then resampling (drift) within demes according to the WrightFisher model (Fisher 1930; Wright 1931). In the next two sections, two alleles are assumed to be segregating at a single locus, and Many independently segregating loci considers their introduction by mutation. The wildtype or nonmutant allele has relative fitness equal to 1, and the mutant allele has fitness 1 + s_{D}, where s_{D} ≥1. The next section establishes the diffusion approximation for the frequency of the mutant allele as D → ∞, but Ds_{D} remains finite. The migration rate can vary between 0 and 1 (0 < m ≤ 1) and N is assumed to be finite. This is in contrast to the usual assumption that Nm is finite as N goes to infinity.
Considering the number of mutants within each deme, it is apparent that there are exactly N + 1 kinds of demes. Each deme that begins a generation with i copies of the mutant will have mutant frequency
Limiting allele frequency dynamics at a single locus: Let
The nature of the diffusion approximation (9) below is that the migration and drift within demes equilibrate quickly in comparison to the rate of drift and selection in the total population. The results show that, to a sufficient order of approximation, demes can be considered to always be at a stochastic equilibrium ν_{j} (0 ≤ j ≤ N) with respect to migration and drift for a given x. The fraction of demes that have j copies of the mutant converges on
The form of Equation 5 was obtained by selecting parameters of a hypergeometric distribution that gave the same mean and variance of allele counts among demes as Equation 4, namely
appendix a shows that, in the limit as D goes to infinity, the change in x by drift and selection is so much slower than that by migration and drift within demes that the collection of demes is always at the equilibrium ν_{i}, which depends on N and m, and of course x. By Theorem 3.3 of Ethier and Nagylaki (1980), as D goes to infinity the above system reduces to a diffusion x(·) with generator
Cherry and Wakeley (2003) assumed (8) to hold and showed that simulations agreed well with the predictions of the implied diffusion process, such as the time to fixation or loss of the mutant type. Without giving a proof, we can guess that this diffusion should be given by the results of the section above and appendix a if N_{D} → ∞ when D → ∞ and lim_{D}_{→∞}2N_{D}m_{D} = M, so that F = 1/(M + 1), but with lim_{D}_{→∞}N_{D}/D = 0 (Ethier and Nagylaki 1980). Cherry and Wakeley (2003) also showed that the distribution of allele frequencies among demes in simulations with N = 100 and m = 0.01 (and D = 1000 and s_{D} = 0.001) conformed well to the predictions of Equation 8 in a particular generation when x was equal to 0.611. Further support for the existence of this largeD, largeN diffusion is given in The expected number of neutral segregating sites by comparing its predictions under neutrality to those of the corresponding coalescent model (Wakeley 1998). Otherwise, N is assumed here to be finite.
Many independently segregating loci: If we posit an infinite number of loci, i.e., nucleotide sites, which can sustain mutations and which each evolve according to the diffusion of the previous section independently, then the PRF results of Sawyer and Hartl (1992) hold for x. Because of the way time is measured in the diffusion, the appropriate mutation parameters are also scaled:
Rewriting Sawyer and Hartl’s (1992) Equations 13 and 14 in terms of the present notation gives
Assume that we have taken a random sample of n sequences from d different demes in one of the species, such that n_{1}, n_{2},..., n_{d} are the sample sizes from each deme. We can write in general that the expected number of sites that show i_{1}, i_{2},..., i_{d} copies of the mutant base in the sample (0 ≤ i_{k} ≤ n_{k}) is given by
Because we have assumed an infinite number of independently segregating sites with collective mutation rates given by (10), the PRF model (Sawyer and Hartl 1992) shows that S_{j}(i_{1},..., i_{d}) is Poisson distributed with expected value equal to (15). The numbers of sites segregating at various frequencies within each deme contain information about migration rates, and the numbers of sites segregating at various frequencies in the total population contain information about the selection coefficient. Note that (15) can also be used to compute the expected number of apparent fixed differences, i.e., polymorphisms where the entire sample has the mutant base, as required in Sawyer and Hartl’s (1992) analysis. This provides a framework for estimating selection coefficients (and migration rates) in the context of a subdivided population. As illustrated in results, we use Equations 11, 12, 13, 14 in conjunction with Equation 15 to obtain predictions about the numbers of fixedsynonymous, fixedreplacement, polymorphicsynonymous, and polymorphicreplacement sites in a sample from two species. Further, Equation 15 gives the joint frequencies among demes of segregating polymorphisms. In the panmictic case, Hartl et al. (1994), Akashi (1999), and Bustamante et al. (2001) showed that allele frequencies at polymorphic sites contain substantial information about selection.
RESULTS
The first result to note is that if each sample is taken from a different deme, the methods of Sawyer and Hartl (1992) can be applied wihout modification. It is necessary only to realize that the inferred mutation parameters and the divergence time are scaled in terms of ND/(1  F) generations instead of the usual ND generations. This result follows from the fact that each sample drawn in this way has probability x of showing the mutant base. That is, h(1x, 1) = x and h(0x, 1) = 1  x, and similarly for h^{*}(i_{k}x, n_{k}). Summing Equation 15, for each species, over all i_{1}, i_{2},..., i_{d} such that
Inferences from singledeme samples: At the opposite extreme, consider the case in which all samples are drawn from the same deme within each species. Note that we assume, as in Sawyer and Hartl (1992), that the two species are identical (here in terms of N, m, and γ). Let n_{1} and n_{2} denote the sample sizes from the two species. For this sample, the expected numbers of fixedsynonymous (K_{s}), fixedreplacement (K_{a}), polymorphicsynonymous (S_{s}), and polymorphicreplacement (S_{a}) sites are given by
Figure 2 plots the expected values of K_{s}, K_{a}, S_{s}, and S_{a} as functions of the migration rate when n_{1} = n_{2} = 10 and N = 100 and for three different values of γ: 2, 0, and 2. The results are as expected for singledeme samples. When m = 1, they are the same as in a panmictic population. As m decreases, samples from single demes tend to be closely related, so the numbers of polymorphisms will decrease and the numbers of (apparent) fixation events will increase. This is true regardless of whether γ is positive, zero, or negative, although the relative magnitudes of the four quantities depend strongly on γ. The curves for E(K_{s}) and E(S_{s}) are, of course, identical for all values of γ. The results that would be obtained by assuming lim_{N}_{→∞}2Nm = M and using Equations 8 and 17 would be similar to what is shown in Figure 2 if M were varied from 0.02 to 200.
To understand the effects of (islandmodel) population subdivision for the extreme case of singledeme samples, we can use the “data” of Figure 2 to fit the parameters of Sawyer and Hartl’s (1992) panmictic model. Figure 3 shows that estimates of γ are remarkably robust to subdivision, even in this case, where the effects of subdivision should be strongest. Again, if samples were taken singly from different demes, there would be no error in using the panmictic model. For singledeme samples there is some error when the migration rate is low, but even in the extreme case of m = 10^{4} (2Nm = 0.02) the estimates are off only by ∼25%. However, the level of error will be greater for larger samples (see discussion) and when the absolute value of γ is larger. An additional effect is that the error in estimating γ is conservative in that the bias is toward neutrality regardless of whether γ is positive or negative. Figure 4 shows the effect on the other parameters: t_{div}, θ_{s}, and θ_{a}. As should be expected from Figure 2, mutation rates are underestimated and the divergence time is overestimated when the migration rate is small. The error in estimating these other parameters is much more extreme than that for γ. In addition, there is a small effect of γ on estimates of θ_{a}.
The expected number of neutral segregating sites: Under neutrality, the results presented here agree with those found using a coalescent approach in Wakeley (1998), and later in Wakeley (1999, 2001), which were derived under the assumption that lim_{N}_{→∞}2Nm = M. We make the same assumption here and further assume that this occurs in such a way that the diffusion result still holds (see Limiting allele frequency dynamics at a single locus). Then we can use g(x_{k}x) and h^{*}(i_{k}x, n_{k}) in expression (15) to show that the expected number of synonymous segregating sites is equal to
Consider the number of segregating sites in a sample of n sequences, all from the same deme. From the coalescent approach we have
DISCUSSION
The results presented above can be understood in terms of a samplesize effect of subdivision, one that depends on how the sample is distributed among demes. In the limit of a large number of demes, the history of a sample under neutrality has two distinct phases: the scattering phase and the collecting phase described in Wakeley (1999). Although in this analysis incorporating selection was not phrased in these terms, it is clear from Figure 2 that the same effect is at work, namely, that a scattering phase, which is a stochastic sample size adjustment that begins with a sample of size n and ends with n′ lineages each in a separate deme, where 1 ≤ n′≤ n (Wakeley 1999), induces a downward samplesize adjustment to singledeme samples. In the case of large N and correspondingly small m, the scattering phase for a sample from a single deme is given by P[n′n] = S_{1}(n, n′)M^{n}^{′}/M_{(}_{n}_{)}, which appears in Equation 23. Figure 5 shows how the expected values of K_{s}, K_{a}, S_{s}, and S_{a} depend on n_{1} = n_{2} under panmixia with γ= 2. Thus, the values on the righthand side of Figure 5 are identical to those on the righthand side of Figure 2a. Although scales of the horizontal axes are not the same, the effect of smaller migration rate is qualitatively similar to that of smaller sample size. The reason that the values on the lefthand sides of the two panels are different is that the average value of n′ at the left in Figure 2a is equal to 1.06, which is considerably smaller than the practical lower limit of 2 in Figure 5. Instead, the values on the lefthand side of Figure 5 can be compared to those in Figure 2a for log_{10}(m) =2.67, or m = 0.00215, which (with N = 100) gives E[n′] ≈ 2.
This work shows that inferences about natural selection made from DNA polymorphism and divergence data are robust to population subdivision (Figure 3) as long as the migration rate is not too low. This is remarkable in view of the strong effects subdivision has on numbers of polymorphisms, shown in Figure 2, but is understandable in terms of the effect of subdivision on θ_{s}, θ_{a}, and t_{div}. Except for the weak dependence of θ_{a} estimates on γ (Figure 4), subdivision and migration act equally on selected and neutral variation. In both cases, fixation events are overestimated and polymorphisms underestimated when the migration rate is small. This causes mutation rates to be substantially underestimated and divergence times grossly overestimated if subdivision is ignored, but these effects compensate one another and allow relatively accurate estimates of selection even if subdivision is ignored. Often γ will be the focus of study, but if θ_{s}, θ_{a}, and t_{div} are also of interest, it would be useful to have a framework for simultaneous inferences about migration rates, selection coefficients, and these other parameters. The theory presented above is a first step toward this goal.
It is important to note that inferences about natural selection made from allele frequencies at polymorphic sites will be robust to subdivision only in the case of samples taken singly from different demes. Otherwise, the distribution of samples among demes will cause some frequency classes to be overrepresented, resulting in biased inferences. Even when all the samples are taken from the same deme, restricted migration can mimic the effect of positive γ on allele frequencies (Wakeley and Aliacar 2001). While allele frequencies at polymorphic sites provide an additional source of information about natural selection (Hartlet al. 1994), this illustrates that they are also greatly influenced by nonselective demographic factors; see also Nielsen (2001). In addition, allele frequency patterns are quite sensitive to levels of recombination (Bustamanteet al. 2001). Thus, it is especially important to account for subdivision when making inferences from allelefrequency data.
APPENDIX A
Again,
Now let
The derivation of Equations A5, A6, A7, A8, A9 follows from Equations A2, A3, A4. For Equation A5 we have
For Equation A6 we have
For equation A7 we have
In the derivations of (A8) and (A9) below I assume that the exact solution of (4) is sufficiently close to (5) that the latter can be used in place of the exact solution. More precisely, I assume that
For Equation A8 we have
For Equation A9 we have
This completes the derivation of (A5A9), showing that Theorem 3.3 in Ethier and Nagylaki (1980) can be applied and that the diffusion x(·) with generator (9) in the text holds as D goes to infinity.
APPENDIX B
Beginning with Equation 24, and then putting in g(xx) and ϕ_{s}(x), we have
Acknowledgments
I thank Dan Hartl, Thomas Nagylaki, Stanley Sawyer, and Clifford Taubes for helpful discussions of the work. I am also grateful to Sabin Lessard for seeing that deme sizes need not be large for the largenumberofdemes coalescent to hold. This work was supported by grants DEB9815367 and DEB0133760 from the National Science Foundation.
Footnotes

Communicating editor: N. Takahata
 Received August 6, 2002.
 Accepted October 14, 2002.
 Copyright © 2003 by the Genetics Society of America