Abstract
Population subdivision complicates analysis of molecular variation. Even if neutrality is assumed, three evolutionary forces need to be considered: migration, mutation, and drift. Simplification can be achieved by assuming that the process of migration among and drift within subpopulations is occurring fast compared to mutation and drift in the entire population. This allows a twostep approach in the analysis: (i) analysis of population subdivision and (ii) analysis of molecular variation in the migrant pool. We model population subdivision using an infinite island model, where we allow the migration/drift parameter 0398; to vary among populations. Thus, central and peripheral populations can be differentiated. For inference of 0398;, we use a coalescence approach, implemented via a Markov chain Monte Carlo (MCMC) integration method that allows estimation of allele frequencies in the migrant pool. The second step of this approach (analysis of molecular variation in the migrant pool) uses the estimated allele frequencies in the migrant pool for the study of molecular variation. We apply this method to a Drosophila ananassae sequence data set. We find little indication of isolation by distance, but large differences in the migration parameter among populations. The population as a whole seems to be expanding. A population from Bogor (Java, Indonesia) shows the highest variation and seems closest to the species center.
POPULATION subdivision is centrally important for evolution and affects estimation of all evolutionary parameters from natural and domestic populations. Even if just a neutral model is considered (i.e., selection is ignored), analysis of molecular variation in subdivided populations requires modeling of at least three forces: mutation, migration, and drift. Straightforward incorporation of these three forces into a comprehensive model leads to formidable complexity and necessitates computerintensive numerical integration schemes (Beerli and Felsenstein 1999, 2001). If the number of subpopulations (demes) is large, however, the process of migration among and drift within subpopulations (“scattering phase”) is relatively fast, while the process of coalescence in the entire population (“collecting phase”) is relatively slow (Wakeley 2001). Then, time is too short for mutations to happen in the scattering phase; instead, almost all mutations will occur in the collecting phase.
Because of its importance, many approaches to measuring population subdivision have been put forward. Among them, the infinite island model and the complementary Fstatistics approach (Wright 1931, 1969) are the earliest and most influential. Rousset (2001) and Excoffier (2001) introduce the different approaches, clarify their historic development, and discuss their relative merits. We therefore limit our introduction and discussion to what is relevant to our approach.
Since the seminal work of Wright (1931), population subdivision has often been modeled with the infinite island model. In randomly mating species, usually a single parameter describing the correlation of two random gametes within a population is estimated. This parameter was labeled F_{st} (Wright 1969) or Θ (Weir and Cockerham 1984). It is related to the migration parameter γ= 4N_{e}m (or 3N_{e}m for Xlinked variation) via the equation Θ= 1/(1 +γ) (see Table 1 for abbreviations of key mathematical symbols). In the original model, it is assumed that all populations share the same migration parameter γ. But, obviously, populations differ generally in their sizes or migration rates or both. Wakeley (2001) relaxed these assumptions and showed that, even if a rather general migration model is assumed, the scattering phase is short compared to the collecting phase as long as the number of demes is large.
In many species with subdivided populations, the central or original populations show higher variability than the peripheral populations. This is the case for the only slightly subdivided populations of humans (Rosenberget al. 2002), Drosophila melanogaster (David and Capy 1988), and D. simulans (Hamblin and Veuille 1999), which have their population centers in subSaharan Africa and spread over all continents. In many crop species, most variation among cultivars can be found where the species has originally been domesticated, e.g., for maize in Mexico (Piperno and Flannery 2001) and for potato in the Andes (Hawkes 1990).
D. ananassae exhibits more population structure than both D. melanogaster and D. simulans. It exists in many semiisolated populations around the equator, particularly in mainland Southeast Asia and on the islands of the Pacific Ocean (Tobari 1993). D. ananassae is thought to have originated in Southeastern Asia where most of its relatives occur (Dobzhansky and Dreyfus 1943; McEveyet al. 1987). Due to its extensive population structure, D. ananassae can be used to analyze the effect of population subdivision on genetic variation. Past molecular analyses of the effect of population subdivision on variation are limited to few loci and populations (Stephan 1989; Stephan and Langley 1989; Stephan and Mitchell 1992; Stephanet al. 1998; Chenet al. 2000). Herein, we use data from nine nuclear DNA fragments in eight populations of Southeast Asia and Australia to improve understanding of population subdivision and molecular genetic variation in this species.
Current methods for inferring population differentiation with only a single parameter cannot capture variability among populations. With coalescencebased approaches (Beerli and Felsenstein 1999, 2001; Wakeley 1999), however, this can be accomplished. These approaches generally are quite complex such that the powerful Markov chain Monte Carlo (MCMC) method or the related importancesampling method are used. Both these methods can be combined and approximate the full likelihood or posterior probability extremely well. With a Bayesian approach, additional complications can also be incorporated relatively easily, e.g., missing data, unreliable scoring of markers, and mutation. As implemented until now, however, the cost of this flexibility is an extraordinary computational load (e.g., Beerli and Felsenstein 1999, 2001; Wakeley 1999), even if data sets are only moderately sized. Hence, one would wish for a method fast enough to allow for analysis of large data sets with only moderate computational requirements, but flexible enough to allow for inference of different migration parameters for different populations, among other parameters. Then, a species differentiation into central, genetically variable and peripheral, genetically impoverished populations could be detected. Furthermore, the method should be able to deal with molecular sequence data.
Herein, we follow Wakeley (1998, 1999, 2001) and assume that the process of migration among and drift within subpopulations is relatively fast compared to the process of mutation and drift in the entire population. This permits the use of a twostep method for the analysis of molecular sequence variation in organisms with population subdivision. In the first step, we use the infinite island model to account for migration among and drift within subpopulations (scattering phase) to estimate the migration/drift parameter γ [or conversely the differentiation parameter Θ= 1/(γ+ 1)] for each population separately and to estimate contributions to the pool of migrants. In the second step, we analyze variation within the pool of migrants (collecting phase), using traditional statistics for molecular sequence variation (Watterson 1975; Tajima 1989). With these statistics, global expansion or shrinkage of the entire population over longer time spans can be detected as in the case of a single panmictic population.
Theory for analysis of molecular sequence data (e.g., Watterson 1975; Tajima 1983, 1989) is based on sample allele counts, i.e., integral frequencies. On the other hand, the usual Kallele infinite island model, which we employ for the scattering phase, is parameterized using allelic proportions in the migrants, i.e., “probabilities” of alleles. To overcome this mismatch (counts vs. proportions), we impute the number of alleles that reach the migrant pool by discounting those alleles that coalesce within the population. The allele frequencies in the migrant pool can then be subject to an analysis of molecular variation using the methods cited above or similar ones. In contrast to Wakeley (1999), our method is computationally simple and can handle almost arbitrarily large data sets. The cost of this simplicity is the use of the inexact Kallele model for the scattering phase. We evaluate this approximation with computer simulations. The method is applied to a data set of nine loci from eight D. ananassae populations.
MATERIALS AND METHODS
Data collection: Population samples: A total of 69 isofemale lines of D. ananassae were sampled from the following eight locations (sample size in parentheses): Kakadu (3) and Darwin (6), Australia; Bogor (10), Java, Indonesia; Mandalay (10), Burma; Kathmandu (10), Nepal; and Bhubaneswar (10), Puri (10), and Chennai (10), India (Figure 1). Details on the collections of these samples will be reported elsewhere.
Identification of DNA fragments: The data set consists of DNA sequence data from nine loci. Since DNA sequence information is limited in D. ananassae, we used genomic information from D. melanogaster to identify marker loci on the X chromosome of D. ananassae. (D. melanogaster is the only closely related species with a completely available genome sequence.) The forward and reverse primers for each fragment were designed (and cordially provided by David de Lorenzo) in adjacent exons of a randomly chosen gene to amplify the intron flanked by these exons. More than 200 fragments of the above specifications (sizes varying from 400 to 1000 bp in the D. melanogaster database) were identified and tested for amplification by PCR with genomic DNA from D. ananassae; <10% of the primer pairs actually amplified DNA fragments. For each of the successfully amplified fragments, the DNA sequence was aligned with the corresponding fragment of D. melanogaster. A fragment was used for the population genetic analysis if sequence homology to the exons of D. melanogaster at both ends of the fragment was observed and if the fragment length was 300600 bp. Eight such fragments were identified and used in this study. An intron fragment of the Om(1D) (Stephanet al. 1998) gene located in a normal recombination region of the X chromosome of D. ananassae was used as the ninth locus.
DNA sequencing: A single male from each isofemale line was used for genomic DNA preparation using the PUREGENE DNA isolation kit (Gentra Systems, Minneapolis) and then purified with EXOSAPIT (USB Corporation, Cleveland). Both DNA strands were sequenced on an automated DNA sequencer (Megabace1000; Amersham Biosciences, Buckinghamshire, UK). Sequences were edited with SeqMan and aligned with MegAlign (DNAStar, Madison, WI). Manual alignment was used when required. Insertiondeletion polymorphisms were removed from sequences after alignment and thus not considered for the analyses.
Analysis of migration and drift among populations: The main culprit for the slow computation speed of some coalescencebased approaches is the complex migration model used (Beerli and Felsenstein 1999, 2001). By contrast, the infinite island model allows for much easier computation. Furthermore, Wakeley (2001) could show that this model agrees with quite general migration models if the number of subpopulations is large. Hence, we adopt an infinite island model with haploid individuals as the basic migration model, but use all the flexibility it can offer; i.e., we allow for populationspecific migration parameters (Figure 2A). Following Weir and Cockerham (1984), we label this parameter Θ_{i}. For comparison, we also present simulation results using a splitting population model (Figure 2B). We define Θ_{i} as the probability that, looking back in time, two random alleles coalesce within the ith population before any of them migrates into the allele pool (Figure 3A). If applied to subdivided populations, the last part of this definition needs to be replaced by “before the split from the ancestral population” (Figure 3B).
The Weir and Cockerham (1984) approach is analogous to an ANOVA randomeffects model. With a randomeffects model, a hyperparameter—usually the variance of effects—is used to combine information from individual effects. Generally, only the variance of effects is interesting, while estimates of the individual effects are ignored. An example is the inference of the additive genetic variance in the standard quantitative genetics model, where the individual breeding values are usually ignored. Although we are interested in the individual migration parameters, we nevertheless follow the randomeffects model by assuming that the Θ_{i} are drawn from a distribution. The advantage is that even if each population is rather uninformative, quite accurate estimates of the means and variances of the Θ_{i} over populations can be obtained. We chose an independent βdistribution for each Θ_{i} with parameters α and β. The βdistribution is flexible and can assume shapes from a wide “U” to a narrow bell, yet is controlled by only two parameters. We then have
We note that the standard Weir and Cockerham model follows in the limit of α → ∞ and β → ∞ with α/(α+β) = const. =Θ.
For further development of the model, we assume that loci are unlinked and therefore conditionally independent of each other, yet all have the same Θ_{i} for each population. The latter assumption could be relaxed easily (Balding and Nichols 1995, 1997). Our method assumes that the time for coalescence within the population is so short that no new mutations occur (for justification of this assumption, see Wakeley 2001). We differentiate between the migrant pool and the I populations and assume a Kallele prior for allelic proportions in the migrant gene pool during the scattering phase. This is strictly justified only if the mutations in the collecting phase follow a Kallele model. With other mutation models (e.g., the infinite site model) Wakeley (1999, 1998) offers an exact method, where the distribution of noncoalescing migrants is enumerated. Unfortunately, this method is computationally cumbersome, such that only small data sets can be analyzed. The justification for our simpler method is that, for larger data sets, influence of the prior generally becomes negligible. (We could not detect a compromise in our computer simulations with rather moderately sized data sets; see below.)
The scattering phase is handled as follows. Conditional on the allelic proportions in the migrant pool and the hyperparameters α and β defined above, populations are independent of each other. Conditional on Θ_{i}, coalescence is independent among loci. Hence, consideration of a single locus in a single population suffices for illustration, as all other loci and populations are conditionally independent. It can be shown that the number of coalescences within allelic class k is independent of other allelic classes and depends only on the migration parameter γ_{i}, the allelic proportion ρ_{k}, and the current frequency in the sample n_{k} (see appendix): within allelic class k the decision between migration and coalescences can be made recursively. Suppose that there are currently n_{k} individuals in class k; then the probability of the next event being a migration event is γ_{i}ρ_{k}/(γ_{i}ρ_{k} + n_{k}  1) (appendix, Equation A6) and the probability of it being a coalescence event is its complement. Then n_{k} is reduced by one and the procedure repeated until only one is left that is a migrant by default.
Relevant parameters for calculation of allelic proportions in the migrant pool are the allele frequencies for each locus. These must be estimated from only those alleles within the I populations that originated directly from the migrant pool, whereas alleles that arose through sampling within populations (or, looking backward, through coalescence within a population) must be ignored. Otherwise, estimation of allelic proportions in the migrant pool is straightforward.
For integration, we employ a MCMC approach (Gelmanet al. 1995), where we alternate cyclically between updating the parameters: conditionally on the current Θ_{i} and the vector of inferred allelic proportions in the migrant pool {ρ_{1},..., ρ_{K}}, we update the coalescence history for each population and locus; conditionally on the coalescence histories for each population, we update the allelic proportions in the migrant pool {ρ_{1},..., ρ_{K}} for each locus; conditionally on the allelic proportions in the migrant pool {ρ_{1},..., ρ_{K}}, the observed allele frequencies {n_{1},..., n_{K}}, and the two hyperparameters α and β, we update Θ_{i} for each population; and, finally, conditionally on the Θ_{i}’s we update the hyperparameters α and β. This updating scheme converges relatively quickly such that, with reasonable starting conditions, good approximations to the posterior distribution may already be obtained after a “burnin” period of ∼1000 iterations.
For validation of our method and the computer program, we simulated data that were sampled from five populations distributed with varying Θ_{i} (Table 2). The model parameters (F_{st} or Θ_{i}) can be tuned such that probabilities of coalescence within populations are identical to first order for the infinite island model and the splitting population model. Parameters were five populations with Θ_{i} from 0.1 to 0.9 and 20 loci. With the infinite island model, we observed a bias toward lower values; with the splitting population model the bias reverses for higher Θ_{i}. Despite the bias, our estimator definitely improved on the Weir and Cockerham estimator.
Analysis of molecular variation within the migrant pool: The MCMC procedure described above provides, for each iteration, a sample of alleles that make it into the migrant pool. The collection of such samples over a run approximates the posterior frequencies of alleles in the migrant pool. Hence, we determine statistics describing molecular variation, the Watterson (1975) θ_{w} (not to be mistaken with the population differentiation parameter Θ above), π (Tajima 1983), and D (Tajima 1989), at each 100th cycle. We also determine the average over loci, the posterior variance, and the quantiles.
The variance of the estimators in the posterior distribution is only part of the total variance. In addition, there is the familiar variation of the estimates due to coalescence, drift, and sampling within the entire population (Tajima 1983, 1989). In the case of D, this variance is normalized to one under equilibrium assumptions. The two sampling processes, within populations and within the migrant pool, are assumed to be independent. Hence, the total variance is the sum of the two variances.
Strictly, the Bayesian approach we adopt does not allow for calculation of confidence intervals or significance. We follow conventional statistics, however, and declare results significant if the 0.95 posterior intervals do not overlap the expected value or if two 0.95 posterior intervals do not overlap.
For validation of the method, we performed computer simulations assuming both the infinite island and the splitting population model. Parameters were 10 populations, 20 loci with a finite site mutation model with 500 sites each, a mutation rate of 7.5 × 10^{6}, and a population size of 10^{4} in the migrant pool or the ancestral population (for the infinite island model and the splitting population model, respectively). Note that, because the model is a finite site model rather than the more usual infinite site model, the expectation of π is ∼0.0013 (and thus slightly <0.0015) and the expectation for Tajima’s (1989) D is slightly negative (∼ 0.1). From this ancestral population/gene pool, 10 subpopulations were sampled with (A) Θ_{i} = 0.1, (B) Θ_{i} = 0.2, and (C) Θ_{i} = 0.5, using both the infinite island model and the splitting population model.
In Table 3, we find that the estimated Θ_{i} is biased to low values in the infinite island model and, generally, toward high values in the splitting population model, similar to the simulations with unequal population sizes above. As expected, averages across populations of the estimators of population variation π and θ_{w} are lower than the true values in the ancestral population/gene pool and D is positive. This is expected as drift reduces molecular variation, affecting θ_{w} more than π. Our new estimators of π, θ_{w}, and D in the migrant pool are very close to the true values. All these results hold true for both the infinite island and the splitting population model. In fact, we find few differences between these two models.
DATA ANALYSIS
We analyzed a sample of eight D. ananassae populations from Asia and Australia (Figure 1), where nine loci were sequenced in up to 10 isofemale lines per population. Pairwise F_{st} values for the eight populations (Table 4) show values between 0.04 and 0.19. The two Australian populations (Kakadu and Darwin) are most closely related; otherwise there is little evidence for isolation by distance. In particular, the two Indian populations of Bhubaneswar and Puri have an intermediate F_{st} of 0.09, although they are separated by only 50 km. The Weir and Cockerham (1984) estimate for the whole data set is 0.09.
Descriptive statistics of molecular variation (π, θ_{w}, and D) are given in Table 5. Populations with the highest molecular variation are Kakadu (Australia) and Bogor (Indonesia) with θ_{w} on average ∼0.013. The Kakadu population has the smallest sample size and, accordingly, the highest variation among loci. Hence, the very high values for loci 1 and 2 may be statistical flukes. The overall high values for Bogor, on the other hand, seem quite trustworthy. The other Australian population (Darwin) and the Indian populations have smaller levels of molecular variation of θ_{w} ≈ 0.005. The Dstatistic is mostly neutral, but Kathmandu (Nepal), Mandalay (Burma), and Bogor (Indonesia) have increasingly negative D’s.
With our twostep analysis, we performed a single MCMC run, where 10^{5} iterations were sampled after a burnin period of 10^{4} iterations. For each population, the mean Θ_{i} and quantiles were calculated from the approximate posterior distribution (A5). Estimates of Θ_{i} vary a lot among populations (see Table 6). The Bogor population is the most variable and least differentiated from the migrant pool, with a Θ_{i} of ∼0.02, whereas, at the opposite end, the population from Chennai in Southern India has a Θ_{i} of ∼0.25. Thus, the Bogor population seems closest to the species center and the Chennai population most peripheral. The average Θ_{i} is ∼0.11, slightly higher than the estimate according to Weir and Cockerham (1984).
The molecular variation among migrants (Table 7) is higher than that in any of the populations. This is to be expected, as our method restores the variation that is reduced due to drift within populations. The large differences between π and θ_{w} are unexpected, though: on average, values for θ_{w} are much higher than those for π. Correspondingly, Tajima’s (1989) Dstatistic is very negative, on average 1.40. While only one individual Dvalue is statistically significantly negative, there is a clear trend toward negative values and the mean Dvalue is significantly negative.
DISCUSSION
Population subdivision is centrally important to evolution. Unfortunately, it complicates analysis of molecular variation. Herein, we follow Wakeley (1998, 2001) and assume a twostep model: the first step is modeling of migration among and drift within subpopulations (scattering phase), using the infinite island model; the second is analysis of mutation and drift in the entire population (collecting phase), using traditional statistics for molecular sequence variation (Watterson 1975; Tajima 1989).
Our method of modeling the process of migration among and drift within subpopulations with the infinite island model is similar to those developed by Balding and Nichols (1995, 1997) and Nicholson et al. (2002). But we allow each population to have its own specific differentiation parameter Θ_{i} and assume that the Θ_{i}’s are drawn from a common distribution. [This is similar to the randomeffects model of Weir and Cockerham (1984).] In contrast to the method developed by Wakeley (1999), our approach is exact for only a Kallele mutation model, but leads to relatively simple formulas for conditional probabilities of key parameters. We did not note any compromise in performance in computer simulations.Furthermore, our approach is fast (inference takes only minutes with quite large data sets on a personal computer) and can handle almost arbitrarily sized data sets.
We use a MCMC scheme, mainly Gibbs sampling, for integration. We validate the model with computer simulations. If the infinite island model is used for simulations, the statistical analysis performs very well as expected. But even if a different model with splitting populations instead of the infinite island model is used for simulations, estimates of Θ_{i} are quite accurate. Our analysis therefore seems quite robust to violations of assumptions as long as the population differentiation parameters Θ_{i} remain the same. In the second step of our analysis, we analyze the allele spectrum within the migrant population, using traditional estimates of population variation and deviation from mutationdrift equilibrium (Watterson 1975; Tajima 1989). As with the migration/drift parameters, mutation/drift parameters used for simulation are recovered in our analysis. The migration/drift process increases variability in the estimates of molecular variation, but does not appear to cause bias.
The process of estimating statistics of molecular variation in the migrant pool/ancestral population described above can be thought of as a way of removing the effect of drift within subpopulations. The statistics therefore can be used exactly like those from a sample of an undivided population. A negative D, for example, may indicate population expansion.
With this method, we analyzed molecular variation from eight populations of D. ananassae from South Asia, Southeast Asia, and Australia. In tropical and subtropical regions of the world, D. ananassae is one of the most common Drosophila species, especially in and around human habitations. Although populations are separated by major geographical barriers such as mountains and oceans, recurrent transportation by human activity may lead to gene exchange. In spite of this, however, earlier studies with molecular genetic markers detected significant population subdivision with a different method (Stephan 1989; Stephan and Langley 1989; Stephan and Mitchell 1992; Stephanet al. 1998; Chenet al. 2000). This was confirmed in this study: the average population differentiation parameter Θ was ∼0.1. Migration parameters vary a lot among populations from a very low value of 0.02 in Bogor (Java, Indonesia) to values an order of magnitude higher in some Australian and Indian populations. These estimates suggest that the Bogor population is close to the species center, while the Australian and Indian populations are peripheral. Furthermore, we observe little isolation by distance in this data set: the Indian populations of Puri and Bhubaneswar are separated by only 50 km, yet share as much genetic variation as two random populations. At ranges different from those covered in this data set, isolation by distance may still be observed in D. ananassae, and appropriate adjustments need to be made in the method.
As with population differentiation, levels of molecular variation are quite variable among populations: the Bogor population is more than twice as variable as some Indian populations. In light of the previous analyses, the Indian populations seem to show reduction of variation due to drift. This is consistent with values of Tajima’s (1989) Dstatistics observed in these populations: the Indian populations show neutral or slightly positive D, while the Bogor population is negative. In a substructured population, drift leads to positive D values. For consistency with the negative D of the Bogor population, a migrant pool with negative D needs to be postulated. This may be caused by rapid population expansion. The process of drift in peripheral populations may then push these negative values back toward neutral or slightly positive values. Thus, in some Indian populations, two processes pushing in opposite directions might cancel out and lead to neutral or slightly positive D. In our twostep analysis, the allele spectrum in the migrant pool has an even more negative D than the sample from Bogor. Simulation results indicate, however, that this may be partly an artifact: in expanding populations, estimates of the Dstatistic are increasingly negative with increasing sample size (data not shown).
From a biological point of view, our observation of negative values of the Dstatistic for the Bogor population is particularly interesting, as our results also suggest that this population is close to the species center of D. ananassae. This observation contradicts the conventional notion (assumption) that ancestral populations are in an approximate equilibrium. A similar case was recently reported by Hamblin and Veuille (1999): using a single, putatively neutral marker, these authors found negative (though not significant) Dvalues for a number of African populations of D. simulans.
APPENDIX
Consider a haploid infinite island model and assume that the diffusion approximation or the Moran model holds. Since we assume that loci are unlinked and therefore conditionally independent, consideration of one locus suffices. Assume that the locus has K alleles in the pool of migrants; some of these may be missing in a sample from a particular population. The allelic proportions in the pool of migrants {ρ_{1},..., ρ_{K}} are constant over time, but unknown and need to be estimated. The main parameter of interest is the migration parameter for each population, γ_{i}, or equivalently the probability of two random alleles to coalesce within the population before migration, Θ_{i} = 1/(γ_{i} + 1). Given the allelic proportions in the pool of migrants, estimation of γ_{i} or, equivalently, Θ_{i} is independent among populations. We thus leave out the indices i and l for population and locus, respectively. Data consist of allele frequencies in the sample, denoted by {n_{1},..., n_{K}}.
The likelihood: The distribution of the data {n_{1},..., n_{K}} given the allelic proportions in the migrant pool {ρ_{1},..., ρ_{K}} and the migration parameter γ is a Dirichletmultinomial distribution (see Balding and Nichols 1995, 1997, and references therein):
The coalescence history: For simulating the coalescence history backward in time, we consider an infinitesimally short time interval, δ. Two types of events may happen in this interval: a migration event and a coalescent event. With a coalescent event, two lineages with the same allele collapse into one. With a migration event, a lineage is substituted by another one by migration. Since migration is independent of the allelic type, no information on which allele was in the lineage before the migration event is available. Hence, the lineage is exchangeable with all other lineages in the population for which no data are available. The lineage is thus dropped from the analyses exactly as one is dropped through coalescence.
Subsequently we need the posterior distribution of the type of the next allele sampled given the previously sampled allelic types {n_{1},..., n_{K}} and the allelic proportions in the migrant pool {ρ_{1},..., ρ_{K}}. This distribution has been derived previously (Balding and Nichols 1995, 1997; Stephens and Donnelly 2000). For consistent notation, we introduce the characteristic vector {x_{1},..., x_{k}}, where x_{i} = 1 for the chosen allele and 0 for all others. In our notation, the posterior of the characteristic vector is
We index the sequence of coalescence or migration events with s. Obviously, the last event must be a migration event. Assume that the lineages remaining in the analysis after the sth coalescence or migration event are {n_{1}(s),..., n_{K}(s)}. Introduce another characteristic vector {y_{1}(s),..., y_{K}(s)} as {x_{1}(s),..., x_{K}(s)}. The probability of a lineage being reduced by one by migration to {n_{1}(s)  x_{1}(s),..., n_{k}(s)  x_{k}(s)} within the interval δ given {n_{1}(s),..., n_{k}(s)}, {ρ_{1},..., ρ_{K}}, and γ is
The probability of a lineage being reduced by one by coalescence to {n_{1}(s)  x_{1}(s),..., n_{K}(s)  x_{K}(s)} within the interval δ given {n_{1}(s),..., n_{K}(s)} and γ is
Summing the relevant terms, one realizes that the posterior probability of class k to be chosen for either a migration or a coalescence event is n_{k}/n. Thus, if we again use the characteristic vector {x_{1},..., x_{k}}, the probability of class k to be chosen is a Bernoulli distribution
The coalescence history can thus be inferred by simulating backward until no more lineages remain in the population. For each locus, this requires at most n_{i}  1 samples from a Bernoulli distribution from the ith population. Records of the allelic types of all migrants need to be kept for updating the estimates of allelic proportions within the migrant pool.
The migration parameter: Updating γ or Θ involves sampling from a nonstandard distribution. [Note that even though we currently do not index Θ, we refer to the populationspecific migration parameter and not to the Weir and Cockerham (1984) summary parameter.] This distribution is proportional to the product of the conditonal distribution of Θ given the hyperparameters α and β in formula (1) and the likelihood of formula (A1) multiplied over all loci. If a flat prior for γ is used, the prior distribution is not proper; i.e., no constant can be found such that it integrates to one. Reparameterizing Θ= 1/(1 +γ) assures a proper posterior distribution with flat priors. Introducing the index l for the loci, we then have
The hyperparameters: Updating the hyperparameters α and β involves sampling from a nonstandard distribution proportional to formula (1). Again we employ a MetropolisHastings step.
Allelic proportions in the migrant pool: To get a new estimate of the allele frequencies in the migrant pool {ρ_{1},..., ρ_{K}}, the sum over all populations of allelic frequencies in each allelic class that enter the migrant pool needs to be determined. Given ρ, the probability of this sum of alleles is multinomial. The new allele proportions {ρ_{1},..., ρ_{K}} can thus be sampled from the sum of allelic frequencies using a Dirichlet distribution, where a flat or other Dirichlet prior distribution may be used.
The sampling scheme: We alternate between (i) sampling the coalescence history recursively using formula (A6) conditional on the migrant allelic proportions ρ and the differentiation parameter Θ_{i} for each population and locus, (ii) sampling the allelic proportions {ρ_{1},..., ρ_{K}} in the migrant pool conditional on the coalescence histories in all populations using a Dirichlet distribution for each locus, (iii) sampling the population differentiation parameters Θ_{i} from the Dirichletmultinomial distribution conditional on the observed allele frequencies and the hyperparameters using (A7), and (iv) sampling the hyperparameters α and β conditional on the population differentiation parameters Θ_{i}. This updating scheme converges to the joint posterior distribution of the parameters. From the joint posterior all marginal posterior distributions can be obtained easily. Furthermore, the sample of individuals that reach the migrant pool is used for calculating the sequence variation statistics once per 100 cycles.
Acknowledgments
We thank two reviewers and the editor for their comments on the manuscript, our colleagues at the evolutionary biology group at LudwigMaximilians Universität for discussion and critical reading of the manuscript, David de Lorenzo for designing the primers we used, and the Deutsche Froschungsgemeinschaft (grant no. STE 325/41) for financial support.
Footnotes

Communicating editor: M. Veuille
 Received February 5, 2003.
 Accepted August 6, 2003.
 Copyright © 2003 by the Genetics Society of America