Bayesian Inference of Recent Migration Rates Using Multilocus Genotypes
Gregory A. Wilson, Bruce Rannala

Abstract

A new Bayesian method that uses individual multilocus genotypes to estimate rates of recent immigration (over the last several generations) among populations is presented. The method also estimates the posterior probability distributions of individual immigrant ancestries, population allele frequencies, population inbreeding coefficients, and other parameters of potential interest. The method is implemented in a computer program that relies on Markov chain Monte Carlo techniques to carry out the estimation of posterior probabilities. The program can be used with allozyme, microsatellite, RFLP, SNP, and other kinds of genotype data. We relax several assumptions of early methods for detecting recent immigrants, using genotype data; most significantly, we allow genotype frequencies to deviate from Hardy-Weinberg equilibrium proportions within populations. The program is demonstrated by applying it to two recently published microsatellite data sets for populations of the plant species Centaurea corymbosa and the gray wolf species Canis lupus. A computer simulation study suggests that the program can provide highly accurate estimates of migration rates and individual migrant ancestries, given sufficient genetic differentiation among populations and sufficient numbers of marker loci.

IN recent decades, indirect estimates of gene flow (reviewed in Slatkin and Barton 1989) have been widely used by biologists, first with allozyme data and more recently with restriction fragment length polymorphisms (RFLPs), DNA sequence data, microsatellite markers, and single-nucleotide polymorphisms (SNPs). Direct estimates of migration rates based on mark-recapture or other methods can be impractical for large populations that exchange small numbers of migrants because the expected number of recaptures is too low; indirect estimates of gene flow using genetic markers are often the only recourse. Commonly used indirect estimators of gene flow, such as 4Nem = 1/FST - 1, are derived on the basis of simplified models of population structure that assume constant population sizes, symmetrical migration (at constant rates), and population persistence for periods sufficient to achieve genetic equilibrium (Wright 1931, 1969).

The development of coalescent theory (Kingman 1982; reviewed by Tavare 1984), which traces the ancestral genealogy of a sample rather than modeling changes of gene frequencies in the population as a whole, has allowed less restrictive models to be used in developing indirect estimators of gene flow. The new methods accommodate recent population expansions, nonsymmetrical migration, and other complexities that are typical of real biological populations (Beerli and Felsenstein 1999, 2001; Vitalis and Couvet 2001). However, even coalescent-based methods currently assume that population demography has followed a relatively simple model of either constant size or deterministic expansion (with constant migration rates) for roughly the last 4Ne generations, which is the average time until the sampled chromosomes coalesce to a most recent common ancestor (Kingman 1982). For populations with large Ne or species in highly disturbed habitats, this assumption may be unreasonable.

Recently, nonequilibrium approaches have been proposed for identifying migrants (Rannala and Mountain 1997; Pritchardet al. 2000) or hybrids between species (Anderson and Thompson 2002) and assigning individuals of unknown population affinity to potential source populations using multilocus genotypes (Paetkauet al. 1995; Rannala and Mountain 1997; Cornuetet al. 1999; Dawson and Belkhir 2001; Gaggiottiet al. 2002). These methods extract information about recent migration (within the last few generations) from transient disequilibrium observed at individual multilocus genotypes of migrants or individuals recently descended from migrants. In comparison with indirect estimators of long-term gene flow, these methods make relatively few assumptions, but are informative only about recent patterns of migration. The two approaches (long-term gene flow and recent migration estimation) are complementary, providing information about migration on different timescales. Previous methods for inferring recent migration have focused on identifying individual migrants and their source populations (Paetkauet al. 1995; Rannala and Mountain 1997) or jointly identifying migrants and populations (Pritchardet al. 2000). Existing methods do not explicitly estimate migration rates among populations.

In this article, we develop a new Bayesian multilocus genotyping method for estimating rates of recent migration among populations. The method requires fewer assumptions than estimators of long-term gene flow and can be legitimately applied to nonstationary populations that are far from genetic equilibrium. Moreover, the newly proposed method relaxes a key assumption of previous nonequilibrium methods for assigning individuals to populations and identifying migrants—namely that genotypes are in Hardy-Weinberg equilibrium within populations. We allow arbitrary genotype frequency distributions within populations by incorporating a separate inbreeding coefficient for each population. The joint probability distribution of inbreeding coefficients is estimated from the data. Our method also allows for missing genotype data by using data augmentation techniques to integrate over possible genotypes for individuals.

THEORY

Data and model parameters: Consider a collection of I populations of a diploid species, with discrete nonoverlapping generations, and let m = {mlq} be the migration rates between populations, where mlq is the fraction of individuals in population q that are migrants from population l (m can also be treated as time dependent). Assume that some proportion of an individual’s alleles originate via a single migrant ancestor that arrived at the current (or a past) generation (this is justified for low migration rates, see appendix a). The individual itself may also be a migrant, in which case 100% of its genome is of migrant origin. Define M = {Mh}, where Mh is the source of migrant ancestry for individual h, and t = {th}, where th is the generation at which a migrant ancestor of individual h arrived (e.g., if th = 0 the individual has no migrant ancestry, if th = 1 the individual is itself a migrant, etc.). M and t are then unobserved variables describing the ancestry of each individual. To allow population genotype frequencies to deviate from Hardy-Weinberg equilibrium we define F = {Fl}, where Fl is the inbreeding coefficient for population l and -1 ≤ Fl ≤ 1. Let p = {pijl} be the population frequencies of marker alleles, where pijl is the frequency of allele i at locus j in population l.

Let X = {Xhj} be the multilocus genotypes observed at J marker loci in a random sample of n diploid individuals, where Xhj is the genotype of individual h at locus j, and let S = {Sh} identify the population source for each sampled individual, where Sh is the population that individual h was sampled from. The number of individuals sampled from the lth population is nl. The data (observations) are X and S. The joint (and marginal) posterior probability distributions of the remaining parameters M, t, p, m, and F are estimated numerically using Markov chain Monte Carlo (MCMC) methods (Gamerman 1997). The estimated posterior probabilities are used to make inferences about these parameters (including point estimates). The elements of m are of primary interest, but other parameters, such as M and t, may also be of interest (as in Rannala and Mountain 1997) and can be estimated similarly.

Likelihood: The likelihood of the data is the probability of the observed genotypes given the model parameters. This is Pr(XS;M,t,F,p)=h=1nj=1JPr(XhjSh;Mh,th,F,p), (1) where Pr(XhjSh;Mh,th,F,p)={Φ(Xhj,g)ifMh=Sh=gandth=0,0ifMhSh=gandth=0,Φ(Xhj,r)ifMh=r,Sh=g,andth=1,(112th2)Φ(Xhj,g)+(12th2)φ(Xhj,r,g)ifSh=g,Mh=r,andth>1,} and Φ(Xhj,r)={(1Fr)pijr2+FrpijrifXhj(1)=Xhj(2)=i,2(1Fr)pijrpkjrifXhj(1)=iandXhj(2)=kforik,} and ρ(Xhj,r,g)={pijrpijgifXhj(1)=Xhj(2)=i,pijrpkjg+pkjrpijgifXhj(1)=iandXhj(2)=korXhj(2)=iandXhj(1)=kforik,} where Xhj(1) denotes the allele present on the maternal chromosome, and Xhj(2) denotes the allele present on the paternal chromosome. Note that we define th = 0 if Mh = Sh (i.e., if the individual has no immigrant ancestry). The likelihood presented in Equation 1 involves a product of individual genotype probabilities across marker loci and individuals because it is assumed that individuals are randomly sampled and the markers are unlinked.

Prior distributions of parameters: To calculate the probability of observing M and t, given m, we assume that the populations are large enough that there is negligible genetic drift over two, or three, generations (for a justification, see appendix a). The expected proportion of migrants from population l that arrive in the present generation (the generation at which sampling is carried out) is then mlq and the expected proportion of individuals with one migrant ancestor from the previous generation of migration is 2mlq (see appendix a). We use only first- and second-generation migrants to estimate mlq in this article, but more distant migrant ancestries could also be used. The probability distribution of M and t, given m, follows a multinomial distribution, Pr(M,tm)=l=1Inl!(t=12qlI([2t1mlq]nlqtnlqt!))×l=1I(mllll0nnll0!), (2) where mll=1t=12ql2t1mlq, and nlqt=h=1n(Mh,th,Sh), and (Mh,th,Sh)={1ifMh=l,Sh=q,andth=t,0otherwise.} We use uninformative (uniform) Dirichlet prior densities for m and p subject to the constraints i=1kljpijl=1,for allj=1,2,,Jandl=1,2,,I, where klj is the total number of alleles at locus j in population l and q=1Imql=1,for alll=1,2,,I. We assume a uniform prior on the interval (-1, 1) for the population inbreeding coefficient of population l, Fl.

Posterior distributions of parameters: The joint posterior probability density of the model parameters, applying Bayes’ theorem, is f(m,M,t,F,p,X,S)=Pr(XS;M,t,F,p)×Pr(M,tm)fp(p)fm(m)fF(F)Pr(XS). (3) The denominator of Equation 3 above involves high-dimensional sums and integrals and it is not practical to evaluate it explicitly for samples of hundreds of individuals. Here, we use MCMC methods to estimate the joint posterior probability density of Equation 3. This requires only that it be possible for the numerator to be evaluated; this can be done using Equations 1 and 2 given above. MCMC can be carried out efficiently, even for large samples. Details of the MCMC algorithm are given in appendix b.

EXAMPLES

Application to data from the plant Centaurea corymbosa: The plant species Centaurea corymbosa is currently found in only six populations in southern France. In a study by Freville et al. (2001), 228 individuals (minimum population sample size of 20) from these six populations were genotyped at six microsatellite loci. This data set provides a useful test for our method, as the genetic differentiation between most populations is large, likely as a result of limited seed and pollen dispersal (Frevilleet al. 2001). While the geographical distances between the populations vary, all occur within a 3-km2 area. Observed pairwise FST values between populations ranged between 0.03 and 0.39 (mean FST = 0.23). An assignment test performed as described in Rannala and Mountain (1997) assigned 91.7% of the individuals to their source population and 7.4% to a neighboring population (Frevilleet al. 2001).

Figure 1.

—Log-posterior probabilities of the proposed states for the gray wolf and the C. corymbosa microsatellite data. Log-posterior probabilities were measured through 600,000 iterations of the MCMC program, sampled every 500 iterations.

To estimate the posterior probability distributions of parameters the MCMC was run for a total of 3 × 106 iterations, discarding the first 106 iterations as burn-in (intended to allow the chain to reach stationarity). Samples were collected every 2000 iterations to infer posterior probability distributions of parameters of interest, including the population allele frequencies, migrant proportions, and individual immigrant ancestries. Figure 1 shows the log posterior probability plotted against the iteration number for the C. corymbosa data for the first 600,000 iterations. The increase in log probability appears to plateau after only ∼500 iterations.

To further examine the convergence of the MCMC algorithm, the posterior probability density of each allele frequency at each locus in each population (grouped in intervals of 0.05) was compared for two independent runs with random initial parameter values, using either 2500 or 3 × 106 iterations. The results are shown in Figure 2, A and B. If the two chains have converged, the relationship between their posterior probabilities should be linear. The high degree of scatter in the plot of 2500 iterations illustrates that the chains have not yet converged (Figure 2A). With 3 × 106 iterations, the relationship is much more linear (Figure 2B). A similar plot of the posterior densities of the inbreeding coefficients in two runs of 3 × 106 iterations also indicates a strong correlation between posterior probabilities estimated from the two independent runs (Figure 3A).

Figure 2.

—Posterior probability densities of the allele frequencies generated from two separate runs of the program. The runs differed in initial random seed and initial values of m and F. (A and C) The relationship between these runs over the first 2500 iterations, before equilibrium has been reached. (B and D) The relationship between these runs after equilibrium has been reached. The latter runs consist of 3 × 106 iterations, a burn-in of 106, and a sampling period of 2000. Allele frequencies are grouped in 0.05 intervals.

The mean posterior probabilities of the immigration rates among populations for the C. corymbosa data are shown in Table 1. Most populations have low migrant proportions (when averaged over the posterior probabilities) with the exception of population E1, which appears to have a large expected proportion of migrants (m = 0.25) from population E2. There appears to be a source-sink relationship between the two populations because the expected proportion of migrants into population E2 from E1 is much smaller (m = 0.00). Figure 4, A and B, presents the posterior densities of the frequencies of two alleles in a population with either a low (Figure 4A) or a high migration rate (Figure 4B). Population sample sizes are nearly identical (38 and 40 individuals, respectively). Both the migration rate and the sample size affect the variance of the posterior probability distribution; higher migration rates and smaller sample sizes both increase the variance. In Figure 4A, the estimated 95% credible set of values for the allele frequency is (0.50, 0.80) while in Figure 4B it is (0.55, 0.95). Migration can also cause the mode of the posterior density of allele frequency to differ from the maximum-likelihood estimate of allele frequency that would be obtained by using the population sample directly and ignoring immigration as is done in many population assignment tests (e.g., Paetkauet al. 1995).

Figure 3.

—Posterior probability densities of inbreeding coefficients generated from two different runs of the program. Settings are as in Figure 2.

Another property of the populations that can be studied is the posterior probability distribution of the total numbers of nonimmigrants, first-generation immigrants, and second-generation immigrants. Figure 5, A-C, shows these posterior distributions for C. corymbosa population E1. The expected proportions of nonimmigrants and first-generation immigrants overlap, although the variance of the posterior distribution of the proportion of first-generation migrants is lower. The expected proportion of second-generation immigrants is about twice as high and the variance is also larger (this is likely due in part to the fact that assignments of second-generation immigrants are less certain than those of first generation). The 95% credible set for the proportion of first-generation migrants is (0.10, 0.45) vs. (0.30, 0.75) for second-generation migrants and (0.00, 0.55) for nonmigrants. The reason that the probability of the proportion of nonimmigrants being above 0.55 is negligible, while the migration rate into this population is ∼0.25 (Table 1), is outlined in appendix a. The prior predicts that the expected proportion of first-generation migrants should be m and the proportion of second-generation migrants should be 2m. As no higher orders of migrants are currently considered in our method, the average proportion of nonmigrants should be ∼1 - m - 2m under our model, or in this case, 0.25, which falls near the center of our 95% credible set.

View this table:
TABLE 1

Migration rates among C. corymbosa populations

Figure 4.

—Posterior probability density of a particular allele over all sampled iterations. (A) Allele 174 from locus 13D10 in population Pe. (B) Allele 163 at locus 13B7 in population E1. (C) The frequency distribution of allele 128, locus cxx140, Fort St. John population. (D) The distribution of allele 200, locus cxx204, Great Bear Lake population. The gray line represents the maximum-likelihood estimate for this allele when calculated from individuals sampled from this population. Settings for the MCMC chain are as in Figure 2.

Figure 5.

—Posterior probability distribution of the proportion of the individuals in a population assigned as nonimmigrants (0), first-generation migrants (1), and second-generation migrants (2) at each sampling iteration. E1 and the Southern Richardsons are populations of C. corymbosa and wolf, respectively.

Our method can also be used to study the migrant ancestry assignments of individuals, taking account of overall population migration rates and uncertain population allele frequencies. Figure 6A shows the posterior probabilities of nonimmigrant, first-, or second-generation immigrant ancestry for five individuals from population E1 and one individual from population E2. Individual 4-E1 is most likely to be a first-generation immigrant, individuals 11-E1 and 37-E1 are most likely to be second-generation immigrants, and individuals 22-E1 and 31-E1 are roughly equally likely to be either nonimmigrants or second-generation immigrants. Our method is able to identify second-generation immigrants with a high level of certainty due to the linkage disequilibrium observed in the multilocus genotypes of individuals whose parents have originated in different populations. Individual 1-E2 is most likely to be a nonimmigrant. Excluding population E1, in only 3 of 190 cases did an individual assign with probability >0.05 to a population other than the one it was sampled from, indicating very low levels of migration.

Figure 6.

—Posterior distribution for the assignment of individuals to ancestry states 0 (▪), 1 (Graphic), and 2 (□) for C. corymbosa and wolf. All individuals are from the populations examined in Figure 5, except the last C. corymbosa individual, which is from E2.

The posterior probability density of the population inbreeding coefficient, F, was concentrated near 0 for most populations, although the standard deviation was large in population E1, which had the greatest amount of immigration; in that case, the estimated mean of the posterior density was F = 0.027 but the standard deviation was 0.39. This is likely a result of the lack of information available to the method for estimating F, as most individuals in this population have high posterior probabilities of being first- or second-generation migrants. The remaining populations had much lower standard deviations (<0.08). The population Pe had significant posterior probability associated with relatively large positive values of F (mean of posterior density was F = 0.123 with a standard deviation of 0.05), suggesting potential local inbreeding effects.

Application to gray wolf data: In a study of population genetic structure of gray wolves, Canis lupus, in the Canadian Northwest, Carmichael et al. (2001) genotyped nine microsatellite loci in 491 individuals (minimum sample size of 9 individuals) from nine separate regions. This data set is a valuable test of our method, as the amount of differentiation between populations has a fairly wide range. Some populations are situated fairly close to one another, with no obvious physical barriers to gene flow between them (for example, the Tuktoyaktuk/Inuvik and Paulatuk populations, FST = 0.009), while others are separated by mountain ranges (Kluane National Park), the Arctic Ocean (Banks Island), or large geographic distances (Fort St. John). As such, these samples allow us to determine the effect of differences in genetic differentiation on our method’s ability to obtain reliable estimates of migration rates and individual immigrant ancestries.

To estimate the posterior probability distributions of the parameters the MCMC was run for a total of 3 × 106 iterations, discarding the first 106 iterations as burn-in. Samples were collected every 2000 iterations to infer posterior probability distributions of parameters. Figure 1B shows the log-posterior probability plotted against iteration number for the gray wolf data. The increase in log-probability appears to plateau after ∼10,000 iterations. Figure 2, C and D, shows the correlations (between two independent MCMC runs) of the posterior probability densities of each allele frequency, at each locus, in each population (grouped in intervals of 0.05). The high degree of scatter in the plot of 2500 iterations vs. the plot of 3 × 106 iterations (which is highly linear) once again illustrates that the chains have not yet converged at 2500 iterations but have the appearance of convergence after 3 × 106 iterations. A similar plot (Figure 3B) of the posterior densities of the inbreeding coefficients in two runs, each with 3 × 106 iterations, also indicates a strong correlation between posterior probabilities (suggesting the chains have converged).

The means (averaged over posterior probabilities) of the immigration rates between populations for the gray wolf data are shown in Table 2. Four of the populations appear quite isolated (Banks Island, Fort St. John, Kluane National Park, and Northern Richardson Mountains). The remaining five populations all have at least one major source of immigrants. There were some notably large mean migration rates between wolf populations. The mean migration rate from the Northern Richardson Mountains to the Southern Richardson Mountains was 0.22; from Tuk/Inuvik to Great Bear Lake, 0.14; from Tuk/Inuvik to Paulatuk, 0.21; and from the Southern Richardson Mountains to Tuk/Inuvik, 0.23. All of these populations are relatively close to one another, occurring on the mainland of the northern Yukon or the Northwest Territories. However, it is worth noting that most of these populations do not have symmetrical migration rates, suggesting that movement of animals between these regions is predominantly unidirectional. For example, while the mean migration rate from the Northern to the Southern Richardson Mountains populations was 0.22, the mean migration rate in the opposite direction was only 0.04. The mean migration rate from Banks Island to Victoria Island was also fairly large at 0.19 while the reverse rate was near zero (see Table 2). These islands are quite close to one another and are joined by ice during the winter months.

View this table:
TABLE 2

Migration rates among gray wolf populations

Figure 4, C and D, presents the posterior densities of the frequencies of two alleles in populations with either a low immigration rate and a larger sample size or a high immigration rate and a smaller sample size. In these examples, the sample sizes are quite different between the populations (e.g., 41 individuals for Fort St. John and 22 individuals for Great Bear Lake). Immigration causes the mode of the distribution to exceed the maximum-likelihood estimate by a considerable amount (Figure 4D) and the variance of the estimated posterior density of allele frequency is also much larger in the example with a smaller sample size and higher migration rate. In Figure 4C the estimated 95% credible set for the allele frequency is (0.35, 0.60) while in Figure 4D it is (0.10, 0.70).

Figure 5, D-F, shows the posterior probability distributions of the total proportions of nonimmigrants and first- and second-generation immigrants (from any population) for the Southern Richardson Mountains gray wolf population. The mode of the posterior proportion of nonmigrants is much lower than that for the posterior distribution of the proportion of either first- or second-generation migrants. Also, the mode of the posterior distribution of second-generation migrants is roughly twice that of first-generation migrants. The variance of the posterior distributions of first- and second-generation migrant proportions is much greater than that of the nonmigrant proportion. The 95% credible sets for the former are (0.20, 0.50) and (0.40, 0.70), respectively, vs. (0.00, 0.20) for the latter.

Figure 6B shows the posterior probabilities of nonimmigrant, first-, or second-generation immigrant ancestry for four individuals from the Southern Richardson Mountains population. Individual MP9205 is most likely to be a nonimmigrant. Individual MP9224 is most likely to be a first-generation immigrant, individual MP9219 a second-generation immigrant, and individual MP9220 is fairly evenly split between being a first- and second-generation immigrant. The posterior probability density of the population inbreeding coefficient, F, was concentrated near 0 for most populations, with the exception of two populations, Great Bear Lake and Northern Richardson Mountains, which had significant posterior probability associated with negative values of F. F was also approximately uniformly distributed between -1 and +1 in the Victoria Island population, likely because most of the individuals in this population were assigned as migrants.

SIMULATION STUDY

Simulation methods: To evaluate the statistical properties of the new method we simulated samples from populations exchanging migrants according to the Wright (1931) island model (at stationarity). The allele frequencies (assuming biallelic loci) in pairs of populations receiving migrants from a common source, with allele frequency qi at locus i, were simulated from the stationary probability density function (pdf) under the Wright island model. The simulated markers could be SNPs, for example, which are typically biallelic. The pdf of the allele frequency at locus i in population j is f(pij)=Γ(4Nm)Γ(4Nmqi)Γ(4Nm[1qi])pijNmqi1(1pij)4Nm(1qi)1. (4) The pdf of the allele frequencies at J unlinked loci in population i is f(pj) =Πif(pij), where the product is over the J loci and pj = {pij} is the vector of allele frequencies in population j. The alleles at each locus were therefore simulated as independent and identically distributed with common pdf given by Equation 4. A sample of n individuals was generated from each simulated population according to the multinomial sampling distribution of Equation 2. It was assumed that (recent) migration occurs between the two populations with rates m12 and m21. To reduce the number of parameters to be considered in our simulations, we assumed that m = m12 = m21 and qij = q for all i, j.

If an individual is a nonmigrant, the genotype is generated by assigning alleles according to the Hardy-Weinberg proportions, conditional on the simulated allele frequencies in the population from which the individual was sampled. A first-generation migrant similarly has its genotype assigned according to Hardy-Weinberg proportions, but conditional on the allele frequencies in the alternative population. A second-generation migrant has its genotype assigned by drawing an allele from each population, respectively, at each locus. To simplify the comparisons, we define the population allele frequencies in terms of FST by using the standard result for the expected FST at stationarity under the Wright model, FST = 1/(4Nm + 1), and solving for 4Nm in terms of FST to obtain 4Nm = 1/FST - 1. The right-hand side of this equation was substituted for 4Nm in Equation 4. The simulation results are therefore presented in terms of FST, m, q, and n. To evaluate the statistical performance of the estimator of migration rates under the simulations we focused on two statistics, the mean square error (MSE) and the bias (see Casella and Berger 1990). MSE is a function of both the bias and the variance of the estimator (MSE = bias2 + variance). A decrease in MSE therefore indicates an improvement in the estimator. To evaluate the statistical accuracy of migrant ancestry assignments we examined the proportion of migrants from each ancestral class (e.g., nonmigrants, first-generation migrants, and second-generation migrants) that were assigned to a given class with maximum posterior probability.

To examine the performance of the model under various conditions, different values were assigned to a number of parameters. The most common allele in a population (q) was assigned a value of either 0.5 or 0.9. The number of individuals sampled from each population (n) was either 20 or 100. Populations were separated by FST values of 0.01, 0.10, or 0.25. Migration rates between populations (m) were 0.01, 0.05, 0.10, or 0.20. Three different numbers of loci were simulated: 5, 10, and 20. The parameters listed above were used for simulations in all possible combinations, for a total of 144 parameter combinations. Each of these combinations was replicated 10 times. As each simulated data set contained two populations, data were generated for 20 simulated populations for each combination of parameter settings. The MCMC was run with the same settings (number of iterations, etc.) as in each of the examples. As the results with q = 0.5 were very similar to those obtained with q = 0.9, only the former are examined here.

Simulation results: The results of the simulation study are summarized in Figures 7, 8, 9, 10. Figure 7 shows the influence of the number of loci and the migration rate used for the simulations on MSE and bias of the estimated migration rate for a fixed degree of genetic differentiation (FST = 0.25). In the case of 5 loci (Figure 7D), the data have little influence on the estimates, by comparison with the influence of the prior. The prior specifies that m is uniform on the interval (0, 0.33) with mean 0.167. When the actual value of m exceeds the mean of the prior (e.g., when m = 0.2), the estimator has a negative bias. When the actual value of m is less than the mean of the prior (e.g., m ≤ 0.1) the estimator has a positive bias, as expected if the posterior is essentially similar to the prior. With 20 loci, the data have a greater influence than the prior and we see a smaller positive bias for all values of m considered (Figure 7B). In general, MSE decreases with an increase in the number of loci sampled (Figure 7, A and C) and with increasing sample size, although sample size appears less important in this case.

Figure 7.

—MSE and bias for the migration rate estimate from simulated data. The following parameters were used for data simulation: 5 (C and D) or 20 (A and B) loci, 20 (Graphic) or 100 (▪) individuals per population, and migration rates of 0.2, 0.1, 0.05, or 0.01, when FST = 0.25.

It is apparent from our simulation analyses that the effects of sampling either more individuals or more loci are correlated. With a small number of loci, increasing the sample size (from 20 to 100) has little effect on the bias or MSE of the estimated migration rate (Figure 8, A and B), but with a larger number of loci (20 loci), increased sample size dramatically reduces bias and MSE (Figure 8, C and D).

The migration rate and the level of genetic differentiation between populations also influence the mean (and variance) of the maximum posterior probabilities (i.e., the highest posterior probability assignment) of individual migrant ancestries. In the case of a high degree of genetic differentiation between populations (FST = 0.25) and 20 loci, the mean of the maximum posterior probability assignment (across sampled individuals) increases with decreasing migration rate and the variance of the maximum posterior probability (across individuals) decreases (Figure 9, A and B). In the case of low genetic differentiation between populations (FST = 0.01) and 5 loci the migration rate has little influence on the mean or variance of the maximum posterior probability assignments (Figure 9, C and D).

Figure 8.

—MSE and bias for the migration rate estimate from simulated data. The following parameters were used for data simulation: FST = 0.01 and 5 loci (A and B) or FST = 0.10 and 20 loci (C and D). Simulations were performed with either 20 (Graphic) or 100 (▪) individuals per population and migration rates of 0.2, 0.1, 0.05, or 0.01. MSE and bias for the prior (□) are also given.

Figure 10 examines the accuracy of the individual migrant ancestry assignments as a function of migration rate, sample size, and number of loci when populations with a high degree of genetic divergence (FST = 0.25) are considered. For each of the categories 0 (nonmigrant), 1 (first-generation migrant), or 2 (second-generation migrant), the total population of individuals actually belonging to that category is represented by the height of the histogram bar. Each histogram bar is then divided into three different shades, representing the proportion of individuals actually belonging to that category that are assigned to each of the three categories. If the assignments were perfectly accurate, each histogram bar would be filled with a single shade (corresponding to the migrant ancestry class represented by that histogram bar).

Figure 9.

—Mean and variance of the maximum posterior probability for each individual migrant ancestry from simulated data. The following parameters were used for data simulation: FST = 0.25 and 20 loci (A and B) or FST = 0.01 and 5 loci (C and D). Simulations were performed with either 20 (Graphic) or 100 (▪) individuals per population and migration rates of 0.2, 0.1, 0.05, or 0.01.

Of the four cases shown in Figure 10, the cases with either high migration rate (m = 0.2) and large samples of individuals (100) and loci (20) or low migration rate and small samples of individuals (20) and loci (5) provide the most accurate assignments (Figure 10, A and D). Decreasing the number of loci sampled from 20 to 5 has a large effect in decreasing the accuracy of assignments (Figure 10, A and B), but increasing the number of individuals sampled has only a modest effect on accuracy (Figure 10, B and C). Finally, decreasing the migration rate also has a large effect, improving the accuracy of the method even when only 5 loci and 20 individuals are sampled (Figure 10, C and D). At least part of the explanation for this trend is the fact that with lower migration rates population allele frequencies are more accurately estimated (due to the larger proportion of nonmigrants in the sample).

In conclusion, although it is impossible to generalize because of the enormous number of possible parameter combinations that can occur, our simulations suggest that with five or fewer loci and low migration rates very little information is available for inferring migration rates; increasing the number of individuals sampled has a modest effect in improving estimation except in certain cases, such as with low migration and a high degree of genetic differentiation among populations. A higher level of genetic differentiation among populations results in improved accuracy of estimated migration rates and migrant ancestry assignments. Migrant ancestries are most accurate when either a large number of loci and individuals are sampled or migration rates are low.

DISCUSSION

In this article, a new Bayesian method is presented for use with allozyme, microsatellite, RFLP, or SNP multilocus genotype data, which allows one to simultaneously infer recent migration rates, population allele frequencies, population inbreeding coefficients, individual migrant ancestries, and other parameters of potential interest. Our method should be of interest to ecologists assessing the relative importance of specific patterns of population dynamics in nature, the prevalence of male- (or female-) biased dispersal, the importance of geographic barriers to dispersal, and so on.

We have applied our method to two previously published microsatellite data sets for plants (C. corymbosa) and mammals (gray wolves) to illustrate its use. We have shown that for each of these data sets reasonably precise information about recent migration patterns can be extracted. In the case of the C. corymbosa data, a highly asymmetrical pattern of immigration in one pair of populations (E1 and E2) supports the existence of a source-sink population structure.

Another pattern observed in both example analyses is that a greater proportion of individuals in populations with ongoing migration have more distant migrant ancestry (e.g., second-generation vs. first-generation migrant ancestry). This is as expected under the low migration rate approximation presented in appendix a. If migrants beyond the first generation are ignored in an assignment test the result may be biased so that individuals with second-generation migrant ancestry are incorrectly assigned as first-generation migrants. It was also observed that estimated population allele frequencies could deviate considerably from maximum-likelihood estimates (observed proportions of alleles) in populations experiencing high rates of immigration; it is therefore important to simultaneously estimate individual migrant ancestries and population allele frequencies as we have done in this article. Failing to do so may increase the likelihood that an immigrant individual is incorrectly assigned as a nonimmigrant due to incorrect estimation of the allele frequencies within populations. This study therefore suggests that it may be preferable to estimate migration rates, migrant ancestries, and allele frequencies simultaneously in population assignment tests.

Figure 10.

—The proportion of individuals with a migrant ancestry of 0 (▪), 1 (Graphic), and 2 (□) (size of the vertical bar) who have their maximum posterior probability in each state (proportion of the bar shaded). Data were simulated from a population FST of 0.25. Simulations were performed with a migration rate of 0.2 (A-C) or 0.05 (D), 100 (A and B), or 20 (C and D) individuals, and 20 (A) or 5 (B-D) loci.

The results of our limited simulation study indicate that very accurate estimates of migration rates and individual migrant ancestries can be obtained when levels of genetic differentiation among populations are large, migration rates are low, and 20 or more loci are examined. If 5 or fewer loci are examined little information may be available, even if a large number of individuals are sampled. To explore the robustness of migration rate estimates and assignments for particular data sets, it may be advisable to carry out preliminary simulations to determine the expected accuracy of the method, given the observed level of genetic differentiation among populations. In our simulation study, we considered only diallelic loci and it is likely that accuracy may increase with increasing numbers of alleles (e.g., with microsatellite loci vs. SNPs).

There are a number of ways in which the approach presented here could be extended in the future. First, we have ignored preexisting patterns of genetic differentiation among populations; our population-specific inbreeding coefficients consider only identity by descent (IBD) of alleles (making up genotypes) within populations. One could take direct account of population structure by introducing additional F-statistics that describe the probabilities of IBD of alleles sampled from different populations (in the case of individuals with mixed migrant ancestry). This could improve performance because the allele frequencies in populations with low levels of differentiation are not independent and genotype sample information can therefore be effectively “combined” across populations (through the use of F-statistics) to provide improved estimates of allele frequencies (in the extreme case, imagine two populations with no differentiation; a sample from the first population can be used to estimate allele frequencies in the second).

Another extension of our approach could be to allow immigration rates to vary over time. Posterior probabilities under models with constant or variable immigration rates could then be compared, using predictive posterior probabilities (see Bernardo and Smith 2000) to test the hypothesis of constant immigration rates during the last few generations. This might potentially allow one to directly address the relationship between immigration and gene flow. Strictly speaking, gene flow involves both immigration and local reproduction. If the rates of migration in the current and previous generations are similar this suggests that there is no difference in breeding success between residents and migrants (gene flow equals immigration rate), etc.

A disadvantage of our method, as currently formulated, is that it allows only the proportions of immigrants in a population to be estimated; it does not allow one to estimate directly the total proportion of individuals that emigrate from a population or the proportion that emigrate from one particular population to another. For example, a small population may have a large proportion of the total individuals in the population migrating to a particular large population but the fraction of migrants detected in the large population will be low (because of the relative difference in the population sizes) and will provide no indication of the large proportion of actual emigration from the source population. One way to deal with this would be to estimate emigration rates that are corrected for the relative population sizes (if known). Alternatively, if temporal samples were collected, with replacement, information from unique individual genotypes (or mark-recapture tagging) could be combined with a method such as ours to jointly estimate population sizes and relative migration rates. Another assumption of the method is that all populations exchanging migrants have been sampled. The effect of this assumption may be important and a goal of future research should be to devise models that allow some degree of migration from “unobserved” populations for which no reference allele frequencies are available. This could likely be done using the flexible MCMC framework presented here.

To derive the prior probability distribution for individual immigrant ancestries, we have considered the distribution of migrant ancestries in a population in the limit of low migration rates and random mating between migrants and residents. Simulation studies are needed to determine the robustness of this approximation in the face of high migration rates and local inbreeding among migrant founders. Despite some outstanding issues of interpretation and reliability, methods for estimating recent migration rates using multilocus genotypes, such as we have presented here, should provide a useful (and complementary) alternative to existing methods, on the basis of diffusion approximations or coalescent theory, aimed at estimating historical migration rates under particular demographic scenarios. On balance, we are optimistic that new methods for inferring contemporary migration rates and gene flow will ultimately require fewer assumptions and will yield information that is highly relevant to conservation biologists, ecologists, human geneticists, and others dealing with practical problems involving recent (or ongoing) migration and admixture among study populations.

The program BayesAss, written in C, is available from our website at http://rannala.org.

Figure A1.

—Several possible patterns of immigrant ancestry that would each result in either all of an individual’s genes arising from an immigrant source (top of figure, ϕ = 1) or one-half of an individual’s genes arising from an immigrant source (bottom of figure, ϕ = ½). Immigrants are denoted by solid circles and nonimmigrants by open circles. The probability of each pattern, given a migration rate m and assuming random mating, is given below each part.

APPENDIX A

Expected migrant proportions: Define the probability (per generation) that an individual is a migrant as m and let ϕ be the expected fraction of alleles at a locus that is derived from migrants. By enumerating all possible patterns of ancestry (individual is a migrant, individual is a nonmigrant but both parents are migrants, etc.) that result in a given value of ϕ, we obtain Pr(ϕ=1)=m+(1m)m2+=m+O(m2).Pr(ϕ=12)=2m(1m)2+(42)m2(1m)5+=2m+O(m2).Pr(ϕ=14)=4m(1m)6+(82)m2(1m)13+=4m+O(m2), where the notation O(m2) denotes terms of order m2 and higher. The first term in each series is the probability of a single migrant ancestor: In the case of ϕ = 1 the individual is a migrant (at generation 1); in the case of ϕ = 1/2 the individual has a migrant parent (at generation 2); and in the case of ϕ = ¼ the individual has a migrant grandparent (at generation 3). Several possible ancestries leading to ϕ = 1 and ϕ = ½ are shown in Figure A1. The other terms allow possibilities such as two immigrant parents at generation 2 (in the case of ϕ = 1), etc.

The first term in each of the three possibilities listed above (migrant, migrant parent, and migrant grandparent) is a linear function of m, and the remaining terms are of order m2 and higher. If m is small the higher-order terms can be neglected and we need consider only possibilities involving a single migrant ancestor at some generation (this approximation is implicit in the method of Rannala and Mountain 1997). In the limit of small m, we expect a fraction m of individuals in the population to be first-generation migrants, a fraction 2m to have one migrant parent, a fraction 4m to have one migrant grandparent, and so on. Individuals with migrant ancestry beyond parents will have only one-quarter of their genome derived from migrant ancestors, on average, and for smaller numbers of loci such individuals will be statistically indistinguishable from nonmigrants; we have therefore chosen to use only the previous two generations of migrant ancestry to estimate m, although more distant generations could also be included with sufficient numbers of loci.

Constant allele frequencies: Assume that a Fisher-Wright population of constant size Ne receives migrants at rate m. The deterministic change in allele frequency in the population due to migration in each generation is Δp=_mΔp0, where Δp = p1 - p0 is the change in the population allele frequency in a single generation and Δp0 = p0 - pm is the difference in allele frequency between the population that is the migrant source and the population from which individuals are sampled. For a single diallelic locus, the measure of population differentiation, FST, is defined as FST=σp2p(1p) (see Wright 1969), where is the average allele frequency across populations and σp2 is the variance of allele frequencies across populations. We can write σp2 for our pair of populations as σp2=12[(pp0)2+(ppm)2],=12[(p0+pm2p0)2+(p0+pm2pm)2],=14Δp02. For a given value of FST, the difference Δp takes on its most extreme values when = ½ and the value of FST is then Δp02 . In that case, we can rewrite Δp as Δp=mFST. We now have an expression for the magnitude of the change of allele frequency (per generation) under migration pressure in a population that receives migrants from another population with a specified level of differentiation between the populations. If m < 0.05 and FST < 0.05 then Δp < 0.01 and the change of allele frequency over a few generations will be negligibly small. Similarly, twice the standard deviation of the allele frequency change due to drift will be 2σp=2p0(1p0)2Ne=2p0(1p0)Ne. The change in allele frequency under drift will be greatest when p0 = ½ and in that case 2σp=12Ne . If Ne > 5000 then 2 σp < 0.01 and the change of allele frequency over a few generations will be negligibly small. These values define boundaries beyond which the approximations underlying the proposed method will be well satisfied. In such cases, the resulting estimates should be accurate. The method may provide reasonable estimates for larger values of m and FST (or smaller Ne) as well but the specific range of applicability remains to be shown. Simulation studies are needed to evaluate the performance of the method under a range of conditions.

APPENDIX B

MCMC algorithm: The Metropolis-Hastings (MH) algorithm (Metropoliset al. 1953; Hastings 1970) was used to numerically calculate the posterior probability density of the parameters in our analyses. The basic idea is to construct a Markov chain with a stationary distribution that is the joint posterior distribution of the parameters to be estimated. This chain is simulated and samples from the chain are used to make inferences about joint or marginal posterior probabilities of parameters. The implementation of the MH algorithm used in our program has four steps at each iteration of the chain. At each step (outlined below) a particular set of parameters are potentially modified.

Modifying population migration rates: The matrix of population migration rates at iteration a, denoted as m[a], is modified to be m[a + 1] = m* with probability αm(mm[a])=min{1,Pr(M,tm)Pr(M,tm[a])}. The nominating function g(m*|m[a]) is as follows: Choose one of the I 2 elements of the migration matrix to be modified with uniform probability 1/I 2. The migration rates are constrained by our model such that mll=1ql(mlq+2mlq)=13qlmlqandmll0. It follows that 13qlmlq0,qlmlq13,mll(23,1). To maintain these constraints, we used the following proposal scheme. If element l, q is chosen (lq), the proposed value is mlq* = mlq[a] + z, where z is chosen on a uniform interval (-δm, +δm) with reflecting boundaries, where δm = max{0.10, 1 - mll}. If mlq* > ⅓ or mlq* < 0 then mlq* is reflected back onto the interval (0, ⅓) by an amount mlq[a] + z - ⅓ or - z - mlq[a]. The remaining elements jq of row l are adjusted so that they sum to 1 by using the transformation mlj=mlj(1mllmlq)jq,jlmlj,for alljq,jl. If element mll is chosen, the proposed value is mll* = mll[a] + z, where z is chosen on a uniform interval (-δm, +δm) with reflecting boundaries, where δm ≤ ⅓. If mll* > 1 or mll* < ⅔ then mlq* is reflected back onto the interval (⅔, 1) by an amount mlq[a] + z - 1 or 2/3 - z - mlq[a]. The remaining elements are adjusted to sum to 1 using the transformation mlj=mlj(1mll)jqmlj. We assumed a uniform Dirichlet prior for m and a uniform prior (on the integers 0, 1, 2) for ti so that the terms in the MH ratio involving the priors for m and t cancel. The nominating function g(m*|m[a]) described above is symmetrical so that these terms also cancel from the MH ratio.

Modifying individual migrant ancestries: The matrix of individual migrant ancestries at iteration a, denoted by the composite parameters M[a] and t[a], are modified to be M[a + 1] = M* and t[a + 1] = t* with probability αM,t(M,tM[a],t[a])=min{1,Pr(M,tm)Pr(XS;M,t,F,p)g(M[a],t[a]M,t)Pr(M[a],t[a]m)Pr(XS;M[a],t[a],F,p)g(M,tM[a],t[a])}, where g(M[a],t[a]M,t)g(M,tM[a],t[a])=nlqt+1nlqt. The nominating function g(M*, t*|M[a], t[a]) is as follows: Choose one of the n sampled individuals to have its migrant ancestry modified with uniform probability 1/n. There are 2I - 1 possible states for the migrant ancestry of the chosen individual; it can be a nonmigrant or a first- or second-generation migrant from one of the remaining I - 1 populations. The proposed change for an individual must be to one of the 2I - 2 states other than its present state and each possibility is assigned a uniform probability 1/(2I - 2).

Modifying population allele frequencies: The matrix of population allele frequencies at iteration a, denoted as p[a], is modified to be p[a + 1] = p* with probability αp(pp[a])=min{1,Pr(XS;M,t,F,p)Pr(XS;M,t,F,p[a])}. The nominating function g(p*|p[a]) is as follows: Choose one of the I populations with uniform probability 1/I, choose one of the J loci with uniform probability 1/J, and choose one of the klj alleles at locus j in population l with uniform probability 1/klj. If allele i at locus j in population l is chosen the proposed value is pijl* = pijl[a] + z, where z is chosen on a uniform interval (-δp, +δp) with reflecting boundaries and the remaining allele frequencies are adjusted so that the proposed allele frequencies sum to 1.

Modifying population inbreeding coefficients: The vector of population inbreeding coefficients at iteration a, denoted as F[a], is modified to be F[a + 1] = F* with probability αF(FF[a])=min{1,Pr(XS;M,t,F,p)Pr(XS;M,t,F[a],p)}. The nominating function g(F*|F[a]) is as follows: Choose one of the I populations with uniform probability 1/I. The proposed value is Fl* = Fl[a] + z, where z is chosen on a uniform interval (-δF, +δF) with reflecting boundaries such that Fl* remains on the interval (-1, +1).

Modifying genotypes with missing data: If X- is a submatrix of X = {X-, X+} containing the missing genotypes for each individual, the proposed genotypes at these loci at iteration a, denoted as X-[a], were modified to be X-[a + 1] = X-* with probability αX(XX[a])=min{1,Pr(X+,XS;M,t,F,p)Pr(X+,X[a]S;M,t,F,p)}. The nominating function g(X-*|X-[a]) is as follows: Choose any one of the LT = ΣLi loci with missing data with uniform probability 1/LT, where Li is the number of loci with missing data for individual i. Modify the locus to become genotype u, v with uniform probabilities 2/[kl(kl - 1)] if uv and 1kl2 if u = v where kl is the number of alleles (in all sampled populations) at locus l.

Acknowledgments

We thank Helene Freville and Isabelle Olivieri for providing us with their plant microsatellite data and Lindsey Carmichael and Curtis Strobeck for providing us with their wolf microsatellite data. We are grateful to the two anonymous reviewers. This research was supported by the National Institutes of Health grant HG01988 and Canadian Institutes of Health research grant MOP 44064 to B.R.

Footnotes

  • Communicating editor: J. Hey

  • Received May 22, 2002.
  • Accepted December 10, 2002.

LITERATURE CITED

View Abstract