## Abstract

We present a new multilocus genotype method that makes inferences about recent immigration rates and identifies the environmental factors that are more likely to explain observed gene flow patterns. It also estimates population-specific inbreeding coefficients, allele frequencies, and local population *F*_{ST}'s and performs individual assignments. We generate synthetic data sets to determine the region of the parameter space where our method is and is not able to provide accurate estimates. Our simulation study indicates that reliable results can be obtained when the global level of genetic differentiation (*F*_{ST}) is >1%, the number of loci is only 10, and sample sizes are of the order of 50 individuals per population. We illustrate our method by applying it to Pakistani human data, considering altitude and geographic distance as explanatory factors. Our results suggest that altitude explains better the genetic data than geographic distance. Additionally, they show that southern low-altitude populations have higher migration rates than northern high-altitude ones.

THE study of dispersal processes is an essential problem in ecology, population genetics, conservation, and management of wildlife. For this reason, the estimation of migration rates has been one of the most investigated problems in population biology. Migration parameters can be directly estimated using ecological approaches such as mark-release-recapture methods but they are not applicable to the study of large or extended metapopulations. In these cases, population genetics approaches provide a better alternative because the information contained in DNA can provide gene flow parameter estimates for different and complementary timescales. Methods based on coalescent theory provide long-term migration rates because they use the genealogical information contained in a sample of genes (*e.g.*, MIGRATE, Beerli and Felsenstein 2001). On the other hand, methods based on multilocus genotypes (*e.g.*, BAYESASS, Wilson and Rannala 2003) provide estimates of recent immigration rates by extracting the gametic disequilibrium signal generated by immigrant individuals or their descendants.

Besides simply estimating migration rates, it is very important to identify the biotic and/or abiotic factors that influence them. This can be done by first obtaining gene flow estimates and then searching for correlations between them and various environmental variables (*e.g.*, Giordano *et al.* 2007). Such an approach requires the use of summary statistics that do not take advantage of all the information contained in genetic data. An alternative approach is to implement the joint analysis of genetic and nongenetic data. Several current methods that combine both genetic and geographic data can be used to detect recent migrants (*e.g.*, GENELAND, Guillot *et al.* 2005; TESS, François *et al.* 2006) but they do not take into account other environmental factors.

In previous studies we presented methods that use genetic and environmental data to study colonization processes (Gaggiotti *et al.* 2002, 2004) and population genetic structure (Foll and Gaggiotti 2006). These approaches based on hierarchical Bayesian methods (*e.g.*, Gelman *et al.* 1995) estimate the probability that a given environmental factor influences the parameters of interest (*e.g.*, composition of colonizing groups or local population *F*_{ST}'s) because they explicitly model the relationship between them and the relevant ecological factors. In this article we present a new multilocus genotype method for inferring recent immigration rates and identifying the environmental factors that best explain observed gene flow patterns. We use a hierarchical Bayesian approach that introduces nongenetic data through the prior distribution of the migration rates. Following Wilson and Rannala's (2003) approach we implement the estimation of inbreeding coefficients to allow for departures from Hardy–Weinberg equilibrium within local populations. Finally, the method infers the population ancestry of individuals by assigning their alleles to populations from which they originated. We carry out a simulation study to identify the region of parameter space where the method is and is not able to provide accurate posterior estimates. We also illustrate our method with a real data example.

## DATA AND MODEL PARAMETERS

#### Inferring migration rates from genetic data:

The method is based on a population genetics model that differs from that used by Wilson and Rannala (2003). More specifically, instead of assuming that sampling takes place right after migration, we consider that this is done after reproduction and before migration. Let us consider a metapopulation of a diploid species with nonoverlapping generations that is subdivided into *I* demes that can exchange migrants. Let be the observed multilocus genotypes of *n* individuals scored at *L* marker loci, where denotes the genotype of individual *h* at locus *l*. We assume that *n _{i}* individuals were sampled from population

*i*and use the vector to identify the population

*S*where the individual

_{h}*h*was sampled from.

Population allele frequencies are given by a matrix composed of vectors that give the frequency of allele *a* at locus *l* for population *i*. Following Falush *et al.* (2003), we consider a model with correlated allele frequencies based on the approach introduced by Balding and Nichols (1995). Thus, we assume that before the last generation, the population was at migration–drift equilibrium so that allele frequencies in each population are determined by the global allele frequencies in the metapopulation as a whole, , and the degree of genetic differentiation between each local population and the overall metapopulation, , where θ_{i} = − 1. Finally, to allow departures from Hardy–Weinberg equilibrium, we introduce population-specific inbreeding coefficients , where *F _{i}* is the inbreeding coefficient for population

*i*. Thus, we consider two levels of inbreeding, one at the population level corresponding to

*F*

_{ST}and another one at the individual level, corresponding to

*F*

_{IS}.

Instead of focusing directly on individual migration rates, we consider the probability that genes in a deme originated in another one over the last generation. Thus, migration is described by a matrix , where *m _{ij}* is the probability that alleles in population

*i*came from population

*j*during the previous generation. The ancestral state of the individuals is described by a matrix , where is a two-element vector identifying the source demes (

*i*and

*j*) for the two alleles of individual

*h*. All possible ancestry states are considered: both alleles come from the deme where the individual was sampled, or both come from another deme, or they come from two different ones. Thus, migration rates for individuals are obtained as(1)where is the probability that individuals sampled from population

*i*belong to the ancestry class (

*j*,

*k*). Note that our approach estimates migration rates only over the last generation. Moreover, as opposed to Wilson and Rannala (2003) migration rates vary freely in the interval (0, 1) and do not have to be small.

The model parameters described above () are estimated from the genetic data using a Bayesian approach and Markov chain Monte Carlo (MCMC) techniques.

##### Likelihood:

The likelihood is the probability of the observed genotypes given model parameters and is constructed by defining the probability of observing the genotype of individual *h* at locus *l* in terms of the ancestry classes. We note these genotypes , where *X _{hlc}* is the allele observed at locus

*l*in chromosome

*c*= 1, 2 of individual

*h*. Thus, individual

*h*genotype likelihood at locus

*l*is given by(2)where(3)and(4)

The first case considered in Equation 2 corresponds to the scenario where both alleles originated in the same source population, in which case we need to take into account possible deviations from Hardy–Weinberg equilibrium (see Equation 3). The second case considers that the individual is the descendant of parents that come from two different source populations, in which case we need to take into account that there are two different ways of assigning the alleles to the parents.

If we assume that individuals were sampled at random and loci are unlinked, then the likelihood of the whole sample is obtained by multiplying across all loci and individuals,(5)This likelihood can be used as the basis for inference using a Bayesian approach.

#### Combining genetic and environmental data:

One can expect that migration patterns are influenced by environmental factors such as population densities, distances between local populations, etc. To identify which environmental factors have influenced gene flow we use Gaggiotti *et al.*'s (2004) approach. Let us suppose that our knowledge of the species under study leads us to think that *R* environmental factors may influence the migration process. We can then introduce their effect through the prior distribution of gene migration rates. More specifically, we focus on the ancestry of immigrant alleles by conditioning on not being a resident allele(6)and assume that the vector follows a Dirichlet distribution; *i.e*., , where are shape parameters for the Dirichlet distribution. Furthermore, we assume that each shape parameter ψ_{ij} follows a lognormal distribution; *i.e.*, for each pair of distinct populations *i* ≠ *j*(7)where the mean μ_{ij} is given by the generalized linear regression(8)where α_{r} denotes the effect of environmental factor *r* and α_{rs} denotes the effect of first-order interactions between factors *r* and *s*; these parameters are collected into a single vector . The sign and the magnitude of the α's tell us about the direction and the strength of the environmental factors. Finally, σ^{2} is the amount of variation that remains unexplained by the regression and is the observed value for factor *r*, which is hypothesized to influence migration between populations *i* and *j*. To reduce posterior correlation and to simplify prior elicitation and posterior interpretation process, explanatory factors are normalized before analysis so that they have zero mean and variance one.

By excluding different regression terms we can define different alternative models. We note, however, that as opposed to previous applications of this approach (*cf*. Gaggiotti *et al.* 2004; Foll and Gaggiotti 2006), the intercept α_{0} is included in all models because it takes into account the effect of factors that act at a geographic scale larger than that of the metapopulation under study (see discussion for more details).

##### Other priors:

We assume that there is no prior information on the shape of the other parameters and, therefore, adopt the vague priors that are given in the appendix. Note that in the particular case of the probability to observe nonmigrant genes (*i.e.*, *m _{qq}*), we adopt a uniform prior between 0 and 1 because, although some environmental factors may influence whether or not an individual decides to emigrate, our method is aimed at estimating immigration rates and, therefore, cannot take into account this possibility.

#### Posterior distribution:

The model is now expressed in terms of parameters and the corresponding posterior distribution is given by Bayes' rule:(9)The full model is represented by the directed acyclic graph (DAG) in Figure 1 .

The posterior distributions of parameters given in Equation 9 are estimated using MCMC methods that are described in the supplemental information.

#### Posterior model probabilities:

Besides estimating migration rates our method is aimed at identifying the environmental factors influencing gene flow. As we mentioned before, several alternative models can be obtained from the full regression (8) by canceling elements of the vector . Note that models that include first-order interactions between factors *r* and *s* are allowed only if both factors are included. Thus for model , the corresponding posterior distribution is given by(10)where is the parameter vector under model , is the corresponding regression vector, and denotes prior model probability. Posterior model probabilities are estimated using the reversible-jump (RJ)MCMC approach (Green 1995, detailed in supplemental information). Here we note only that one of the problems faced when estimating posterior model probabilities is that the prior for can have a large effect on the estimates. When very vague priors are used, more posterior weight is placed on the model with the fewest parameters. This is the well-known Jeffreys–Lindley paradox (Robert 1994). This problem was avoided by first running an MCMC of the full model with vague priors and then using the posterior estimates of the α's as informative priors for a new MCMC run.

## SIMULATION STUDY

We evaluated the sensitivity of our method by generating synthetic data under a particular scenario in which gene flow is influenced by two factors. We considered various levels of genetic differentiation and migration rates. We are interested in the ability of our algorithm to find the correct scenario and to provide accurate posterior estimates for migration rates (with corresponding fairly narrow highest posterior density intervals, HPDI).

#### Generation of synthetic data:

We simulated data following the inference model presented in Figure 1. We initially considered a scenario with *I* = 4 populations, each with local population sizes of *N _{i}* = 5000 individuals. The sample size per population was

*n*= 100 and we assumed that each sampled individual was scored for

_{i}*L*= 10 polymorphic loci, each with

*K*= 10 alleles.

##### Generating migration rates from environmental factors:

We consider two environmental factors that could be, for example, geographic distance (*G*_{1}) and population density (*G*_{2}). The pairwise geographic distances were generated using a standard normal distribution. The pairwise differences in population density were generated by first filling the top triangular matrix with values drawn from a standard normal and then filling the bottom triangular matrix with the opposite values. This procedure is equivalent to standardizing the observed pairwise differences in environmental factors before analysis.

To generate the migration matrix, we first chose the values for the diagonal elements (proportion of nonmigrant genes) and then we calculated the values for the nondiagonal elements (immigration rates) using the following procedure. We set α_{1} = −0.9, α_{2} = 1.1, and α_{12} = 0 (*i.e.*, no interaction effect) and calculated μ_{ij}'s using Equation 8. Assuming no deviation from the linear regression (*i.e.*, σ^{2} = 0), we set . Finally we computed the means of the Dirichlet distribution used for the migration rates (Equation A2) and rescaled them so that they added up to 1 − *m _{ii}*.

##### Genetic data:

To generate multilocus genotypes with a given level of genetic differentiation, we need to generate parametric allele frequencies. This task was performed using Balding and Nichols' (1997) sampling formula for *F*_{ST}. According to this formula, genes are sampled one by one during an iterative process. The probability that the next gene sampled is *a* after having sampled *n* genes of which *n _{a}* correspond to allele

*a*is given by(11)where

*p*is the global frequency of allele

_{a}*a*in the metapopulation.

To generate the parametric allele-frequency distribution of each locus in any given local population for a given *F*_{ST} value, we first sample an allele at random from the metapopulation allele-frequency distribution and then use Equation 11 to calculate the probability distribution for the type of the allele that will be sampled next. Using this distribution we obtain the next allele. This process is repeated iteratively until we obtain the 2*N _{i}* alleles present in local population

*i*. We used uniform allele frequencies for the metapopulation.

Large departures from the target *F*_{ST} value were avoided by using the following iterative process. We generated the local population's allele-frequency distributions and calculated the global and pairwise *F*_{ST}'s. If one or more of these values were not within 10% of the target value we discarded the allele frequencies and generated new ones. If the new allele frequencies satisfied the requirement, we generated the genotypes; otherwise we continued this iterative procedure until the constraint was satisfied. This procedure was used to control for the effect of genetic differentiation on the performance of the method.

Using the gene migration rates calculated above, we obtained the proportion of migrant individuals in each population using Equation 1. Multilocus genotypes for each local population were generated assuming Hardy–Weinberg equilibrium. Genotypes of nonmigrant individuals were obtained by drawing two alleles from the local population allele-frequency distribution. For the migrant individuals with both parents coming from the same local population, the genotype was obtained by drawing two alleles at random from the parents' source population. For migrant individuals with parents coming from different source populations, we sampled one allele from the source population of each parent. Finally, samples were generated by drawing *n _{i}* individuals from each local population, keeping track of the ancestry of both alleles for each sampled individual.

#### Implementation details:

Each MCMC was run for 1,030,000 iterations. The first 20,000 iterations consist of short pilot runs used to tune up the proposal distributions to obtain acceptance rates between 25 and 45%. The next 10,000 iterations were discarded as burn-in and the remaining observations were sampled every 100 iterations, giving a sample size of 10,000 for each analysis.

To take into account model uncertainty, parameters are estimated using Bayesian model-averaging methods. The only exception to this rule are the regression parameters, which are model specific and, therefore, were estimated using the subset of values corresponding to the model with the highest posterior probability. Finally, posterior model probabilities are obtained by observing the number of times the chain visits each alternative model.

Posterior estimates are based on the sample mean except for the deviation from the regression σ^{2}, which usually has a highly asymmetric posterior distribution. In this latter case we used the posterior mode, which was estimated using kernel density estimation.

We investigated the effect of varying three parameters: the level of genetic differentiation, *F*_{ST} = 〈0.01, 0.05, 0.10, 0.25〉; the proportion of nonmigrant alleles, *m _{ii}* = 〈0.7, 0.9〉; and the number of populations,

*I*= 〈4, 6〉.

For each parameter set we generated 10 independent genetic data sets as described above. The results we present below are averages across these 10 replicates. As a measure of accuracy we also present the relative mean square errors (RMSE).

#### Results:

We investigated the performance of our method to provide reliable estimates under different scenarios of migration and genetic differentiation and number of populations studied. We consider first the effects on model determination, and then we address the influence on migration rate estimates and finally on individual assignments.

When the immigration rate is high (*m _{ii}* = 0.7; see Table 1), estimates of posterior model probabilities are strongly influenced by the degree of genetic differentiation (

*F*

_{ST}). When differentiation is low (

*F*

_{ST}= 0.01), the method fails to identify the model used to generate the synthetic data. However, the correct model is identified when

*F*

_{ST}> 0.01, and, moreover, its posterior model probability increases steadily with increasing genetic differentiation. The estimation of regression parameters is also influenced by the magnitude of

*F*

_{ST}but to a lesser degree. The RMSE decreases with increasing genetic differentiation but the bias is largely unaffected. Thus, it is the accuracy of the estimates (as illustrated by the HPDIs) that is influenced by

*F*

_{ST}. The proportion of the variance that remains unexplained by the model, σ

^{2}, decreases as genetic differentiation increases.

Decreasing the immigration rate (*m _{ii}* = 0.90) has a detrimental effect on estimates (Table 2). Although the true model is correctly identified for

*F*

_{ST}> 0.01, its posterior probability is lower than that observed when

*m*= 0.70. Estimates of regression parameters are more biased and less accurate (wider HPDIs), leading to higher RMSEs. Also, the proportion of the variance that remains unexplained, σ

_{ii}^{2}, is larger. Note, however, that as was the case before, the quality of all estimates improves with increasing genetic differentiation.

Increasing the number of populations studied (*I* = 6) improves model determination (Table 3). More precisely, the posterior probability of the true model is strongly increased and the proportion of variance that remains unexplained decreases sharply (see Table 3 and last columns of Tables 1 and 2). However, the effect on the quality of the regression parameter estimates is somewhat decreased since the bias and the RMSE increase. Nevertheless, the width of the HPDIs decreases, indicating that the precision increases.

Estimates of gene migration rates improve with increasing genetic differentiation (Table 4). The bias decreases sharply between *F*_{ST} = 0.01 and 0.1 and then remains very low. Note that the only cases where the HPDI does not include the true value correspond to the case with the weakest genetic differentiation (*F*_{ST} = 0.01). When the number of nonmigrant genes decreases we observe the same pattern but in this case the number of estimates for which the HPDIs do not include the true value is smaller and corresponds only to the estimates of nonmigrant proportions (Table 5).

In terms of posterior individual assignments, increasing genetic differentiation improves the quality of the estimation (see the bottom three rows of Tables 1 and 2). That is, the proportion of individuals that are misassigned decreases while the average posterior assignment probability increases. Decreasing the proportion of migrant genes also improves the quality of assignments; the proportion of misassignments decreases and the average posterior probabilities with which individuals are assigned increase. The effect of varying the number of populations is very small, being somewhat more distinguishable when the proportion of nonmigrants is larger (see bottom three rows of Tables 1–3⇑).

We also investigated what is the effect of using explicative variables that are different from the ones used to generate the synthetic data. The results (Table 6) show that the highest posterior probability is assigned to the null model, which indicates that the method does not wrongly identify as important factors that are not responsible for the observed migration pattern.

It is also important to investigate the effect of the amount of data used for the estimation, which can be characterized by the sample sizes and number of loci scored. The effect of decreasing the sample size from 100 to 50 individuals per population does not have much of an effect on posterior model probabilities while estimates of regression parameters have a slightly larger bias and wider HPDIs leading to somewhat larger RMSEs (compare last column of Table 1 with Table 7). Migration rate estimates show no increase in bias but their HPDIs are larger (compare Tables 8 and 9). Finally, the quality of the assignments is barely influenced by a decrease in the sample sizes (compare Tables 1 and 7). The effect of increasing the number of loci scored from 10 to 20 does not have an effect on model determination, estimates of regression parameters, and migration rates when the level of genetic differentiation is moderate (*F*_{ST} = 0.1) (results not shown). The only result that changes is the proportion of individuals that are misassigned, which decreases from 0.002 to 0. We also carried out analysis of a scenario with *F*_{ST} = 0.05 and in this case, increasing the number of loci from 10 to 20 decreased the width of the HPDIs for migration rate estimates and improved the accuracy of individual assignments.

## APPLICATION TO REAL DATA

We use the human genome diversity cell line panel–Centre d'Etude du Polymorphisme Humain (HGDP–CEPH) presented by Cann *et al.* (2002) to illustrate how our method can be used to make inferences about the factors influencing migration patterns. In our example we selected a subset of eight populations, all from Pakistan (see Figure 2), corresponding to 200 individuals (25 per population). We grouped together the Balochi and Brahui samples because the STRUCTURE analyses carried out by Rosenberg *et al.* (2002) place them in the same genetic cluster (see their Figure 2). Also, instead of using all 377 loci we did a first screening using an improved version of Beaumont and Balding's (2004) method to identify outlier loci that could be influenced by selection. On the basis of this screening we selected a total of 247 loci that were used in the analysis.

The effect of distance is supposed to be one of the main factors in determining gene flow in many species, but other factors such as altitude can influence geographic isolation and, therefore, migration patterns. We use our method to evaluate the relative importance of these two factors. We obtained pairwise geographic distances from latitude and longitude coordinates and also calculated the difference in altitude between each focal population and all other populations. Cann *et al.* (2002) give the geographic coordinates of each population as sample intervals; thus we used the gravity center of the area for the calculation of geographic distances between populations. With two parameters we can define five alternative models, which are presented in Table 10.

As was the case for the simulation study we used short pilot runs to tune up the proposal distributions to achieve reasonable acceptance ratios. To ensure convergence we increased the burn-in to 10^{6} and the sample size to 20,000 and used a thinning interval of 50 iterations.

Some of the population-specific *F*_{ST} values are <0.01 (see Table 11), the level of genetic differentiation that our simulation study identified as problematic for the estimation of parameters. This example, therefore, provides us with an opportunity to illustrate the problems that may arise when our method (or any other MCMC-based method) is used in scenarios with weak genetic differentiation. In these situations, it is necessary to run many independent replicates and compare their results; in the present case we used 10 runs. In 6 of them, the most probable model included altitude only and in all cases there was a posterior probability of at least 50%. The second most probable model included both factors. However, in 4 other runs two other models, one including distance only and the other including both distance and altitude, gave similar high posterior probabilities while the model including altitude only was ranked third. Given these results, we followed Faubet *et al.* (2007) and chose the run with the lowest deviance for estimation purposes. The Bayesian deviance has been proposed as a measure of model fit by a number of authors (Faubet *et al.* 2007 and references therein) and in our specific case we considered the assignment component of the total deviance, . Table 10 presents these results. The model with the highest posterior probability includes altitude only and the second most probable model includes both altitude and distance. In this latter case, the regression coefficients for the effects of altitude and distance are both negative, indicating that, as expected, both factors decrease migration rates between populations. Note, however, that the former seems to have a stronger effect (*i.e.*, larger absolute value).

Table 12 presents the mode and HPDI of migration rates between populations. Although the sum of maximum posterior estimates does not necessarily add up to one, we used them as estimators because of the inherent asymmetry of migration rate posterior distributions. There are three populations that do not receive migrants (Burusho, Bu; Hazara, Ha; and Kalash, Ka) and they correspond to those located in high-altitude areas. Moreover, two of these populations (Bu and Ka) do not seem to send migrants either and a third one (Ha) seems to contribute very little to the gene pool of the Pathan. The population with the highest proportion of migrant genes is the Sindhi, which receives migrants mainly from Balochi/Brahui (Ba/Br). Three other populations have a somewhat lower proportion of migrant genes (Ba/Br; Makrani, Ma; and Pathan, Pa). In the case of Ba/Br, most of the genes come from Ma, and, conversely, most of the genes of Ma come from Ba/Br. Finally, the Pathan receive similar proportions of genes from Ba/Br, Ha, and Sindhi (Si). In general, there are frequent gene exchanges among southern populations while northern populations remain fairly isolated. The best explanation for this migration pattern is altitude differences with the most isolated populations being at high altitude and the least isolated ones at low altitude.

Finally, the mean and mode of inbreeding coefficient estimates are somewhat large when compared to *F*_{ST} estimates but this is not the case if we compare the lower bounds of the HPDIs (Table 11). Still, there are three local populations (Ba/Br, Ma, and Pa) for which the lower bound of *F*_{IS} HPDIs is >0.04 while that of *F*_{ST}'s is much lower. A potential explanation for this result could be that samples were taken from adult individuals and, therefore, the data set does not fit model assumptions concerning the moment at which sampling takes place. However, we do not have information concerning the age group involved in the sampling.

## DISCUSSION

We present a new method for the estimation of recent migration rates that also allows for making inferences about the factors that influence gene flow in subdivided populations. It focuses on the F_{1} descendants of migrant individuals and, therefore, estimates the probability that a given individual migrated during the previous generation. Our approach also estimates various other population-specific parameters such as local *F*_{ST}, inbreeding coefficients, and allele frequencies. The method requires data from codominant markers such as RFLPs, microsatellites, allozymes, and SNPs and environmental data specific to each local population. Note, however, that the modeling of dispersal barriers (mountains, roads, deforested areas, etc.) between pairs of populations can be introduced by considering landscape resistance measures usually used by landscape ecologists (see, *e.g.*, McRae 2006).

We generated synthetic data following the inference model described above to investigate the effect of varying levels of genetic differentiation, proportions of nonmigrant genes, and numbers of populations, loci, and individuals. The results of this simulation study indicate that the method can provide reliable estimates when global *F*_{ST} values are >1%, the number of loci is only 10, and sample sizes are of the order of 50 individuals per population. Additionally, the identification of the environmental factors influencing migration is easier when migration rates are high and the number of local populations considered increases. We did not investigate the effect of varying the degree of polymorphism (*i.e.*, the number of allelic classes) or the effect of unsampled populations. We expect that increasing polymorphism will increase accuracy while the effect of unobserved populations is more likely to decrease it depending on true migration rates between unsampled and sampled populations. Our simulation study could be extended to take into account these considerations. Additionally, it would be desirable to consider demographic scenarios that differ from the one assumed by the inference model to test the robustness of our method.

We applied our method to a previously published microsatellite human data set for which local *F*_{ST}'s are within the range of values that our simulation study identified as problematic for parameter estimation. As expected, we observed convergence problems for this application and followed the approach of Faubet *et al.* (2007) to minimize them (see previous section for a more detailed explanation). We found that altitude influences recent migration among Pakistani populations and that gene exchanges are more frequent in the south than in the north of Pakistan. Geographic distance seems to have little effect on migration, a result that can be explained by the limited geographic scale considered and the fact that even in poorly developed areas there are many means of transportation that facilitate movement of humans. On the other hand, altitude can represent an important barrier particularly in winter when populations at high altitude can remain isolated for long periods of time.

The estimation of migration rates has proved to be a very difficult task. Several methods exist for this purpose; some of them estimate long-term migration rates and are based on coalescent theory (*e.g.*, MIGRATE, Beerli and Felsenstein 2001) while others provide recent migration rate estimates and are based on multilocus genotype approaches (*e.g.*, BAYESASS, Wilson and Rannala 2003). All recent methods for estimating migration rates rely on MCMC approaches and require one to pay special attention to convergence issues (Faubet *et al.* 2007). This is particularly important when genetic differentiation among populations is weak. This caveat also applies to our method, and the human example we present illustrates how to deal with these problems.

Being a multilocus genotype approach, our method resembles in many respects BAYESASS. It is important to note, however, that this resemblance is only superficial because we do not assume the same sampling scheme and we allow for high migration rates. Indeed, as opposed to Wilson and Rannala (2003) we assume that sampling takes place after reproduction and before migration. This was done to avoid the low migration rate restriction underlying their method and to allow migration rates to vary between 0 and 1. More specifically, Wilson and Rannala's (2003) formulation provides estimates of migration rates restricted to the interval (0, ) and assumes that *m* is very small because to account for individuals with mixed ancestry (*i.e.*, individuals whose alleles come from two different populations) they need to consider individuals that arrived one generation before sampling takes place. Thus, they are forced to assume that at most half of an individual's alleles comes from another population. In our case, we do not have this restriction because after reproduction the alleles of a given individual can come from any population. Doing this, however, precludes us from distinguishing between first-generation and second-generation migrants. Nevertheless, we can consider cases where parents are migrants from two different populations while BAYESASS considers only a single migrant ancestor.

The information used by our estimation method is the gametic disequilibrium generated by migration, which increases as genetic differentiation among local populations increases. Indeed, limited migration is very effective in increasing differentiation of gamete types among the subpopulations by random genetic drift (Ohta 1982). The strength of this gametic disequilibrium can be measured through the genotype of migrant individuals (or descendants from recent migrants) or through the gamete haplotype frequencies. Clearly, the former corresponds to short-term migration while the latter corresponds to the effect of long-term migration. All this implies that if the long-term migration is very high, the signature left by recent migration events will be weak. In the case of our method, the simulation study indicates that reliable estimates can be obtained when the effective number of migrants is less than five (*i.e.*, *F*_{ST} ≥ 0.05). The gametic disequilibrium due to long-term migration can also lead to a deviation from the hypothesis of independence among loci used to derive the likelihood function. This is a problem shared by all the methods that estimate migration rates from multilocus genotype data. The potential biases that could be introduced due to this problem require a very detailed simulation study, using an individually based model that produces synthetic data that allow for the estimation of gametic disequilibrium.

Another improvement introduced in our method is the use of the F-model first proposed by Balding and Nichols (1995). This feature allows us to take into account the population admixture that may have taken place before the last generation of migration. Additionally, as pointed out by Falush *et al.* (2003), the implementation of this model permits identification of subtle population subdivisions and, therefore, improves the estimation of allele-frequency distributions when genetic differentiation is weak. This in turn improves the estimation of migration rates as shown by a pilot study comparing the performance of our method with and without the F-model (results not shown). All the improvements implemented by our method lead to good mixing properties of the MCMC and therefore minimize convergence problems. We stress, however, that users should always carefully check the convergence of the MCMC by running multiple analyses and comparing their results.

An important feature of our method is that besides simply estimating migration rates it also identifies the factors that influence them. We use the same approach as that first proposed by Gaggiotti *et al.* (2004), which consists of using a Dirichlet prior for the immigration rates and linking its shape parameters with the environmental data, using a generalized linear model. In the present case, however, we do not consider models without the constant factor (*i.e.*, the regression intercept). This was done because our experience with the application of this type of method (Gaggiotti *et al.* 2004; Foll and Gaggiotti 2006) indicates that models excluding this parameter almost always had null posterior probabilities. These results can be explained by the fact that the regression intercept captures the effects of factors that act at a larger geographic scale than that considered for the metapopulation under study. It also takes into account behavioral characteristics of the species under study that remain the same regardless of the environment. In fact, the regression intercept influences only the variance of immigration rates, which increases as α_{0} decreases. For example, we expect that the variance of the migration rate between two given populations will be larger for species that can disperse very long distances than for species with very poor dispersal abilities. In this case, then we expect to obtain estimates of the intercept that are smaller for the former.

In our approach we assumed that the probability of observing nonmigrant alleles in any given population is independent of environmental factors. The underlying rationale for this is that local environmental conditions will influence only emigration rates but do not have any effect on the immigration rates that are the focus of our estimation method. Ideally we would also like to estimate emigration rates. As Wilson and Rannala (2003) point out, this could be done if we know local population sizes or, alternatively, if we could develop a method that can make use of temporal samples. However, such approaches are likely to involve much more complex likelihood functions that will necessarily lead to a worsening of convergence problems that are typical of complex methods that use MCMC approaches.

The software that implements the method incorporates features that facilitate the interpretation of results. For example, it provides estimates of both means and modes, which allows the user to choose the best parameter estimator depending on the shape of the posterior distribution (which is also provided by the software). Indeed, when posterior distributions are asymmetric, posterior estimates based on the mode and on the mean are rather different and the former provides a better way of describing the results. Thus, users should always have a look at the shape of posterior distributions to choose appropriate estimators.

Bayesian methods such as the one we present here are powerful tools for the study of natural populations. Users, however, should keep in mind that their application requires some expertise on the computational methods underlying their implementation, particularly on MCMC approaches. These issues are discussed more in detail in Faubet *et al.* (2007) and also in the user manuals of several of the currently available methods. If these recommendations are followed, population biologists will be able to extract highly valuable information about the species under study.

## APPENDIX: PRIOR DISTRIBUTIONS FOR PARAMETERS

We take the following priors for each parameter discussed in the text.

#### Probability to observe nonmigrant genes:

We assume that nonmigrant proportions are not influenced by environmental factors and therefore use a uniform distribution:(A1)

#### Probability to observe migrant genes:

We use a Dirichlet prior for the rate of migrant genes contributed by local populations other than the focal one, ,(A2)where the 's are given by Equation 6.

#### Shape parameters for the Dirichlet prior:

As ψ_{ij}'s must be positive we use a log-normal distribution,(A3)where μ_{ij} is given by the regression (8).

#### Regression coefficients:

We use a normal distribution,(A4)where = 10.

#### Deviation from the regression:

We assume that σ^{2} follows an inverse-gamma distribution,(A5)where *a*_{τ}, *b*_{τ} = 1

#### Local population *F*_{ST}'s:

As θ_{i}'s must be positive we use a log-normal distribution,(A6)where ω = ξ = 1.

#### Metapopulation allele frequencies:

We use an uninformative Dirichlet prior,(A7)where *k _{l}* is the number of alleles observed at locus

*l*in the metapopulation and λ = 1.

#### Population allele frequencies:

We use a Dirichlet prior,(A8)

#### Population-specific inbreeding coefficients:

We use a uniform distribution,(A9)

#### Ancestry assignments:

Following Wilson and Rannala (2003), we use a multinomial prior,(A10)where *n _{ijk}* is the number of individuals sampled from population

*i*that belongs to ancestry class (

*j*,

*k*) and is given by Equation 1.

## Acknowledgments

Most of the computations presented in this article were performed on the cluster HealthPhy (CIMENT, Grenoble, France). We are grateful to Matthieu Foll for providing us the software to identify outlier loci in the human data set. We also thank Olivier Francois and two anonymous reviewers for their useful suggestions that helped to improve the manuscript. The software that implements the method is available for the three most popular operating systems at http://www-leca.ujf-grenoble.fr/logiciels.htm. This work was supported by the Fond National de la Science (grant ACI-Impbio-2004-42-ADGP). P.F. holds a Ph.D. studentship from the Ministère de la Recherche.

## Footnotes

Communicating editor: L. Excoffier

- Received September 29, 2007.
- Accepted December 20, 2007.

- Copyright © 2008 by the Genetics Society of America