Identifying the Environmental Factors That Determine the Genetic Structure of Populations
Matthieu Foll, Oscar Gaggiotti


The study of population genetic structure is a fundamental problem in population biology because it helps us obtain a deeper understanding of the evolutionary process. One of the issues most assiduously studied in this context is the assessment of the relative importance of environmental factors (geographic distance, language, temperature, altitude, etc.) on the genetic structure of populations. The most widely used method to address this question is the multivariate Mantel test, a nonparametric method that calculates a correlation coefficient between a dependent matrix of pairwise population genetic distances and one or more independent matrices of environmental differences. Here we present a hierarchical Bayesian method that estimates FST values for each local population and relates them to environmental factors using a generalized linear model. The method is demonstrated by applying it to two data sets, a data set for a population of the argan tree and a human data set comprising 51 populations distributed worldwide. We also carry out a simulation study to investigate the performance of the method and find that it can correctly identify the factors that play a role in the structuring of genetic diversity under a wide range of scenarios.

ONE of the fundamental problems in population genetics is the study of the nature of genetic differentiation that is found in real populations and, if possible, to identify the factors that are responsible for the observed spatial structuring of genetic diversity. A clear understanding of these issues is of fundamental importance for a wide range of applications that include, among others, the inference of population histories, biodiversity conservation, and the identification of disease genes and/or disease-resistant genes in humans and economically important species. There are many methods that estimate different measures of genetic differentiation among populations (Excoffier 2001; Rousset 2001) but there is a paucity of methods that allow us to identify factors that influence genetic structuring. The most commonly used method, the multivariate Mantel test (Smouse et al. 1986), is based on the calculation of genetic and environmental distance measures between every pair of populations.

Genetic structuring of neutral markers is a consequence of the amount of genetic drift to which each local population has been subjected, due to its local effective size and/or due to its overall degree of geographic/ecologic isolation. Thus, it seems appropriate to base the estimation on parameters and variables that are specific to each local population. However, the study of population genetic structuring is traditionally done using global measures such as FST or GST, which ignore differences in the strength of genetic drift across populations. Over a decade ago, Balding and Nichols (1995) proposed the use of population-specific FST's in the context of a migration–drift equilibrium model. They considered biallelic markers and modeled allele frequencies using a beta distribution with expectation p and variance p(1 − p)/(1 + θ) so that FST = 1/(1 + θ). Such a formulation enabled them to use a likelihood-based approach to estimate population-specific FST's. More recently, Nicholson et al. (2002) used a truncated normal distribution with mean p and variance cp(1 − p) instead of a beta because they were interested in a nonequilibrium fission model where subpopulations evolve in isolation after splitting from an ancestral population in which the allele frequency is p. FST and c are the same in the limit as c ↓ 0 but differ when p is close to 0 or 1 or c is large (Balding 2003). Marchini and Cardon (2002) compared the two formulations by fitting them to two human data sets and concluded that although Nicholson et al. parameterization fitted better the European data set, both formulations performed equally well with the global data set. Recently, both formulations were extended to consider multiallelic loci; Balding (2003) considered the migration–drift equilibrium model while Falush et al. (2003) considered the nonequilibrium fission model. Interestingly, both formulations lead to the same multinomial-Dirichlet distribution for the subpopulation allele frequencies, Math. The main difference between the two models resides in the interpretation given to FST. In the case of the migration–drift model, the FST's measure how divergent each local population is from the metapopulation as a whole, while in the case of the fission model they measure the degree of genetic differentiation between each descendant population and the ancestral population.

Using as a starting point the methodological advances described above, we have developed a novel hierarchical Bayesian method that represents a more informative complement to the Mantel test. The most important innovation of our method is that it introduces nongenetic (environmental) data via the prior distribution of the population-specific FST's, which are modeled using a generalized linear model. We can thus consider different alternative models, each including a different set of environmental variables, and estimate their posterior probabilities. Using these posterior probabilities we can make inferences about the factors that influence genetic structure. The method is easy to employ, intuitive, and of wide applicability. It can be used to test for the effect of specific biotic/abiotic factors and also to study population expansion processes. In this latter case, the geographic coordinates of the samples are used to define the prior distribution of FST's. We carry out a detailed simulation study that demonstrates that the method can correctly identify the biotic/abiotic factors that influence the genetic structure of populations and we present examples that illustrate how the method can be applied to study a wide range of scenarios.


As stated in the Introduction, the model we use is based on ideas put forward by Balding and Nichols (1995) and later extended by Nicholson et al. (2002), Balding (2003), and Falush et al. (2003). Balding and Nichols (1995) and Balding (2003) considered a migration–drift model while Nicholson et al. (2002) and Falush et al. (2003) considered a fission model. For the sake of simplicity we describe the details of our approach using the terminology of the fission model but it should be kept in mind that it also applies to island models. We consider a collection of J subpopulations that evolved in isolation after splitting from an ancestral population. The derived subpopulations may have been subject to different amounts of genetic drift and, therefore, their allele frequencies will show different degrees of differentiation from the ancestral allele frequency. The extent of differentiation between subpopulation j and the ancestral population is measured by FSTj and is the result of its demographic history.

We consider a set of I loci and let Ki be the number of alleles at the ith locus. Let Math denote the allele frequencies of the ancestral population at locus i, where pik is the frequency of the allele k at locus i Math. We use Math to denote the entire set of allele frequencies of the ancestral population and Math to denote the current allele frequencies at locus i for subpopulation j. Under these assumptions, the allele frequencies at locus i in subpopulation j follow a Dirichlet distribution with parameters Math,Math(1)where θj = 1/FSTj − 1. The parameters FSTj are very closely related to Wright's FST (Wright 1951) parameter and are interpreted as measures of the shared ancestry within each of the subpopulations (see Balding 2003 for a more detailed explanation).

Model parameters:

Our objective is to estimate the FST's by combining genetic and environmental data. For simplicity we focus on the derived parameters θj but the same results would be obtained by focusing directly on the FST's. We use a hierarchical Bayesian approach that introduces the genetic data through the likelihood function and the environmental data through the prior distribution of the θj's. We also estimate the allele frequencies of each subpopulation and that of the ancestral population since they are also unknowns. The data consist of allele counts obtained from samples of size nij (where the subscript i refers to locus and the subscript j to population). We use aijk to denote the number of alleles k observed at locus i in the sample from subpopulation j. Thus, Math. The full data set can be presented as a matrix Math, where Math is the allele count at locus i for subpopulation j. The observed allele frequencies, Math, can be considered as sampled from the true alleles frequencies Math and, therefore, can be described by the multinomial distribution (Holsinger 1999):Math(2)

Let us first construct the likelihood function for the allele count Math at locus i for subpopulation j. In principle, we could use as likelihood the multinomial distribution (Equation 2) and consider Equation 1 as a Bayesian prior. However, in our case, we can calculate exactly the marginal distribution of Math because the Dirichlet distribution is the conjugate prior of the multinomial. This allows us to eliminate the nuisance parameters Math that are not of immediate interest but are needed by the model. The marginal distribution is obtained by integrating out Math,Mathwhere, Math. The right-hand terms Math and Math are given by Equations 1 and 2, respectively. Thus, we obtain the multinomial-Dirichlet distribution:Math

This equation is basically the same as Equation 13 in Balding (2003). The only difference is that here we express it in more general terms by using subindexes to identify loci and populations. It is also equivalent to Equation 5 in Falush et al. (2003) and differs from the parameterizations of Balding and Nichols (1995) and Nicholson et al. (2002) in that they use, respectively, the beta or a truncated normal distribution instead of the Dirichlet because they are concerned with biallelic markers while we deal with multiallelic loci.

The likelihood is obtained by multiplying across all loci and populations:Math(3)

Since the allele frequencies in the ancestral population are unknown, we have to estimate them by introducing a noninformative Dirichlet prior, Math, into our Bayesian model.

Incorporation of nongenetic data:

We want to determine the influence of biotic/abiotic factors on the FST coefficients or equivalently on the θj's. For example, we often want to ascertain if geographic isolation (usually measured in terms of pairwise geographic distances but best described by a connectivity measure; see Moilanen and Nieminen 2002 for a review) and/or linguistic affinity have influenced the genetic structure of a species (Sokal et al. 1992; Cavalli-Sforza and Feldman 2003; Geffen et al. 2004; Hunley and Long 2005). Given that FST-values are bounded between 0 and 1, we could use the familiar logistic regression modelMath(4)where Math and Math are, respectively, the observed values of the first and the second environmental factors for population j and can be presented as elements of a matrix Math. Here we consider only two factors but the method can be easily extended to include many more factors. Although this is in principle a valid approach, in practice it would impose a very strong prior on the θj's. Thus, we chose instead to use a lognormal prior where the mean is a function of environmental variables (as in Gaggiotti et al. 2004),MathwithMath(5)where Math denotes a vector of model parameters to be estimated. The magnitude and sign of these parameters indicate the strength and direction of the effect of the corresponding biotic/abiotic factors. α3 corresponds to the interaction of the two factors. To improve computational efficiency and facilitate the posterior interpretation process, the factors Math should be normalized so as to have mean 0 and variance 1. The variance of the lognormal distribution, σ2, is a measure of model fit and has to be estimated too. To estimate this and all other parameters related to the effect of environmental factors, we use vague priors. More specifically, we take Math and Math.

With these likelihoods and priors, the full posterior distribution represented by the directed acyclic graph (DAG) in Figure 1 is given byMath(6)

Figure 1.—

The DAG of the model given in Equation 6. Square nodes denote known quantities (i.e., data) and circles represent parameters to be estimated. Lines between nodes represent direct stochastic relationships within the model. The variables within each node correspond to the different model parameters discussed in the text. A is the genetic data and G the environmental data used to explain the genetic differentiation. Formula is the matrix of ancestral allele frequencies, Formula is the vector of the genetic differentiation coefficient for each local population, Formula is the vector of regression coefficients that correspond to factors in G, and σ2 measures the deviation from the regression. The multinomial-Dirichlet distribution allows the exact calculation of the marginal likelihood to eliminate the nuisance parameters Formula.

Posterior model probabilities:

As well as parameter estimates, we are interested in determining the dependence (or otherwise) of the factors upon the genetic differentiation. Alternative models are obtained by adopting different forms for μj given in Equation 5. The simplest model might set μj = α0, corresponding to the assumption that neither factor has an effect upon the genetic differentiation. A suitable alternative might be Math, in which we assume that factor G(2) has some influence. Clearly these models can be expressed by setting the appropriate αl = 0 in Equation 4 with the obvious restriction that the interaction term can be included only if both G(1) and G(2) are included in the model. Thus with two environmental factors there are nine alternative models as shown in Table 1 for the human example (see below).

View this table:

Posterior model probabilities

To discriminate between the different models we let Math denote the vector of nonzero α-values under model M and then derive the following posterior distribution over both parameter and model space,Math(7)where p(M) denotes the prior model probability. In most cases, it makes sense to use a uniform prior where all models are equally likely, so this is the strategy we adopted for all the analyses that follow.


The estimation of model parameters is carried out using a combination of Markov chain Monte Carlo (MCMC) and reversible-jump MCMC (RJMCMC) (Green 1995) techniques that are described in appendix a. We evaluated the convergence of the method using the diagnostic tests implemented in the R BOA package (Bayesian Output Analysis Program version 1.1.5, Smith 2005). The tests indicated that a burn-in of 2000 iterations was enough to attain convergence and it has been implemented as part of the pilot-tuning process (see appendix a). We used a sample size of 10,000 and a thinning interval of 10 as suggested by an autocorrelation analysis. With these parameter values, the total length of the chain was 100,000 iterations. Posterior model probabilities are estimated from the number of times the algorithm visited each model. FST's are estimated using model-averaged posterior means and, therefore, take into account model uncertainty. The parameters of the linear model, Math and σ2, are specific to each model and, therefore, we report the estimates obtained for the model with the highest posterior probability. Additionally, instead of using the posterior means of σ2 we use its mode because its posterior density is highly asymmetric.


We evaluate the performance of the method using three different approaches to generate simulated data. The first approach uses the same statistical model assumed by our method (the inference model) and allows us to study the effect of the quality of the samples (number of loci, sample sizes) and different biological scenarios (number of populations, environmental factors considered, etc.) on the accuracy of the estimates obtained. The two other approaches allow us to explore the performance of the method under scenarios that deviate from the assumptions of the inference model. We used EasyPop (Balloux 2001) to generate data under a migration–drift–mutation model that incorporates the effect of isolation by distance and SPLATCHE (Currat et al. 2004) to generate data under a population fission where new populations are formed progressively from older populations.

Inference model:

The inference model used to derive our method applies to two different scenarios, an island model at migration–drift equilibrium and a simple population fission model. The same algorithm can be used to generate synthetic samples under these two scenarios and is described in appendix b. We chose a set of default values for the key parameters of the inference model (see Table 2) and first evaluated the accuracy of the method using the relative mean square error (RMSE),Mathwhere Math is its true value of the parameter being estimated and pk is its estimated value for replicate k. We analyzed 10 replicate data sets generated using the default values (Table 2). We considered samples of good quality (50 individuals and 20 loci) but within the range of those commonly seen in the literature. We also studied the effect of changing the value of one parameter at a time. We used data simulated under model 7 that include two factors and a constant term. The degree of uncertainty of the estimations is measured by the 95% highest probability density interval (HPDI, the smallest interval that contains 95% of the values).

View this table:

Default parameters


In general, all estimates have a small RMSE and narrow HPDI with the exception of σ2 (Table 3). Lower values of FST lead to slightly higher RMSEs. The poor estimation of σ2 is due to the uncertainty in the estimation of the α's, which cannot be separated from the deviation from the regression included in the model. As is shown below, the quality of the estimation for this parameter can be greatly improved by increasing the number of populations sampled.

View this table:

Accuracy of the estimates

Sample sizes and number of loci:

The quality of the estimates is good even with the lowest sample size and number of loci (Tables 4 and 5) and increases as they increase, especially for estimates of FST. The posterior probability of the true model (7) was always the highest, independently of sample size and number of loci but model determination did increase substantially as they increased. The second most probable model always had a probability at least three times lower than that of the true model. Note that it is easier to improve estimates (i.e., decrease HPDIs and increase posterior probability of the true model) by increasing the number of loci studied than by increasing sample sizes. In particular, increasing sample sizes beyond 50 does not improve estimates.

View this table:

The effect of sample sizes on quality of estimation

View this table:

The effect of the number of loci studied on the quality of the estimates

Number of populations:

Varying the number of populations changes the number of FST-values considered in the model; therefore, we evaluate only its effect on model determination and on estimates of regression parameters. With only 5 or 7 populations the method fails to identify the true model and provides low-quality estimates of regression parameters (Table 6). However, with ≥10 populations model determination greatly improves (the posterior probability for the true model is at least 0.80) and bias and uncertainty of all regression parameter estimates decrease pronouncedly. This is particularly the case for σ2-estimates, in which bias almost disappears with 20 populations. The results obtained depend largely on the values chosen for the coefficients of the regression (α0 = −2.4, α1 = −0.6, and α2 = 0.5). The effect of a factor on population genetic structure is proportional to the absolute value of the corresponding regression coefficient. Factors with a strong effect are more easily identified than factors with a small effect. Thus, with only five populations, the highest posterior probability corresponds to the model with the constant term only. With 7 populations, the method can also identify the effect of the first factor but not that of the second and, therefore, the posterior probability is highest for model 5. The effect of the second factor is identified only with ≥10 populations. Additionally, models that do not include the constant term are readily excluded because its effect is very strong and easily identified by the method. It is important to note that when the method fails to identify the true model, the results obtained give hints that they are not reliable. For instance, the posterior probability of models with either factor is nonnegligible (>15%) and the estimates of σ2 are very high (upper bound of the HPDI well over 1). Thus alert users of the method will not be misled to accept a false model; instead they will conclude that the results are inconclusive.

View this table:

The effect of the number of populations studied on the quality of the estimates

Model determination:

We simulated data under all models except equivalent ones (e.g., model 3 and model 5). The true model has always the highest probability. We note that this probability seems to increase with the number of parameters included in the true model. The probabilities of the true models are, respectively, 0.86, 0.82, 0.91, 0.70, 0.80, 0.93, and 1.00 for models 1, 2, 3, 6, 7, 8, and 9.

Overall the results of these simulations indicate that the method can accurately estimate the parameters and identify the true model if samples are of at least average quality.

Subdivided population model:

The first interpretation of our inference model (see above) considers a subdivided population at migration–drift equilibrium where the proportion of migrants can differ among local populations but where all migrant groups are drawn from the same migrant pool (see Balding 2003). Here we consider instead an isolation-by-distance scenario where the composition of the migrant group arriving at any local population depends on the distance between the focal population and each one of the other local populations. Synthetic data for this scenario can be generated using EASYPOP, a software that implements an individual-based model for the simulation of genetic data under different scenarios of population subdivision. We considered a scenario with 14 local populations, each consisting of 50 individuals (see Figure 2A). The probability of being a migrant was fixed to 0.05 and the migration rate between populations followed a negative exponential kernel with parameter r = 1/α. Thus, when Math the populations are totally isolated and as α increases, the effect of distance on migration rates decreases; here we take α = 0.75. We simulate 15 loci, each with six allelic states and with a mutation rate of 0.0001. Using these settings we generated 10 replicate data sets consistent with a scenario where population genetic structure is influenced only by the degree of geographic isolation of local populations, which is measured by their connectivity. This measure is typically used in metapopulation studies to describe the effect of the habitat matrix properties on the degree of isolation of local populations (e.g., Moilanen and Nieminen 2002). There are many connectivity measures to choose from, and some include geographic distance and habitat patch size and shape. Since we consider only the degree of geographic isolation we chose to use the connectivity of population j, Sj,Mathwhere dij is the distance between populations j and i, and β measures the effect of distance on migration probability.

Figure 2.—

(A) Map of the 14 populations used for the isolation-by-distance model simulated with EASYPOP. The connectivity of each population is described by the shaded scale. Populations with dark shading are isolated while ones with light shading are well connected to others. (B) Plot of ln[FST/(1 − FST)] against the connectivity for the isolation-by-distance simulated data. The equation of the regression line is ln[FST/(1 − FST)] = −1.12 − 0.50S, where S denotes the connectivity measure defined in the text.

The analyses of these data sets consider three alternative models, a null model that includes the constant regression term (model 1), another that includes only connectivity (model 2), and finally a model that includes both the constant term and the distance (model 3).

Our method correctly identifies model 3 as the most probable with a posterior probability of 0.90. The regression coefficients α0 and α1 are estimated, respectively, as −1.12 and −0.50. More isolated populations (e.g., 1, 2, or 8) have a higher FST than well-connected populations (e.g., 3, 4, or 5). The connectivity explains relatively well the genetic differentiation because the posterior mode of σ2 is 0.24 and there is little scattering of points around the regression line (Figure 2B). These results indicate that the method provides reliable results under an isolation-by-distance model applicable to a wide range of species.

Spatial population expansion model:

The second interpretation of our inference model considers a simple fission scenario where local populations evolve in isolation after splitting from an ancestral population. To evaluate the performance of the method when the true evolutionary model differs radically from this scenario we used SPLATCHE (Currat et al. 2004). More specifically, SPLATCHE simulates a population expansion from a single origin in a two-dimensional habitat (strict two-dimensional stepping-stone model) and generates genetic samples for geographic locations chosen by the user. We used the example of the human population expansion with origin in East Africa (see below) as a template for our simulated scenarios. We used a growth rate of 0.10, a carrying capacity of 100 for all demes, and a migration rate of 0.20. With these settings, the whole world is colonized after ∼4000 generations. We simulated 10 replicates of this scenario and for each of them we “collected” genetic data for 36 populations chosen uniformly on the map (Figure 3 shows the map of the world with the sampling locations). We calculated the distance from East Africa to each of the sampling locations using the shortest possible land-based route. The objective of this part of the study is to determine if our method can detect the genetic signal left by a recent spatial population expansion so we use distance from East Africa as a factor in our analysis. However, we also include a second factor, “insularity,” to illustrate the flexibility of our method. Clearly, genetic structure is influenced by physical barriers to migration such as water masses. Therefore, the second factor considered by the logistic regression model (4) takes values of either 0 for continental populations (e.g., those in mainland Europe, Africa, Asia, and America) or 1 for insular populations (those in England, Japan, Indonesia, New Zealand, Australia, and Greenland).

Figure 3.—

Map of the 36 populations (shaded circles) used in the simulation of the colonization of the world with SPLATCHE. The solid square indicates the origin of the expansion.

The results of this analysis clearly show that our method can accurately detect the effect of not only distance but also other physical barriers despite the fact that the synthetic data were generated using a demographic model that differs radically from the simple fission model. Figure 4 shows the plots of ln[FST/(1 − FST)] vs. distance and insularity. Data points correspond to mean values over the 10 replicated scenarios. The posterior probability of the model with the constant factor and both distance from East Africa and insularity is 0.97. Figure 4 shows the strong effect of distance on FST and it also indicates that insular populations (identified with triangles) would not fit very well the regression line without including the insularity factor. Here we estimate σ2 = 0.15 while in another analysis without the insularity factor we obtained σ2 = 0.25.

Figure 4.—

Plots of ln[FST/(1 − FST)] against the two factors, distance to East Africa and insularity, for the human simulated data. Triangles indicate insular populations and solid squares indicate African populations. The equation of the regression line is ln[FST/(1 − FST)] = −2.59 + 0.46d, where d denotes the distance to East Africa.


Argan tree:

One of the goals of conservation genetics is to identify populations for on-site conservation. This requires the characterization of the status of each local population in terms of its genetic diversity and its degree of differentiation with respect to all other populations and an understanding of the factors that are responsible for the observed genetic structuring. A good example of this type of study is presented by Petit et al. (1998), who analyzed the genetic structure of the argan tree of Morocco. Here we use their data to illustrate how our approach can be applied to study metapopulation scenarios with the objective of determining if the degree of isolation is influenced by geographic location and/or one other factor, in this case altitude. We note that this latter variable was not considered by Petit et al. (1998) and is included here only for the sake of illustration.


The argan tree data consist of samples from 12 populations that were analyzed for 12 polymorphic loci. Sample size varied between 20 and 50 individuals per population. Geographic isolation of each local population is modeled using two factors, connectivity and elevation. Connectivity was calculated using the geographic distance between populations (kindly provided by R. Petit) and the same connectivity measure, Sj, as in the simulation study (see above). The elevation data were obtained from Table 1 in Petit et al. (1998).


The highest probability model (P = 0.60) includes a constant and connectivity. This model dominates well all the others, and it seems clear that elevation does not influence genetic differentiation since models that include this factor have low posterior probability (between 0 and 0.07). The estimated σ2 is 0.27 (HPDI = [0.13; 0.82]), which indicates that distance explains well the genetic differentiation and that, although some other unknown environmental factor may also play a role, its effect is unlikely to be very strong. The estimate of α2 = −0.55, which is negative, indicating that genetic differentiation does indeed increase with increasing geographic isolation (see Figure 5).

Figure 5.—

Plots of ln[FST/(1 − FST)] against the connectivity for the argan tree data. The equation of the regression line is ln[FST/(1 − FST)] = −0.66 − 0.55S, where S denotes the connectivity measure defined in the text.


One of the most widely accepted theories on the origin of modern humans, the recent African origin (RAO) model, postulates that our species has evolved from a small East African population that had subsequently colonized the whole world. Numerous studies provide evidence supporting this model but less is known about the specific details of the demographic history of humans (see Excoffier 2002). For example, some studies have found that an important component of human diversity seems to have evolved outside Africa (Zhao et al. 2000; Yu et al. 2001), which has been interpreted as implying a migration back to Africa. Also, the strength of the bottlenecks that may have taken place in the colonization process and the magnitude of the subsequent population expansions remain controversial (see Excoffier 2002 for a review). Here we use the Human Genome Diversity Cell Line Panel–Centre d'Etude du Polymorphisme Humain (HGDP–CEPH) presented by Cann et al. (2002) to illustrate how our method can be used to make inferences about population expansion events such as those undergone by humans.


The data consist of 1056 individuals from 51 subpopulations, which were scored for 377 microsatellites. We carried out two analyses. The first one considers the effect of one factor, geographic distance from East Africa along colonization routes (as in Prugnolle et al. 2005), while the second one considers a model with two factors, longitude and latitude.


Estimated FST-values are presented in appendix c. If humans spread from East Africa and in the process underwent successive bottlenecks of small amplitude (as proposed by Prugnolle et al. 2005) we should expect a gradual increase in FST-values as the distance from the center of origin increases. Thus, we followed Prugnolle et al. (2005) and calculated the distance from East Africa along likely colonization routes for each one of the 51 populations and estimated the posterior probability of the three alternative models, with and without distance. The estimated posterior probability of the model with the constant factor and distance is 1. Figure 6 shows the strong effect of distance on θ but it also indicates that some of the African populations (identified by solid squares) do not fit very well the regression line. Indeed, repeating the estimation process without these populations improves the fit from σ2 = 0.27 to σ2 = 0.18. Several factors can explain the relative lack of fit of African populations (see discussion).

Figure 6.—

Plots of ln[FST/(1 − FST)] against the distance to East Africa for the human data. Solid squares indicate African populations. The equation of the regression line is ln[FST/(1 − FST)] = −3.01 + 0.56d, where d denotes the distance to East Africa.

Another way of uncovering the genetic signature left by the demographic history of humans is to carry out a second analysis that uses as factors latitude and longitude. The rationale for this is that under the RAO model there should be a clear effect of longitude on FST since the most important population movements have occurred along this axis (e.g., Piazza et al. 1981). We also expect an effect of latitude due to the combined effect of southward population movements (mainly in the Americas but also in Africa), climate, and smaller effective sizes of the southern populations included in the data set (the African and Amerindian populations are much smaller than the European and most Asian populations). Thus, latitude would take into account the increased effect of genetic drift due to all these factors.

To take into account the fact that the Americas were colonized from East Asia, we measured longitude on a scale between 0° and 360° from west to east and starting at the Greenwich meridian. With the two factors considered, there are nine alternative models but almost all the probability mass was allocated to model 7 (Table 1) that includes the constant term and both latitude and longitude. The only other model visited is model 9 with probability 0.02, which also includes the two factors. So it is clear that they both play a major role in explaining the observed genetic structure. In addition, the estimate of σ2 is 0.19 with a HPDI of [0.13; 0.29], yet another indication that these two factors explain well the observed genetic pattern. Note also that this estimate of σ2 is lower than that obtained above for the analysis that included the African populations and is almost identical to that obtained excluding these populations. Thus, latitude and longitude explain more of the variation than distance alone. The stronger effect of longitude is indicated by the larger absolute value of its regression coefficient (see Table 7) and the less pronounced scattering of points in the plot of ln[FST/(1 − FST)] against longitude than against latitude (Figure 7). The negative value of α2 indicates that colonization occurred mainly from west to east while the negative value of α1 indicates that the effect of genetic drift was stronger in the southern populations.

Figure 7.—

Plots of ln[FST/(1 − FST)] against the two factors: latitude and longitude for the human data.

View this table:

Human data regression


In this article we present a new Bayesian method to evaluate the effect that biotic and abiotic environmental factors (geographic distance, language, temperature, altitude, local population sizes, etc.) have on the genetic structure of populations. It estimates FST-values for each local population and relates them to environmental factors using a generalized linear model. The method requires genetic data from codominant markers (e.g., allozymes, microsatellites, or SNPs) and environmental data specific to each local population. The results of our simulation study indicate that the method can correctly identify the environmental factors that influence the structuring of genetic variability under a wide range of conditions. In particular, it can perform very well for moderate values of sample sizes (20) and number of loci (10) and it is very robust to deviations from the inference model assumed by our method. Thus, it should be applicable to a wide variety of species.

Our method provides many pieces of information that can help us better understand the demographic and evolutionary history of the species under study. Population-specific FST's allow us to establish how distinct a local population is in terms of its genetic composition; a high FST indicates that the allele frequency distribution of a local populations is fairly different from that of the metapopulation as a whole. The posterior model probabilities allow us to identify the biotic/abiotic factors controlling genetic structure and also explain why a given local population is or is not genetically distinct. The signs of the regression coefficient estimates indicate if the effect of a factor increases or decreases genetic differentiation and, when the most probable model includes more than one factor, their absolute value tells us which factor has a stronger effect.

The method can be used to study a wide range of problems, including human genetics, conservation biology, and species demographic history. In the field of human genetics it can be used to test for the importance of cultural factors as descriptors of the distribution of genetic variation in humans. This information is of fundamental importance for efforts to identify disease genes by association with marker loci. In the field of conservation biology, the identification of the factors that are important determinants of the genetic structure of natural populations can help design management strategies for best preserving genetic biodiversity. It can also help gather valuable information for designing and parameterizing simulation models aimed at predicting the effect of global climate change on natural populations. Its application to the study of population demographic history is well illustrated by the application of our method to the analysis of the HGDP–CEPH Human Genome Diversity Cell Line Panel (see below).

As opposed to the multivariate Mantel test, which uses a summary statistic (pairwise genetic distance), our method is model based and makes full use of the data (allele frequencies). This is a very important consideration when data sets are not very large, as is usually the case. Additionally, it avoids the problems faced by many frequentist approaches, most of which use randomization techniques to test for the significance of the relationships. In the particular case of the multivariate Mantel test, Raufaste and Rousset (2001) and Rousset (2002) have recently called into question the validity of the jackknife procedure when two or more environmental factors are included in the analysis. The limitations of the sequential testing procedures were discussed previously by Cheverud et al. (1989) and particularly by Legendre (2000), who described two additional randomization methods for carrying out the multivariate Mantel test. Legendre (2000) also provided recommendations on ways of deciding which randomization technique is more appropriate for the data at hand. Thus, the multivariate Mantel test can be used as a data exploration technique provided that all these considerations are taken into account. On the other hand, a more information-rich method, such as ours, is more suitable when the aim is to test specific hypotheses about the effect of environmental factors. It is also more appropriate for studying the influence of variables whose effects cannot be measured in terms of pairwise distance measures. For example, the effect of local population sizes on genetic structuring cannot be modeled using pairwise distances since the effect of genetic drift is determined by the Ne of each local population and not by their difference. Our approach can readily account for this effect if estimates of local population sizes are available. Also, as shown in our simulation study, it can be used to study the effect of physical barriers such as water masses by considering a binary factor to discriminate between, for example, continental and insular populations.

The generality of our method stems from the fact that the likelihood function we use arises as an approximation to two different demographic models. In the case of the fission model considered by Nicholson et al. (2002) and Falush et al. (2003), the population-specific FST's measure the degree of genetic differentiation between each descendant population and the ancestral population. In the case of the island model considered by Balding and Nichols (2005) and Balding (2003), the FST's measure how divergent each local population is from the metapopulation as a whole. In both cases, the estimation of individual FST's for each one of the descendant/local populations takes into account the fact that the strength of genetic drift differs among them, something that is overlooked by the traditional procedure of estimating a single FST-value for the whole set of demes.

We illustrate the application of our method using published data sets. The argan tree example approaches a migration–drift model and uses connectivity and altitude as factors that could explain the genetic structuring. There is no a priori reason to believe that altitude plays a role and our method confirms this expectation. Connectivity on the other hand has a strong effect. Although our example considers only abiotic factors, it is also possible to include biotic factors directly related to dispersal, for example, the abundance of a pollinator species.

The human example illustrates the application of our method to a scenario that approaches imperfectly that of a fission model. Our analysis of the effect of distance to East Africa on FST-values (Figure 6) is not entirely consistent with those obtained by Prugnolle et al. (2005), who considered its effect on heterozygosity. In our case, although the overall fit of the model is very good, three African populations (Biaka Pigmies, Mbuti Pygmies, and San) seem to deviate more from the FST-values predicted by distance. This distinct behavior of African populations has also been observed by Ramachandran et al. (2005), using regression methods that did not explicitly assume a specific evolutionary model (but see below). Several factors can explain this lack of fit. The first one to consider is a mismatch between the fission model assumed by the method and the real demographic history of humans. Indeed, the fission model assumes that all subpopulations evolved from an ancestral population while the RAO model considers a series of bottlenecks with new populations being formed progressively from the ones that left the ancestral population. To determine if this is the source of the problem, we used SPLATCHE (Currat et al. 2004) to generate synthetic data under the RAO model (see results). The simulations considered a population expansion from a single origin in East Africa in a two-dimensional habitat (strict two-dimensional stepping-stone model) and collected genetic data for 36 populations chosen uniformly on the map (cf. Figure 3). As we did in the analysis of the real data set, we use distance from East Africa as a factor but also added a second factor, insularity, to account for the barrier effect of water masses, which was expected to be very strong due to the way migration across water masses has to be modeled with SPLATCHE. The results (Figure 4) show that samples collected in Africa are no longer outliers, indicating that the mismatch between the inference model and the RAO model is unlikely to be responsible for the observed lack of fit.

Another potential explanation for the lack of fit of African populations is that they may have smaller effective sizes, a consequence of being rather small and isolated tribal groups. In their analysis, Prugnolle et al. (2005) used heterozygosity, a summary statistic that does not really take into account the information provided by very low-frequency alleles. Thus, it is less affected by genetic drift due to small population sizes and provides a better fit. The fit of the model might have been improved by including estimates of local population sizes but no such estimates were available. Finally, a third possible explanation is a potential ascertainment bias in the choice of the microsatellites of the database we used. Indeed, these markers were primarily ascertained on an European CEPH panel and may be biased toward loci that are more variable in European than in African populations (Ray et al. 2005). This would introduce a bias in the estimation of the ancestral allele frequency distribution, which will be closer to that in European populations. Prugnolle et al. (2005) used heterozygosity, which, as mentioned before, is less sensitive to the effect of low-frequency alleles and therefore is less influenced by ascertainment bias. We carried out simulations to determine to what extent ascertainment bias can influence the results given by our method. We considered the same scenario described above, i.e., 36 populations chosen uniformly on the world map, and used SPLATCHE to generate a data set with 100 loci. From these we generated an unbiased data set by selecting at random 20 loci and a biased data set by selecting 20 loci with the highest variability in a French sample. The effect of ascertainment bias can be studied using the averaged squared residuals (average squared difference between observed and fitted values) of the African samples, which in the case of the biased data set is almost twice as high as that of the unbiased one (0.073 vs. 0.042). This result suggests that ascertainment bias could explain at least part of the lack of fit of African samples. Another important effect of ascertainment bias is to skew the estimated allele frequencies of the ancestral population toward those of the population used as a reference for choosing the markers. Thus, the lowest FST for the biased data set value corresponds to that of the French sample, which in the unbiased data set ranked 16th. This can have a large effect on studies that attempt to narrow down the region of Africa that was the epicenter of the human population expansion. This problem was investigated by Ray et al. (2005). They carried out realistic simulations of the genetic diversity expected after an expansion of modern humans into the Old World from different possible areas and then compared the results to the same human data set we have analyzed. Their results point toward a North African origin for modern humans. To uncover the effect of ascertainment bias they generated 20,000 data sets under the assumption of an East African origin. They then selected 1000 simulations with the highest variability in the subset of European samples (a biased data set) and 1000 additional simulations selected at random (an unbiased data set). The results obtained with the biased samples indicated a North African origin while the unbiased samples correctly pointed toward an East African origin. To take into account the effect of ascertainment bias they used synthetic data sets that presented the same bias as the real human database. With this correction, the best fit between observed and simulated data was obtained for the scenario that considered an East African origin for modern humans. A better understanding of the demographic history of humans requires a more even geographic coverage of sampling efforts and less biased choice in the selection of the loci.

A statistical problem that arises when assessing the effect of geographic distance from East Africa on human genetic diversity/differentiation is that human populations have a shared history of drift and cannot be considered as independent. For example, Amerindian populations have all issued from the same bottleneck that took place at the time of colonization of the American continent and they all bear its genetic signal. This is akin to the problem that Felsenstein (1985) pointed out for studies that use the comparative method and is common to all the recent studies that have used a regression approach to evaluate the validity of the RAO model (e.g., Prugnolle et al. 2005; Ramachandran et al. 2005; Amos and Manica 2006). More precisely, human populations can be considered as independent for the purpose of statistical analysis only under a strict fission model, in which case the history of genetic drift in the different human populations will be independent. Otherwise, populations in the different continents or regions of the globe will all bear the signal of the bottleneck at the time of colonization of that particular continent/region. Thus, samples coming from the same geographical area can be considered, to at least some degree, as replicates and may inflate the statistical significance of the regression. Although we use a Bayesian approach, our method does not avoid this problem so we carried out simulations to try to evaluate to what extent this lack of independence can affect the results. The Amerindian populations are the most likely to influence the estimates of σ2, the variance of θ's, and the posterior model probabilities. Thus, we compared the results of three different scenarios that differ in the number of Amerindian samples. In the first scenario samples were collected from 36 populations chosen uniformly on the world map (the unbiased scenario described above); they comprised 4 populations from North America, 1 from Central America, and 4 others from South America. The second scenario considered 29 populations in total, with only 1 each in North, Central, and South America. Finally, we considered a third scenario with 44 sampled populations of which 4 were from North America, 1 was from Central America, and 12 were from South America. The results correspond to averages across 10 independent simulations of each scenario. They indicate that the number of Amerindian samples included has little effect on posterior model probabilities (0.97 for scenario 1, 0.93 for scenario 2, and 0.99 for scenario 3) and the estimates of σ2 (0.15, 0.18, and 0.13, respectively). Duplicated populations can also have an effect similar to that of ascertainment bias (i.e., skew ancestral allele frequencies toward those of European populations) if a large number of samples are from Europe and the Middle East. We explored this issue using simulations of a scenario where we added six European samples to the unbiased data set. In this case, the estimated posterior probability of model 7 is 1 and the estimate of σ2 is 0.17, indicating that the effect is rather weak. Additionally, the average FST of European populations for the data set with more European samples (FST = 0.059) is almost the same as that obtained for the unbiased one (FST = 0.061). Ideally one would like to eliminate all nonindependent samples as we did for the American continent but it is not clear how to identify the duplicated samples for large areas such as Asia, without knowing the history of the bottlenecks. In any case, the results of the simulations suggest that despite this problem our method seems capable of correctly identifying the effect of distance or other factors on genetic differentiation when the true model radically differs from the fission model.

The results of the application of our method to the study of human demographic history seem to suggest a rapid population expansion into the Middle East and Europe as indicated by the low FST-values obtained for all the populations in these regions. Indeed, under a rapid population expansion, the effect of genetic drift is greatly reduced, leading to low FST-values. This rapid expansion seems to have been followed by a slower expansion into East Asia, Australasia, and the Americas. There are clear longitudinal and latitudinal gradients in the degree of genetic differentiation of subpopulations (Figure 7), with values of FST increasing both toward the east and toward the south, but with a stronger effect of longitude. This is expected because most population movements occurred along a west–east axis. Equivalent results were obtained by the seminal work of Piazza et al. (1981) in a much earlier study that considered only aboriginal populations.

As opposed to the complexity of frequentist multivariate techniques such as principal component analysis or discriminant correspondence analysis, which provide results that are difficult to interpret, our Bayesian method provides results that are easy to interpret. It can be applied to the identification of the environmental factors that underlie observed spatial patterns of genetic diversity but it can also be used to study spatial population processes, such as range expansions, by simply introducing longitude and latitude as the explanatory variables. All these characteristics should make it a very valuable tool for addressing a wide range of problems.


The RJMCMC algorithm is used to estimate the posterior probability of the alternative models and their parameters. RJMCMC updates (Green 1995) provide a natural extension to the basic Metropolis–Hastings algorithm to allow moves that change the dimension of the state vector. The estimates of model probabilities are obtained by simply observing the number of times that the chain visits each distinct model. A very thorough sensitivity study has been made to study the convergence of the method and the accuracy of the estimates under different data set scenarios (sample sizes, number of loci and populations, etc.).

Updating p:

The frequencies are updated one locus at a time. In each case, the move is accepted with probability Math, whereMathThe elements of the vector Math are updated pairwise because they must sum to one. Thus, we select at random two different alleles m and n by choosing m uniformly from {1, … , Ki} and then choosing n uniformly from the remaining integers {1, … , Ki}\{m}. Math is the vector Math with Math and Math replaced by Math and Math. Note that for the elements to sum up to one, Math + Math = Math + Math. Thus, we first take Math = u, where Math and e is some increment value that can be pilot tuned to obtain reasonable acceptance rates. Finally, we set Math = Math + (MathMath).

Updating α:

The Math coefficients are updated one at a time. Mathis the vector Math with element αl replaced by α′l. The move is accepted with probability Math, whereMathWe take the proposal value α′l = αl + e, where Math. Since the proposal normal density is symmetric, Math and thus, both terms drop from the expression for B.

Updating θ:

The θj's are updated one at a time. Math is the vector Math with element θj replaced by θ′j. The move is accepted with probability Math, whereMathWe take the proposal value ln θ′j = ln θj + e, where Math. So we have Math and here the move is not symmetric because of the log transformation: Math.

Updating σ2:

We take the proposal value ln σ′2 = ln σ2 + e, where Math and the move is accepted with probability α(σ2, σ′2) = min(1, B), whereMathHere just like for Math we have q(σ′2, σ2)/q2, σ′2) = σ′22.

Between-model moves:

When moving between models, we add or delete one element to the Math-vector. We randomly select one of these parameters and if it is already in the model, we propose to delete it. If it is not in the current model, we propose to add it. We use the canonical jump function so that when moving to a space of higher dimension (model 5 to model 7, for example) we keep the current values and we propose the new value from a distribution q. The reverse move is achieved by just deleting the parameter no longer included. Suppose we propose to add α2 to the model and that we draw α2 from q. Then we accept this move with probability min(1, B), whereMathBecause we choose the new model uniformly, the ratio of prior model probability simplifies to 1. The Jacobian is one because of the canonical jump function used. To consider the reverse move, we simply accept the move deleting α2 with probability min(1, 1/B). All the other reversible-jump moves are similar. The problem with this proposal scheme is that we might propose to add α1 to a model without α3. To get around this problem, we simply reject any proposal that does not make sense (for example, α0, α1, α3). The proposal distribution q is a normal distribution with its mean and variance pilot tuned to improve convergence (see below).

Tuning proposal distribution:

Proposal distributions have to be adjusted to have acceptance rates between 0.25 and 0.45. Thus, we have to adjust e for the uniform proposal of allele frequencies and the variances σα, σθ, and σσ for the normal proposals of α, θ, and σ. If we propose values in a very wide interval, most moves will be rejected because we will often propose values in areas of low posterior probability. On the other hand, if we propose values very close to the current value, the move will be almost always accepted. These values are tuned on the basis of short pilot runs: we run 200 iterations, and for each parameter the proposal is adjusted to reduce or increase the acceptance rate. We make 10 such pilot runs before starting the sampling, which also play the role of a burn-in period. At the same time, we can choose the proposal distribution q for the reversible jump. Brooks et al. (2003) show that the best choice is to take Math to be the full conditional distribution of Math in model M′. Because we do not know this distribution, we use the pilot run to have an estimation of the mean mi and variance vi for all αl under the saturated model (which corresponds to model 9). Then we propose a new value for αl from Math that is generally close to the full conditional distribution.


The algorithm used in the simulations of the inference model is given below in pseudocode:

  • choose factors G(1) and G(1)

  • choose the influence of factors α and the deviation from the regression σ2

  • for each locus i in 1:I

    • choose the number Ki of possible allelic states at locus i

    • choose the alleles frequencies pi of the ancestral population at locus i

  • endfor

  • for each population j in 1:J

    • calculate μj = α0 + α1Gj(1) + α2Gj(2) + α3Gj(1)Gj(2)

    • draw θj from lognormal(μj, σ2)

    • for each locus i in 1:I

      • draw the alleles frequencies at locus i in population j

      • Math from Math

      • choose the total number nij of alleles at locus i in population j

      • draw the allele count aij at locus i in population j from

      • Math

    • endfor

  • endfor


PopulationFST95% HPDI
    Karitiana (Brazil)0.31[0.27, 0.35]
    Surui (Brazil)0.40[0.36, 0.45]
    Piapoco/others (Colombia)0.26[0.22, 0.30]
    Maya (Mexico)0.11[0.09, 0.12]
    Pima (Mexico)0.24[0.20, 0.27]
    Melanesian (Bougainville)0.16[0.14, 0.18]
    Papuan (New Guinea)0.14[0.12, 0.16]
East Asia0.07
    Cambodian (Cambodia)0.077[0.061, 0.094]
    Dai (China)0.072[0.055, 0.089]
    Daur (China)0.072[0.055, 0.089]
    Han (China)0.076[0.061, 0.092]
    Hezhen (China)0.091[0.071, 0.11]
    Lahu (China) 0.11 [0.092, 0.14]
    Miaozu (China)0.079[0.060, 0.097]
    Mongola (China)0.057[0.042, 0.071]
    Naxi (China)0.080[0.062, 0.099]
    Oroqen (China)0.076[0.059, 0.094]
    She (China) 0.10 [0.081, 0.13]
    Tu (China)0.061[0.046, 0.077]
    Tujia (China)0.086[0.067, 0.11]
    Uygur (China) 0.040 [0.028, 0.052]
    Xibo (China)0.056[0.041, 0.072]
    Yizu (China)0.069[0.053, 0.086]
    Japanese (Japan)0.069[0.059, 0.080]
    Balochi (Pakistan)0.034[0.027, 0.041]
    Brahui (Pakistan)0.032[0.026, 0.040]
    Burusho (Pakistan)0.027[0.021, 0.033]
    Hazara (Pakistan)0.023[0.018, 0.030]
    Kalash (Pakistan) 0.089 [0.075, 0.10]
    Makrani (Pakistan)0.023[0.018, 0.029]
    Pathan (Pakistan)0.026[0.020, 0.032]
    Sindhi (Pakistan)0.025[0.019, 0.031]
    Yakut (Pakistan)0.061[0.051, 0.073]
    Basque (France)0.040[0.032, 0.048]
    French (France various regions)0.034[0.027, 0.041]
    Bergamo (Italy)0.035[0.025, 0.045]
    Sardinian (Italy) 0.047 [0.038, 0.055]
    Tuscan (Italy)0.036[0.022, 0.050]
    Orcadian (Orkney Islands)0.037[0.027, 0.047]
    Russian (Russia)0.035[0.027, 0.042]
    Adygei (Russia Caucasus)0.030[0.022, 0.038]
    Druze (Israel Carmel)0.041[0.034, 0.047]
    Palestinian (Israel Central)0.026[0.022, 0.031]
    Bedouin (Israel Negev)0.030[0.025, 0.035]
    Mozabite (Algeria Mzab)0.033[0.027, 0.039]
Sub-Saharan Africa0.08
    Biaka Pygmies (Central Africa)0.083[0.073, 0.093]
    Mbuti Pygmies (Congo) 0.099 [0.085, 0.11]
    Bantu Speakers (Kenya)0.063[0.050, 0.074]
    San (Namibia) 0.12 [0.10, 0.15]
    Yoruba (Nigeria)0.069[0.059, 0.078]
    Mandenka (Senegal)0.058[0.049, 0.066]
  • Italics identify populations with a FST significatively different from others in the same geographic area.

  • Estimates of FST for the 51 human populations with 95% HPDIs


    We thank Peter Smouse, Robin Waples, and Ilkka Hanski for very useful comments on the manuscript and Remy Petit for the argan tree data. Daniel Falush and an anonymous reviewer provided detailed comments that helped to greatly improve the manuscript. This work was supported by the Fond National de la Science (grant ACI-IMPBio-2004-42-PGDA). M.F. holds a Ph.D. studentship from the Ministère de la Recherche. The software implementing the method is available at


    • Communicating editor: M. W. Feldman

    • Received April 14, 2006.
    • Accepted August 8, 2006.


    View Abstract