Abstract
We introduce a Bayesian method for estimating hidden population substructure using multilocus molecular markers and geographical information provided by the sampling design. The joint posterior distribution of the substructure and allele frequencies of the respective populations is available in an analytical form when the number of populations is small, whereas an approximation based on a Markov chain Monte Carlo simulation approach can be obtained for a moderate or large number of populations. Using the joint posterior distribution, posteriors can also be derived for any evolutionary population parameters, such as the traditional fixation indices. A major advantage compared to most earlier methods is that the number of populations is treated here as an unknown parameter. What is traditionally considered as two genetically distinct populations, either recently founded or connected by considerable gene flow, is here considered as one panmictic population with a certain probability based on marker data and prior information. Analyses of previously published data on the Moroccan argan tree (Argania spinosa) and of simulated data sets suggest that our method is capable of estimating a population substructure, while not artificially enforcing a substructure when it does not exist. The software (BAPS) used for the computations is freely available from http://www.rni.helsinki.fi/~mjs.
ONE of the inevitable consequences of genetic drift is that gene frequencies diverge between populations of a common origin when migration and mutation rates are low. In evolutionary science, a lot of effort has therefore been devoted to the development and empirical application of statistical methods for estimation of the degree of population differentiation using molecular marker data. A majority of studies have used statistical measures derived from Wright’s F-statistics (Wright 1951, 1965), while only recently, more sophisticated methods have been proposed; see, e.g., Holsinger (1999), Edwards and Beerli (2000), Kitada et al. (2000), Pritchard et al. (2000), and Dawson and Belkhir (2001).
Natural animal and plant populations typically have a nested substructure with respect to their hierarchical spatial pattern, such as sites within riverbeds, riverbeds within a river, or rivers within a river basin (Weir 1996). When sampling individuals from such hierarchical systems, one often follows the substructure (at least implicitly) by collecting the data groupwise from individuals sharing some low level of hierarchy. The traditional analyses have quantified this kind of nested genetic variation by using various statistical measures and conditioning on the fixed preassigned structure. Recently, approaches of Pritchard et al. (2000) and Dawson and Belkhir (2001) used Bayesian model-based clustering to assign individuals one at a time to unknown populations. Their main focus was on the situation where the information contained in the sampling design is not available or not imposed, although Pritchard et al. (2000) briefly considered also the other case. Here we introduce an approach that is conditioned on the geographical sampling information available about the preassigned groups of individuals. The partition among the groups is treated here as the parameter of main interest, such that all group combinations are considered a priori equally likely. The molecular marker data are then used for assessing which substructures are empirically plausible. The actual analysis is performed using a systematic Bayesian approach, where a Markov chain Monte Carlo (MCMC) estimation is used whenever the number of possible partitions is too large to be handled with exact calculations.
The posterior distribution of the population substructure and population-specific parameters also enables the estimation and uncertainty assessment for any related quantities that might be of interest, such as the F-statistics familiar to most evolutionary biologists. Our method is applicable to several types of codominant markers [e.g., allozymes, single-nucleotide polymorphisms (SNPs), and microsatellites], on the basis of assumptions of Hardy-Weinberg equilibrium (HWE) and linkage equilibrium between loci within each observed population. We also discuss possible extensions of the methodology to higher-dimensional hierarchies and an alternative way of handling the situation where the HWE assumption seems empirically unjustified.
The proposed Bayesian model is described in the following section, whereas the computational details are given in the appendix. Investigation of genetic separation among populations is considered thereafter. To illustrate the methodology we use the Moroccan argan tree (Argania spinosa) data from Petit et al. (1998). Results of sensitivity studies using simulated data are also presented, and finally, some possibilities for further extensions of the method are discussed.
BAYESIAN MODELING OF ALLELE FREQUENCIES IN A GEOGRAPHICALLY STRUCTURED POPULATION
We consider a sampling design where individuals are gathered from NP distinct populations on the basis of available prior knowledge concerning their geographical separation. Assume that genotypes are observed at NL independent (unlinked) marker loci, where at each locus j there are NA(j) possible alleles to be distinguished. To be adequate sources of information about population substructure, these markers should be neutral and their mutation frequency should be reasonably low. Furthermore, the unlinked genetic markers are assumed to be in HWE within each observed population.
Since the true underlying population substructure is unknown, the number of populations with differing allele frequencies is treated here as a parameter νP, having the range of reasonable values [1, NP], where the upper bound is directly given by the sampling design. At locus j, the unobserved probability of observing allele Ajk (allele frequency) in population i is represented by pijk [i = 1,..., νP; j = 1,..., NL; k = 1,..., NA(j)]. To simplify the notation, θ is used as a generic symbol jointly for the allele frequencies (θi for population i), and similarly n represents jointly the observed marker allele counts nijk. Missing alleles are simply ignored among observations, since they do not contribute in the model under HWE assumption. Note here that pijk depends on νP, and consequently, nijk may be a sum of several allele counts calculated from the original populations. The partition of the original populations can be represented by a NP × NP population structure parameter matrix S, with elements defined as
In a multinomial setting, a common choice as a prior π(θ|νP, S) for the allele frequencies (see Rannala and Mountain 1997; Holsinger 1999; Pritchardet al. 2000; Anderson and Thompson 2002) is the Dirichlet(λ) distribution with hyperparameter vector λ, where each element λk represents the prior mass on the allele k (at some arbitrary locus). As a reference assumption we prefer an invariant noninformative prior with λijk = 1/NA(j), which can be interpreted to relatively contain as much information as a likelihood with a single observation. This particular prior was also suggested in Anderson and Thompson (2002) in a related context. It is further assumed that π(S|νP)π(νP) is a uniform distribution in the finite space of distinct values of (νP, S). A strategy enabling joint estimation of the parameters (θ, νP, S) in model (1) is described in the appendix, and the given noninformative priors are used in all subsequently reported analyses of real and simulated data.
MEASURING OF GENETIC SEPARATION AMONG POPULATIONS
A wide diversity of evolutionary measures of population differentiation is available in the genetic literature (see Weir 1996; Nagylaki 1998; Tomiuket al. 1998; Yang 1998; Excoffier 2001; Rousset 2001). Simple statistical point estimates of such parameters can be obtained, but this requires conditioning on a known population structure. Quantification of the uncertainty about the estimates is much more tedious and resampling methods (like bootstrap) are often applied in the estimation of confidence intervals. However, resampling methods may provide biased estimates when based on hierarchical data sets (Petit and Pons 1998).
Given the posterior of the allele frequencies and population structure (θ, νP, S), it is possible to derive the posterior distribution also for any function of these parameters, such as the familiar F-statistic FST (in examples we have used the formula given in Nei 1977). Our approach enables Bayesian model-averaged estimation of evolutionary measures, by accounting for the uncertainty related to the unknown population structure. For a general discussion of Bayesian model averaging, see Ball (2001) and Sillanpää and Corander (2002). To aid in interpretation of the genetic marker data with respect to separation among populations, we emphasize the importance of studying the visual appearances of the posterior distributions of all parameters of interest. Using the posterior distribution of structure parameters, it is possible to give a measure of the uncertainty concerning whether any particular population pair among the original NP populations can in fact be regarded as samples from a single population. However, our model cannot readily be used to empirically verify whether one has collected individuals that are originally from several different populations within a single geographical region. The uncertainty can be presented as an NP × NP matrix with the (mr)th element defined as the posterior probability
EXAMPLE ANALYSES
Real data: To illustrate the proposed methodology, we used the Moroccan argan tree (A. spinosa) data from Petit et al. (1998), which has previously also been analyzed in Holsinger (1999). Due to implementation, Holsinger’s analysis was based on preprocessing of multiallelic data to a biallelic form, and therefore, his results are comparable to ours only under the same restriction.
The original data consist of allele measurements at 12 isozyme loci (two to five alleles) for 12 different populations with 20-50 individuals in each. We use the same abbreviated notation for the population names as Petit et al. (1998). For NP = 12 there are 4,213,597 possible partitions of the populations, so that the exact analysis may not in this case be considered feasible for a routine analysis. Nevertheless, we performed the exact analysis and the differences in pairwise probabilities (2) appeared to be negligible when compared to the MCMC approximation (based on a Markov chain of length 105 after a discarded “burn-in” period of 104 iterations). In all investigations reported subsequently we have used the MCMC approach with the same chain length for the burn-in. Mixing properties of the chains were monitored visually using various tools (e.g., cumulative occupancy plot; see Uimari and Sillanpää 2001), and our algorithm seems to perform well in this respect. Note that successive realizations of allele frequencies are independent, and consequently, values for quantities depending on allele frequencies (such as FST) do not have any autocorrelation (see the appendix).
Posterior probabilities of different groupings of population samples for the Argania spinosa data
Simulated data: In addition to the example analysis with real data, we applied our approach also to data sets that were simulated from population models with or without substructure. This enables investigation of whether one has a sufficient probability of detecting differences among allele frequencies while still maintaining a low probability of imposing a structure artificially, when such does not exist. From a theoretical point of view it is clear that the given Bayesian model will a priori support the simplest partition with no separation of populations, since the conditional distribution of the marker frequencies has then the smallest possible number of parameters.
We simulated data sets from distributions with 10 different alleles, some of which were considerably rare. In the first setting alleles were generated for a single locus with frequencies [0.3, 0.3, 0.2, 0.1, 0.05, 0.015, 0.015, 0.01, 0.005, 0.005]. Samples of 10, 20, and 50 diploid individuals from this single population were then randomly assigned into five different populations. An analogous setting with the same allele frequencies was also used to generate observations from five independent loci simultaneously. In the second scheme alleles were generated from two populations with different allele frequencies, one having the frequencies in the previous example and one with frequencies [0.15, 0.15, 0.15, 0.15, 0.1, 0.1, 0.05, 0.05, 0.05, 0.05]. The same sample sizes and numbers of loci (one and five) as in the first scheme were used. All sampling configurations were replicated 10,000 times and the posterior distributions were analytically calculated for each replicate.
—Kernel-estimated posterior distribution of FST for the Argania spinosa data.
—Cumulative probabilities of different values of νP at each MCMC iteration cycle in an early phase (≤11,000 cycles) of the chain.
Results, real data: From the posterior of structure S based on the real data (see Table 1), samples from populations Mijji (MI), Sidi Ifni (SI), and Tensif (TE) are all considered to originate from a single population with probability 0.999. Furthermore, given the abbreviations Argana (AR), Tizint’est (TT), and Ademine (AD), population samples in pairs (AR, TT) and (AD, AR) are considered to have equal origins with probabilities 0.874 and 0.093, respectively. All the remaining combinations of populations are estimated to have corresponding probability equal to zero. For comparison the posterior mean of FST equals 0.273 (95% credible interval being [0.251, 0.296]). Figure 1 shows the posterior density of FST and Figure 2 illustrates the rapid convergence of the particular chain with respect to νP in a form of cumulative occupancy probabilities. The posterior estimate of FST is rather distinct from the value obtained in Holsinger (1999), and therefore, we repeated our analysis in this respect, using a biallelic transform of the original data following Holsinger (1999). The resulting estimate is 0.172 (with 95% credible interval being [0.148, 0.196]), and the comparable estimate and credible interval given in Holsinger (1999) are 0.192 and [0.177, 0.206], respectively. The slightly lower value of our estimate is expected, since we are accounting for the equality of certain populations.
To investigate sensitivity and the effects of individual loci, we reanalyzed the data using only a single locus at a time. In Table 2, only counts of loci for which the pairwise posterior probabilities P(θm =θr|n) for populations (m and r) that exceed 0.75 are shown. It can be seen that most populations have concordant allele frequencies at many loci; however, concordant loci vary among the populations.
The estimated posterior means of Kullback-Leibler divergences are used in a three-dimensional multidimensional scaling plot of the populations (Figure 3) to visualize their distinction from each other. The estimated distances among populations MI, SI, and TE are equal to zero, and therefore, the population labels are overlapping in the plot. The populations Beni-Snassen (BS) and Oued Grou (OG) seem to locate far from the other populations, which is in concordance with the results of Petit et al. (1998). When Figure 3 is compared to the geographical map given in Petit et al. (1998), one can conclude that some genetic distances coincide relatively closely with the geographical distances, whereas some pairs of genetically similar populations are very distant from each other.
Results, simulated data: For the simulated data sets lacking population substructure, results are summarized in Figure 4. Histograms in the figure show the empirical distribution (over replications) in different settings for the posterior probability of the event that any two populations are equal. The panels correspond to the case with one locus only; for data sets with five loci the posterior probability was equal to unity for all replicates. The analysis illustrates clearly that our method will support merging of populations if the data do not provide enough evidence against the similarity hypothesis. Results for the configurations where the underlying structure consists of two distinct populations are presented in Figure 5, analogously to the previous example. As expected, the empirical power to detect the correct underlying structure increases with the sample size.
Counts of loci of the Argania spinosa data for which pairwise posterior probabilities of populations being equal exceed 0.75
DISCUSSION
We have presented a Bayesian method for estimating hidden population substructure using multilocus molecular markers. Underlying model assumptions concerning HWE and linkage equilibrium within the populations imply that each individual contributes two independent alleles to the likelihood at each locus. To check the validity of these assumptions, one may use, for instance, the methods introduced in Ayres and Balding (1998, 2001), respectively. When the populations are in significant departure from HWE, the data are effectively assumed to contain too much information about the allele frequencies, and consequently, the level of uncertainty concerning the parameters will become underestimated. One potential remedy for this is to parameterize the model using genotype frequencies instead of allele frequencies. Such a model avoids HWE assumption and allows for any form of dependence between alleles. This would also enable the use of commonly available dominant markers such as randomly amplified polymorphic DNAs and amplified fragment length polymorphisms in our analysis. For another Bayesian approach to the analysis of dominant markers, see Holsinger et al. (2002). In the genotype model with codominant markers it is possible to take into account missing allelic data through data augmentation (Schafer 2000). However, the genotype model may become infeasible when the data are scarce or when the number of alleles at different loci is high.
—A multidimensional scaling plot of the estimated posterior means of Kullback-Leibler divergences among Argania spinosa populations (MI, SI, and TE have zero distances).
—No substructure. Shown are empirical distributions (based on 10,000 replicates) of the posterior probability of the event that any two of five simulated populations are equal on the basis of single-locus data. All underlying populations have equal allele frequencies. Sample size (n) is indicated at each top left corner.
—Two underlying populations. Shown are empirical distributions (based on 10,000 replicates) of the posterior probability of the event that the two simulated populations are equal on the basis of single-locus (left) and fiveloci (right) data. The underlying populations have different allele frequencies. Sample size (n) is indicated at each top left corner.
When the aim of modeling the marker data is investigation of neutral evolution, one should bear in mind the assumption of a relatively slow rate of mutation of the alleles. In this respect conclusions with respect to differentiation are most well suited for allozymes and SNPs on low-mutating genome regions. One should be more careful concerning inferences about genetic drift when using microsatellite alleles, since they fluctuate more randomly over generations.
We have here concentrated on utilization of the geographical information available in a two-level hierarchy, since it corresponds to commonly used sampling designs. Occasionally, sampling designs may enable the use of information even from higher-dimensional hierarchies (typically, at three levels). Such designs can be taken into account by defining the hyperparameters in the prior as random coefficients depending on some parameter indexing the nested population substructure (cf. Holsinger 1999). Although the exact form of the posterior may then not be available even for a small number of populations, the MCMC approach presented here can be modified to handle more general settings, by suitably changing the mechanism of generating proposals.
The general Bayesian approach applied here is very flexible, and it would be valuable to incorporate information from phenotypes, different mutation models, spatial distances, and demographic parameters in the future. In conclusion, we have shown that the Bayesian model is a powerful tool for inference about the genetic population structure. However, as the simulation results with an underlying population structure illustrate, one cannot expect to obtain conclusive evidence for separation among populations when the numbers of sampled individuals and loci are small, unless the observed allele frequencies are considerably different. This feature represents common sense in statistical inference and protects against exaggerated interpretations concerning differences caused by random fluctuations in allele frequencies over generations.
Our analysis shows the favorable feature of combining information from several loci into a single probability model, as opposed to the simple averaging used in a traditional FST analysis. One special advantage of the proposed MCMC sampling scheme is that tuning problems related to the choice of proposal and prior distributions seem to be minimized. This reflects the positive effect of analytically integrating out relative allele frequency parameters from the posterior expression of the structure. A major advantage of the approach as a whole compared to most earlier methods is that the number of populations is treated here as an unknown parameter. Hence, we can avoid the labeling problems of populations that occur with high levels of gene flow. In other words, what is considered as two genetically distinct populations, either recently founded or connected by considerable gene flow, would be considered as one panmictic population with a certain probability in our approach.
APPENDIX: ESTIMATION OF MODEL PARAMETERS
To enable Bayesian inference jointly about parameters (θ, νP, S) in general, the standard Metropolis-Hastings MCMC algorithm (e.g., Gilkset al. 1996) is utilized to obtain an approximation to the posterior distribution. However, for a sufficiently small number of collected populations NP, the limited size of the parameter space enables the posterior distribution of (νP, S) to be calculated by complete enumeration, such that the marginal likelihood of a particular partition value (νP, S) is divided by the sum of marginal likelihoods over all possible partitions. Conditional on this distribution, one can generate a suitable number of independent posterior realizations of θ explicitly (see below).
The number of distinct values of S (i.e., partitions of the finite set {1,..., NP}) equals the sum
In typical applications the value NP is small enough (say, at most 30-50) so that the time required for the convergence of the MCMC approach is presumably acceptable for practical purposes. In Dawson and Belkhir (2001) partitioning at the individual level (as opposed to the population level used here) using MCMC was still performing well for 200 entities. However, invery complicated problems with a really large number of populations it may be sensible to approximate only the mode of the posterior distribution. In such cases it is preferable to use a formulation in terms of a combinatorial optimization problem, such as those solved by simulated annealing (Aarts and Korst 1989).
The posterior distribution of the population structure is proportional to the analytically calculated integral (see Rannala and Mountain 1997) according to
The acceptance ratio for the Metropolis-Hastings step, where current populations given in (νP, S) are split or merged to form a proposal
Given the previously specified priors, the full conditional distribution of θ is a product of Dirichlet distributions, given by
Acknowledgments
The authors thank two anonymous referees whose suggestions and comments significantly improved the original manuscript. This work was supported by research grant nos. 52457 and 47201 from the Academy of Finland and by the Centre of Population Genetic Analyses, University of Oulu, Finland.
Footnotes
-
Communicating editor: J. B. Walsh
- Received September 6, 2002.
- Accepted October 4, 2002.
- Copyright © 2003 by the Genetics Society of America