Skip to main content
  • Facebook
  • Twitter
  • YouTube
  • LinkedIn
  • Google Plus
  • Other GSA Resources
    • Genetics Society of America
    • G3: Genes | Genomes | Genetics
    • Genes to Genomes: The GSA Blog
    • GSA Conferences
    • GeneticsCareers.org
  • Log in
Genetics

Main menu

  • HOME
  • ISSUES
    • Current Issue
    • Early Online
    • Archive
  • ABOUT
    • About the journal
    • Why publish with us?
    • Editorial board
    • Early Career Reviewers
    • Contact us
  • SERIES
    • All Series
    • Genomic Prediction
    • Multiparental Populations
    • FlyBook
    • WormBook
    • YeastBook
  • ARTICLE TYPES
    • About Article Types
    • Commentaries
    • Editorials
    • GSA Honors and Awards
    • Methods, Technology & Resources
    • Perspectives
    • Primers
    • Reviews
    • Toolbox Reviews
  • PUBLISH & REVIEW
    • Scope & publication policies
    • Submission & review process
    • Article types
    • Prepare your manuscript
    • Submit your manuscript
    • After acceptance
    • Guidelines for reviewers
  • SUBSCRIBE
    • Why subscribe?
    • For institutions
    • For individuals
    • Email alerts
    • RSS feeds
  • Other GSA Resources
    • Genetics Society of America
    • G3: Genes | Genomes | Genetics
    • Genes to Genomes: The GSA Blog
    • GSA Conferences
    • GeneticsCareers.org

User menu

  • Log out

Search

  • Advanced search
Genetics

Advanced Search

  • HOME
  • ISSUES
    • Current Issue
    • Early Online
    • Archive
  • ABOUT
    • About the journal
    • Why publish with us?
    • Editorial board
    • Early Career Reviewers
    • Contact us
  • SERIES
    • All Series
    • Genomic Prediction
    • Multiparental Populations
    • FlyBook
    • WormBook
    • YeastBook
  • ARTICLE TYPES
    • About Article Types
    • Commentaries
    • Editorials
    • GSA Honors and Awards
    • Methods, Technology & Resources
    • Perspectives
    • Primers
    • Reviews
    • Toolbox Reviews
  • PUBLISH & REVIEW
    • Scope & publication policies
    • Submission & review process
    • Article types
    • Prepare your manuscript
    • Submit your manuscript
    • After acceptance
    • Guidelines for reviewers
  • SUBSCRIBE
    • Why subscribe?
    • For institutions
    • For individuals
    • Email alerts
    • RSS feeds
Previous ArticleNext Article

Bayesian Analysis of Genetic Differentiation Between Populations

Jukka Corander, Patrik Waldmann and Mikko J. Sillanpää
Genetics January 1, 2003 vol. 163 no. 1 367-374
Jukka Corander
* Rolf Nevanlinna Institute, FIN-00014, University of Helsinki, Helsinki, Finland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Patrik Waldmann
† Department of Biology, FIN-90014, University of Oulu, Oulu, Finland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Mikko J. Sillanpää
* Rolf Nevanlinna Institute, FIN-00014, University of Helsinki, Helsinki, Finland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: mjs@rolf.helsinki.fi
  • Article
  • Figures & Data
  • Info & Metrics
Loading

Abstract

We introduce a Bayesian method for estimating hidden population substructure using multilocus molecular markers and geographical information provided by the sampling design. The joint posterior distribution of the substructure and allele frequencies of the respective populations is available in an analytical form when the number of populations is small, whereas an approximation based on a Markov chain Monte Carlo simulation approach can be obtained for a moderate or large number of populations. Using the joint posterior distribution, posteriors can also be derived for any evolutionary population parameters, such as the traditional fixation indices. A major advantage compared to most earlier methods is that the number of populations is treated here as an unknown parameter. What is traditionally considered as two genetically distinct populations, either recently founded or connected by considerable gene flow, is here considered as one panmictic population with a certain probability based on marker data and prior information. Analyses of previously published data on the Moroccan argan tree (Argania spinosa) and of simulated data sets suggest that our method is capable of estimating a population substructure, while not artificially enforcing a substructure when it does not exist. The software (BAPS) used for the computations is freely available from http://www.rni.helsinki.fi/~mjs.

ONE of the inevitable consequences of genetic drift is that gene frequencies diverge between populations of a common origin when migration and mutation rates are low. In evolutionary science, a lot of effort has therefore been devoted to the development and empirical application of statistical methods for estimation of the degree of population differentiation using molecular marker data. A majority of studies have used statistical measures derived from Wright’s F-statistics (Wright 1951, 1965), while only recently, more sophisticated methods have been proposed; see, e.g., Holsinger (1999), Edwards and Beerli (2000), Kitada et al. (2000), Pritchard et al. (2000), and Dawson and Belkhir (2001).

Natural animal and plant populations typically have a nested substructure with respect to their hierarchical spatial pattern, such as sites within riverbeds, riverbeds within a river, or rivers within a river basin (Weir 1996). When sampling individuals from such hierarchical systems, one often follows the substructure (at least implicitly) by collecting the data groupwise from individuals sharing some low level of hierarchy. The traditional analyses have quantified this kind of nested genetic variation by using various statistical measures and conditioning on the fixed preassigned structure. Recently, approaches of Pritchard et al. (2000) and Dawson and Belkhir (2001) used Bayesian model-based clustering to assign individuals one at a time to unknown populations. Their main focus was on the situation where the information contained in the sampling design is not available or not imposed, although Pritchard et al. (2000) briefly considered also the other case. Here we introduce an approach that is conditioned on the geographical sampling information available about the preassigned groups of individuals. The partition among the groups is treated here as the parameter of main interest, such that all group combinations are considered a priori equally likely. The molecular marker data are then used for assessing which substructures are empirically plausible. The actual analysis is performed using a systematic Bayesian approach, where a Markov chain Monte Carlo (MCMC) estimation is used whenever the number of possible partitions is too large to be handled with exact calculations.

The posterior distribution of the population substructure and population-specific parameters also enables the estimation and uncertainty assessment for any related quantities that might be of interest, such as the F-statistics familiar to most evolutionary biologists. Our method is applicable to several types of codominant markers [e.g., allozymes, single-nucleotide polymorphisms (SNPs), and microsatellites], on the basis of assumptions of Hardy-Weinberg equilibrium (HWE) and linkage equilibrium between loci within each observed population. We also discuss possible extensions of the methodology to higher-dimensional hierarchies and an alternative way of handling the situation where the HWE assumption seems empirically unjustified.

The proposed Bayesian model is described in the following section, whereas the computational details are given in the appendix. Investigation of genetic separation among populations is considered thereafter. To illustrate the methodology we use the Moroccan argan tree (Argania spinosa) data from Petit et al. (1998). Results of sensitivity studies using simulated data are also presented, and finally, some possibilities for further extensions of the method are discussed.

BAYESIAN MODELING OF ALLELE FREQUENCIES IN A GEOGRAPHICALLY STRUCTURED POPULATION

We consider a sampling design where individuals are gathered from NP distinct populations on the basis of available prior knowledge concerning their geographical separation. Assume that genotypes are observed at NL independent (unlinked) marker loci, where at each locus j there are NA(j) possible alleles to be distinguished. To be adequate sources of information about population substructure, these markers should be neutral and their mutation frequency should be reasonably low. Furthermore, the unlinked genetic markers are assumed to be in HWE within each observed population.

Since the true underlying population substructure is unknown, the number of populations with differing allele frequencies is treated here as a parameter νP, having the range of reasonable values [1, NP], where the upper bound is directly given by the sampling design. At locus j, the unobserved probability of observing allele Ajk (allele frequency) in population i is represented by pijk [i = 1,..., νP; j = 1,..., NL; k = 1,..., NA(j)]. To simplify the notation, θ is used as a generic symbol jointly for the allele frequencies (θi for population i), and similarly n represents jointly the observed marker allele counts nijk. Missing alleles are simply ignored among observations, since they do not contribute in the model under HWE assumption. Note here that pijk depends on νP, and consequently, nijk may be a sum of several allele counts calculated from the original populations. The partition of the original populations can be represented by a NP × NP population structure parameter matrix S, with elements defined as Smr={1,ifθm=θr,0,otherwise,} where m and r take values in the range [1, NP]. The joint distribution of the observed marker allele counts and the model parameters is specified by π(θ,νP,S,n)=π(n∣θ,νP,S)π(θ∣νP,S)π(S∣νP)π(νP)∝∏i=1νP∏j=1NL∏k=1NA(j)[pijknijkπ(pijk)]π(S∣νP)π(νP), (1) where π(n∣θ,νP,S)∝∏i=1νP∏j=1NL∏k=1NA(j)pijknijk is the multinomial likelihood, π(θ∣νP,S)=∏i=1νP∏j=1NL∏k=1NA(j)π(pijk) is the prior density of θ, and π(S|νP)π(νP) is the joint prior of the structure parameters. When the allele frequencies of two populations are equal, their observed counts in n can be summed together in the likelihood. It is worth noting that under the assumptions of HWE and linkage equilibrium the above model arises naturally from the basic modeling principles of the Bayesian framework; see, e.g., Bernardo and Smith (1994).

In a multinomial setting, a common choice as a prior π(θ|νP, S) for the allele frequencies (see Rannala and Mountain 1997; Holsinger 1999; Pritchardet al. 2000; Anderson and Thompson 2002) is the Dirichlet(λ) distribution with hyperparameter vector λ, where each element λk represents the prior mass on the allele k (at some arbitrary locus). As a reference assumption we prefer an invariant noninformative prior with λijk = 1/NA(j), which can be interpreted to relatively contain as much information as a likelihood with a single observation. This particular prior was also suggested in Anderson and Thompson (2002) in a related context. It is further assumed that π(S|νP)π(νP) is a uniform distribution in the finite space of distinct values of (νP, S). A strategy enabling joint estimation of the parameters (θ, νP, S) in model (1) is described in the appendix, and the given noninformative priors are used in all subsequently reported analyses of real and simulated data.

MEASURING OF GENETIC SEPARATION AMONG POPULATIONS

A wide diversity of evolutionary measures of population differentiation is available in the genetic literature (see Weir 1996; Nagylaki 1998; Tomiuket al. 1998; Yang 1998; Excoffier 2001; Rousset 2001). Simple statistical point estimates of such parameters can be obtained, but this requires conditioning on a known population structure. Quantification of the uncertainty about the estimates is much more tedious and resampling methods (like bootstrap) are often applied in the estimation of confidence intervals. However, resampling methods may provide biased estimates when based on hierarchical data sets (Petit and Pons 1998).

Given the posterior of the allele frequencies and population structure (θ, νP, S), it is possible to derive the posterior distribution also for any function of these parameters, such as the familiar F-statistic FST (in examples we have used the formula given in Nei 1977). Our approach enables Bayesian model-averaged estimation of evolutionary measures, by accounting for the uncertainty related to the unknown population structure. For a general discussion of Bayesian model averaging, see Ball (2001) and Sillanpää and Corander (2002). To aid in interpretation of the genetic marker data with respect to separation among populations, we emphasize the importance of studying the visual appearances of the posterior distributions of all parameters of interest. Using the posterior distribution of structure parameters, it is possible to give a measure of the uncertainty concerning whether any particular population pair among the original NP populations can in fact be regarded as samples from a single population. However, our model cannot readily be used to empirically verify whether one has collected individuals that are originally from several different populations within a single geographical region. The uncertainty can be presented as an NP × NP matrix with the (mr)th element defined as the posterior probability P(θm=θr∣n), (2) which can be calculated by summing the posterior probabilities of such partitions where the two populations (m and r) are merged together (see the appendix). However, when the amount of data increases, differences can be detected on a finer scale. Consequently, it may be that the posterior probability (2) approaches zero, although the allele distributions are rather close to each other in some metric. In addition to the probability (2), one can technically measure the discrepancy between allele distributions of two populations over different loci by using the Kullback-Leibler divergence (Kullback and Leibler 1951; Kullback 1968; Anderson and Thompson 2002). However, as was pointed out to us by a referee, the evolutionary meaning of this quantity is not known and needs to be further investigated.

EXAMPLE ANALYSES

Real data: To illustrate the proposed methodology, we used the Moroccan argan tree (A. spinosa) data from Petit et al. (1998), which has previously also been analyzed in Holsinger (1999). Due to implementation, Holsinger’s analysis was based on preprocessing of multiallelic data to a biallelic form, and therefore, his results are comparable to ours only under the same restriction.

The original data consist of allele measurements at 12 isozyme loci (two to five alleles) for 12 different populations with 20-50 individuals in each. We use the same abbreviated notation for the population names as Petit et al. (1998). For NP = 12 there are 4,213,597 possible partitions of the populations, so that the exact analysis may not in this case be considered feasible for a routine analysis. Nevertheless, we performed the exact analysis and the differences in pairwise probabilities (2) appeared to be negligible when compared to the MCMC approximation (based on a Markov chain of length 105 after a discarded “burn-in” period of 104 iterations). In all investigations reported subsequently we have used the MCMC approach with the same chain length for the burn-in. Mixing properties of the chains were monitored visually using various tools (e.g., cumulative occupancy plot; see Uimari and Sillanpää 2001), and our algorithm seems to perform well in this respect. Note that successive realizations of allele frequencies are independent, and consequently, values for quantities depending on allele frequencies (such as FST) do not have any autocorrelation (see the appendix).

View this table:
  • View inline
  • View popup
TABLE 1

Posterior probabilities of different groupings of population samples for the Argania spinosa data

Simulated data: In addition to the example analysis with real data, we applied our approach also to data sets that were simulated from population models with or without substructure. This enables investigation of whether one has a sufficient probability of detecting differences among allele frequencies while still maintaining a low probability of imposing a structure artificially, when such does not exist. From a theoretical point of view it is clear that the given Bayesian model will a priori support the simplest partition with no separation of populations, since the conditional distribution of the marker frequencies has then the smallest possible number of parameters.

We simulated data sets from distributions with 10 different alleles, some of which were considerably rare. In the first setting alleles were generated for a single locus with frequencies [0.3, 0.3, 0.2, 0.1, 0.05, 0.015, 0.015, 0.01, 0.005, 0.005]. Samples of 10, 20, and 50 diploid individuals from this single population were then randomly assigned into five different populations. An analogous setting with the same allele frequencies was also used to generate observations from five independent loci simultaneously. In the second scheme alleles were generated from two populations with different allele frequencies, one having the frequencies in the previous example and one with frequencies [0.15, 0.15, 0.15, 0.15, 0.1, 0.1, 0.05, 0.05, 0.05, 0.05]. The same sample sizes and numbers of loci (one and five) as in the first scheme were used. All sampling configurations were replicated 10,000 times and the posterior distributions were analytically calculated for each replicate.

Figure 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 1.

—Kernel-estimated posterior distribution of FST for the Argania spinosa data.

Figure 2.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 2.

—Cumulative probabilities of different values of νP at each MCMC iteration cycle in an early phase (≤11,000 cycles) of the chain.

Results, real data: From the posterior of structure S based on the real data (see Table 1), samples from populations Mijji (MI), Sidi Ifni (SI), and Tensif (TE) are all considered to originate from a single population with probability 0.999. Furthermore, given the abbreviations Argana (AR), Tizint’est (TT), and Ademine (AD), population samples in pairs (AR, TT) and (AD, AR) are considered to have equal origins with probabilities 0.874 and 0.093, respectively. All the remaining combinations of populations are estimated to have corresponding probability equal to zero. For comparison the posterior mean of FST equals 0.273 (95% credible interval being [0.251, 0.296]). Figure 1 shows the posterior density of FST and Figure 2 illustrates the rapid convergence of the particular chain with respect to νP in a form of cumulative occupancy probabilities. The posterior estimate of FST is rather distinct from the value obtained in Holsinger (1999), and therefore, we repeated our analysis in this respect, using a biallelic transform of the original data following Holsinger (1999). The resulting estimate is 0.172 (with 95% credible interval being [0.148, 0.196]), and the comparable estimate and credible interval given in Holsinger (1999) are 0.192 and [0.177, 0.206], respectively. The slightly lower value of our estimate is expected, since we are accounting for the equality of certain populations.

To investigate sensitivity and the effects of individual loci, we reanalyzed the data using only a single locus at a time. In Table 2, only counts of loci for which the pairwise posterior probabilities P(θm =θr|n) for populations (m and r) that exceed 0.75 are shown. It can be seen that most populations have concordant allele frequencies at many loci; however, concordant loci vary among the populations.

The estimated posterior means of Kullback-Leibler divergences are used in a three-dimensional multidimensional scaling plot of the populations (Figure 3) to visualize their distinction from each other. The estimated distances among populations MI, SI, and TE are equal to zero, and therefore, the population labels are overlapping in the plot. The populations Beni-Snassen (BS) and Oued Grou (OG) seem to locate far from the other populations, which is in concordance with the results of Petit et al. (1998). When Figure 3 is compared to the geographical map given in Petit et al. (1998), one can conclude that some genetic distances coincide relatively closely with the geographical distances, whereas some pairs of genetically similar populations are very distant from each other.

Results, simulated data: For the simulated data sets lacking population substructure, results are summarized in Figure 4. Histograms in the figure show the empirical distribution (over replications) in different settings for the posterior probability of the event that any two populations are equal. The panels correspond to the case with one locus only; for data sets with five loci the posterior probability was equal to unity for all replicates. The analysis illustrates clearly that our method will support merging of populations if the data do not provide enough evidence against the similarity hypothesis. Results for the configurations where the underlying structure consists of two distinct populations are presented in Figure 5, analogously to the previous example. As expected, the empirical power to detect the correct underlying structure increases with the sample size.

View this table:
  • View inline
  • View popup
TABLE 2

Counts of loci of the Argania spinosa data for which pairwise posterior probabilities of populations being equal exceed 0.75

DISCUSSION

We have presented a Bayesian method for estimating hidden population substructure using multilocus molecular markers. Underlying model assumptions concerning HWE and linkage equilibrium within the populations imply that each individual contributes two independent alleles to the likelihood at each locus. To check the validity of these assumptions, one may use, for instance, the methods introduced in Ayres and Balding (1998, 2001), respectively. When the populations are in significant departure from HWE, the data are effectively assumed to contain too much information about the allele frequencies, and consequently, the level of uncertainty concerning the parameters will become underestimated. One potential remedy for this is to parameterize the model using genotype frequencies instead of allele frequencies. Such a model avoids HWE assumption and allows for any form of dependence between alleles. This would also enable the use of commonly available dominant markers such as randomly amplified polymorphic DNAs and amplified fragment length polymorphisms in our analysis. For another Bayesian approach to the analysis of dominant markers, see Holsinger et al. (2002). In the genotype model with codominant markers it is possible to take into account missing allelic data through data augmentation (Schafer 2000). However, the genotype model may become infeasible when the data are scarce or when the number of alleles at different loci is high.

Figure 3.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 3.

—A multidimensional scaling plot of the estimated posterior means of Kullback-Leibler divergences among Argania spinosa populations (MI, SI, and TE have zero distances).

Figure 4.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 4.

—No substructure. Shown are empirical distributions (based on 10,000 replicates) of the posterior probability of the event that any two of five simulated populations are equal on the basis of single-locus data. All underlying populations have equal allele frequencies. Sample size (n) is indicated at each top left corner.

Figure 5.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 5.

—Two underlying populations. Shown are empirical distributions (based on 10,000 replicates) of the posterior probability of the event that the two simulated populations are equal on the basis of single-locus (left) and fiveloci (right) data. The underlying populations have different allele frequencies. Sample size (n) is indicated at each top left corner.

When the aim of modeling the marker data is investigation of neutral evolution, one should bear in mind the assumption of a relatively slow rate of mutation of the alleles. In this respect conclusions with respect to differentiation are most well suited for allozymes and SNPs on low-mutating genome regions. One should be more careful concerning inferences about genetic drift when using microsatellite alleles, since they fluctuate more randomly over generations.

We have here concentrated on utilization of the geographical information available in a two-level hierarchy, since it corresponds to commonly used sampling designs. Occasionally, sampling designs may enable the use of information even from higher-dimensional hierarchies (typically, at three levels). Such designs can be taken into account by defining the hyperparameters in the prior as random coefficients depending on some parameter indexing the nested population substructure (cf. Holsinger 1999). Although the exact form of the posterior may then not be available even for a small number of populations, the MCMC approach presented here can be modified to handle more general settings, by suitably changing the mechanism of generating proposals.

The general Bayesian approach applied here is very flexible, and it would be valuable to incorporate information from phenotypes, different mutation models, spatial distances, and demographic parameters in the future. In conclusion, we have shown that the Bayesian model is a powerful tool for inference about the genetic population structure. However, as the simulation results with an underlying population structure illustrate, one cannot expect to obtain conclusive evidence for separation among populations when the numbers of sampled individuals and loci are small, unless the observed allele frequencies are considerably different. This feature represents common sense in statistical inference and protects against exaggerated interpretations concerning differences caused by random fluctuations in allele frequencies over generations.

Our analysis shows the favorable feature of combining information from several loci into a single probability model, as opposed to the simple averaging used in a traditional FST analysis. One special advantage of the proposed MCMC sampling scheme is that tuning problems related to the choice of proposal and prior distributions seem to be minimized. This reflects the positive effect of analytically integrating out relative allele frequency parameters from the posterior expression of the structure. A major advantage of the approach as a whole compared to most earlier methods is that the number of populations is treated here as an unknown parameter. Hence, we can avoid the labeling problems of populations that occur with high levels of gene flow. In other words, what is considered as two genetically distinct populations, either recently founded or connected by considerable gene flow, would be considered as one panmictic population with a certain probability in our approach.

APPENDIX: ESTIMATION OF MODEL PARAMETERS

To enable Bayesian inference jointly about parameters (θ, νP, S) in general, the standard Metropolis-Hastings MCMC algorithm (e.g., Gilkset al. 1996) is utilized to obtain an approximation to the posterior distribution. However, for a sufficiently small number of collected populations NP, the limited size of the parameter space enables the posterior distribution of (νP, S) to be calculated by complete enumeration, such that the marginal likelihood of a particular partition value (νP, S) is divided by the sum of marginal likelihoods over all possible partitions. Conditional on this distribution, one can generate a suitable number of independent posterior realizations of θ explicitly (see below).

The number of distinct values of S (i.e., partitions of the finite set {1,..., NP}) equals the sum ∑νP=1NPσνPNP , where σνPNP is the Stirling number of the second kind (see, e.g., Abramovitz and Stegun 1969). In routine applications, the number of distinct partitions is practically small for at least NP ≤ 10 (where ∑νP=1NPσνPNP≤115,975 ), to allow for enumerative calculations. For larger NP we use an MCMC algorithm to generate samples from the posterior distribution of (θ, νP, S) with two distinct move types: (1) randomly split a population into two distinct parts or merge two different populations into a single one and (2) update allele frequencies θ with Gibbs sampling conditional on the data and the current value of parameters (νP, S). Since the model is specified at the population level, split and merge moves are restrictively proposed only within the range of populations that were present in the original sampling configuration.

In typical applications the value NP is small enough (say, at most 30-50) so that the time required for the convergence of the MCMC approach is presumably acceptable for practical purposes. In Dawson and Belkhir (2001) partitioning at the individual level (as opposed to the population level used here) using MCMC was still performing well for 200 entities. However, invery complicated problems with a really large number of populations it may be sensible to approximate only the mode of the posterior distribution. In such cases it is preferable to use a formulation in terms of a combinatorial optimization problem, such as those solved by simulated annealing (Aarts and Korst 1989).

The posterior distribution of the population structure is proportional to the analytically calculated integral (see Rannala and Mountain 1997) according to π(νP,S∣n)∝∫∏i=1νP∏j=1NL∏k=1NA(j)pijknijk+λijkdθ=∏i=1νP∏j=1NLΓ(∑kλijk)Γ(∑k(λijk+nijk))∏k=1NA(j)Γ(λijk+nijk)Γ(λijk), (A1) where Γ(·) is the gamma function.

The acceptance ratio for the Metropolis-Hastings step, where current populations given in (νP, S) are split or merged to form a proposal (νP∗,S∗) , equals π(νP∗,S∗∣n)π(νP,S∣n)×q((νP,S)∣(νP∗,S∗))q((νP∗,S∗)∣(νP,S)), (A2) where q(·|·) is the conditional probability of proposing a population substructure from a given one, calculated explicitly at each iteration. The proposed structures are generated uniformly from the set of possible splits or mergings at a given configuration. The prior ratio of the structure parameters equals one for all possible values and cancels therefore from (A2).

Given the previously specified priors, the full conditional distribution of θ is a product of Dirichlet distributions, given by π(θ∣νP,S,n)∝∏i=1νP∏j=1NL∏k=1NA(j)pijknijk+λijk, (A3) from which values can be drawn explicitly. Note that the full conditional distribution given a specific partition remains unchanged during the simulation. In many analyses the used prior gives an equal mass to all alleles, although it would also be possible to incorporate knowledge from previous studies into λ. In the approach presented we prefer the theoretically derived reference choice of λijk = 1/NA(j), which was also used in Anderson and Thompson (2002; for a theoretical derivation see Bernardo and Smith 1994). As discussed in Anderson and Thompson (2002), other choices with larger λijk lead to a prior containing a substantial amount of information when the number of alleles is large. Only the suggested prior has the property of containing as much information as a likelihood with a single observation regardless of the number of alleles, which makes it a reasonable reference choice.

Acknowledgments

The authors thank two anonymous referees whose suggestions and comments significantly improved the original manuscript. This work was supported by research grant nos. 52457 and 47201 from the Academy of Finland and by the Centre of Population Genetic Analyses, University of Oulu, Finland.

Footnotes

  • Communicating editor: J. B. Walsh

  • Received September 6, 2002.
  • Accepted October 4, 2002.
  • Copyright © 2003 by the Genetics Society of America

LITERATURE CITED

  1. ↵
    1. Aarts E. H. L.,
    2. Korst J.
    , 1989 Simulated Annealing and Boltzmann Machines. Wiley, Chichester, UK.
  2. ↵
    1. Abramovitz M.,
    2. Stegun I. A.
    (Editors), 1969 Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover, New York.
  3. ↵
    1. Anderson E. C.,
    2. Thompson E. A.
    , 2002 A model-based method for identifying species hybrids using multilocus genetic data. Genetics 160: 1217–1229.
    OpenUrlAbstract/FREE Full Text
  4. ↵
    1. Ayres K.,
    2. Balding D. J.
    , 1998 Measuring departures from Hardy-Weinberg: a Markov chain Monte Carlo method for estimating the inbreeding coefficient. Heredity 80: 769–777.
    OpenUrlCrossRefPubMedWeb of Science
  5. ↵
    1. Ayres K.,
    2. Balding D. J.
    , 2001 Measuring gametic disequilibrium from multilocus data. Genetics 157: 413–423.
    OpenUrlAbstract/FREE Full Text
  6. ↵
    1. Ball R. D.
    , 2001 Bayesian methods for quantitative trait loci mapping based on model selection: approximate analysis using the Bayesian information criterion. Genetics 159: 1351–1364.
    OpenUrlAbstract/FREE Full Text
  7. ↵
    1. Bernardo J. M.,
    2. Smith A. F. M.
    , 1994 Bayesian Theory. Wiley, Chichester, UK.
  8. ↵
    1. Dawson K. J.,
    2. Belkhir K.
    , 2001 A Bayesian approach to the identification of panmictic populations and the assignment of individuals. Genet. Res. 78: 59–77.
    OpenUrlCrossRefPubMedWeb of Science
  9. ↵
    1. Edwards S. V.,
    2. Beerli P.
    , 2000 Perspective: gene divergence, population divergence, and the variance in coalescence time in phylogeographic studies. Evolution 54: 1839–1854.
    OpenUrlCrossRefPubMedWeb of Science
  10. ↵
    1. Excoffier L.
    , 2001 Analysis of population subdivision, pp. 271–308 in Handbook of Statistical Genetics, edited by Balding D., Bishop M., Cannings C.. Wiley, Chichester, UK.
  11. ↵
    1. Gilks W.,
    2. Richardson S.,
    3. Spiegelhalter D.
    , 1996 Markov Chain Monte Carlo in Practice. Chapman & Hall, London.
  12. ↵
    1. Holsinger K. E.
    , 1999 Analysis of genetic diversity in geographically structured populations: a Bayesian perspective. Hereditas 130: 245–255.
    OpenUrlCrossRefWeb of Science
  13. ↵
    1. Holsinger K. E.,
    2. Lewis P. O.,
    3. Dey D. K.
    , 2002 A Bayesian approach to inferring population structure from dominant markers. Mol. Ecol. 11: 1157–1164.
    OpenUrlCrossRefPubMed
  14. ↵
    1. Kitada S.,
    2. Hayashi T.,
    3. Kishino H.
    , 2000 Empirical Bayes procedure for estimating genetic distance between populations and effective population size. Genetics 156: 2063–2079.
    OpenUrlAbstract/FREE Full Text
  15. ↵
    1. Kullback S.
    , 1968 Information Theory and Statistics. Wiley, New York.
  16. ↵
    1. Kullback S.,
    2. Leibler R. A.
    , 1951 On information and sufficiency. Ann. Math. Stat. 22: 79–86.
    OpenUrlCrossRef
  17. ↵
    1. Nagylaki T.
    , 1998 Fixation indices in subdivided populations. Genetics 148: 1325–1332.
    OpenUrlAbstract/FREE Full Text
  18. ↵
    1. Nei M.
    , 1977 F-statistics and analysis of gene diversity in subdivided populations. Ann. Hum. Genet. 41: 225–233.
    OpenUrlPubMedWeb of Science
  19. ↵
    1. Petit R. J.,
    2. Pons O.
    , 1998 Bootstrap variance of diversity and differentiation estimators in a subdivided population. Heredity 80: 56–61.
    OpenUrlCrossRef
  20. ↵
    1. Petit R. J.,
    2. El Mousadik A.,
    3. Pons O.
    , 1998 Identifying populations for conservation on the basis of genetic markers. Conserv. Biol. 12: 844–855.
    OpenUrlCrossRef
  21. ↵
    1. Pritchard J. K.,
    2. Stephens M.,
    3. Donnelly P.
    , 2000 Inference of population structure using multilocus genotype data. Genetics 155: 945–959.
    OpenUrlAbstract/FREE Full Text
  22. ↵
    1. Rannala B.,
    2. Mountain J. L.
    , 1997 Detecting immigration by using multilocus genotypes. Proc. Natl. Acad. Sci. USA 94: 9197–9201.
    OpenUrlAbstract/FREE Full Text
  23. ↵
    1. Rousset F.
    , 2001 Inferences from spatial population genetics, pp. 239–270 in Handbook of Statistical Genetics, edited by Balding D., Bishop M., Cannings C.. Wiley, Chichester, UK.
  24. ↵
    1. Schafer J. L.
    , 2000 Analysis of Incomplete Multivariate Data. Chapman & Hall/CRC Press, Boca Raton, FL.
  25. ↵
    1. Sillanpää M. J.,
    2. Corander J.
    , 2002 Model choice in gene mapping: what and why. Trends Genet. 18: 301–307.
    OpenUrlCrossRefPubMedWeb of Science
  26. ↵
    1. Tomiuk J.,
    2. Guldbrantsen B.,
    3. Loeschcke V.
    , 1998 Population differentiation through mutation and drift—a comparison of genetic identity measures. Genetica 102/103: 545–558.
    OpenUrlPubMed
  27. ↵
    1. Uimari P.,
    2. Sillanpää M. J.
    , 2001 Bayesian oligogenic analysis of quantitative and qualitative traits in general pedigrees. Genet. Epidemiol. 21: 224–242.
    OpenUrlCrossRefPubMed
  28. ↵
    1. Weir B. S.
    , 1996 Genetic Data Analysis II. Sinauer Associates, Sunderland, MA.
  29. ↵
    1. Wright S.
    , 1951 The genetical structure of populations. Ann. Eugen. 15: 323–354.
    OpenUrlWeb of Science
  30. ↵
    1. Wright S.
    , 1965 The interpretation of population structure by F-statistics with special regard to systems of mating. Evolution 19: 395–420.
    OpenUrlCrossRefWeb of Science
  31. ↵
    1. Yang R.-C.
    , 1998 Estimating hierarchical F-statistics. Evolution 52: 950–956.
    OpenUrlCrossRef
View Abstract
Previous ArticleNext Article
Back to top

PUBLICATION INFORMATION

Volume 163 Issue 1, January 2003

Genetics: 163 (1)

ARTICLE CLASSIFICATION

INVESTIGATIONS
View this article with LENS
Email

Thank you for sharing this Genetics article.

NOTE: We request your email address only to inform the recipient that it was you who recommended this article, and that it is not junk mail. We do not retain these email addresses.

Enter multiple addresses on separate lines or separate them with commas.
Bayesian Analysis of Genetic Differentiation Between Populations
(Your Name) has forwarded a page to you from Genetics
(Your Name) thought you would be interested in this article in Genetics.
Print
Alerts
Enter your email below to set up alert notifications for new article, or to manage your existing alerts.
SIGN UP OR SIGN IN WITH YOUR EMAIL
View PDF
Share

Bayesian Analysis of Genetic Differentiation Between Populations

Jukka Corander, Patrik Waldmann and Mikko J. Sillanpää
Genetics January 1, 2003 vol. 163 no. 1 367-374
Jukka Corander
* Rolf Nevanlinna Institute, FIN-00014, University of Helsinki, Helsinki, Finland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Patrik Waldmann
† Department of Biology, FIN-90014, University of Oulu, Oulu, Finland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Mikko J. Sillanpää
* Rolf Nevanlinna Institute, FIN-00014, University of Helsinki, Helsinki, Finland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: mjs@rolf.helsinki.fi
del.icio.us logo Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
Citation

Bayesian Analysis of Genetic Differentiation Between Populations

Jukka Corander, Patrik Waldmann and Mikko J. Sillanpää
Genetics January 1, 2003 vol. 163 no. 1 367-374
Jukka Corander
* Rolf Nevanlinna Institute, FIN-00014, University of Helsinki, Helsinki, Finland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Patrik Waldmann
† Department of Biology, FIN-90014, University of Oulu, Oulu, Finland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Mikko J. Sillanpää
* Rolf Nevanlinna Institute, FIN-00014, University of Helsinki, Helsinki, Finland
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • For correspondence: mjs@rolf.helsinki.fi

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero

Related Articles

Cited By

More in this TOC Section

  • Selective Sweep at a QTL in a Randomly Fluctuating Environment
  • Comparison of the Full Distribution of Fitness Effects of New Amino Acid Mutations Across Great Apes
  • Broad Promotes Neuroepithelial Stem Cell Differentiation in the Drosophila Optic Lobe
Show more Investigations
  • Top
  • Article
    • Abstract
    • BAYESIAN MODELING OF ALLELE FREQUENCIES IN A GEOGRAPHICALLY STRUCTURED POPULATION
    • MEASURING OF GENETIC SEPARATION AMONG POPULATIONS
    • EXAMPLE ANALYSES
    • DISCUSSION
    • APPENDIX: ESTIMATION OF MODEL PARAMETERS
    • Acknowledgments
    • Footnotes
    • LITERATURE CITED
  • Figures & Data
  • Info & Metrics

GSA

The Genetics Society of America (GSA), founded in 1931, is the professional membership organization for scientific researchers and educators in the field of genetics. Our members work to advance knowledge in the basic mechanisms of inheritance, from the molecular to the population level.

Online ISSN: 1943-2631

  • For Authors
  • For Reviewers
  • For Subscribers
  • Submit a Manuscript
  • Editorial Board
  • Press Releases

SPPA Logo

GET CONNECTED

RSS  Subscribe with RSS.

email  Subscribe via email. Sign up to receive alert notifications of new articles.

  • Facebook
  • Twitter
  • YouTube
  • LinkedIn
  • Google Plus

Copyright © 2019 by the Genetics Society of America

  • About GENETICS
  • Terms of use
  • Advertising
  • Permissions
  • Contact us
  • International access