## Abstract

We describe a model-based clustering method for using multilocus genotype data to infer population structure and assign individuals to populations. We assume a model in which there are *K* populations (where *K* may be unknown), each of which is characterized by a set of allele frequencies at each locus. Individuals in the sample are assigned (probabilistically) to populations, or jointly to two or more populations if their genotypes indicate that they are admixed. Our model does not assume a particular mutation process, and it can be applied to most of the commonly used genetic markers, provided that they are not closely linked. Applications of our method include demonstrating the presence of population structure, assigning individuals to populations, studying hybrid zones, and identifying migrants and admixed individuals. We show that the method can produce highly accurate assignments using modest numbers of loci—*e.g.*, seven microsatellite loci in an example using genotype data from an endangered bird species. The software used for this article is available from http://www.stats.ox.ac.uk/~pritch/home.html.

IN applications of population genetics, it is often useful to classify individuals in a sample into populations. In one scenario, the investigator begins with a sample of individuals and wants to say something about the properties of populations. For example, in studies of human evolution, the population is often considered to be the unit of interest, and a great deal of work has focused on learning about the evolutionary relationships of modern populations (*e.g.*, Cavalli *et al.* 1994). In a second scenario, the investigator begins with a set of predefined populations and wishes to classify individuals of unknown origin. This type of problem arises in many contexts (reviewed by Davies*et al.* 1999). A standard approach involves sampling DNA from members of a number of potential source populations and using these samples to estimate allele frequencies in each population at a series of unlinked loci. Using the estimated allele frequencies, it is then possible to compute the likelihood that a given genotype originated in each population. Individuals of unknown origin can be assigned to populations according to these likelihoods Paetkau*et al.* 1995; Rannala and Mountain 1997).

In both situations described above, a crucial first step is to define a set of populations. The definition of populations is typically subjective, based, for example, on linguistic, cultural, or physical characters, as well as the geographic location of sampled individuals. This subjective approach is usually a sensible way of incorporating diverse types of information. However, it may be difficult to know whether a given assignment of individuals to populations based on these subjective criteria represents a natural assignment in genetic terms, and it would be useful to be able to confirm that subjective classifications are consistent with genetic information and hence appropriate for studying the questions of interest. Further, there are situations where one is interested in “cryptic” population structure—*i.e.*, population structure that is difficult to detect using visible characters, but may be significant in genetic terms. For example, when association mapping is used to find disease genes, the presence of undetected population structure can lead to spurious associations and thus invalidate standard tests (Ewens and Spielman 1995). The problem of cryptic population structure also arises in the context of DNA fingerprinting for forensics, where it is important to assess the degree of population structure to estimate the probability of false matches (Balding and Nichols 1994, 1995; Foreman*et al.* 1997; Roeder*et al.* 1998).

Pritchard and Rosenberg (1999) considered how genetic information might be used to detect the presence of cryptic population structure in the association mapping context. More generally, one would like to be able to identify the actual subpopulations and assign individuals (probabilistically) to these populations. In this article we use a Bayesian clustering approach to tackle this problem. We assume a model in which there are *K* populations (where *K* may be unknown), each of which is characterized by a set of allele frequencies at each locus. Our method attempts to assign individuals to populations on the basis of their genotypes, while simultaneously estimating population allele frequencies. The method can be applied to various types of markers [*e.g.*, microsatellites, restriction fragment length polymorphisms (RFLPs), or single nucleotide polymorphisms (SNPs)], but it assumes that the marker loci are unlinked and at linkage equilibrium with one another within populations. It also assumes Hardy-Weinberg equilibrium within populations. (We discuss these assumptions further in background on clustering methods and the discussion.)

Our approach is reminiscent of that taken by Smouse *et al.* (1990), who used the EM algorithm to learn about the contribution of different breeding populations to a sample of salmon collected in the open ocean. It is also closely related to the methods of Foreman *et al.* (1997) and Roeder *et al.* (1998), who were concerned with estimating the degree of cryptic population structure to assess the probability of obtaining a false match at DNA fingerprint loci. Consequently they focused on estimating the amount of genetic differentiation among the unobserved populations. In contrast, our primary interest lies in the assignment of individuals to populations. Our approach also differs in that it allows for the presence of admixed individuals in the sample, whose genetic makeup is drawn from more than one of the *K* populations.

In the next section we provide a brief description of clustering methods in general and describe some advantages of the model-based approach we take. The details of the models and algorithms used are given in models and methods. We illustrate our method with several examples in applications to data: both on simulated data and on sets of genotype data from an endangered bird species and from humans. incorporating population information describes how our method can be extended to incorporate geographic information into the inference process. This may be useful for testing whether particular individuals are migrants or to assist in classifying individuals of unknown origin (as in Rannala and Mountain 1997, for example). Background on the computational methods used in this article is provided in the appendix.

## BACKGROUND ON CLUSTERING METHODS

Consider a situation where we have genetic data from a sample of individuals, each of whom is assumed to have originated from a *single* unknown population (no admixture). Suppose we wish to cluster together individuals who are genetically similar, identify distinct clusters, and perhaps see how these clusters relate to geographical or phenotypic data on the individuals. There are broadly two types of clustering methods we might use:

*Distance-based methods.*These proceed by calculating a pairwise distance matrix, whose entries give the distance (suitably defined) between every pair of individuals. This matrix may then be represented using some convenient graphical representation (such as a tree or a multidimensional scaling plot) and clusters may be identified by eye.*Model-based methods.*These proceed by assuming that observations from each cluster are random draws from some parametric model. Inference for the parameters corresponding to each cluster is then done jointly with inference for the cluster membership of each individual, using standard statistical methods (for example, maximum-likelihood or Bayesian methods).

Distance-based methods are usually easy to apply and are often visually appealing. In the genetics literature, it has been common to adapt distance-based phylogenetic algorithms, such as neighbor-joining, to clustering multilocus genotype data (*e.g.*, Bowcock*et al.* 1994). However, these methods suffer from many disadvantages: the clusters identified may be heavily dependent on both the distance measure and graphical representation chosen; it is difficult to assess how confident we should be that the clusters obtained in this way are meaningful; and it is difficult to incorporate additional information such as the geographic sampling locations of individuals. Distance-based methods are thus more suited to exploratory data analysis than to fine statistical inference, and we have chosen to take a model-based approach here.

The first challenge when applying model-based methods is to specify a suitable model for observations from each cluster. To make our discussion more concrete we introduce very briefly some of our model and notation here; a fuller treatment is given later. Assume that each cluster (population) is modeled by a characteristic set of allele frequencies. Let *X* denote the genotypes of the sampled individuals, *Z* denote the (unknown) populations of origin of the individuals, and *P* denote the (unknown) allele frequencies in all populations. (Note that *X, Z*, and *P* actually represent multidimensional vectors.) Our main modeling assumptions are Hardy-Weinberg equilibrium within populations and complete linkage equilibrium between loci within populations. Under these assumptions each allele at each locus in each genotype is an independent draw from the appropriate frequency distribution, and this completely specifies the probability distribution Pr(*X*|*Z, P*) (given later in Equation 2). Loosely speaking, the idea here is that the model accounts for the presence of Hardy-Weinberg or linkage disequilibrium by introducing population structure and attempts to find population groupings that (as far as possible) are not in disequilibrium. While inference may depend heavily on these modeling assumptions, we feel that it is easier to assess the validity of explicit modeling assumptions than to compare the relative merits of more abstract quantities such as distance measures and graphical representations. In situations where these assumptions are deemed unreasonable then alternative models should be built.

Having specified our model, we must decide how to perform inference for the quantities of interest (*Z* and *P*). Here, we have chosen to adopt a Bayesian approach, by specifying models (priors) Pr(*Z*) and Pr(*P*), for both *Z* and *P.* The Bayesian approach provides a coherent framework for incorporating the inherent uncertainty of parameter estimates into the inference procedure and for evaluating the strength of evidence for the inferred clustering. It also eases the incorporation of various sorts of prior information that may be available, such as information about the geographic sampling location of individuals.

Having observed the genotypes, *X*, our knowledge about *Z* and *P* is then given by the posterior distribution
*is* possible to obtain an approximate sample (*Z*^{(1)}, *P*^{(1)}), (*Z*^{(2)}, *P*^{(2)}), … ,(*Z*^{(M)}, *P*^{(M)}) from Pr(*Z*, *P*|*X*) using Markov chain Monte Carlo (MCMC) methods described below (see Gilks*et al.* 1996b, for more general background). Inference for *Z* and *P* may then be based on summary statistics obtained from this sample (see *Inference for Z, P, and Q* below). A brief introduction to MCMC methods and Gibbs sampling may be found in the appendix.

## MODELS AND METHODS

We now provide a more detailed description of our modeling assumptions and the algorithms used to perform inference, beginning with the simpler case where each individual is assumed to have originated in a single population (no admixture).

**The model without admixture:** Suppose we genotype *N* diploid individuals at *L* loci. In the case without admixture, each individual is assumed to originate in one of *K* populations, each with its own characteristic set of allele frequencies. Let the vector *X* denote the observed genotypes, *Z* the (unknown) populations of origin of the individuals, and *P* the (unknown) allele frequencies in the populations. These vectors consist of the following elements,
*J _{1}* is the number of distinct alleles observed at locus

*l*, and these alleles are labeled 1, 2, … ,

*J*

_{l}.Given the population of origin of each individual, the genotypes are assumed to be generated by drawing alleles independently from the appropriate population frequency distributions,
*p*_{z(i)lj} is the frequency of allele *j* at locus *l* in the population of origin of individual *i.*)

Assume that before observing the genotypes we have no information about the population of origin of each individual and that the probability that individual *i* originated in population *k* is the same for all *k*,

We follow the suggestion of Balding and Nichols (1995) (see also Foreman*et al.* 1997 and Rannala and Mountain 1997) in using the Dirichlet distribution to model the allele frequencies at each locus within each population. The Dirichlet distribution _{1}, λ_{2}, … , λ* _{J}*) is a distribution on allele frequencies

*p*= (

*p*

_{1},

*p*

_{2}, … ,

*p*) with the property that these frequencies sum to 1. We use this distribution to specify the probability of a particular set of allele frequencies

_{J}*p*for population

_{kl·}*k*at locus

*l*,

*k,l.*The expected frequency of allele

*j*is proportional to λ

*, and the variance of this frequency decreases as the sum of the λ*

_{j}*increases. We take λ*

_{j}_{1}= λ

_{2}= ⋯ = λ

*= 1.0, which gives a uniform distribution on the allele frequencies; alternatives are discussed in the discussion.*

_{Jl}**MCMC algorithm (without admixture):** Equations 2, 3, and 4 define the quantities Pr(*X*|*Z, P*), Pr(*Z*), and Pr(*P*), respectively. By setting θ = (θ_{1}, θ_{2}) = (*Z, P*) and letting π(*Z, P*) = Pr(*Z, P*|*X*) we can use the approach outlined in *Algorithm A1* to construct a Markov chain with stationary distribution Pr(*Z, P*|*X*) as follows:

Algorithm 1: *Starting with initial drawing Z*^{(0)} *for Z (by drawing Z*^{(0)} *at random using* (3) *for example), iterate the following steps for m* = 1, 2, … .

*Step* 1. *Sample P*^{(m)} *from* Pr(*P*|*X, Z*^{(m−1)}).

*Step* 2. *Sample Z*^{(m)} *from* Pr(*Z*|*X, P*^{(m)}).

Informally, step 1 corresponds to estimating the allele frequencies for each population assuming that the population of origin of each individual is known; step 2 corresponds to estimating the population of origin of each individual, assuming that the population allele frequencies are known. For sufficiently large *m* and *c*, (*Z*^{(m)}, *P*^{(m)}), (*Z*^{(m+c)}, *P*^{(m+c)}) (*Z*^{(m+2c)}, *P*^{(m+2c)}), … will be approximately independent random samples from Pr(*Z, P*|*X*). The distributions required to perform each step are given in the appendix.

**The model with admixture:** We now expand our model to allow for admixed individuals by introducing a vector *Q* to denote the admixture proportions for each individual. The elements of *Q* are
*Z* to replace the assumption that each individual *i* originated in some unknown population *z*^{(i)} with the assumption that each observed allele copy

Our primary interest now lies in estimating *Q.* We proceed in a manner similar to the case without admixture, beginning by specifying a probability model for (*X, Z, P, Q*). Analogues of (2) and (3) are
*P* as before. To complete our model we need to specify a distribution for *Q*, which in general will depend on the type and amount of admixture we expect to see. Here we model the admixture proportions *i* using the Dirichlet distribution
*K* populations in equal proportions. For very small values of α (≪1), it models each individual as originating mostly from a single population, with each population being equally likely. As α → 0 this model becomes the same as our model without admixture (although the implementation of the MCMC algorithm is somewhat different). We allow α to range from 0.0 to 10.0 and attempt to learn about α from the data (specifically we put a uniform prior on α ϵ [0, 10] and use a Metropolis-Hastings update step to integrate out our uncertainty in α). This model may be considered suitable for situations where little is known about admixture; alternatives are discussed in the discussion.

**MCMC algorithm (with admixture):** The following algorithm may be used to sample from Pr(*Z, P, Q*|*X*).

Algorithm 2: *Starting with initial values Z*^{(0)} *for Z (by drawing Z*^{(0)} *at random using* (3) *for example), iterate the following steps for m* = 1, 2, … .

*Step* 1. *Sample P*^{(m)}, *Q*^{(m)} *from* Pr(*P, Q*|*X, Z*^{(m−1)}).

*Step* 2. *Sample Z*^{(m)} *from* Pr(*Z*|*X, P*^{(m)}, *Q*^{(m)}).

*Step* 3. *Update* α *using a Metropolis-Hastings step.*

Informally, step 1 corresponds to estimating the allele frequencies for each population and the admixture proportions of each individual, assuming that the population of origin of each allele copy in each individual is known; step 2 corresponds to estimating the population of origin of each allele copy, assuming that the population allele frequencies and the admixture proportions are known. As before, for sufficiently large *m* and *c*, (*Z*^{(m)}, *P*^{(m)}, *Q*^{(m)}), (*Z*^{(m+c)}, *P*^{(m+c)}, *Q*^{(m+c)}), (*Z*^{(m+2c)}, *P*^{(m+2c)}, *Q*^{(m+2c)}), … will be approximately independent random samples from Pr(*Z, P, Q*|*X*). The distributions required to perform each step are given in the appendix.

**Inference:** *Inference for Z, P, and Q:* We now discuss how the MCMC output can be used to perform inference on *Z, P*, and *Q.* For simplicity, we focus our attention on *Q*; inference for *Z* or *P* is similar.

Having obtained a sample *Q*^{(1)}, … , *Q*^{(M)} (using suitably large burn-in *m* and thinning interval *c*) from the posterior distribution of *Q* = (*q*_{1}, … , *q _{N}*) given

*X*using the MCMC method, it is desirable to summarize the information contained, perhaps by a point estimate of

*Q.*A seemingly obvious estimate is the posterior mean

*q*is (1/

_{i}*K*,1/

*K*, … , 1/

*K*) for all

*i*, whatever the value of

*X.*For example, suppose that there are just two populations and 10 individuals and that the genotypes of these individuals contain strong information that the first 5 are in one population and the second 5 are in the other population. Then either

*q*being (0.5, 0.5). This is essentially a problem of nonidentifiability caused by the symmetry of the model [see Stephens (2000b) for more discussion].

_{i}In general, if there are *K* populations then there will be *K*! sets of symmetric modes. Typically, MCMC schemes find it rather difficult to move between such modes, and the algorithms we describe will usually explore only one of the symmetric modes, even when run for a very large number of iterations. Fortunately this does not bother us greatly, since from the point of view of clustering all the symmetric modes are the same [compare the clusterings corresponding to (9) and (10)]. If our sampler explores only one symmetric mode then the sample means (8) will be very poor estimates of the posterior means for the *q _{i}*, but will be much better estimates of the

*modes*of the

*q*, which in this case turn out to be a much better summary of the information in the data. Ironically then, the poor mixing of the MCMC sampler between the symmetric modes gives the asymptotically useless estimator (8) some practical value. Where the MCMC sampler succeeds in moving between symmetric modes, or where it is desired to combine results from samples obtained using different starting points (which may involve combining results corresponding to different modes), more sophisticated methods [such as those described by Stephens (2000b)] may be required.

_{i}*Inference for the number of populations:* The problem of inferring the number of clusters, *K*, present in a data set is notoriously difficult. In the Bayesian paradigm the way to proceed is theoretically straightforward: place a prior distribution on *K* and base inference for *K* on the posterior distribution
*Q*, *Z*, and *P*, say) are relatively robust to these assumptions. Moreover, there are typically severe computational challenges in estimating Pr(*X*|*K*). We therefore describe an alternative approach, which is motivated by approximating (11) in an *ad hoc* and computationally convenient way.

Arguments given in the appendix (*Inference on K, the number of populations*) suggest estimating Pr(*X*|*K*) using
*X*|*K*) for each *K* and substitute these estimates into (11) to approximate the posterior distribution Pr(*K*|*X*).

In fact, the assumptions underlying (12) are dubious at best, and we do not claim (or believe) that our procedure provides a quantitatively accurate estimate of the posterior distribution of *K.* We see it merely as an *ad hoc* guide to which models are most consistent with the data, with the main justification being that it seems to give sensible answers in practice (see next section for examples). Notwithstanding this, for convenience we continue to refer to “estimating” Pr(*K*|*X*) and Pr(*X*|*K*).

## APPLICATIONS TO DATA

We now illustrate the performance of our method on both simulated data and real data (from an endangered bird species and from humans). The analyses make use of the methods described in *The model with admixture.*

**Simulated data:** To test the performance of the clustering method in cases where the “answers” are known, we simulated data from three population models, using standard coalescent techniques (Hudson 1990). We assumed that sampled individuals were genotyped at a series of unlinked microsatellite loci. Data were simulated under the following models.

Model 1: A single random-mating population of constant size.

Model 2: Two random-mating populations of constant effective population size 2N. These were assumed to have split from a single ancestral population, also of size 2*N* at a time *N* generations in the past, with no subsequent migration.

Model 3: Admixture of populations. Two discrete populations of equal size, related as in model 2, were fused to produce a single random-mating population. Samples were collected after two generations of random mating in the merged population. Thus, individuals have *i* grandparents from population 1, and 4 − *i* grandparents from population 2 with probability (*i* ϵ {0, 4}. All loci were simulated independently.

We present results from analyzing data sets simulated under each model. Data set 1 was simulated under model 1, with 5 microsatellite loci. Data sets 2A and 2B were simulated under model 2, with 5 and 15 microsatellite loci, respectively. Data set 3 was simulated under model 3, with 60 loci (preliminary analyses with fewer loci showed this to be a much harder problem than models 1 and 2). Microsatellite mutation was modeled by a simple stepwise mutation process, with the mutation parameter 4*N*μ set at 16.0 per locus (*i.e.*, the expected variance in repeat scores within populations was 8.0). We did not make use of the assumed mutation model in analyzing the simulated data.

Our analysis consists of two phases. First, we consider the issue of model choice—*i.e.*, how many populations are most appropriate for interpreting the data. Then, we examine the clustering of individuals for the inferred number of populations.

**Choice of K for simulated data:** For each model, we ran a series of independent runs of the Gibbs sampler for each value of

*K*(the number of populations) between 1 and 5. The results presented are based on runs of 10

^{6}iterations or more, following a burn-in period of at least 30,000 iterations. To choose the length of the burn-in period, we printed out log(Pr(

*X*|

*P*

^{(m)},

*Q*

^{(m)})), and several other summary statistics during the course of a series of trial runs, to estimate how long it took to reach (approximate) stationarity. To check for possible problems with mixing, we compared the estimates of

*P*(

*X*|

*K*) and other summary statistics obtained over several independent runs of the Gibbs sampler, starting from different initial points. In general, substantial differences between runs can indicate that either the runs should be longer to obtain more accurate estimates or that independent runs are getting stuck in different modes in the parameter space. (Here, we consider the

*K*! modes that arise from the nonidentifiability of the

*K*populations to be equivalent, since they arise from permuting the

*K*population labels.)

We found that in most cases we obtained consistent estimates of *P*(*X*|*K*) across independent runs. However, when analyzing data set 2A with *K* = 3, the Gibbs sampler found two different modes. This data set actually contains two populations, and when *K* is set to 3, one of the populations expands to fill two of the three clusters. It is somewhat arbitrary which of the two populations expands to fill the extra cluster: this leads to two modes of slightly different heights. The Gibbs sampler did not manage to move between the two modes in any of our runs.

In Table 1 we report estimates of the posterior probabilities of values of *K*, assuming a uniform prior on *K* between 1 and 5, obtained as described in *Inference for the number of populations.* We repeat the warning given there that these numbers should be regarded as rough guides to which models are consistent with the data, rather than accurate estimates of the posterior probabilities. In the case where we found two modes (data set 2A, *K* = 3), we present results based on the mode that gave the higher estimate of Pr(*X*|*K*).

With all four simulated data sets we were able to correctly infer whether or not there was population structure (*K* = 1 for data set 1 and *K* > 1 otherwise). In the case of data set 2A, which consisted of just 5 loci, there is not a clear estimate of *K*, as the posterior probability is consistent with both the correct value, *K* = 2, and also with *K* = 3 or 4. However, when the number of loci was increased to 15 (data set 2B), virtually all of the posterior probability was on the correct number of populations, *K* = 2.

Data set 3 was simulated under a more complicated model, where most individuals have mixed ancestry. In this case, the population was formed by admixture of two populations, so the “true” clustering is with *K* = 2,
and *Q* estimating the number of grandparents from each of the two original populations, for each individual. Intuitively it seems that another plausible clustering would be with *K* = 5, individuals being assigned to clusters according to how many grandparents they have from each population. In biological terms, the solution with *K* = 2 is more natural and is indeed the inferred value of *K* for this data set using our *ad hoc* guide [the estimated value of Pr(*X*|*K*) was higher for *K* = 5 than for *K* = 3, 4, or 6, but much lower than for *K* = 2]. However, this raises an important point: the inferred value of *K* may not always have a clear biological interpretation (an issue that we return to in the discussion).

*Clustering of simulated data:* Having considered the problem of estimating the number of populations, we now examine the performance of the clustering algorithm in assigning particular individuals to the appropriate populations. In the case where the populations are discrete, the clustering performs very well (Figure 1), even with just 5 loci (data set 2A), and essentially perfectly with 15 loci (data set 2B).

The case with admixture (Figure 2) appears to be more difficult, even using many more loci. However, the clustering algorithm did manage to identify the population structure appropriately and estimated the ancestry of individuals with reasonable accuracy. Part of the reason that this problem is difficult is that it is hard to estimate the original allele frequencies (before admixture) when almost all the individuals (7/8) are admixed. A more fundamental problem is that it is difficult to get accurate estimates of *q*^{(i)} for particular individuals because (as can be seen from the *y*-axis of Figure 2) for any given individual, the variance of how many of its alleles are *actually* derived from each population
can be substantial (for intermediate *q*). This property means that even if the allele frequencies were known, it would still be necessary to use a considerable number of loci to get accurate estimates of *q* for admixed individuals.

**Data from the Taita thrush:** We now present results from applying our method to genotype data from an endangered bird species, the Taita thrush, *Turdus helleri.* Individuals were sampled at four locations in southeast Kenya [Chawia (17 individuals), Ngangao (54), Mbololo (80), and Yale (4)]. Each individual was genotyped at seven microsatellite loci (Galbusera*et al.* 2000).

This data set is a useful test for our clustering method, because the geographic samples are likely to represent distinct populations. These locations represent fragments of indigenous cloud forest, separated from each other by human settlements and cultivated areas. Yale, which is a very small fragment, is quite close to Ngangao. Extensive data on ringed and radio-tagged birds over a 3-year period indicate low migration rates (Galbusera*et al.* 2000).

As discussed in background on clustering methods, it is currently common to use distance-based clustering methods to visualize genotype data of this kind. To permit a comparison between that type of approach and our own method, we begin by showing a neighbor-joining tree of the bird data (Figure 3). Inspection of the tree reveals that the Chawia and Mbololo individuals represent (somewhat) distinct clusters. Several individuals (marked by asterisks) appear to be classified with other groups. The four Yale individuals appear to fall within the Ngangao group [a view that is supported by summary statistics of divergence showing the Yale and Ngangao to be very closely related (Table 2)].

The tree illustrates several shortcomings of distance-based
clustering methods. First, it would not be possible (in this case) to identify the appropriate clusters if the labels were missing. Second, since the tree does not use a formal probability model, it is difficult to ask statistical questions about features of the tree, for example: Are the individuals marked with asterisks actually migrants, or are they simply misclassified by chance? Is there evidence of population structure *within* the Ngangao group (which appears from the tree to be quite diverse)?

We now apply our clustering method to these data.

**Choice of K, for Taita thrush data:** To choose an appropriate value of

*K*for modeling the data, we ran a series of independent runs of the Gibbs sampler at a range of values of

*K.*After running numerous medium-length runs to investigate the behavior of the Gibbs sampler (using the diagnostics described in

*Choice of K for simulated data*), we again chose to use a burn-in period of 30,000 iterations and to collect data for 10

^{6}iterations. We ran three to five independent simulations of this length for each

*K*between 1 and 5 and found that the independent runs produced highly consistent results. At

*K*= 5, a run of 10

^{6}steps takes ~70 min on our desktop machine.

Using the approach described in *Inference for the number of populations*, we estimated Pr(*X*|*K*) for *K* = 1, 2, … , 5 and corresponding values of Pr(*K*|*X*) for a uniform prior on *K* = 1, 2, … , 5. (In fact, this data set contains a lot of information about *K*, so that inference is relatively robust to choice of prior on *K*, and other priors, such as taking Pr(*K*) proportional to Poisson(1) for *K* > 0, would give virtually indistinguishable results.) From the estimates of Pr(*K*|*X*), shown in the last column of Table 3, it is clear that the models with *K* = 1 or 2 are completely insufficient to model the data and that the model with *K* = 3 is substantially better than models with larger *K.* Given these results, we now focus our subsequent analysis on the model with three populations.

**Clustering results for Taita thrush data:** Figure 4 shows a plot of the clustering results for the individuals in the sample, assuming that there are three populations (as inferred above). We did not use (and indeed, did not know) the sampling locations of individuals when
we obtained these results. Our clustering algorithm seems to have performed very well, with just a few individuals (labeled 1–4) falling somewhat outside the obvious clusters. All of the points in the extreme corners (some of which may be difficult to resolve on the picture) are correctly assigned. The four Yale individuals were assigned to the Ngangao cluster, consistent with the neighbor-joining tree and the (δμ)^{2} distances. We return to this data set in incorporating population information to consider the question of whether the individuals that seem not to cluster tightly with others sampled from the same location are the product of migration.

**Application to human data:** The next data set, taken from Jorde *et al.* (1995), includes data from 30 biallelic restriction site polymorphisms, genotyped in 72 Africans (Sotho, Tsonga, Nguni, Biaka and Mbuti Pygmies, and San) and 90 Europeans (British and French).

Application of our MCMC scheme with *K* = 2 indicates the presence of two very distinct clusters, corresponding to the Africans and Europeans in the sample (Figure 5). The model with *K* = 2 has vastly higher posterior probability than the model with *K* = 1.

Additional runs of the MCMC scheme with the models *K* = 3, 4, and 5 suggest that those models may be somewhat better than *K* = 2. This may reflect the presence of population structure within the continental groupings, although in this case the additional populations do not form discrete clusters and so are difficult to interpret.

Again it is interesting to contrast our clustering results with the neighbor-joining tree of these data (Figure 6). While our method finds it quite easy to separate the two continental groups into the correct clusters, it would not be possible to use the neighbor-joining tree to detect distinct clusters if the labels were not present. The data set of Jorde also contains a set of individuals of Asian origin (which are more closely related to Europeans than are Africans). Neither the neighbor-joining method nor our method differentiates between the Europeans and Asians with great accuracy using this data set.

## INCORPORATING POPULATION INFORMATION

The results presented so far have focused on testing how well our method works. We now turn our attention to some further applications of this method.

Our clustering results (Figure 4) confirm that the three main geographic groupings in the thrush data set (Chawia, Mbololo, and Ngangao) represent three genetically distinct populations. The geographic labels correspond very closely to the genetic clustering in all but a handful of cases (1–4 in Figure 4). Individual 2 is also identified as a possible outlier on the neighbor-joining tree (Figure 3). Given this, it is natural to ask whether these apparent outliers are immigrants (or descendants of recent immigrants) from other populations. For example, given the genetic data, how probable is it that individual 1 is actually an immigrant from Chawia?

To answer this sort of question, we need to extend our algorithm to incorporate the geographic labels. By doing this, we break the symmetry of the labels, and we can ask specifically whether a particular individual is a migrant from Chawia (say). In essence our approach (described more formally in the next section) is to assume that each individual originated, with high probability, in the geographical region in which it was sampled, but to allow some small probability that it is an immigrant (or has immigrant ancestry). Note that this model is also suitable for situations in which individuals are classified according to some characteristic other than sampling location (physical appearance, for example). “Immigrants” in this situation would be individuals whose genetic makeup suggests they were misclassified. Thus, while we speak of “immigrants” and “immigrant ancestry,” in some contexts these terms may relate to something other than changes in physical location.