## Abstract

A new approach to assigning individuals to populations using genetic data is described. Most existing methods work by maximizing Hardy–Weinberg and linkage equilibrium within populations, neither of which will apply for many demographic histories. By including a demographic model, within a likelihood framework based on coalescent theory, we can jointly study demographic history and population assignment. Genealogies and population assignments are sampled from a posterior distribution using a general isolation-with-migration model for multiple populations. A measure of partition distance between assignments facilitates not only the summary of a posterior sample of assignments, but also the estimation of the posterior density for the demographic history. It is shown that joint estimates of assignment and demographic history are possible, including estimation of population phylogeny for samples from three populations. The new method is compared to results of a widely used assignment method, using simulated and published empirical data sets.

THE assignment of individuals to populations is a common population genetic application (Paetkau *et al.* 1995; Rannala and Mountain 1997; Pritchard *et al.* 2000; Dawson and Belkhir 2001; Corander *et al.* 2003; Baudouin *et al.* 2004; Guillot *et al.* 2005; François *et al.* 2006; Pella and Masuda 2006; Wu *et al.* 2006; Huelsenbeck and Andolfatto 2007; Zhang 2008; Reeves and Richards 2009). Most methods assume random mating within populations and free recombination between loci to find population assignments that minimize the amount of departure from Hardy–Weinberg equilibrium (HWE) and linkage equilibrium (LE) within populations. The population assignment with the highest likelihood, or highest posterior probability under Bayesian approaches, is taken as the assignment estimate. Generally the true allele frequencies are not known so that methods must either make use of estimated allele frequencies (*e.g.*, Grant *et al.* 1980) or consider the range of possible allele frequencies (*e.g.*, Pritchard *et al.* 2000).

Methods based on minimizing departure from HWE and LE typically offer little accommodation for the different potential causes of differentiation between populations. They allow populations to differ in allele frequencies, but they do so without including explicit evolutionary models of demography and mutation that cause such differences (Waples and Gaggiotti 2006; Listman *et al.* 2007). For example, it is possible that a method based on departures from HWE and LE will provide better estimates when divergence arises under an island model than when a similar amount of divergence arises under a phylogenetic branching model or vice versa. To assess such dependencies methods must be run on data simulated under various kinds of demographic histories. Without demographic history as a part of the model implemented in a method, such simulation studies tend to treat the computer program as a “black box” that can reveal interactions between demography and assignment only when observed in operation under different kinds of simulated histories (Evanno *et al.* 2005; Waples and Gaggiotti 2006; Chen *et al.* 2007; Fogelqvist *et al.* 2010).

In fact, likelihood-based methods for studying demographic histories offer a direct path to studying population assignment (*e.g.*, Matz and Nielsen 2005; Nielsen and Matz 2006). For example, Kuhner *et al.* (1995) pioneered a Markov chain Monte Carlo (MCMC) approach (Metropolis *et al.* 1953; Hastings 1970) to sampling genealogies, which could be modified to include the population assignments of genes, at the terminal branches of genealogies, as free parameters. Here we take this general approach to study population assignment and demographic history by adapting an MCMC framework developed for the isolation-with-migration (IM) model for two or more populations (Hey and Nielsen 2007; Hey 2010b). The goals of this article are to describe a new method for estimating population assignment together with parameters of an IM model and to examine how well the method performs using simulated and real data and in comparison to another widely used assignment method that does not use an explicit divergence model

## Models and Methods

The multipopulation IM model includes a population tree that is an ultrametric binary rooted tree (of populations) with a labeled history (Edwards 1970), in which nodes of the tree are ordered in time (Hey 2010b). A genealogy, represented by *G*, is also a bifurcating ultrametric tree that lies within the population tree and that describes, for all of the sampled gene copies of a locus, the times and topology of common ancestry and the times and directions of migration events. Individual loci have no recombination, but all of the loci, denoted by *L*, are assumed to be unlinked so that their corresponding genealogies are independent of each other. Every node of a genealogy is given the corresponding population label within which it falls in the population tree (see Figure 1).

Hey and Nielsen (2007) used MCMC to sample a multilocus genealogy **G** and splitting time **t** from the posterior distribution,where *f*(**X**|**G**) is the likelihood of the data, π(**G**|**t**) and π(**t**) are prior distributions, and *f*(**X**) is the marginal likelihood. In the method of Hey and Nielsen (2007) the data, **X**, include gene copies from one or more loci that are already assigned to the populations from which they are sampled. We use *f* to denote likelihood and marginal-likelihood functions and π to designate prior and posterior distributions. It is possible to sample genealogies from the posterior distribution by having a closed-form expression for the prior distribution given bywhere **Θ** is a vector of demographic parameters pertaining to population sizes and migration rates. Values of **G** drawn from the posterior distribution are used to estimate the posterior density of **Θ** given the data **X**,(1)where Ψ is the set of all possible genealogies, □ is the expectation over samples from π(**G**, **t**|**X**), and π(**Θ**|**G**, **t**) is the distribution of demographic parameters given genealogies and splitting times. Estimates of **Θ** are obtained by maximizing (1) over the range of parameter values (as specified by the prior distribution of **Θ**) using the posterior sample of **G**,(2)where *J* is the size of a posterior sample, π(**G*** _{j}*|

**t**

*,*

_{j}**Θ**) = π(

**G**

*|*

_{j}**Θ**) is the probability of the

*j*th posterior sample of genealogy given demographic parameters (Kingman 1982; Felsenstein 1992), and π(

**Θ**|

**t**

*) is the prior of the demographic parameters. Typically in analyses of IM models the prior distribution on parameters is uniform over a specified range. In this case the posterior density is proportional to the likelihood over the range, permitting likelihood-ratio tests of nested demographic models (Hey and Nielsen 2007).*

_{j}### IM models with assignment

We consider data drawn from *n* individuals, with each individual having come from one of *K*_{max} sampled populations. The genetic data without information on which populations individuals came from is denoted by **Y**. The genotype for individual *i* at locus *l* is a single-valued vector *Y _{il}* for haploid data or a two-valued vector (

*Y*

_{il}_{1},

*Y*

_{il}_{2}) for diploid data and the genotype for individual

*i*,

*Y*, consists of

_{i}*L*such vectors. Assignment

**A**is a vector of size

*n*with element

*A*taking a value from 1 through

_{i}*K*

_{max}. The number of sampled populations,

*K*, can be less than

*K*

_{max}because

**A**can have zero individuals in some populations. The number of ways to assign

*n*individuals to

*K*populations is a Stirling number of the second kind (Bell 1934). All of the assignments are equally likely

*a priori*. The posterior distribution of genealogy, splitting times, and assignment given data from unassigned individuals is(3)where

*f*(

**Y**|

**G**) is the likelihood, π(

**G**|

**t**,

**A**), π(

**t**|

**A**), and π(

**A**) are priors, and

*f*(

**Y**) is the marginal likelihood. Note that because the likelihood

*f*(

**Y**|

**G**) depends only on the topology and branch lengths of the genealogy, it does not depend on

**A**.

#### MCMC with genealogies, splitting times, and assignment:

To estimate the posterior distribution (3) we employ an MCMC simulation in which we start with an initial value of (**G**^{(0)}, **t**^{(0)}, **A**^{(0)}) and replace current values (**G**, **t**, **A**) by new values (**G***, **t***, **A***) with acceptance probability(4)where *q* is a proposal density. An update of assignment requires relabeling branches on genealogies, which can entail changes to the structure of genealogies, including changes in coalescent times as well as migration times and directions thereof. Furthermore, when an individual is represented by multiple loci, an update of that individual’s assignment must be applied to the genealogies for all of the loci with genes sampled from that individual. Just as in the case of an update to a population splitting time, which applies to the genealogies for all of the loci in a study (Hey and Nielsen 2004), the acceptance rates of proposed assignment updates are expected to decrease as more loci are included in the study. Two update protocols for assignment were developed on the basis of previous methods for updating genealogies (Beerli and Felsenstein 1999; Nielsen and Matz 2006; Hey and Nielsen 2007). We first describe the approach of Nielsen and Matz followed by that of Beerli and Felsenstein.

#### Nielsen and Matz (2006) updating method:

Figure 1 shows a change of assignment on a genealogy in which five individuals are each represented by a single gene copy. Figure 1, A and B, shows a genealogy in a two-population model before and after gene copy 4 is reassigned from population 2 to 1. Prior to the reassignment migration events along the branch leading to the genes of the reassigned individual are erased, and new migration events are resimulated after the genes are reassigned (*i.e.*, moved to a different population). Note that each internal node on the genealogy has a population designation, which depends upon the time of that node, the assignment of descendant gene copies, and migration events on branches leading to the descendant gene copies. When a sampled gene is reassigned to another population, its new label may become incompatible with the population label at the base of its branch. In Figure 1A the labels at nodes C and 4 are both population 2, but after gene 4 is moved to population 1 in Figure 1B its location is incompatible with the population label 2 on node C. This incompatibility is resolved by simulating migration events conditioned on the requirement that population labels at internal nodes be compatible with the population labels of sampled gene copies. With this type of update the probability of the data given the genealogy remains unchanged; *i.e.*, *f*(**Y**|**G***) = *f*(**Y**|**G**) because the update changes neither the topology nor the branch lengths of a genealogy. With a uniform prior for the splitting time and assignment, the acceptance probability of Equation 4 becomes(5)

The Nielsen and Matz (2006) update is convenient because it does not require changing the tree topology and therefore does not require recalculation of the likelihood. However, this same feature, and the need to resimulate migration events to accommodate a fixed topology with a new assignment, leads to low acceptance rates when multiple loci must be updated simultaneously.

#### Beerli and Felsenstein (1999) updating method:

Beerli and Felsenstein (1999) devised an update in which a proposed genealogy is sampled from the prior distribution for genealogies, conditioned on the current values of the demographic parameters. We found that this method, adapted to accommodate assignment, performed better with data sets having multiple loci than did the Nielsen and Matz (2006) method and hereafter all results are based on the Beerli and Felsenstein (1999) method.

The Beerli and Felsenstein (1999) update begins by randomly selecting gene copies to be reassigned and proceeds by erasing the external branches for those gene copies, randomly selecting new population assignments, and simulating new branches. Beerli and Felsenstein called the new edge an “active lineage” and all the other lineages in the genealogy “inactive lineages”. The simulation of an active lineage can include adding migration events, and it ends in a coalescent event. When the active lineage coalesces at some time point, one of the inactive lineages is chosen as a new sister to the active lineage. To improve the acceptance rate of updates when the infinite sites mutation model (Kimura 1969) is used, we propose only genealogies that are compatible with the data under that model. When a finite sites mutation model is in use, the situation is more complex because selection of any inactive lineage as a sister to the active lineage will yield a tree that has a nonzero likelihood. We have not implemented the method for finite site mutation models.

The method of Beerli and Felsenstein requires the simulation of migration and coalescence using values of demographic parameters. However, unlike some MCMC-based genealogy samplers, the method of Hey and Nielsen (2007) does not include these parameters in the Markov chain, and so it was necessary to use values estimated from the current genealogy. Just as we estimate the posterior density for **Θ** using a larger sample of genealogies using (2), we can use the current genealogy *G _{j}* of a single locus to estimate

**Θ**by maximizing(6)We describe one way to estimate

**Θ**using a single genealogy in

*Appendix A*. The acceptance probability α

_{BF}(

**G**,

**t**,

**A**→

**G***,

**t***,

**A***) is given by(7)where

*f*(

**Y**|

**G***) ≠

*f*(

**Y**|

**G**), because of changes to the topology and branch lengths of the genealogy.

#### Implicit inference of the population tree:

In the absence of a population tree, assignment information is limited to which individuals occur together in the same groups, and the actual population labels attached to groups are interchangeable. However, with a population tree, populations are not interchangeable but rather vary in how related they are to other populations, in which case assignment becomes a question not only of grouping individuals but also of where groups of individuals fall on the population tree; see also O’Meara (2010). Figure 2 shows an example in which sampling assignment, for populations on a tree, requires that we sample both the assignment and the population tree. The posterior distribution of genealogy, population tree, and assignment is(8)where τ is a population tree, which is taken to include the vector of splitting times **t**.

In trials using simulated data in a three-population IM model, neither the Nielsen and Matz (2006) nor the Beerli and Felsenstein (1999) protocols provided sufficient mixing of the MCMC simulation. Therefore we developed a method of updating the population tree and applied it to the case of three sampled populations. The acceptance probability with this update is(9)where the likelihood ratio disappears because *f*(**Y**|**G***) and *f*(**Y**|**G**) cancel out. Figure 2 describes a population tree update, what we call the “three-point-turn”, in which a population branch first slides down along its sister branch to form a polytomy and then back up a different branch.

### Summarizing assignment values sampled from the Markov chain

Huelsenbeck and Andolfatto (2007) introduced a method of summarizing a set of assignment values that are sampled from a posterior distribution. We employ this approach and adapt it to quantify assignment uncertainty and to estimate demographics jointly with assignment. Following Huelsenbeck and Andolfatto (2007) we make use of the concepts of “partition distance” between two assignments and of “mean assignment”. We explain these concepts in some detail because of their importance for jointly estimating population assignment and demographic history.

#### Partition distance:

The partition distance is the minimum number of individuals that must be removed to make two assignments equivalent (Almudevar and Field 1999; Gusfield 2002; Konovalov *et al.* 2005). For two assignments, with populations represented, respectively, we begin by relabeling each assignment using the restricted growth function (RGF) (*e.g.*, Stanton and White 1986; Huelsenbeck and Andolfatto 2007). Under RGF indexing, individual assignments are numbered sequentially except that all individuals assigned to the same population are assigned the same label. For example, the RGF of assignment (3, 3, 1, 2) is (1, 1, 2, 3), where the first two individuals form a group, and each of the third and the fourth forms its own group. With the reindexed assignments **A**_{1} and **A**_{2}, let *d*(**A**_{1}, **A**_{2}) be the number of individuals whose population assignment in **A**_{1} is different from what it is in **A**_{2}, and let λ(**A**) be a relabeling in which the population labels in an assignment **A** are permuted. There are *k*! possible permutations of a set of *k* elements; *e.g.*, six permutations of the ordered set {1, 2, 3} include (1, 2, 3), (1, 3, 2), (2, 1, 3), (2, 3, 1), (3, 1, 2), and (3, 2, 1). Because labels are ordered in an assignment, a relabeling operation λ(1, 2, 3) = (3, 2, 1) means that λ(1) = 3, λ(2) = 2, and λ(3) = 1. For the assignment with the most populations represented, there are *s*! possible relabeled assignments, where *s* = max(*s*_{1}, *s*_{2}), and *s*_{1} and *s*_{2} are the numbers of distinct populations in **A**_{1} and **A**_{2}, respectively. For each possible relabeling of that assignment, we calculate *d*(**A**_{1}, **A**_{2}). The smallest of these values is the partition distance. For example, consider two assignments of (3, 1, 2) and (2, 2, 1). Their RGF values are (1, 2, 3) and (1, 1, 2), respectively. We choose the first one to relabel and fix the second. Among six possible assignments (1, 3, 2) differs from (1, 1, 2) in only one individual. The partition distance is 1.

When there are more than two populations, and they are connected by a population tree, not all labels are interchangeable. In this case we use the actual assignment samples without RGF conversion. For example, with three populations only two of six (3!) permutations are valid with a population tree of ((1, 2), 3) [*i.e.*, (1, 2, 3) and (2, 1, 3)] because only these two permutations place population 3 as the outgroup to the sister pair of 1 and 2. If a binary tree of populations has *K*_{s} pairs of sister present-day populations, then there are only permutations that provide equivalent labelings. All of these labelings fall equivalently onto the same population tree. An example is shown in Table 1 for five assignment samples for a problem with six individuals and a population tree in which populations 1 and 2 are the most closely related: ((1 ,2), 3). We identify the partition distance that is constrained by the population tree, as the “tree-constrained partition distance”.

#### Determining the mean assignment:

The method of mean assignment developed by Huelsenbeck and Andolfatto (2007) is to summarize a posterior sample of assignment values by minimizing the squared partition distance to the sampled assignment values. To extend the idea of mean assignment to the case where we are given a population tree constraint, we define the tree-constrained mean assignment as the assignment that minimizes the tree-constrained partition distance to all sampled assignments. The tree-constrained mean assignment minimizes the tree-constrained partition distance, rather than the square of it, as used by Huelsenbeck and Andolfatto, because squared tree-constrained partition distances can be inconsistent with the partition distance. For example, using the total sample of size 5 in Table 1, the sum of the squared tree-constrained distances of assignment A is larger than the one using assignment B although assignment A is closer to the five samples than is assignment B.

#### Quantifying assignment uncertainty:

The unit of a partition distance is the number of individuals, and as such it is fairly easy to interpret a given value. For example, consider a case of samples from two populations for which the true assignment is known. Then a distance between a sampled assignment and the true value that is half the number of individuals in the study would suggest that the assignment was no better than random with respect to the true value (*i.e.*, half the individuals are correctly assigned and half are not).

The partition distance also lends itself to quantifying each individual’s assignment uncertainty, which we define as the proportion of sampled assignment values in which that individual must be removed in the calculation of the distance between an assignment and the mean assignment. The more often an individual is removed in the procedure of computing assignment distance, the less confident we can be that the individual is assigned to the population in which it occurs in the mean assignment. Another useful measure of assignment uncertainty is the variance of a mean assignment, which is the mean squared partition distance between sampled assignment values and the mean assignment (Huelsenbeck and Andolfatto 2007). Similarly the standard deviation, in units of individuals, is the square root of the variance. This measure can serve as an indication of the number of individuals that are typically uncertain in their assignment.

### Joint estimation of assignment and demographic parameters

We developed two methods for jointly estimating assignment and population-specific demographic parameters. The first algorithm, joint demography and assignment (JDA), can be applied to an island model or a two-population model with a single splitting time. The second algorithm, joint demography and assignment with population tree (JDAP), can be applied when there are more than two populations on a population phylogeny. To understand these we first turn to the matter of “label switching”, in which the effective identity of populations changes with assignment (Stephens 2000). Consider a case of four individuals, with individuals 1 and 2 assigned to population A and individuals 3 and 4 assigned to population B. In a model with just two populations this assignment is equivalent to one in which individuals 3 and 4 are assigned to A and 1 and 2 are assigned to B. Failing to identify this equivalence will prevent the joint estimation of demographic parameters and assignment. Figure 3 shows a simple case with two gene copies in two populations. Note that in Figure 3, A and B are equivalent, but reversed with respect to population labels 1 and 2. If we simply use the two genealogies without any modification, population 1 is represented by gene 1 in Figure 3A and by gene 2 in Figure 3B. If gene 1 and gene 2 are sampled from two different populations, then we wish to estimate parameters that are associated with one population using genealogies in which that population contains either gene 1 or gene 2 but not a mixture of both. To accommodate these kinds of equivalencies, by following Stephens (2000) we relabel the genealogy in either Figure 3A or 3B by swapping the labels of populations 1 and 2 as well as the population labels for internal nodes C and D for this genealogy. The swapping operation on genes with labels 1 and 2 is denoted by ν(**A**) and the additional swapping of internal nodes is denoted by ν(**G**). Because the posterior distribution of assignment and genealogies is invariant to population label changes, under an island model and under a two-population IM model (see *Appendix B*), we can find the estimate and the list of permutations ν_{1}, … , ν* _{J}* that jointly maximizes (see Equations 1 and 2). This maximization is reminiscent of Algorithm 4.1 of Stephens (2000). Because this approach would be extremely slow due to the very large number of searches for the highest posterior probability, we use an approximation based on the mean assignment.

#### JDA algorithm under an island model:

Rather than maximizing the joint posterior density, for each of all possible permutations of sampled assignments we use the mean assignment and partition distance to approximate a list of permutations, which are then used to calculate the joint posterior density. The algorithm proceeds as follows:

Step 1: Find the mean assignment of a posterior sample of

*J*assignments (**A**_{1}, … ,**A**)._{J}Step 2: Identify the sequence of permutations {ν

}_{j}_{j}_{∈{1,…,}_{J}_{}}such that for each**A**is the smallest among the possible_{j}*K*! relabelings.Step 3: In Equation 2 replace π(

**Θ**|**G**,_{j}**t**) with π(**Θ**|ν(**G**),_{j}**t**) to estimate the demographic parameters.

Note that we use the same prior for all population sizes or migration rates for populations that are *a priori* interchangeable, which renders the prior of the relabeled genealogy π(ν(**G**)) equal to the original prior π(**G**) (see *Appendix B*).

#### JDAP algorithm:

When sampling assignments from the posterior distribution we are also implicitly sampling the population tree that is associated with each of the assignments (see Equations 8 and 9). Demographic parameters can be meaningful only under a particular population tree. If two population trees are different, the ancestral population in one of the two population trees can lose its meaning in the other tree. Because population assignment and population tree are confounded, we first find disjoint sets of a posterior sample of population assignment on the basis of its proximity to population trees implied by population assignment. For a population tree with a considerable size of population assignment, we estimate demographic parameters that are pertinent to the population tree. For example, assignment sample 4 in Table 1 has only individual 1 in population 3 while assignment samples 1, 3, and 5 have individuals 5 and 6 assigned to population 3. Because population 3 is the outgroup in the population tree in the model ((1, 2), 3), these two groups of sampled assignments imply different phylogenies for the actual populations from which these samples came. Similarly the two groups of samples invoke different meanings for the demographic parameters associated with the outgroup population 3. Because the implied phylogeny and the meaning of the demographic parameters change with sampled assignment values, we use an algorithm for estimating the posterior density in which genealogy and assignment samples are grouped on the basis of which population tree they are most strongly associated with. To find the most close population tree of a population assignment we use the measure of tree-constrained partition distance and tree-constrained mean assignment. For three populations there are three groups of samples corresponding to the three rooted population trees. The algorithm begins with a categorization and proceeds as follows:

Step 1: Find the mean assignment of a posterior sample of

*J*assignments (**A**_{1}, … ,**A**)._{J}Step 2: Identify the sequence of permutations {ν

}_{j}_{j}_{∈{1,…,}_{J}_{}}such that is the smallest among the*K*! possible relabelings for each**A**._{j}Step 3: Categorize each sampled assignment and genealogy,

**A**and_{j}**G**, to groups on the basis of the corresponding population trees by using the sequence of permutations {ν_{j}}_{j}_{j}_{∈{1,…,}_{J}_{}}.Step 4: That population tree that occurs most often in the categorized sample is considered to be that with the highest posterior probability. Only those assignments and genealogies that are associated with this tree are retained, and these are renumbered from 1 to

*J*_{c}, where*J*_{c}is the total number of the retained values of**A**and**G**. The demographic parameters of the population tree are estimated using just the categorized subsample of genealogies because other genealogies may be related to disparate population trees. The remainder of the JDAP algorithm is similar to the JDA algorithm, but is adapted to the case of an IM model with a population tree.Step 5: Find the tree-constrained mean assignment of the categorized subsample of assignments with the constraint of the best supported tree.

Step 6: Identify the sequence of permutations such that is the smallest among the possible relabelings for each

**A**._{j}Step 7: Maximize Equation 2, with π(

**Θ**|**G**,_{j}**t**) replaced by π(**Θ**|ν(_{j}**G**),_{j}**t**) to find the demographic parameter estimate that is associated with the population tree

We have not yet implemented the case where four or more populations are related by a binary species tree. In these cases the underlying topology of the population tree will also be variable.

#### Computer programs:

The IMa2 computer program (Hey 2010b) was modified to incorporate the methods described here, using a uniform prior on assignment. Data were simulated using a coalescent simulator program SIMDIV (Wang and Hey 2010) that generates data under IM models. These programs are available at http://genfaculty.rutgers.edu/hey/software. We also used the STRUCTURAMA program (Huelsenbeck and Andolfatto 2007) as a representative allele-based method. STRUCTURAMA implements a method that is similar to that developed by Pritchard *et al.* (2000), which is implemented in the STRUCTURE program. However STRUCTURAMA employs an analytic integration over allele frequencies, rather than Gibbs sampling of allele frequencies, and it does not require that the user specify the number of populations. STRUCTURAMA is also unique among population assignment programs in that it provides estimates of the mean assignment, which we use for comparison to the methods described here. We confirmed that for a fixed number of populations STRUCTURAMA uses a uniform prior on assignment by running the program for a fixed four-population model without data. The results revealed a uniform posterior probability of assignment, equal to the prior in the absence of data (results not shown).

## Data

Four groups of simulated data sets were used to assess how well the new methods perform and to compare results for the new methods with a method that does not include a demographic model. We also analyzed and compared results for two real data sets drawn from the literature.

### Simulated data

Simulated data sets were based on a two- or three-population IM model with parameters for sampled population and ancestral population sizes, splitting times and migration rates (Nielsen and Wakeley 2001; Hey and Nielsen 2004). Data were simulated under the infinite sites mutation model. MCMC conditions (in terms of burn-in duration, length of chain, heating scheme, and numbers of Metropolis-coupled chains) were determined on the basis of preliminary runs for representative simulated data sets to ensure samples of effectively independent genealogies and assignments. Results were compared to those obtained for the same data using the program STRUCTURAMA, which did not incorporate a mutation model or a demographic model. Because STRUCTURAMA required allelic data, it was necessary to convert simulated DNA sequences to alleles (*i.e.*, each distinct sequence was given an allele designation). This conversion of sequences to alleles necessarily incurred loss of information. Because the new method uses the infinite sites model and does not require that sequence data be reduced to alleles, we expected that it would do a better job inferring assignment than one that does not.

#### Simulation set 1—varying the number of loci:

Data sets were simulated with varying numbers of loci with demographic parameters as follows: θ_{1} = θ_{2} = θ_{A} = 1, *t* = 0.1, and *m*_{12} = *m*_{21} = 0.1. A single gene copy was sampled from each of 20 individuals for each population, and the number of loci, *L*, ranged from 1 to 10. For each value of *L*, 50 independent data sets were generated. Prior distributions for parameters were uniform from zero to a maximal value: 10 for population mutation rates, 1 for population migration rates, and 3 for splitting time.

#### Simulation set 2—varying population size, migration rates, and splitting time:

Several values were considered for each of the parameters of the two-population IM model: population mutation rate θ’s were set to 0.5, 1.0, or 2.0 (in each case all three populations had the same size); population migration rates, *m*_{12} × θ and *m*_{21} × θ in both directions were set to 0.0, 0.1, or 1.0; and scaled splitting time, *t*/θ, was set to 0.05, 0.10, or 0.20. Twenty gene copies were sampled from each population, with four loci per data set; and 50 data sets were simulated for each of the 27 sets of parameter values. Parameter priors were the same as in simulation 1 except for the maximum of splitting time *t* being set to 10.

#### Simulation set 3—joint estimation of assignment and demographics:

To assess the quality of joint estimates of assignment and demography, data sets were simulated under models with recent divergence and/or high gene flow. Such models yield assignments with high uncertainty and it is in these types of situations where we want to see how well demography can be estimated together with assignment. Two models were used, one with no migration and splitting time set to *t*/θ* _{A}* = 0.05 and a second with migration rates of 0.5 in both directions and a splitting time of

*t*/θ

_{A}= 0.10. For both models population size parameters were set to unequal values (θ

_{1}= 1, θ

_{2}= 3, and θ

_{A}= 2). Each data set consisted of 10 loci with 20 gene copies from each sampled population, and 50 independent data sets were generated for each set of parameter values.

#### Simulation set 4—implicit inference of population tree:

To assess how well sampled assignment values can reveal the population tree, we consider 12 three-population IM models. We assumed that all of the five populations (three present and two ancestral) share the same population size, with θ set to 1 in all of the simulations. The older splitting time was fixed to 0.5 in all of the 12 cases, while the most recent splitting time took on values of 0.1, 0.2, 0.3, and 0.4. Instead of just two migration rates (as is the case in a two-population IM model) a three-population model has eight migration rates (Hey 2010b). We considered three values for population migration rates (0, 0.1, and 1), in each case setting all of the migration rates to the same value. Upper bounds on the prior distributions were 3 for θ, 2 for *t*, and 2 for *m*. Each data set consisted of seven gene copies at each of four loci. We generated 30 replicates for each of the 12 parameter sets.

### Real data sets

Two data sets drawn from the literature are described below and listed in Table 2 with the options used for the IMa2 and STRUCTURAMA programs. Unlike the simulation studies we do not know with certainty that the actual species assignments, in the original reference for each empirical data set that was used, are correct; and we acknowledge that apparently “incorrect” or uncertain assignment estimates could be due to mistaken assignments in the original study. Most of the loci in the data sets were diploid with two gene copies per locus sampled per individual. For the purposes of this paper the loci were treated as haploid with one gene copy sampled from each individual at each locus.

#### Mouse data set:

Geraldes *et al.* (2008) studied the divergence of mouse species using eight loci. We focused on samples for *Mus domesticus* and *M. castaneus* with seven of the loci, excluding the mitochondrial control region locus because of lack of individual information for these species at this locus. The total number of individuals was 113: 55 from *M. domesticus* and 58 from *M. castaneus*. A two-population IM model with the infinite sites mutation model was used.

#### Chimpanzee data set:

Fischer *et al.* (2006) studied divergence among chimpanzee taxa, including West (*Pan troglodytes verus*), East (*P. t. schweinfurthii*), and Central African common chimpanzees (*P. t. troglodytes*) and bonobos (*P. paniscus*). The population tree of the four populations was estimated as ((Eastern, Central), Western), bonobo) (Becquet *et al.* 2007; Caswell *et al.* 2008). We used a partial data set of 10 loci under the infinite sites mutation model. The total number of individuals was 39: 9 from bonobos and 10 from each of the chimpanzee populations. Analyses were conducted under a four-population islands model as well as an IM tree model of the three populations of common chimpanzee. In the three-population analysis, without the bonobo, 1 of the 10 loci was not variable and so was excluded.

## Results

### Simulation study

#### Simulation set 1:

The effect of varying the number of loci is shown in Figure 4. With 20 individuals from each population the largest possible partition distance is 20. Although results for both methods show very little resolution for assignment with small numbers of loci, the trends are useful for seeing how the methods compare as the number of loci is increased. For both IMa2 and STRUCTURAMA the distances from the true value become smaller with more loci, and both programs start recovering the true assignment (*i.e.*, some data sets had a mean assignment equal to the true value) with data sets of seven loci. The two programs performed qualitatively similarly, with the median distances from the true value for IMa2 runs consistently less than those for STRUCTURAMA runs.

#### Simulation set 2:

Table 3 shows the effect of varying the parameters of a two-population IM model for a model with four loci. Assignment accuracy decreased for more recent divergence times, larger migration rates, and smaller present-day population sizes. To compare the assignment performance of the two methods we counted the number of data sets in which IMa2 inferred assignments better than, worse than, or as good as STRUCTURAMA. One method performed no better than the other in the cases of θ = 0.5 or 1.0 and those of *t*/θ = 0.05 or 0.10. IMa2 outperformed STRUCTURAMA in the cases of θ = 2 and those of *t*/θ = 0.2.

#### Simulation set 3:

Table 4 shows how well the method performed for the joint estimation of assignment and demography. For *t* = 0.05 and no migration, the distance between the mean and true assignments was 13.68, which was quite large for a total sample of 40 gene copies per locus. Under this model the estimates of *t* tended to be much lower when assignment was variable than when it was fixed. Also under this model, the upper 95% quantile for θ_{1} was much higher (near the upper bound for this parameter) when assignment was variable. The same patterns for *t* and θ_{1} were seen for the second model in which the splitting time was greater (*t* = 0.1) but there was substantial gene flow. Under this model the distance between the mean and true assignments was 6.64. In both of these models, regardless of whether assignment was fixed or variable, the estimated posterior distributions for migration rates were nearly flat and the population size estimates had a substantial bias, with estimated values tending to be between 65% and 85% of the true value. In short, the conditions of low divergence and modest data set size that made assignment difficult to estimate were also those in which it was difficult to estimate demographic parameters.

#### Simulation set 4:

Table 5 shows results for inferring the tree with three populations, using the three-point-turn update, for different migration rates and splitting times. As expected, the probabilities for the correct tree were highest for lower migration rates and when the time between the two splitting events was greater.

### Real data analysis

#### Mouse data set of Geraldes *et al.* (2008):

Both IMa2 and STRUCTURAMA estimated the true assignment correctly. The average squared distance of the assignment to posterior samples for IMa2 was 0.6. The average squared distance using STRUCTURAMA was considerably higher at 9, which was equivalent to a standard deviation of three individuals having assignment different from the mean assignment. In a model in which population assignment was an unknown, the estimated demographic parameter values were similar to those obtained in a model in which population assignment was fixed (Table 6). This demonstrated that assignment could be studied jointly with demography, but was not surprising in this case because population assignment was estimated correctly with low uncertainty. Population size of *M. domesticus* was estimated to be smaller than half the population size of *M. castaneus*. Their ancestor was estimated to be smaller in population size than *M. castaneus* and larger than *M. domesticus*. The migration rate was estimated to be nonzero in the direction of *M. domesticus* to *M. castaneus*. These patterns were very similar to those reported by Geraldes *et al.* (2008) for the full data set for these species.

#### Chimpanzee data of Fischer *et al.* (2006) under an island model of population structure:

In Table 7 the column labeled “IMa2” shows the mean assignment for these data under an island model for four populations. While most individuals were assigned to their reported subspecies, IMa2 assigned East African common chimpanzee individuals Alley and Sultana to the same population as all of the Central African chimpanzee individuals. The average squared distance from the mean assignment under IMa2 was 11, corresponding to a standard deviation of 3.31 individuals. Individual assignment uncertainties for the East and Central African chimpanzees were larger than those for the bonobo and West African populations, with bonobos having the smallest assignment uncertainties. Three chimpanzees that were of higher individual assignment uncertainty were Alley, Judy, and Sultana among which Alley and Sultana were incorrectly assigned. STRUCTURAMA was also run on the chimpanzee data, with a setting of a maximum of four populations. Interestingly only three populations were inferred, with all of the Central and East African individuals estimated to have come from a single population (Table 7). The average squared distance of the mean assignment was 3. Individual assignment uncertainties for bonobo and Central and West African populations were smaller than those for East African populations. The four chimpanzee populations had estimated divergence times that varied over approximately an order of magnitude (Won and Hey 2005; Becquet *et al.* 2007; Caswell *et al.* 2008; Hey 2010a). To see whether the inclusion of the most divergent populations might have contributed to the failure by STRUCTURAMA to resolve four populations, a two-population model with STRUCTURAMA was run with only the 20 chimpanzees of the Central and East African populations. In this case [Table 7, column ST(2)] two populations were inferred rather than a single population. Three East African chimpanzees, Alley, Judy, and Sultana, were assigned to Central African chimpanzees, and Chiquita and Clara from the Central African population were assigned to the East African population. The average squared distance of the assignment to the posterior sample was 32. Individual assignment uncertainties were rather large. For the case of a four-population island model the population size estimates obtained jointly with assignment were in three cases about one-third lower than the values estimated under a fixed true assignment (Table 8). Central African chimpanzee population size was estimated to be larger than that estimated from the fixed assignment presumably because IMa2 assigned a few more individuals of the East African population to the Central African population. However, the order of magnitude of population size estimates remained unchanged: Central African, East African, bonobo, and West African populations in decreasing order of estimated value.

#### Chimpanzee data set of Fischer *et al.* (2006) using IM with a population tree:

With data from three subspecies of common chimpanzees (*i.e.*, without using the Bonobo data) we used the JDAP algorithm to jointly estimate assignment and the population tree. The resulting mean assignment [see the IMa2(3) column in Table 7] was correct with the exceptions that it allotted Alley and Judy to the Central African population and Sultana to the East African population. These same individuals were also uncertain in their assignment under the island model. We also used the total posterior sample of assignments to find the tree-constrained mean assignment of which was the same as the mean assignment of except that Sultana along with Alley and Judy was misassigned to the Central African population (data not shown). Individual assignment uncertainties of West African chimpanzees were relatively small, and those of East and Central African chimpanzees were large. The true population tree was most supported by the posterior sample of assignments among the three population trees that were equally likely *a priori*. We used each of the three sets of genealogy samples to estimate demographic parameters for each corresponding population tree (Table 9). The two splitting events were farther apart in the true population tree than in the other wrong trees. East African population size was underestimated, and Central African population size was overestimated presumably because the three individuals, Alley, Judy, and Sultana, of the East African population were uncertain in their individual assignment.

## Discussion

Investigators in many contexts are often faced with the situation of having data that have multiple sources of variation. Depending on the state of theory in the particular field, it may be possible for these sources of variation to be modeled with parameters. But without a way to simultaneously consider the different causes of variation, analyses may need to be done in a piecemeal fashion. For example, an analysis may begin with the estimation of just one part of the full model and then proceed by plugging this estimate into a second analysis to address a different part of the model. In this way a series of conditional estimates can be obtained, all but the first of which are at added risk of error because of conditioning on previous estimates. In just this way it is common for investigators to first estimate population assignment, using a method that maximizes Hardy–Weinberg and linkage equilibria, and then use these estimates to conduct additional analyses (*e.g.*, Sacks *et al.* 2004; Coulon *et al.* 2006; Bergl and Vigilant 2007).

The methods described here permit investigators to avoid a series of separate but interdependent analyses and to jointly study population assignment and demographic history. In addition to estimated assignments and estimates of demographic parameters, investigators can ask how uncertainty in assignment varies across the demographic history that has been estimated. It is expected that recently formed populations may be estimated with less certainty; but by studying the age of population formation together with assignment, this interaction can be examined directly. Similarly populations that have been exchanging genes may give rise to data that are not easily assigned, but nevertheless it is possible to study assignment and gene exchange together.

The isolation-with-migration model is flexible with regard to mutation models and demographic histories, particularly when multiple sampled populations are included. However, the general approach that we describe, of including population assignment as part of the genealogy in an MCMC analysis, is one that could be adapted to other kinds of demographic models. Recently, Yang and Rannala (2010) described a method for jointly estimating phylogeny, assignment, and population size parameters. Their method assumes that there has been no migration, unlike the applications described here, but it can be applied to data sets with more populations and a larger phylogeny (with the aid of a user-supplied guide tree) than described here (Yang and Rannala 2010).

These methods are also flexible in that it is straightforward to include both data from individuals with unknown assignment and data from individuals whose population assignment status is known. In this way the general assignment problem can be seen to include both the classification of individuals to previously identified populations (a procedure that is in some contexts known as an assignment test) and the discovery of previously unidentified populations. At one extreme we might have a DNA barcoding problem, where an assignment of a single individual is to be determined against a backdrop of previously assigned data (Matz and Nielsen 2005; Nielsen and Matz 2006), and at the other extreme an investigator might have little previous knowledge of how many populations exist.

Our studies with simulated and real data show that it is possible to estimate assignment together with demographic history and that assignment estimates are improved when they are obtained under an IM model, relative to the case without. However, it is important to point out that the comparisons between IMa2 and STRUCTURAMA, while they represent a contrast of with and without a demographic model, also include the contrast of *with* and *without* a mutation model. STRUCTURAMA, like most other population assignment methods, uses an allelic model of variation, whereas the simulated and real data in the present studies were based on mutation models for DNA sequence variation. To run STRUCTURAMA, or any similar allele-based method, requires pruning the data so that each distinct DNA haplotype is reduced to an allelic representation. Thus we do not know how much of the difference in assignment estimation between STRUCTURAMA and IMa2 is due to the inclusion of a demographic model in the latter and how much is due to the inclusion of a mutation model.

### Assignment and phylogeny

In any demographic model with more than two populations, and that includes a phylogenetic component, the estimation of assignment entails the estimation of the population phylogeny. In this article we studied this issue in detail for the case of three-population models. This required that we extend the ideas of partition distance (Almudevar and Field 1999; Gusfield 2002; Konovalov *et al.* 2005) and mean assignment (Huelsenbeck and Andolfatto 2007) to tree-constrained partition distance and tree-constrained mean assignment. We are then able to use tree-constrained partition distances in the JDAP algorithm to jointly estimate assignment, demographic history, and population phylogeny. When we applied these methods to three subspecies of chimpanzees, they returned the true phylogenetic tree and simultaneously provided population assignments that were more accurate, and that came with less uncertainty, than those found using the STRUCTURAMA program.

### Limitations

A major limitation of the methods presented here is that they are computationally quite slow, particularly when compared to the speed of programs implementing allele-based methods. The run time for the two real data sets ranged from 40 to 129 hr, many times that required by STRUCTURAMA for the same data. When included as part of the Markov chain simulation, population assignment takes time to update and can significantly impede the mixing of the Markov chain. The mixing issue can be significantly mitigated by the addition of Metropolis-coupled chains, but the resulting analysis can be quite slow. It is for this reason that the analyses described here are for modestly sized data sets.

The assumptions of the IM models adapted here all stem from those adopted by Nielsen and Wakeley (2001) in their original MCMC method for this model. In particular these include the assumptions of selective neutrality, free recombination within loci, and zero recombination within loci. Allele-based methods that minimize Hardy–Weinberg also, at least implicitly, assume that selection has not been a factor causing a departure from Hardy–Weinberg. They do not require an assumption of no recombination within loci, because in an allelic context intragenic recombination is just an additional source of new alleles, and they can be extended to handle cases of restricted recombination between loci (Falush *et al.* 2003). Strasburg and Rieseberg (2010) examined the case of a failure of the assumption of zero recombination in IM analyses using the IMa program (Hey and Nielsen 2007) and found that the method was fairly robust to modest levels of recombination when data are reduced to nonrecombining blocks, as determined by the four-gamete test (Hudson and Kaplan 1985).

## Acknowledgments

We thank Yong Wang, Jongwoo Jung, Kasper Munch, and Rasmus Nielsen for helpful discussions and Armando Geraldes and Michael Nachman for the mouse data set. This research was supported by grants to J.H. from the National Institutes of Health (GM078204) and the National Science Foundation (DEB-0949561).

## Appendix A

### Genealogy Updating Using Estimates of the Population Mutation and Migration Parameters From the Current Genealogy

The probability of a genealogy, given parameter values under a two-population IM model, is given in Equation 15 of Hey and Nielsen (2007) and in (A1) of the supporting information of that article. This probability is a function of the numbers of a series of terms calculated from the genealogy, including coalescent events (*c*_{1}, *c*_{2}, *c*_{A}), total coalescent rates (*f*_{1}, *f*_{2}, *f*_{A}), numbers of migration events (*w*_{1}, *w*_{2}), and total migration rates (*g*_{1}, *g*_{2}), where the subscripts refer to the corresponding population (A refers to the ancestral population). By taking the derivative of this function with respect to a parameter, we find the individual maximum-likelihood estimates for each parameter associated with that genealogy. The estimate of θ_{1} is derived asand for the parameter *m*_{12} asA nonzero value of 1.0 is used for *m*_{min} because a value of zero violates the MCMC criterion that an update to the genealogy be reversible.

With values for the demographic parameters, we used expressions (3) and (7) of Beerli and Felsenstein (1999) to simulate the time to the next event in the genealogy. We choose an external branch of a gene tree and label the tip of the branch to other populations at random. We detach the branch from the genealogy. We divide the time duration from the present to the root of the genealogy by events including coalescent, migration, and population splits. For each time interval starting at the present we sample the next event and time using expression (3) of Beerli and Felsenstein (1999). If the time sampled is larger than the current time interval, we skip the current time interval and move to the next time interval. The simulation of events ends in a coalescent event. If the simulation does not complete even in the last time interval before the time at the root, we use expression (7) of Beerli and Felsenstein (1999) to sample the next event and time. One difference between the case of Beerli and Felsenstein (1999) and that of ours is that we need to consider population splitting events. In other respects the procedure follows Beerli and Felsenstein (1999).

## Appendix B

### Population Relabeling Preserves the Posterior Distribution of Genealogy and Splitting Time

We wish to show that permutations of assignment labels change neither the prior distribution nor the posterior distribution of a genealogy under a two-population IM model. The following derivation is based on Equation A1 in the supporting information to Hey and Nielsen (2007) that describes the terms in the probability of genealogy and population splitting time given model parameters π(*G*, *t*|**Θ**), where **Θ** = {θ_{1}, θ_{2}, θ_{A}, *m*_{1}, *m*_{2}}. For calculating the prior probability, all of the information in a genealogy in an IM model can be described as a vector with elements being numbers of coalescent events (*c*), total coalescent rates (*f*), numbers of migration events (*w*), and total migration rates (*g*) or *G* = (*c*_{1}, *f*_{1}, *w*_{1}, *g*_{1}, *c*_{2}, *f*_{2}, *w*_{2}, *g*_{2}, *c*_{A}, *f*_{A}) (Hey and Nielsen 2007) (see *Appendix A*). We relabel the genealogy by swapping the terms for populations 1 and 2,where ν is an operator that relabels the ordered list of numbers that represents the given genealogy. The probability of genealogy given model parameters π(*G*|Θ) may differ from that of the relabeled genealogy, π(ν(*G*)|Θ), given the same model parameters. However, if the prior probabilities for the parameters for population 1 are the same as those for population 2 [*i.e.*, π(θ_{1}) = π(θ_{2}) and π(*m*_{1}) = π(*m*_{2})], then the prior of the genealogy is equal to the prior of the relabeled genealogy:

Turning to the posterior density, we note that the likelihood does not depend on assignment, *f*(*X*|*G*) = *f*(*X*|ν(*G*)), and therefore the posterior distribution of the genealogy given the data is invariant to label permutation:Although we assume the two-population IM model, the same argument applies to the multipopulation island model. In the case of a three-population IM model there are three possible labeled tree topologies, but the symmetry applies only to the two sister populations. In general for trees with multiple populations we can relabel genealogies only via swapping for populations that are each other’s sisters.

## Footnotes

*Communicating editor: L. Excoffier*

- Received March 31, 2011.
- Accepted July 7, 2011.

- Copyright © 2011 by the Genetics Society of America