## Abstract

High-throughput genotyping and sequencing technologies can generate dense sets of genetic markers for large numbers of individuals. For most species, these data will contain many markers in linkage disequilibrium (LD). To utilize such data for population structure inference, we investigate the use of haplotypes constructed by combining the alleles at single-nucleotide polymorphisms (SNPs). We introduce a statistic derived from information theory, the *gain of informativeness for assignment* (GIA), which quantifies the additional information for assigning individuals to populations using haplotype data compared to using individual loci separately. Using a two-loci–two-allele model, we demonstrate that combining markers in linkage equilibrium into haplotypes always leads to nonpositive GIA, suggesting that combining the two markers is not advantageous for ancestry inference. However, for loci in LD, GIA is often positive, suggesting that assignment can be improved by combining markers into haplotypes. Using GIA as a criterion for combining markers into haplotypes, we demonstrate for simulated data a significant improvement of assigning individuals to candidate populations. For the many cases that we investigate, incorrect assignment was reduced between 26% and 97% using haplotype data. For empirical data from French and German individuals, the incorrectly assigned individuals can, for example, be decreased by 73% using haplotypes. Our results can be useful for challenging population structure and assignment problems, in particular for studies where large-scale population–genomic data are available.

STRUCTURE of populations and assigning individuals to populations have attracted considerable attention in population genetics, conservation biology, and ecology (Pritchard *et al.* 2000; Beaumont 2004; Manel *et al.* 2005; Platt *et al.* 2010). Since the introduction of Wright’s *F*_{ST} (Wright 1921, 1943), numerous studies of population structure have been conducted for a multitude of species, using a variety of genetic or phenotypic markers. The recent development of high-throughput genotyping and sequencing technologies has resulted in a substantial increase in studies of population structure that are based on a large number of markers (*e.g.*, Jakobsson *et al.* 2008; Platt *et al.* 2010; Vonholdt *et al.* 2010). At the same time, powerful clustering methods have been developed to infer population structure on the basis of multiloci genetic data (*e.g.*, Pritchard *et al.* 2000; Dawson and Belkhir 2001; Corander *et al.* 2003; François *et al.* 2006; Huelsenbeck and Andolfatto 2007; Alexander *et al.* 2009).

For most species, individuals rarely reproduce at random and this can create genetically differentiated subgroups within a population or species. Geographic barriers such as mountains, rivers, and oceans can furthermore hinder random mating, thereby causing populations to be structured (Hale *et al.* 2001; Rosenberg *et al.* 2005). In humans, cultural differences, such as language or religious beliefs, may play an additional role in shaping structure among individuals (Cavalli-Sforza and Feldman 2003; Behar *et al.* 2010; Bryc *et al.* 2010). Large efforts have been made to characterize population structure, both at the global level (*e.g.*, Rosenberg *et al.* 2002; Jakobsson *et al.* 2008; Li *et al.* 2008) and at smaller scales (*e.g.*, Rosenberg *et al.* 2006; Wang *et al.* 2007; Friedlaender *et al.* 2008; Novembre *et al.* 2008; Segurel *et al.* 2008; Reich *et al.* 2009; Tishkoff *et al.* 2009). Although population structure can give important information on the demographic history of a species and may lead to better understanding of evolutionary processes, population structure may also complicate certain investigations. For example, cryptic population structure can lead to false positives in association studies (Marchini *et al.* 2004). Another problem may arise in forensics: if a suspect originates from a population that is genetically differentiated from the reference population, the difference in allele frequencies may lead to incorrect conclusions about matching DNA evidence to a suspect (Balding and Nichols 1994; Weir 1996; Aitken and Taroni 2004).

Assignment methods, in contrast to clustering methods, use prior knowledge about candidate groups in addition to genetic data to assign individuals of unknown origin to groups (Paetkau *et al.* 1995; Manel *et al.* 2005). These methods have been extensively used for conservation management (see, *e.g.*, Wasser *et al.* 2004; Gaskin *et al.* 2009) and parentage analysis (see, *e.g.*, Nielsen *et al.* 2001). Methods that focus on finding potential hybrids of particular types (*e.g.*, first-generation offspring and backcrosses) have also been developed (Anderson and Thompson 2002) and used for identifying hybrids between closely related species (Adams *et al.* 2007).

High-throughput sequencing and genotyping methods have generated dense sets of single-nucleotide polymorphisms (SNPs) for large samples of individuals for several organisms. Linkage disequilibrium (LD) is strong for many SNPs in these dense sets (for most species), and these SNPs are therefore not independent markers. To overcome the problem of LD, some studies prune the set of SNPs before inferring population structure (*e.g.*, Novembre *et al.* 2008; Bryc *et al.* 2010) and some studies analyze subsets of markers and combine the results for different subsets (Jakobsson *et al.* 2008). These approaches of overcoming the problem caused by closely linked markers do not take full advantage of all the information provided by the large number of SNPs. Instead, it may be possible to combine SNPs into haplotypes, which may integrate extra information about ancestry, potentially from recombination events that should in principle harbor information about ancestry similar to mutation events. A previous study utilized haplotypes for revealing population structure, which point at somewhat different inference of population structure for SNPs and haplotypes (Jakobsson *et al.* 2008). Using simulations, Morin *et al.* (2009) demonstrated greater power of population structure inference using haplotypes in many, but not all, cases. However, it is unclear whether, and under which conditions, haplotypes can be more powerful than single SNPs for inferring population structure or assigning individuals to populations.

In this article, we first investigate whether haplotype data can increase the statistical power of assigning individuals to populations compared to SNP data. Second, using a newly developed statistic, the *gain of informativeness for assignment* (GIA), we characterize under which circumstances it may be advantageous to use haplotypes compared to using SNPs for ancestry inference. Third, we demonstrate by simulations and by using empirical SNP data from Europeans that assignment of individuals significantly improves through combining SNPs into haplotypes guided by GIA.

## Theory

We define a “haplotype locus” as the combination of more than one SNP locus. The SNP loci in a haplotype locus are not required to be consecutive along the chromosome. We define a “haplotype allele” as a particular combination of alleles at the SNP loci constituting the haplotype locus. For instance, for a haplotype locus formed by *x* SNPs, 2* ^{x}* distinct alleles can exist, but the number of observed haplotype alleles is typically much smaller than 2

*if*

^{x}*x*is reasonably large. In addition, the number of distinct haplotype alleles is upwardly bounded by the sample size.

To develop a statistic that quantifies under which circumstances it is advantageous for ancestry inference to combine markers into haplotype loci, we start by considering a model of two multiallelic loci denoted locus *A* and locus *B*. The combination of the two loci into a haplotype locus is denoted locus *H*, and the possible haplotype alleles are the combinations of alleles from locus *A* and locus *B* (see Figure 1 for notation). Note that this model can be generalized to handle any number of markers by recursively merging two loci into one multiallelic haplotype locus. Loci *A* and *B* may be in LD, which can, for example, be quantified with the *D* statistic (Lewontin and Kojima 1960). We consider *K* randomly mating populations and we assume that the allele frequencies at each locus in each population are known.

Rosenberg *et al.* (2003) derived a criterion on the basis of information theory to evaluate the efficiency of a marker for assigning individuals to one of *K* populations. This criterion, the *informativeness for assignment* (IA), can be computed for bi- or multiallelic loci, such as SNPs, microsatellites, or haplotype loci,*N* is the number of alleles for the locus, *K* is the number of populations, *j* in population *i*, and *j* across populations,

Using the IA statistic, we define the GIA as*H*) is the informativeness for assignment of the haplotype locus and IA(*A*) and IA(*B*) are the informativeness of locus *A* and locus *B*, respectively. Since IA is nonnegative and bounded upward by log *K*, GIA is restricted to [−2 log *K*, log *K*].

By comparing the information content about ancestry of the haplotype to the sum of the information content of each marker, GIA is specifically designed to answer the question of whether two markers can improve the power of assigning individuals to candidate populations by combining the markers into a haplotype locus. As can been seen from Equations 1 and 2, to compute GIA, we need to know the allele frequencies of the two loci and the allele frequencies of the haplotype locus. When addressing assignment problems, phased data from candidate populations can typically be used to estimate the SNP and haplotype allele frequencies, followed by the use of GIA to determine which loci to combine to haplotype loci for optimal power. Guided by this information, individuals of unknown origin could then be assigned to candidate populations on the basis of haplotype data (see the results section for explicit examples of this procedure).

GIA is not a simple function of the allele frequencies and the haplotype allele frequencies. For example, the sign of GIA cannot be determined by a simple rule of thumb based on allele frequencies. However, for the special case of biallelic markers, we can show that when two loci are in linkage equilibrium, GIA ≤ 0. To arrive at that result, we note that because the loci are biallelic, only the frequencies of one allele for each locus are needed to characterize GIA. Recall also that *D* can be defined as the difference between the frequency of a haplotype allele and the product of the frequencies of its constitutive alleles so that haplotype allele frequencies in Equation 2 can be replaced by *D* and allele frequencies (*e.g.*, *x*_{11} = *a*_{1}*b*_{1} + *D*).

**Theorem. ***Let A and B be two biallelic loci and H be their associated haplotype locus. Consider K randomly mating populations. For population i*, *let* *and* *be the allele frequencies at locus A and locus B*, *respectively. Then*, *for all the frequency distributions of the alleles*,*with equality if and only if*

A proof of the Theorem is given in the *Appendix*. This Theorem demonstrates that when locus *A* and locus *B* are in linkage equilibrium within all populations, the haplotype locus *H* provides less information (or the same amount) for assigning individuals to populations than locus *A* and locus *B* provide when used separately. Intuitively, since there is no correlation between the allele frequencies at locus *A* and the allele frequencies at locus *B*, we expect the combination of alleles into haplotype alleles to arise randomly within each population.

### GIA for two populations

We study Equation 2 for the two-population case (*K* = 2) and for two biallelic markers. To reduce the complexity of the problem, we assume that the level of LD is dominated by linkage of the two markers and that the two populations have similar demographic histories, so that *D*_{1} = *D*_{2} = *D*. Five parameters characterize our problem: *D*, *D* and the range of the allele frequencies at locus *A* and locus *B*; constraints are summarized in Table 1. As an example, we study the behavior of GIA as a function of *D* = 0.1 and different fixed values of *D* = 0.1 for the entire range of possible values of *B* is uninformative on its own [IA(*B*) = 0] since it has identical allele frequencies in both populations. GIA is nonnegative for all possible values of *A* has only two alleles, whereas the haplotype locus can have up to four different alleles, increasing the possibility for the haplotype alleles to uniquely characterize populations, which makes the assignment of individuals easier.

Figure 2, B and C, shows that the sign and magnitude of GIA varies depending on the values of the allele frequencies at locus *A*. The borders of the surfaces are defined by the constraints on *i.e.*, private for one population. There are two interesting points on the surfaces, the leftmost tip and the rightmost tip. Although they share the same property of being the only cases where two haplotype alleles are private, the rightmost tip yields the maximum GIA whereas the leftmost tip yields a negative GIA. The absolute difference *A*) and therefore a smaller GIA than for the rightmost tip. Nevertheless, they are both local maxima, which is caused by the often substantial informativeness of private alleles.

We also investigate the behavior of GIA as a function of *D* when all the allele frequencies are fixed and GIA is therefore completely determined by IA(*H*). Figure 3 shows four examples of GIA as functions of *D*, across the range of possible values of *D*, for different values of *D* = 0, GIA ≤ 0 (consistent with the Theorem). For *D*. This example is similar to the example in Figure 2A, for which locus B was also uninformative.

The sign and the magnitude of GIA varies as a function of *D* for fixed allele frequencies of locus *A* and locus *B*. GIA can be positive for the entire range of *D* (Figure 3A), negative for the entire range (Figure 3D), or change sign depending on *D* (Figure 3, B and C). The range of *D* is defined by the constraints that all haplotype allele frequencies have to be nonnegative. The two extreme values for each case in Figure 3 correspond to one of the eight haplotype allele frequencies (four haplotype allele frequencies in each population) being equal to zero in one population, which means being a private allele for the other population.

In summary, although there are a number of predictable behaviors of GIA—such as that GIA ≤ 0 when markers are in linkage equilibrium and that GIA is often large for cases where private alleles exist—GIA is not a trivial function of LD or allele frequencies.

## Results

### Comparing GIA and performance of assignment

To assess how haplotype loci that are constructed on the basis of GIA perform for assigning individuals to populations, we evaluate assignment in a two-population case for a wide range of allele frequencies and levels of linkage disequilibrium. We investigate a case of 200 haploid individuals, 100 individuals from each population, where each individual is assumed to be typed for 40 pairs of SNPs. We generate a discrete set of haploid gene copies (for a pair of SNPs) for each population that satisfies a particular choice of allele frequencies and levels of LD (see Table 2). This set of gene copies is randomly permuted to generate a set of 40 pairs of SNPs, which ensures that the pairs of SNPs are independent of each other (conditional on the allele frequencies). This procedure guarantees that all the SNP pairs have the same allele frequencies for SNP *A*, SNP *B*, and the *A–B* haplotype locus and consequently the same level of LD between the two SNPs. Note that within a population, most of the LD in the sample is a result of the linkage between the two SNPs in each pair.

For these population-genetic data, we use the software STRUCTURE (Pritchard *et al.* 2000; Falush *et al.* 2003), to assign the 200 haploid individuals to two clusters (no-admixture model, burn-in period of 20,000 iterations followed by 5,000 iterations from which estimates were obtained), using either the 80 SNPs or the 40 haplotype loci obtained by combining each pair of SNPs into one haplotype locus. From the STRUCTURE result, the mean incorrect assignment proportion (MIAP) is computed, which is the average proportion of individuals that are assigned to the incorrect population. For a given set of allele frequencies, we generate 100 different replicate samples using the data-randomization procedure described above, assign individuals to populations, and compute the average (across replicates) of MIAP. For comparison, *F*_{ST} values for the SNP pairs, as well as *F*_{ST} values for the haplotype loci, are computed. Similarly to IA, *F*_{ST} also relies on information about allele frequencies.

Table 2 shows the performance of the assignment based on the 80 SNPs and based on the 40 haplotype loci for various choices of allele frequencies and levels of LD. In most cases when GIA is positive, the MIAP values are lower for the haplotype loci than for the SNPs. Similarly, when GIA is negative, the MIAP values are in most cases lower for the SNPs than for the haplotype loci. For the choices of allele frequencies and levels of LD in Table 2, Figure 4 shows the difference between the MIAP based on SNPs and the MIAP based on haplotype loci (*i.e.*, improved assignment due to haplotype loci) as a function of GIA (Figure 4A), the mean (across populations) of |*D*| (*r*^{2} (*F*_{ST} between the 40 haplotype loci and the 80 SNPs (Figure 4D). The improved assignment due to using haplotype loci is positively correlated with GIA (Pearson’s ρ = 0.748, *P* = 4 × 10^{−5}), and the correlation is nonsignificant with *P* = 0.16 and ρ = −0.302, *P* = 0.18, respectively). The improved assignment is neither correlated with *F*_{ST} for haplotype loci nor correlated with *F*_{ST} for SNPs (ρ = −0.037, *P* = 087 and ρ = 0.401, *P* = 0.06, respectively), but it is positively correlated with the difference between *F*_{ST} for haplotype loci and *F*_{ST} for SNPs (ρ = 0.790, *P* = 7× 10^{−6}). GIA and the difference in *F*_{ST} values appear to be good indicators of how assignment can be improved by combining SNPs into a haplotype loci. The outlier observed far from the regression line in Figure 4A corresponds to the 10th entry in Table 2. For this set of allele frequencies, 40 pairs of SNPs are enough to obtain a very accurate assignment (MIAP close to 0) and there is not much room for improvement when combining the SNPs into haplotype loci. GIA and the difference in *F*_{ST} values are correlated (ρ = 0.792), suggesting that the two statistics contain similar information despite the fact that GIA is based on a measure of information whereas *F*_{ST} measures differentiation, but there are similarities of the two statistics as well. Indeed, if the differentiation between the two populations is easier to capture when considering haplotype loci compared to considering SNPs separately, we would expect that assignment also improves for haplotype data compared to SNP data.

### Improving assignment using GIA—a simulation study

For empirical population genetic data, allele frequencies and levels of LD vary extensively among loci. GIA is defined for multiallelic markers and can be used for assessing the usefulness of combining not only pairs of SNPs, but also haplotype loci themselves. Thus, GIA can be used for large numbers of SNPs. To demonstrate the utility of GIA, we compare the results of the assignment of 200 haploid individuals originating from two populations and based on 1000 SNPs using different strategies of dealing with the SNPs, *e.g.*, by pruning the SNPs or combining them into haplotype loci. We simulate the 200 haploid individuals with the software ms (Hudson 2002) from a two-island model with migration rate *m* (migrants per generation) and an effective population size of 1000. Each haploid individual represents a DNA fragment of 4.2 Mb with a total scaled recombination rate of ρ = 4*Nr* = 150 or ρ = 4*Nr* = 1500 (where *N* is the population size and *r* is the recombination rate per generation for the entire fragment). We repeat the simulation 100 times for a given migration rate and a given recombination rate. For each sample, we assign the 200 individuals using STRUCTURE on the basis of seven different treatments of the SNPs:

Using all 1000 SNPs.

Using a subset of the SNPs obtained by pruning. We prune the set of SNPs with the program PLINK (Purcell

*et al.*2007), to remove SNPs that are in high LD (rejection threshold of*r*^{2}= 0.1, windows of 20 SNPs, and shifts of 5 SNPs).Combining the SNPs into haplotype loci with a greedy algorithm that recursively combines the pair of loci that has the greatest GIA among all the pairwise comparisons of loci until no remaining pair of loci has a positive GIA. We refer to this strategy as MaxGIA.

Using a set of randomly formed haplotype loci with a haplotype length distribution matching the haplotype length distribution of the set in c. We call this strategy RandomHaplotypes.

Using the set of SNPs and haplotype loci obtained with the following algorithm: starting at the first SNP, if GIA is positive between SNP 1 and SNP 2, combine them into a haplotype. Compute GIA for the SNP 1–SNP 2 haplotype and SNP 3, and combine them into a haplotype if GIA is positive. Repeat this process until a SNP

*s*is found for which the haplotype locus and SNP*s*have a nonpositive GIA. Repeat the process starting from SNP*s*. We refer to this strategy as NeighborGIA.Using a set of haplotype loci formed by neighboring SNPs obtained by randomly permuting the breakpoints of the haplotype loci set in e, so that the haplotype length distribution is the same as in e. We call this strategy RandomNeighbor.

Combining the SNPs into haplotype loci with a greedy algorithm that recursively combines the pair of loci that has the greatest δ =

*F*_{ST}(*H*) −*F*_{ST}(M1, M2) among all the pairwise comparisons of loci until no remaining pair of loci has a positive δ.*F*_{ST}(*H*) denotes*F*_{ST}for a haplotype locus, and*F*_{ST}(M1, M2) denotes*F*_{ST}computed for the two markers constituting the haplotype loci. We refer to this strategy as Max*F*_{ST}.

For each sample, migration rate, and strategy, we record the performance of assigning individuals to populations that is obtained from STRUCTURE (with the same settings as above). Figure 5 shows MIAP for the different strategies (no combination, pruning, MaxGIA, RandomHaplotypes, NeighborGIA, RandomNeighbor, and Max*F*_{ST}) for a range of migration rates *m* and scaled recombination rates of ρ = 150 and ρ = 1500. The GIA- and the *F*_{ST}-based strategies require some knowledge about allele frequencies for the considered markers, including the haplotype loci formed in the iterative processes. In the context of an assignment problem, this information can be obtained from phased data for candidate populations. In this simulation study, we estimate the allele frequencies directly from the sample and use our knowledge of the individuals’ true ancestry. Thus, improvement based on the GIA or the *F*_{ST} strategy is to some degree magnified by the fact that we are using information about the individuals’ true ancestry to compute the allele frequencies. However, the NeighborGIA strategy uses the same information as the MaxGIA and Max*F*_{ST} strategies, and the improvement obtained for the MaxGIA and Max*F*_{ST} strategies cannot be explained solely by using information about the individuals’ ancestry.

For both recombination rates, the MaxGIA and Max*F*_{ST} strategies for combining SNPs show the fewest incorrect assignments, but recombination rate has a strong impact on the accuracy of the assignment. For the high-recombination case (ρ = 1500), the markers are less correlated and the set of markers carries more information about ancestry than the markers in the low-recombination case. Furthermore, as expected, when the migration rate increases (and *F*_{ST} decreases), MIAP also increases for all seven strategies. However, for the high-recombination case and a migration rate of 0.01, the MaxGIA and Max*F*_{ST} strategies can uncover the structure with (on average) <2% incorrect assignment compared to 37% using the full set or the pruned set of SNPs (Figure 5B). Combining neighboring SNPs that have positive GIA also improves the assignment, but to a lesser extent than the MaxGIA strategy. For both choices of recombination rates, the strategies that combine SNPs into haplotypes in a random manner (RandomHaplotypes and RandomNeighbor) result in poor assignment. Thus, the improved assignment for MaxGIA, and to some degree NeighborGIA, compared to the pruning or no combination strategies is likely to be the result of using GIA as a criterion for combining SNPs into haplotypes and not just a result of randomly combining SNPs into haplotypes. However, for ρ = 1500, the strategy RandomNeighbor, which consists of randomly combining neighboring SNPs, increases the accuracy of the assignment compared to the pruning or no combination strategies. Finally, we note that the accuracy of the assignment for the pruned set of SNPs is similar to that of the assignment based on the full set of SNPs, suggesting that the removed SNPs contained redundant information about ancestry.

In the case of 1% migrants per generation (*m* = 0.01, the greatest migration rate that we investigate), the distribution of MIAP for the 100 replicates varies depending on the strategy for treating the SNP data. Six distributions of MIAP (based on different treatments of the SNPs) for the low-recombination case (ρ = 150) are shown in Figure 6 and the corresponding distributions of MIAP for the high-recombination case (ρ = 1500) are shown in Figure 7. For ρ = 150, the distribution of MIAP based on the MaxGIA strategy is spread over a range of values compared to the results of the other strategies, which are skewed toward 0.5, the expected value of MIAP for random assignment of individuals to populations (but note that this expected value may be slightly smaller for finite population sizes and unlabeled populations). So, as also shown by the mean MIAP in Figure 5A, MaxGIA is the most accurate strategy, but there are also cases of poor assignment using this strategy. If we increase the recombination rate, all six distributions of MIAP move away from 0.5, except for RandomHaplotypes. The distributions of MIAP for RandomNeighbor, pruning, or no combination strategies are similar and have large variances. The distribution of MIAP for the MaxGIA strategy is skewed toward 0, demonstrating superior assignment accuracy compared to the other strategies.

To get an idea of how many SNPs make up the haplotype loci that are constructed using the MaxGIA strategy, we compute the distribution of the number of SNPs in haplotype loci for four different migration rates and for two different recombination rates (Figure 8). All the length distributions show a clear mode, and the value of the mode appears to increase with increasing migration rate. This observation suggests that when it becomes more difficult to assign individuals to populations because of higher migration rate, longer haplotype loci may increase the accuracy of the assignment. For the low-recombination case (ρ = 150), there is also a second mode at one single SNP (for all but the lowest migration rate), showing that many SNPs are not combined with other SNPs for these cases. In general, however, the recombination rate appears to have little impact on the length distribution of the majority of haplotype loci.

### Improving assignment using GIA—POPRES data

To investigate whether haplotype loci can improve ancestry inference for empirical population genetic data, we use SNP-chip data from the POPRES panel that contain some 1385 individuals from Europe (Nelson *et al.* 2008), which have been genotyped for some 500,000 SNPs. We phased all individuals using fastPHASE (Scheet and Stephens 2006), version 1.4 (“haplotype clusters” set to 20 and 20 runs of the EM algorithm), which generated “best guess” estimates of the phase of each of the two haploid copies for each individual.

We conduct a cross-validation study for the 89 French and 70 German individuals (one German outlier individual was removed) in the POPRES collection (Nelson *et al.* 2008) and focus on the phased data of 105,341 SNPs on chromosomes 1, 2 and 3 (*F*_{ST} = 0.00068). To construct a training set, 45 French individuals and 35 German individuals were randomly sampled, and the remaining 44 French and 35 German individuals make up the validation set. Each chromosome is divided into windows of 10 SNPs and using the MaxGIA strategy, we build a set of haplotype loci using estimated allele frequencies from the training set of individuals for each 10 SNP-window. This set contains 54,762 haplotype loci and the configuration of SNPs is known so that we can combine the SNPs in the validation set to make up the same haplotype loci. We perform the assignment of the individuals in the validation set using STRUCTURE and using principal component analysis (PCA), for either the entire set of SNPs or the set of haplotype loci. For STRUCTURE, we compute the fraction of the validation individuals that are misclassified using the training individuals as known populations (supervised clustering), as well as the fraction of misclassified individuals in the training set alone (based on unsupervised clustering).

There was no obvious clustering of individuals in the training set using either SNPs or haplotype loci (50% correctly classified individuals for both types of data). Assigning individuals in the validation sets also performs poorly for both haplotype loci (51% correctly classified individuals) and SNPs (61% correctly classified individuals). However, PCA based on the haplotype data differentiate the individuals in both the training set and the validation set (Figure 9, C and D), and validation individuals can be assigned to populations with high accuracy (87.3%) in contrast to using SNPs (53.2% correctly assigned individuals in the validation set; Figure 9, A and B), corresponding to a 73% reduction of incorrectly assigned individuals using haplotypes. If we instead use data from all chromosomes, the fraction of incorrectly assigned (validation) individuals is reduced by 33% for haplotypes compared to SNPs. To perform the PCA, the haplotype data are transformed to a matrix of haplotype alleles *vs.* individuals where entries in the matrix denote 0, 1, or 2 copies of a haplotype allele in a particular individual. For both the training set and the validation set, the first component of such PCA based on haplotypes reveals a clear clustering of the individuals, according to French or German origin. The assignment of the validation individuals to candidate populations is determined by the smallest distance along PC1 to the mean coordinate of either the French training set or the German training set.

To investigate a more challenging and realistic application, we assign 209 individuals from Switzerland (84 Swiss–German and 125 Swiss–French), using a training set of 89 French and 70 German individuals from the POPRES data. The level of differentiation among groups is low, for example, *F*_{ST} = 0.00012 for Swiss-French *vs*. Swiss-German, *F*_{ST} = 0.00028 for French *vs*. Swiss-French, *F*_{ST} = 0.00022 for German *vs*. Swiss-German, *F*_{ST} = 0.00034 for French *vs*. Swiss-German, and *F*_{ST} = 0.00047 for German *vs*. Swiss-French. We use the same procedure and the same 105,341 SNPs as for the cross-validation study above, and the haplotype loci (in total 50,268) are constructed using the MaxGIA strategy for 10-SNP windows based on all the French and German individuals. The Swiss-French and the Swiss-German individuals are just barely better than randomly assigned to candidate populations using SNPs (54.5% correctly classified individuals, Figure 10A). Using haplotypes only slightly improves the assignment (58.4% correctly classified individuals), corresponding to 7% fewer misclassified individuals compared to using SNPs (Figure 10B). If we instead conduct a cross-validation study of the Swiss-French and the Swiss-German individuals (similar to the study above for the French and the German individuals), the incorrectly assigned individuals can be reduced 28.6% by using haplotypes instead of SNPs. Finally, we note that the assignment strategy based on the first PC is rather crude, and there is additional information about population assignment in the remaining PCs that may improve the assignment accuracy further.

## Discussion

As genotyping technologies improve, population-genetic data sets increase in number of markers. For example, millions of SNPs have been typed for hundreds of humans (International HapMap 3 Consortium 2010). This development leads to an increase in marker density and substantial levels of LD between many markers. In this study, we focus on how to use dense sets of SNPs for assigning individuals of unknown origin to candidate populations. The idea is to incorporate information from recombination events through combining SNPs into haplotype loci. We describe a new statistic, the gain of informativeness for assignment from haplotype data, as a decision criterion for combining SNPs into haplotype loci. GIA compares the informativeness for assignment contained in a haplotype locus with the sum of the informativeness for assignment contained in each constitutive locus forming the haplotype locus. If the data consist of genotype data from diploids, a phasing step is needed to infer the phase of the two chromosomes in each individual before GIA can be used to construct a set of haplotype loci. We show that combining SNPs into haplotype loci using GIA improves the accuracy of assigning individuals to populations, whereas a strategy of randomly combining SNPs into haplotype loci leads to less efficient assignment. This result demonstrates that not all haplotypes improve assignment and that combining markers sometimes results in poorer assignment, which may appear surprising since haplotype loci are multiallelic and should therefore be more informative about ancestry (compared, for example, with the use of microsatellites in forensics). However, if we consider the extreme situation where all SNPs are combined into one haplotype locus, most individuals would have (two) unique haplotype alleles and the information on ancestry would be nearly zero. There may be an optimum number of SNPs to include in haplotype loci, but this value will depend on both SNP density and levels of LD, which both vary across the genome. The observed modes for the distribution of number of SNPs in haplotype loci (Figure 8) give an indication of the optimum for the particular cases that we investigate.

We use simulations based on a two-island model with continuous migration between the populations and empirical data from the POPRES panel (Nelson *et al.* 2008) to investigate how different strategies can improve assignment of individuals to populations. Similar to many empirical population studies, the simulated data may contain recent migrants from one population to the other. In our setup, an individual is considered to be incorrectly assigned when it is not assigned to the population it was sampled from, regardless of whether the individual was a very recent migrant or not. This means that among the individuals deemed incorrectly assigned, there may be a proportion of recent migrants who are justifiably assigned to the population of their recent ancestry (which is not the population they were sampled from). We may therefore expect a small fraction of incorrectly assigned individuals regardless of the assignment approach, but this phenomenon will have little effect on our simulation study. Indeed, for a migration rate *m* = 0.01 and a sample size of 200, we expect 2 individuals to be first-generation migrants in the sample, with a variance of 2, but this number is too small to explain the high number of incorrectly assigned individuals using, for example, the entire set of SNPs or the pruned set of SNPs (Figures 5–7).

GIA is well adapted for assignment problems where individuals or segments of genomes are assigned to a population among candidate populations for which we have estimates of allele frequencies for the SNPs and for the haplotype loci. In particular, a recursive greedy algorithm was found to improve assignment substantially. Interestingly, assignment based on the same greedy algorithm, but using *F*_{ST} (the difference between haplotype-based *F*_{ST} and single-marker–based *F*_{ST}) instead of GIA to determine which markers to combine, also performs much better than assignment based on single SNPs (Figure 5). This observation suggests that it is the guided combination of SNPs into haplotypes that leads to the improved assignment and not a particular property of GIA, although GIA is a useful tool for determining which SNPs to combine.

For population structure problems, GIA cannot be used directly because it requires some knowledge about the allele frequencies within the populations, but it could potentially be integrated into MCMC algorithms for estimating population structure, where the algorithms involve a step of partitioning individuals, such as in BAPS (Corander *et al.* 2003, 2004), TESS (Chen *et al.* 2007; Durand *et al.* 2009), or STRUCTURE (Pritchard *et al.* 2000; Falush *et al.* 2003). Briefly, for a particular proposed partition, allele frequencies can be estimated from the partitioned sample, and GIA can be computed and used to improve the inference of population structure.

We have demonstrated that haplotypes contain additional information about population structure and that using haplotypes instead of single SNPs can improve assignment of individuals to populations. The GIA statistic determines when it is possible to improve the assignment of individuals to populations by combining markers into haplotypes and it can be used as a tool for population structure inference methods to capitalize on dense sets of genetic markers.

## Acknowledgments

We thank M. Blum, P. Sjödin, C. Schlebusch, and two anonymous reviewers for helpful discussions and comments on the manuscript and N. Duforet-Frebourg for technical assistance. The POPRES data were obtained from dbGaP (accession no. phs000145.v1.p1). Financial support was provided by the Swedish Research Council and the Swedish Research Council Formas.

## Appendix

We rewrite Equation 2. Denote the frequency of allele *u* at locus *A* in population *i* by *v* at locus *B* in population *i* by *uv* of the haplotype locus, formed by allele *u* at locus *A* and allele *v* at locus *B* in population *i* by *U* and *V* denoting the number of alleles at locus *A* and locus *B*, respectively, and using the convention of 0 log 0 = 0.

**Theorem.** *Let A and B be two biallelic loci and H be their associated haplotype locus. Consider K randomly mating populations. For population i*, *let* *and* *be the frequencies of the minor allele at locus A and locus B*, *respectively. Then*, *for all the frequency distributions of the alleles*,*with equality if and only if*

*Proof of Theorem. *Equation 3 with two biallelic loci (*U* = 2 and *V* = 2) gives*D _{i}* = 0,

*f*of α, β, and γ,

*f*is twofold differentiable on the open space

*S*= {

*a*> 0, β > 0, γ > 0|α + β + γ < 1} and we look for the set of points where the gradient of

*f*is equal to zero; in other words, we are looking for the critical points of

*f*. The first partial derivatives of

*f*are

*f*are all equal to zero if and only if αδ = βγ. The nature of the critical points can be investigated by looking at the Hessian matrix ℋ. We can show that for αδ = βγ, ℋ can be written as

*X*the row vector (α − δ, α + γ, α + β) and

*X*

^{T}its transposed vector. ℋ is thus negative and the critical points defined by αδ = βγ are maxima of

*f*. Since the equation αδ = βγ defines a continuous surface in the open space

*f*reaches a maximum, the value of

*f*on this surface is constant:

*f*on

*f*is extendable by continuity on the border of

*k*=

*i*but all those terms are equal to zero. This achieves the proof of the Theorem.

- Received May 26, 2011.
- Accepted August 11, 2011.

- Copyright © 2012 by the Genetics Society of America