## Abstract

The detection of molecular signatures of selection is one of the major concerns of modern population genetics. A widely used strategy in this context is to compare samples from several populations and to look for genomic regions with outstanding genetic differentiation between these populations. Genetic differentiation is generally based on allele frequency differences between populations, which are measured by *F*_{ST} or related statistics. Here we introduce a new statistic, denoted hapFLK, which focuses instead on the differences of haplotype frequencies between populations. In contrast to most existing statistics, hapFLK accounts for the hierarchical structure of the sampled populations. Using computer simulations, we show that each of these two features—the use of haplotype information and of the hierarchical structure of populations—significantly improves the detection power of selected loci and that combining them in the hapFLK statistic provides even greater power. We also show that hapFLK is robust with respect to bottlenecks and migration and improves over existing approaches in many situations. Finally, we apply hapFLK to a set of six sheep breeds from Northern Europe and identify seven regions under selection, which include already reported regions but also several new ones. We propose a method to help identifying the population(s) under selection in a detected region, which reveals that in many of these regions selection most likely occurred in more than one population. Furthermore, several of the detected regions correspond to incomplete sweeps, where the favorable haplotype is only at intermediate frequency in the population(s) under selection.

- rningselective sweeps
- haplotype
- linkage disequilibrium
*F*_{ST}- genome scan
- sheep
- statistical genetics
- structured populations
- population genetics

THE detection of molecular signatures of selection is one of the major concerns of modern population genetics. It provides insight on the mechanisms leading to population divergence and differentiation. It has become crucial in biomedical sciences, where it can help to identify genes related to disease resistance (Tishkoff *et al.* 2001; Barreiro *et al.* 2008; Albrechtsen *et al.* 2010; Fumagalli *et al.* 2010; Cagliani *et al.* 2011), adaptation to climate (Lao *et al.* 2007; Sturm 2009; Rees and Harding 2012), or altitude (Bigham *et al.* 2010; Simonson *et al.* 2010). In livestock species, where artificial selection has been carried out by humans since domestication, it contributes to map traits of agronomical interest, for instance, related to milk (Hayes *et al.* 2009) or meat (Kijas *et al.* 2012) production.

Efficiency of methods for detecting selection varies with the considered selection timescale (Sabeti *et al.* 2006). For the detection of selection within species (the ecological scale of time), methods can be classified into three groups: methods based on (i) the high frequency of derived alleles and other consequences of hitchhiking within population (Kim and Stephan 2002; Kim and Nielsen 2004; Nielsen *et al.* 2005; Boitard *et al.* 2009), (ii) the length and structure of haplotypes, measured by extended haplotype homozygosity (EHH) or EHH-derived statistics (Sabeti *et al.* 2002; Voight *et al.* 2006), and (iii) the genetic differentiation between populations, measured by *F*_{ST} or related statistics (Lewontin and Krakauer 1973; Beaumont and Balding 2004; Foll and Gaggiotti 2008; Riebler *et al.* 2008;Gautier *et al.* 2009; Bonhomme *et al.* 2010). Methods of the latter kind, which we focus on, are particularly suited to the study of species that are structured in well-defined populations, such as most domesticated species. In contrast to methods based on the frequency spectrum (i) or the excess of long haplotypes (ii), they can detect a wider range of selection scenarios, including selection on standing variation or incomplete sweeps, albeit up to a given extent (Innan and Kim 2008; Yi *et al.* 2010).

The most widely used statistic with which to detect loci with outstanding genetic differentiation between populations is the *F*_{ST} statistic (Barreiro *et al.* 2008; Myles *et al.* 2008). The general application of the *F*_{ST}-based scan for selection is to identify outliers in the empirical distribution of the statistics computed genome-wide. One major concern with this approach is that it implicitly assumes that populations have the same effective size and derived independently from the same ancestral population. If this hypothesis does not hold, which is often the case, genome scans based on raw *F*_{ST} can suffer from bias and false positives, an effect that is similar to the well-known effects of cryptic structure in genome-wide association studies (Price *et al.* 2010). To cope with this problem several methods have been proposed to account for unequal population sizes (Beaumont and Balding 2004; Foll and Gaggiotti 2008; Riebler *et al.* 2008; Gautier *et al.* 2009); however, few solutions have been proposed to deal with the hierarchical structure of populations (Excoffier *et al.* 2009). Among them Bonhomme *et al.* (2010) proposed an extension of the classical Lewontin and Krakauer (LK) test (Lewontin and Krakauer 1973), where the hierarchical population structure is captured through a kinship matrix, which is used to model the covariance matrix of the population allele frequencies. A similar covariance matrix was also introduced in a related context to account for the correlation structure arising from population geography (Coop *et al.* 2010).

All *F*_{ST}-based approaches discussed above are single marker tests; *i.e.*, markers are analyzed independently from each other. As dense genotyping data and sequencing data are now common in population genetics, accounting for correlations between adjacent markers has become necessary. Furthermore, haplotype structure contains useful information for the detection of selected loci, as demonstrated by the within-population methods mentioned above (class ii). Several strategies for combining the use of multiple populations and of haplotype information have thus been proposed recently. These include the development of EHH-related statistics for the comparison of pairs of populations (Sabeti *et al.* 2007; Tang *et al.* 2007), the introduction of dependence between SNPs (single nucleotide polymorphisms) in *F*_{ST}-based approaches through autoregressive processes (Guo *et al.* 2009; Gompert and Buerkle 2011), or the computation of *F*_{ST} using local haplotype clusters that are considered as alleles (Browning and Weir 2010). However, none of these approaches accounts for the possibility that populations are hierarchically structured.

We present here an haplotype-based method for the detection of positive selection from multiple population data. This new statistic, hapFLK, builds upon the original FLK statistic (Bonhomme *et al.* 2010). As FLK, it incorporates hierarchical structure of populations, but the test is extended to account for the haplotype structure in the sample. For this, it uses a multipoint linkage disequilibrium model (Scheet and Stephens 2006) that regroups individual chromosomes into local haplotype clusters. The principle is to exploit this clustering model to compute “haplotype frequencies,” which are then used to measure differentiation between populations. The idea of using localized haplotype clusters to study genetic data on multiple populations has been proposed before (Jakobsson *et al.* 2008; Browning and Weir 2010). Browning and Weir (2010) showed that using haplotype clusters rather than SNPs allowed circumvention, to some extent, of the problems arising from SNP ascertainment bias. They also showed that two genome regions known to have been under strong positive selection in particular human populations exhibited large population-specific haplotype-based *F*_{ST}. Jakobsson *et al.* (2008) showed by using fastPHASE that there was a predominance of a single cluster haplotype in the HapMap population of Utah residents with ancestry from northern and western Europe (CEU population) in the region of the LCT gene and interpreted this signal as a recent selective sweep.

In this article, we examined in detail the ability of statistics based on population differentiation at the haplotype level to capture selection signals. Using computer simulations, we study the power and robustness of our new haplotype-based method for different selection and sampling scenarios and compare it to single-marker [*F*_{ST} and FLK (Bonhomme *et al.* 2010)] and haplotype-based approaches (XP–EHH; Sabeti *et al.* 2007). To illustrate the interest of this approach, we provide a practical example on a set of six sheep breeds for which dense genotyping data have been recently released by the Sheep HapMap Project (International Sheep Genomics Consortium 2012). In this context, we propose a new strategy for the detection of outliers loci in genome scans for selection and describe a method for the identification of the populations that have experienced selection at a detected region.

## Methods

### Test statistics

*F*_{ST} and FLK tests for SNPs:

Consider a set of *n* populations that evolved without migration from an ancestral population and a set of *L* SNPs in these populations. For a given SNP, let *p* = (*p*_{1},…,*p _{i}*,…,

*p*)′ be the vector of the reference allele frequency in all populations. Denoting

_{n}*p*s,

_{i}*F*

_{ST}at this SNP is given by

*F*

_{ST}quantifies the genetic differentiation between populations and is commonly used to detect loci under selection. Loci with outstanding high (resp. low) values of

*F*

_{ST}can be declared as targets of positive (resp. balancing) selection.

However, if the sampled populations have unequal effective sizes or/and are hierarchically structured, genome scans based on raw *F*_{ST} values can bias inference. For instance, a given allele frequency difference between two closely related populations should provide more evidence for selection than the same difference between two distantly related populations. To account for these drift and covariance effects when detecting loci under selection, Bonhomme *et al.* (2010) introduced the FLK statistic*p*_{0} is the allele frequency in the ancestral population and Var(*p*) is the expected covariance matrix of vector *p*, which they modeled as* _{ii}* is the expected inbreeding coefficient in population

*i*and ℱ

*is the expected inbreeding coefficient in the ancestral population common to populations*

_{ij}*i*and

*j*. The entries of the kinship matrix ℱ represent the amount of drift accumulated on the different branches of the population tree. They can be derived as a function of the divergence times and the effective population sizes along the population tree, as described in Supporting Information, File S1.

In practice, these demographic parameters are unknown and ℱ must be estimated from genome-wide data. Here, it is done as follows: first, pairwise Reynolds’ distances (Reynolds *et al.* 1983) between populations (including an outgroup) are computed for each SNP and averaged over the genome. Then, a phylogenetic tree is fitted from these distances using the neighbor-joining algorithm. The branch lengths of this tree are finally combined to compute ℱ entries. More details on this procedure can be found in Bonhomme *et al.* (2010). Given the estimation of ℱ, the unbiased estimator of *p*_{0} is obtained as*T*_{FLK}.

Under the assumption that all populations diverged simultaneously from the same ancestral population (star-like evolution) and with the same population size, ℱ is equal to *F*_{ST} over all SNPs and *I _{n}* is the identity matrix of size

*n*. In this case,

*T*

_{FLK}is equivalent to the LK statistic (Lewontin and Krakauer 1973):

#### FLK test for multiallelic markers:

Considering haplotypes as multiallelic markers, an extension of the FLK statistic in the case where each locus presents more than two alleles is required. Letting *A* be the number of alleles at a given locus, the allele frequency vector becomesand a multiallelic version of the *T*_{FLK} statistic is provided by*P*_{0} = (*p*_{10}, …, *p _{A}*

_{0})′ contains the allele frequencies of the

*A*alleles in the ancestral population. Var(

*P*) is written

*P*) corresponds to the biallelic covariance matrix for one of he

*A*alleles, while the extra-diagonal blocks arise from the covariance terms between different alleles. Similar to the biallelic case,

*P*

_{0}is estimated by

*P*) is inverted using the Moore–Penrose generalized inverse.

#### FLK test for haplotypes:

The Scheet and Stephens (2006) model summarizes local haplotype diversity in a sample through a reduction of dimension by clustering similar haplotypes together. These clusters can then be considered as alleles to compute the haplotype version of *T*_{FLK} statistic. Let *i* at marker ℓ. In the hidden markov model of Scheet and Stephens (2006), *g*, it provides for each individual *i*, marker ℓ, and cluster *k* the posterior probabilities *j* are obtained by averaging the probabilities of the *n _{j}* individuals of this population,

*i.e.*,(5)Considering clusters as alleles and population-averaged probabilities as population frequencies, the allele frequency vector of a marker ℓ isFor each marker ℓ, the multiallelic statistic

*T*

_{FLK}is computed according to Equation 3, with a small modification in the derivation of Var(

*P*). Clusters that are fitted in the present population cannot exactly be considered as real alleles that already existed in the ancestral population, as assumed by the original

*T*

_{FLK}statistic. Moreover, the generalized inverse of ℬ

_{0}⊗ ℱ was found numerically unstable for small

*P*

_{a}_{0}values, which are very common when the number of alleles is large. Consequently, the ℬ

_{0}matrix is replaced by the identity matrix

*I*in Equation 4, leading to the statistic

_{A}_{0}(Figure S1).

For the model of Scheet and Stephens (2006), parameter estimates and cluster membership probabilities are obtained using an expectation maximization (EM) algorithm. Because this algorithm converges to a local maximum, it is useful to run it several times from different starting points. Applying the model to haplotype phasing, Guan and Stephens (2008) and Scheet and Stephens (2006) observed that averaging the results from these different runs was more efficient than keeping the maximum-likelihood run, which may be due to the fact that different runs are optimal in different genomic regions. Following their strategy, we averaged the statistics obtained using Equation 6 from different EM iterations to finally obtain the haplotype extension of FLK. We denote this extension hapFLK.

The haplotype extension of the *F*_{ST} test, denoted hap*F*_{ST} in the simulation study, was obtained by replacing ℱ by *I _{n}* in Equation 6, therefore ignoring the hierarchical structure of populations.

#### Software and computational considerations:

Software implementing the hapFLK calculations is available at https://forge-dga.jouy.inra.fr/projects/hapflk. hapFLK comes with an increased computational cost compared to *F*_{ST} and FLK arising from the need to estimate the LD model on the data. The computational cost of the LD model used here, applied on unphased genotype data, is in *o*(*K*^{2}*IL*) with *K* the number of clusters, *I* the number of individuals, and *L* the number of loci on a chromosome. As an example, fitting the model on sheep chromosome 1 with 5284 SNPs for 40 clusters and 278 individuals takes about 1 hr on a single processor. In our implementation of the Scheet and Stephens (2006) model, we perform computations in a parallel fashion allowing the decrease of computational costs on multiprocessor computers.

### Simulations

To evaluate the performance of hapFLK and compare it to that of other tests, we performed a set of simulations mimicking the data obtained from dense SNP genotyping or full sequencing of samples from multiple populations. In particular, we designed our simulation to match the data produced within the Sheep HapMap Project (International Sheep Genomics Consortium 2012) (analyzed below), in terms of population divergence and SNP density.

#### Scenarios with constant size and no migration:

Two scenarios were simulated, one with two populations and the other one with four populations (Figure 1). The two-population scenario was designed to be a subtree of the four-population scenario, which allows comparison of the detection power obtained by testing the four populations jointly, with the power obtained by testing all possible pairs of populations.

The ancestral population was simulated using using *ms* (Hudson 2002), with mutation rate *μ* = 10^{−8}, recombination rate *c* = 10^{−8} (1 cM/Mb), and region length *L* = 5 Mb. The effective population size and the number of simulated haplotypes were *N*_{e} = 1000 and *n*_{h} = 4000 for the two-population case, and *N*_{e} = 2000 and *n*_{h} = 8000 for the four-population case. The generated haplotypes had ∼200 SNPs/Mb. The first two populations (top branches in Figure 1) were created independently by sampling half of the individuals from the founder population. A forward evolution of the populations after their initial divergence was then simulated with the *simuPOP* Python library (Peng and Kimmal 2005), under the Wright–Fisher model. During forward simulations, recombination was allowed but mutation was not.

Simulations were performed with and without selection. For scenarios with selection, selection occurred at a single locus, in the red branch shown in Figure 1. The selected locus was chosen as the closest to the center of the simulated region, among the SNPs with minor allele frequency equal to a predefined value (0.1, 0.05, 0.10, 0.20, or 0.30). The less-frequent allele of this SNP was given fitness 1 + *s*, with selection intensity

At the end of each simulation replicate 50 individuals were sampled from each of the final populations, and SNPs with a minor allele frequency (MAF) >*T*_{hapFLK} were computed at each SNP, assuming that the kinship matrix ℱ was known. The estimation of ℱ is very accurate for evolution scenarios with constant population size and no migration (see Bonhomme *et al.* 2010 and Figure S2). Parameters used for running the test were *K* = 5 (number of clusters) and em = 5 (number of EM runs) for the two-population scenario and *K* = 20 and em = 5 for the four-population scenario. These values were chosen for maximizing the detection power. Greater values did not improve this power, but increased computation time. For the two-population scenario, the XP–EHH statistic (Sabeti *et al.* 2007) was also computed at each SNP, using software obtained from http://hgdp.uchicago.edu/Software/.

Power of the tests was computed as follows. Ten thousand data sets were simulated under the null (neutrality) and 3000 were simulated under the alternative (selection) hypotheses, for each scenario considered. In simulations under selection, only replicates where the final frequency of the selected allele was >60% were kept. For each replicate and statistic *S*, the maximum value *S*^{max} over the 5-Mb region was recorded. This provides the distribution of *S*^{max} under the null and the alternative hypotheses. The power of a test with statistic *S*, for a given type I error *α*, is the proportion of simulations under selection for which *S*^{max} > *q _{α}*, where

*q*is the (1–

_{α}*α*)th quantile of the null distribution of

*S*

^{max}.

#### Scenarios with bottlenecks or migrations:

To study the robustness of the approach, more complex demographic events were investigated through three scenarios. They derived from the two-population scenario described above, with the following modifications:

A bottleneck in a single population: the effective size in this population was set to

*N*_{e}= 100 in the first five generations following the split and to*N*_{e}= 1852 in later generations.Asymmetric migration: at generation 51, population 1 sent 10% of migrants to population 2.

Symmetric migration: at generation 51, population 1 sent 10% of migrants to population 2 and recieved 10% of migrants from population 2.

In terms of expected drift at a single SNP, these scenario are equivalent to the constant size scenario (see SI section 1.1 for a proof). Hence, they can be used to evaluate the influence of the underlying demographic model on hapFLK, while conditioning on a fixed value of ℱ. To ensure that the ℱ matrix used in hapFLK fits the one that would be estimated from real data, 100 artificial whole genome data sets were created for each of the scenarios i–iii and used to estimate ℱ. Each artificial whole genome data set was created by simulating 500 independent genome segments of 5 Mb.

Robustness of hapFLK and XP–EHH were evaluated by comparing quantiles of each statistic obtained under bottleneck or migration demography with those obtained under a constant size evolution. No selection was applied in these simulations.

Evaluation of the detection power of hapFLK and XP–EHH under bottleneck (or migration) with selection was performed as described above; *i.e.*, distributions obtained under neutrality provided quantiles used to calibrate type I error. Because scenarios i and ii are asymmetric, each one provided two different simulation scenarios under selection, one with selection in population 1 and one with selection in population 2.

### Sheep data analysis

A whole genome scan for selection in sheep was performed using the genotype data from the Sheep HapMap Project (available at http://sheephapmap.org/download.php). The sheep HapMap data set includes 2819 animals from 74 breeds, collected in such a way that it represents most of the worldwide genetic diversity in the sheep. Genotypes at 48703 autosomal SNPs (after quality filtering) are available for these animals. Focus was placed on the North European group, all populations with <20 individuals being removed. Populations resulting from a recent admixture were also excluded because they are not compatible with the population tree model assumed for our test. Finally, the following populations were included in the analysis (sample size in parentheses): Galway (49), Scottish Texel (80), New Zealand Texel (24), German Texel (46), Irish Suffolk (55), and New Zealand Romney (24). The Soay breed was used as an outgroup for computing the ℱ matrix.

#### Parameters of the hapFLK analysis:

To determine the number of clusters to be used in the fastphase model, the cross-validation procedure of fastPHASE, which indicated an optimal number of 45 clusters, was used. As the computational cost increases quadratically with the number of clusters, and as the genome scans performed on one single chromosome for 40 and 45 clusters provided very similar results, 40 clusters were used for the rest of the analysis. A sensitivity analysis indicated that on this data set 45 EM runs were required to obtain a stable estimate of hapFLK.

#### Computation of *P*-values:

In contrast to the simulated data sets, real data do not provide null distribution allowing computation of *P*-values from the hapFLK statistics. Also, due to ascertainment bias in the SNP panel, we believe that performing neutral simulations based on an estimation of ℱ is not a good strategy for this particular data set (see the *Discussion* for more details). *P*-values were thus estimated using an empirical approach (described below) exploiting the fact that selected regions, at least those that can be captured with hapFLK, affect a small portion of the genome.

The genome-wide distribution of hapFLK appeared to be bimodal, with a large proportion of values showing a good fit to a normal distribution and a small proportion of extremely high values (Figure S3). Consequently, *P*-values were estimated as follows, First, robust estimators of the mean and variance of hapFLK were obtained, to reduce the influence of outliers. For this estimation the rlm function of the package MASS (Venables and Ripley 2002) in *R* was used. hapFLK values were then standardized using these estimates and corresponding *P*-values were computed from a standard normal distribution. The resulting distribution of *P*-values across the genome was found to be close to uniform for large *P*-values, consistent with a good fit to the normal distribution apart from the outliers that exhibit small *P*-values. Using the approach of Storey and Tibshirani (2003), the FDR estimated when calling significant hypotheses with *P* < 10^{−3} was 5%.

#### Pinpointing the selected population:

Similar to all *F*_{ST}-related tests, hapFLK detects genomic regions in which genetic data are globally not consistent with a neutral evolution, but does not directly indicate where selection occurred in the population tree. To investigate this question, branch lengths of the population tree were reestimated for each significant region, using SNPs exceeding the significance threshold. The principle was to fit (using ordinary least squares) the branch lengths to the local values of Reynolds genetic distances. For each branch the *P*-value for the null hypothesis of no difference between the lengths estimated from data in the region and in the whole genome was computed. We did this local tree estimation using either SNP or haplotype clusters frequencies. Details on the procedure are provided in File S1, section 1.3.

## Results

### Simulation results

We performed a set of simulations to evaluate and compare the performance of hapFLK and other tests (see *Methods* for more details). To present the results of these simulations, we begin with scenarios that fit the assumptions of our model: a population tree without migration and with constant size within each branch. We then move to more complex demographic scenarios, which are expected to be less favorable to our test.

#### Interest of using haplotypes over SNPs:

We first simulated data from two populations of the same effective size (Figure 1, left). In this setting, the structure-aware tests (FLK and hapFLK) are equivalent to their unaware counterparts (*F*_{ST} and hap*F*_{ST} resp.).

In simulations mimicking dense genotyping data, the use of haplotype information (hapFLK) provides more detection power than the use of single SNP tests (*F*_{ST}) (Figure 2). This holds for both hard sweeps (*p*_{0} = 0.01) and soft sweeps detection (*p*_{0} up to 0.3). XP–EHH, which also makes use of haplotype information, has more power than *F*_{ST} but less than hapFLK for hard sweeps detection. The decrease in power for soft sweeps is also more pronounced for XP–EHH (Figure 2), which is expected because XP–EHH is designed to detect the rise in frequency of one single haplotype.

Focusing on hapFLK, we further studied the evolution of the detection power as a function of the initial and final frequencies of the selected allele (Figure S5). Although soft sweeps are obviously harder to detect, there is still reasonable power to detect such events with hapFLK. For example, when the initial frequency is 20% and the final frequency is 90%, the detection power is >75%, for a type I error rate of 1%. When selection acts on mutations at initial low frequency, the detection power is relatively high (around 60%) even for incomplete sweeps with a final frequency of 50–60%.

We also compared FLK, hapFLK, and XP–EHH in simulations mimicking data arising from full sequencing or imputation from a sequenced reference panel. This increase in marker density results in a greater power for all tests (Figure S6). In this setting, FLK is the most powerful. This comes from the fact that the selected SNP, where the allele frequency difference between populations is expected to be the largest, is always included in the sample in this simulation setting. In contrast, the selected SNP itself is often missing when analyzing genotyping data, and information concerning this SNP is then better captured by haplotypes than by single neighboring SNPs. All results below were obtained on simulations mimicking dense genotyping data.

#### Hierarchical structure of populations:

We then considered a four-population sample, where populations are hierarchically structured (Figure 1, right). This allows to compare hapFLK with related tests accounting for population structure only (FLK), haplotype information only (hap*F*_{ST}), or none of these features (*F*_{ST}). As expected, the least powerful approach in this scenario is the classical *F*_{ST}. The gain in power provided by using a haplotype-based approach is of similar size to that provided by accounting for population structure. Finally, combining the two within a single statistic (hapFLK) results in an even greater power gain (Figure 3). This result holds for initial frequencies >1% although the difference between haplotype and single SNP tests tend to decrease with increasing initial frequencies (Figure S4).

A classical approach for selection scans based on more than two populations is to test pairs of populations. It is, for instance, the only possible option for selection scans based on XP–EHH. To evaluate the interest of this pairwise strategy, we compared the detection power obtained by applying hapFLK on pairs of populations or on the four populations jointly and found that testing all pairs of populations is always less powerful (Figure S7). Since XP–EHH also has less detection power than hapFLK in the two-population scenario, we can expect that applying hapFLK using the four populations jointly will be much more efficient than applying XP–EHH on pairs of populations.

#### Robustness and power of hapFLK in complex demographic scenarios:

The model underlying hapFLK is that of pure drift evolution, with constant population size in each branch of a population tree with no admixture. These assumptions are made (i) when estimating the population covariance matrix ℱ and (ii) when assuming allele frequency differences (either SNP or haplotype) are due only to ℱ. We studied the robustness of hapFLK in presence of admixture or bottleneck events by examining separately their consequences on (i) the estimation of the ℱ matrix and (ii) the distribution of the hapFLK statistic. For this, we simulated the evolution of two populations with a bottleneck in one of the populations, migration from one population to the other, or migrations between both populations (see *Methods* for details).

The estimation of the ℱ matrix is slightly affected by demographic events (Figure S2). When one of the populations has experienced a severe bottleneck (reduction in size by a factor 10), the estimated branch length for this population is increased by 10%. In the presence of migrations between populations, the two branches remain of the same length but the Reynolds genetic distance between the two populations is smaller than it should be (5% smaller in the one way migration case and 10% smaller in the two-way migration case).

Using this information we were able to perform simulations under pure drift evolution or bottleneck/migration evolution that led to the same *estimated* ℱ matrix. As hapFLK is conditioned on this estimate, this approach allows evaluation of the effect of demographic events on the statistic, while integrating out their effect on ℱ. We found that the distribution of hapFLK was not greatly affected by deviations from pure drift evolution, on par with XP–EHH (Figure S8). Overall, these results show that while the estimate of ℱ can be affected by deviation from the evolution model, and therefore coefficients in ℱ must not be interpreted too literally, the distribution of hapFLK conditioned on this estimate is robust. In addition, the power of hapFLK is only slightly reduced under migration scenarios and unchanged under a bottleneck scenario (Figure S9).

### Application to the sheep Hapmap data set

To provide an insight into the advantages and issues of using hapFLK on real data, we provide an example of application to a subset of the data from the Sheep HapMap Project. In sheep populations, drift accumulates rapidly, due to their small effective size, typically a few hundred individuals (International Sheep Genomics Consortium 2012). As little power is expected from analyses based on genetic differentiation if populations are too distant, we focused on a group of relatively closely related breeds from Northern European origin. Six populations are included in this group, whose population tree is shown in Figure 5 (top left).

The genome scan performed with FLK provides little evidence for any sweep in these data, with *P*-values of the order of 10^{−4}, a hardly convincing figure, seen only on chromosomes 2 and 14. This is in great contrast (Figure 4) to the genome scan with hapFLK, which identifies seven genome-wide significant regions (Table 1), consistent with the additional power provided by hapFLK on simulated data sets. For each of these regions, we identified the population(s) under selection by reestimating the local population trees and comparing it to the tree estimated from whole genome data (see *Methods* for more details).

Figure 5 shows local trees for the two largest signals, on chromosome 2 and 14 (local trees for the other significant regions are provided in Figure S10).

The most significant selection signature (region 1 in Table 1) corresponds to a 17-Mb region in chromosome 2. Selection occurred in the three Texel breeds, most likely acting on the myostatin gene *GDF*-8, which is located in the middle of the region. Texel sheep carries a mutation in this gene, which contributes to muscle hypertrophy (Clop *et al.* 2006), a strongly selected trait in these populations. Although the mutation was discovered in Belgian Texels, our results imply that it must be present in these other Texel populations. SNPs within this region are almost fixed in the three Texel populations (Figure 6), indicating a hard sweep signal. However, even in this “easy” case, using haplotype information makes the detection signal more interpretable: while *FLK* exhibits only moderate *P*-value decrease in the region, from which no clear conclusion concerning the selected site position can be drawn, hapFLK provides a continuous and strong signal covering the whole region and almost centered on the selected site. The local tree exhibits a large increase in branch length in the branch ancestral to the three Texel populations and reduced branch length between Texel populations (Figure 5). This is consistent with a shared selection event predating the split between populations. Finally, the example of region 1 also illustrates that our test can detect selection signatures that are shared by several populations, which we did not formally test in the simulations. In contrast, to detect this region with a *F*_{ST} genome scan, based on single SNP tests, International Sheep Genomics Consortium (2012) had to group the Texel breeds and test them against all other populations.

In contrast to the selection signature around *GDF*-8, the second most significant region (region 5, on chromosome 14) shows no evidence of a hard sweep (Figure 4) and cannot be identified using the single marker FLK test. The local tree (Figure 5) computed using SNP data exhibits slightly increased branch lengths, whereas the local tree computed using haplotype clusters presents very strong evidence for selection in two breeds: the New Zealand Texel and the New Zealand Rommey, together with reduced haplotype diversity (Figure S14). These two breeds are not historically closely related (Figure 5, top left), but both have been imported in New Zealand (in 1843 and 1991, respectively). The selection signature could thus be due to a common recent selection pressure on the two breeds in the last decades. This would be consistent with the relatively modest frequency of the selected clusters and the fact that these selected clusters are different in the two breeds, suggesting that selection started on different haplotype backgrounds. One possible underlying trait associated with this selection signal is resistance to nematode-like parasites, an important disease affecting sheep in New Zealand. Two studies (Hacariz *et al.* 2009; Matika *et al.* 2011) found evidence for association between genetic polymorphism and parasite resistance related traits in this region of the genome in Texel breeds. Matika *et al.* (2011) also found these polymorphisms associated with muscle depth. While the functional basis of these two effects is still unclear (pleiotropy, linkage disequilibrium with growth factors), it is possible that animal fitness in this region is related to multilocus haplotypes rather than to single SNPs.

We point the reader interested in details for all significant regions in Table 1 to the supporting information. In particular, allele and haplotype cluster frequencies are provided in Figure S11, Figure S12, Figure S13, Figure S14, Figure S15, and Figure S16 and local trees in Figure S10. An alternative approach for pinpointing the selected population(s) is also described (section 1.2, File S1) and applied to these regions (section 2.3, File S1; Figure S17, Figure S18, Figure S19, Figure S10, Figure S20, and Figure S21).

## Discussion

### Haplotype *vs.* single marker differentiation tests

For the analysis of dense genotyping data, where the selected site itself is generally not observed, we show that using haplotypes rather than single SNPs greatly improves the detection power of selection signatures. An intermediate approach between single SNP and haplotype-based tests consists in gathering multiple consecutive loci within sliding windows into a single *windowed* statistic (*e.g.*, Browning and Weir 2010; Weir *et al.* 2005). However, in our simulation study, we found this approach to be less powerful at detecting selection (Figure S23). In the case of sequencing data, we found that a single SNP test was more powerful than hapFLK, consistent with previous results of Innan and Kim (2008), who found in a similar setting that an haplotype-based *F*_{ST} was less powerful than a single locus one. However, our simulations involved a single selected site and in many real situations, selection will act rather effectively on multilocus haplotypes (Pritchard *et al.* 2010), due, for instance, to recurrent mutations affecting the same gene, or to polygenic selection. We expect haplotype-based tests to be more powerful in such situations, which according to us justifies their use also for the analysis of sequencing data. In the particular case of low coverage resequencing, which is becoming a common experimental design in population genetics, this analysis will have to account for the additional uncertainity in genotype estimation, but we believe this can easily be tackled by the clustering algorithm used for hapFLK.

### Different strategies for the inclusion of haplotype information in differentiation

To extend the single marker FLK and *F*_{ST} tests into haplotype based tests, we estimate local haplotype clusters from genotype data and consider these estimated clusters as alleles. Using a multipoint model for linkage disequilibrium (LD) in this context has several advantages. First, haplotypes are generally unknown and must be inferred from genotypes, which typically relies on a model for LD such as Scheet and Stephens (2006). Using directly the model parameters as we do has the advantage of allowing us to average over the uncertainty in the distribution of possible haplotypes rather than using a best guess that is known to include errors (Marchini *et al.* 2006). On a more practical side, hapFLK can be computed on unphased genotype data that are common in population genetics studies. Second, because the model of Scheet and Stephens (2006) is a hidden Markov model, it naturally accounts for variation in LD patterns along the chromosome and alleviates the need to use windowing approaches, which have notorious difficulties accounting for this variation. Finally, several similar haplotypes may be associated to the same selected allele, and treating them independently should affect the detection power of the tests. In the Scheet and Stephens model, similar haplotypes are clustered together and will be considered as a single allele.

Other haplotype-clustering models, for instance, Beagle (Browning 2006), could certainly be used for constructing hapFLK. For example, the pattern of haplotype frequencies around the *LCT* gene in human populations was studied using either fastPHASE (Jakobsson *et al.* 2008) or Beagle (Browning and Weir 2010), and a strong evidence for selection in Europe was observed in both cases. However, to go beyond these observations and build a formal statistical test for selection, it is important to realize that the distribution of hapFLK (or hap*F*_{ST}) depends on the number of clusters used to model haplotype diversity. This number is fixed in fastPHASE but variable along the genome in Beagle. As this variation might be due to natural selection, but also to other effects such as variations in recombination or mutation rate, further studies would be required to evaluate the influence of using different clustering algorithms on the detection power.

Another important feature of hapFLK is its ability to account for the hierarchical structure of the sampled populations, arising from their evolutionary history within the species. FLK was already shown to be more powerful than the *F*_{ST} test in many simulated scenarios (Bonhomme *et al.* 2010). It was also compared to the Bayesian differentiation test of Foll and Gaggiotti (2008) in one simulated scenario with hierarchically structured populations and again provided more detection power. Consequently, we expect that hapFLK will also perform better than other haplotype-based differentiation tests (Guo *et al.* 2009; Browning and Weir 2010; Gompert and Buerkle 2011) for hierarchically structured populations.

To build tests that account for both the differentiation between populations and haplotype structure, all methods discussed above propose including haplotype information into single-marker differentiation tests. Another popular strategy, developed in the XP–EHH (Sabeti *et al.* 2007) and Rsb (Tang *et al.* 2007) statistics, is to compute a statistic quantifying the excess of long haplotypes within each population and to contrast this statistic among pairs of populations. Simulating a two-population sample, we found that XP–EHH and hapFLK had relatively similar power for hard sweep detection. However, one important difference was that hapFLK maintained some power for soft sweep detection, in contrast to XP–EHH. Interestingly, the combination of XP–EHH, FLK, and hapFLK allows a slight increase of power in the two-population simulations (File S1, section 1.5 and Figure S22), indicating that these different statistics do not capture exactly the same patterns in the data.

When more than two populations are sampled, comparing only pairs of populations raises a multiple testing issue leading to a significant decrease in power (Figure S7). Besides, computing a single test at the meta-population level seems more appropriate for several reasons. First, the signals we detected in sheep suggest that favorable alleles are often positively selected in several populations, either closely (region 1) or distantly related (region 5). Second, our ability to detect loci under selection depends on our ability to estimate the allele frequencies in this common ancestral population, which is clearly improved when using all populations simultaneously. One potential difficulty arising from our meta-population approach is the identification of the population(s) under selection, which is more difficult than when comparing pairs of populations. We proposed addressing this question using a local reestimation of the population tree, as illustrated in the sheep Hapmap data analysis. An alternative approach, which is based on a spectral decomposition of hapFLK, is also described in the supporting information and applied to the sheep data.

### Robustness of hapFLK and computation of *P*-values in a general situation

In many genome scans for selection, all loci above a given empirical quantile of the test statistic are considered as potential targets of selection. However, this so-called outlier approach does not allow control of the false-positive rate and can be inefficient in many situations (Teshima *et al.* 2006). To overcome this limitation and quantify the statistical significance of selection signatures, one must describe the expected distribution of the test statistic under neutral evolution, which depends on the demographic history of the sampled populations. In the case of hapFLK, this neutral distribution could be estimated by (i) fitting the kinship matrix ℱ from genome-wide SNP data and (ii) simulating neutral samples conditional on ℱ, using a simple model with no migration and constant population size along each branch of the population tree. This approach avoids estimation of a full demographic model for the sampled populations and was found to be robust to bottlenecks or to intermediate levels of migration/admixture (Figure S8). For the analysis of samples involving stronger departures from the hierarchical population model assumed in this study (for instance, with hybrid populations), the expected covariance matrix of allele frequencies could also be modeled using relaxed hypotheses. The strategies used in Bayenv (Coop *et al.* 2010) or TreeMix (Pickrell and Pritchard 2012) could, for instance, be adapted to the application of hapFLK.

However, in many situations (*e.g.*, in the sheep HapMap data), the neutral distribution of hapFLK is not only affected by demography, but also by SNP ascertainment bias. Simulating the ascertainment process is in general difficult: for example, in the sheep data it involves animals from a large panel of worldwide populations (International Sheep Genomics Consortium 2012). For single SNP tests such as FLK, this ascertainment issue can be circumvented by estimating a neutral distribution for several bins of the allele frequency in the ancestral population (Bonhomme *et al.* 2010), because we can assume that the only effect of SNP ascertainment is to bias the allele frequency distribution. But this strategy is not applicable to haplotype-based tests, for which the effect of SNP ascertainment is more complex. We consequently proposed a more empirical approach, in which the null distribution of hapFLK is directly estimated from the data using an estimator that is robust to outlier values. This empirical approach might be useful in future genome scans for selection, even if they are based on different test statistics than hapFLK, but its validity will depend on each particular data set and needs to be checked carefully by looking at the *P*-value distribution (see *Methods* for more details).

The most significant selection signatures detected in sheep using hapFLK exhibit extremely small *P*-values (down to 10^{−13}), while the smallest *P*-values obtained with FLK for the same data set were of order 10^{−4}. This difference of magnitude might be artificially inflated by the fact that we compute hapFLK *P*-values using a normal distribution, and FLK *P*-values using a chi-square distribution. However, we note that the choice of these distributions is supported by the data. Besides, we found that FLK *P*-values in simulated samples with selection using a chi-square distribution can go down to at least 10^{−11} (data not shown). We thus believe that the *P*-value difference observed in sheep reflects the fact that hapFLK is much more powerful than FLK, especially for SNP data where ascertainment bias leads to remove SNPs with extreme allele frequencies.

### Soft or incomplete sweeps

While genome scans for selection have historically focused on hard sweeps, several recent studies have pointed out the importance of soft sweeps in the evolution of populations (Pritchard *et al.* 2010; Hernandez *et al.* 2011) and described the genomic signature of these selection scenarios (Hermisson and Pennings 2005). We tested hapFLK for initial frequencies of the favorable allele up to 30% and found that reasonable power could be achieved also in this situation. The detection of incomplete sweeps is another important issue, which has not been much tackled in the literature. Detecting selected alleles at intermediate frequency is almost impossible with methods based on the allele frequency spectrum and very difficult with EHH- or *F*_{ST}-based existing approaches. In contrast, hapFLK is quite powerful in the case of incomplete sweeps, and several of the selection signatures detected in the sheep HapMap data correspond to intermediate frequencies of the selected haplotype (see Figure S11, Figure S15, and Figure S16).

Few hard sweeps were actually detected in the sheep data, although they are easier to detect than soft sweeps. This might be due to the short divergence time between these populations (a few hundred generations), which would limit the rise in frequency of favorable alleles. On the other hand, artificial selection has been associated with strong selection intensities, especially in the last decades, which should compensate for the short evolution time. One alternative explanation could be the variation of the selection intensity or direction over time, due to changes in agronomical objectives (*e.g.*, in the sheep from wool to meat production) or importations of animals in a new environment (*e.g.*, in the Texel and Romney breeds from Europe to New Zealand). The small number of hard sweeps can also be explained by the fact that artificial selection on quantitative traits is in general polygenic.

As a final and general remark on all methods aiming at discovering positive selection, selective constraints in functional and nonfunctional regions are probably more complex than what is usually simulated (with purifying and background selection, polygenic selection, balancing selection, etc). Definitely more research effort needs to be done on these aspects.

### Conclusions

Overall, our study demonstrates that using haplotype information in *F*_{ST}-based tests for selection greatly increases their detection power. Consistent with several recent other studies (Excoffier *et al.* 2009; Bonhomme *et al.* 2010; Coop *et al.* 2010), it also confirms the importance of analyzing multiple populations jointly, while accounting for the hierarchical structure of these populations. The new hapFLK statistic, which combines these two features, can detect a wide range of selection events, including soft sweeps, incomplete sweeps, sweeps occurring in several populations, and selection acting directly on haplotypes.

## Acknowledgments

We thank Michael Blum and Lucia Spangenberg for useful comments on the manuscript and Carole Moreno for earlier access to the sheep HapMap data. We thank the reviewers for their helpful comments, in particular the suggestion to construct local population trees in regions under selection. The ovine SNP50 HapMap data set used for the analyses described was provided by the International Sheep Genomics Consortium (ISGC) and obtained from http://www.sheephapmap.org in agreement with the ISGC Terms of Access. The simulations and data analysis were performed on the computer cluster of the bioinformatics platform Toulouse Midi-Pyrenees.

## Footnotes

*Communicating editor: J. Hermisson*

- Received October 26, 2012.
- Accepted December 14, 2012.

- Copyright © 2013 by the Genetics Society of America