## Abstract

The recent advent of high-throughput sequencing and genotyping technologies makes it possible to produce, easily and cost effectively, large amounts of detailed data on the genotype composition of populations. Detecting locus-specific effects may help identify those genes that have been, or are currently, targeted by natural selection. How best to identify these selected regions, loci, or single nucleotides remains a challenging issue. Here, we introduce a new model-based method, called SelEstim, to distinguish putative selected polymorphisms from the background of neutral (or nearly neutral) ones and to estimate the intensity of selection at the former. The underlying population genetic model is a diffusion approximation for the distribution of allele frequency in a population subdivided into a number of demes that exchange migrants. We use a Markov chain Monte Carlo algorithm for sampling from the joint posterior distribution of the model parameters, in a hierarchical Bayesian framework. We present evidence from stochastic simulations, which demonstrates the good power of SelEstim to identify loci targeted by selection and to estimate the strength of selection acting on these loci, within each deme. We also reanalyze a subset of SNP data from the Stanford HGDP–CEPH Human Genome Diversity Cell Line Panel to illustrate the performance of SelEstim on real data. In agreement with previous studies, our analyses point to a very strong signal of positive selection upstream of the *LCT* gene, which encodes for the enzyme lactase–phlorizin hydrolase and is associated with adult-type hypolactasia. The geographical distribution of the strength of positive selection across the Old World matches the interpolated map of lactase persistence phenotype frequencies, with the strongest selection coefficients in Europe and in the Indus Valley.

IN the new era of population genomics, surveys of genetic polymorphism (“genome scans”) offer the opportunity to distinguish locus-specific from genome-wide effects at many loci (Black *et al.* 2001). Reliable inference of demography and phylogenetic history depends on being able to identify putative neutral regions of the genome, which are assumed to be influenced by genome-wide effects only (Ross *et al.* 1999). Conversely, detecting locus-specific effects may help identify those genes that have been, or still are, targeted by natural selection (Luikart *et al.* 2003). Such genes may be involved, for example, in the adaptation to new environments or in the arms race with pathogens (Nielsen 2005). The applications for population genomic analyses therefore cover a wide range of disciplines. Although some progress has been made (Nielsen 2001), the problem of how best to identify regions, loci, or single nucleotides that have been, or are currently, targets of selection remains challenging.

Tests of selective neutrality have been developed for samples drawn from single populations. Most of them are based on the comparison of some summary statistics of the site-frequency spectrum (*i.e.*, the observed distribution of gene frequencies) to their expected distribution from diffusion theory under an infinitely many sites mutation model (Bustamante *et al.* 2001; Payseur *et al.* 2002; Nielsen *et al.* 2005b; Williamson *et al.* 2005). Accounting for different classes of markers (*e.g.*, selected and neutral) is achieved by a Poisson random field (PRF) approximation (Sawyer and Hartl 1992), which assumes independent mutation and selection parameters across sites (see, *e.g.*, Kim and Stephan 2002; Bustamante *et al.* 2003). In particular, Bustamante *et al.* (2003) developed a hierarchical PRF model that allows the estimation of selection coefficients at a set of DNA polymorphisms sampled in a single population (see also Nielsen *et al.* 2005a). Williamson *et al.* (2005) used a similar approach to infer selection in a nonequilibrium demographic model. Yet, they assume *a priori* which mutations are selectively neutral and which are not. The putatively neutral class of markers is then used to infer demographic parameters and, given these estimates, inferences regarding selection are performed on the other class of markers.

Other tests of selective neutrality are based on haplotype structure. Focusing on haplotypes at a locus of interest (referred to as “core haplotypes”), Sabeti *et al.* (2002) analyzed the decay of gene identity as a function of distance from the core, as measured by the extended haplotype homozygosity (EHH). Core haplotypes that have both a high population frequency and a high EHH are evidence of recent positive selection. Deriving the expected distribution of the EHH requires making strong assumptions about the underlying population history, however, which makes it difficult to evaluate the significance of observed values. Several extensions have therefore been proposed and adapted to genome-wide scans of single nucleotide polymorphism (SNP) data (reviewed in Gautier and Vitalis 2012), based on the empirical distribution of EHH-like statistics either for single populations (Voight *et al.* 2006) or for pairs of differentiated populations (Tang *et al.* 2007). Such approaches therefore rely on the assumption that most SNPs behave neutrally, so that the observed distribution of the statistics provides a proxy to the null distribution.

When markers are genotyped across multiple populations, it has been advocated that signatures of natural selection may simply be identified in the extreme tails of the empirical distribution of *F*_{ST} estimates (Goldstein and Chikhi 2002). The rationale is that loci that are involved in adaptation to local environmental conditions are expected to show unusually high levels of differentiation among populations. Conversely, loci that are under balancing selection are expected to show unusually low levels of differentiation. These model-free approaches are highly computationally efficient and, therefore, have early been applied to large data sets of tens to hundreds of thousands of SNPs, such as the Perlegen (Hinds *et al.* 2005) and the HapMap (International HapMap Consortium 2003, 2005) data in humans (see, *e.g.*, Akey *et al.* 2002; Weir *et al.* 2005; Barreiro *et al.* 2008). Model-free approaches implicitly assume that most of the markers analyzed are selectively neutral, however, and the choice of a decision criterion to characterize “outlier loci” is fairly subjective. These methods are intended to be immune to arbitrary assumptions about the (unknown) demographic history of the sample. Dependence upon the unknown demography (including the geographic and historical relationship among populations) was indeed a severe criticism of Lewontin and Krakauer’s model-based test of selective neutrality (Lewontin and Krakauer 1973), which relies on the sampling distribution of the parameter *F*_{ST} (Robertson 1975; Nei and Maryuyama 1975).

Refinements of this controversial test (see, *e.g.*, Beaumont and Nichols 1996) were based on the properties of gene genealogies in structured populations (Beaumont 2005), which, in many cases, tend toward a simple structure (Nordborg 1997; Wakeley 1999). In this simple structure, the recent genealogy of genes in each local population can be separated from the ancestral genealogy of the whole metapopulation (Wakeley 2004). If this separation-of-timescale approximation holds, the gene frequency distribution in each local population may be approximated as a Dirichlet-multinomial distribution (Balding and Nichols 1995; Balding 2003). Otherwise, it may be necessary to model demography explicitly (Nielsen *et al.* 2009), to restrict the analysis to pairwise comparisons (Vitalis *et al.* 2001), or to account for the covariance matrix of the population allele frequencies (Bonhomme *et al.* 2010; Coop *et al.* 2010; Frichot *et al.* 2013; Günther and Coop 2013; Guillot *et al.* 2014). Several likelihood-based approaches that take advantage of the Dirichlet-multinomial distribution of gene frequencies have been proposed, generally within a Bayesian framework. In particular, Beaumont and Balding (2004) proposed decomposing *F*_{ST} into locus-, population-, and locus-by-population components in a hierarchical model. Their model offered a new way to formulate the problem of inferring which loci are targets of selection. This framework was further extended by Riebler *et al.* (2008), Foll and Gaggiotti (2008), Guo *et al.* (2009), and Gompert and Buerkle (2011). Yet, these approaches (as most *F*_{ST}-based methods aimed at looking for locus-specific effects on *F*_{ST} estimates; see Beaumont and Nichols 1996) are typically not designed to identify population-specific selection. Furthermore, many species have a hierarchical structure, where a subset of populations share a recent ancestry, or exchange more migrants, relative to the full species range. For such complex structures, extensions of the Beaumont and Nichols (1996) test or the classical Lewontin and Krakauer (1973) test have been proposed (see Excoffier *et al.* 2009; Bonhomme *et al.* 2010, respectively). As an alternative to models based on the Dirichlet-multinomial distribution of gene frequencies, Coop *et al.* (2010) developed an extension to the truncated Gaussian model proposed by Nicholson *et al.* (2002), which accounts for the pattern of covariance in allele frequencies between populations (see also Frichot *et al.* 2013; Günther and Coop 2013; Guillot *et al.* 2014). The resulting multivariate normal distribution of gene frequencies allows testing a linear effect of some environmental variable on the allele frequency at some loci (Hancock *et al.* 2010, 2011; Frichot *et al.* 2013).

A major limitation of the methods based on comparisons among populations, however, is that they do not quantify selection. Rather, they are constructed as tests of departure from selective neutrality (Gautier *et al.* 2010). One exception, though, lies in Bazin *et al.* (2010), where the effective migration rate at a marker locus is expressed as a function of the selection coefficient at a positively selected locus and the recombination rate between the two (see Petry 1983). While neutrality is a convenient null hypothesis, a proper interpretation of the observed patterns of variability, in particular the extent to which the neutral model is applicable, requires methods that rely on nonneutral models (see, *e.g.*, Donnelly *et al.* 2001). Furthermore, proper tests of selection should provide estimates of the parameters of interest, *i.e.*, the strength and the type of selection acting on segregating polymorphisms.

Here, we provide a new method, called SelEstim, to distinguish neutral from selected polymorphisms and estimate the intensity of selection at the latter. Our model accounts explicitly for positive selection, and we suppose that all marker loci in the data set are responding to selection, to some extent. The method is based on a diffusion approximation for the distribution of allele frequency in a population subdivided in a number of demes that exchange migrants (*i.e.*, an island model; see Wright 1931). The framework for statistical inference from this model consists in a hierarchical Bayesian model (see Gelman *et al.* 2004). We use a componentwise Markov chain Monte Carlo (MCMC) algorithm to sample from the joint posterior distribution of the model parameters. We then evaluate the performance of SelEstim, by means of stochastic simulations. Last, we reanalyze a subset of SNP data from the Centre d'Etude du Polymorphisme Humain (CEPH) Human Genome Diversity Panel (HGDP) (Cann *et al.* 2003) to illustrate the applicability of SelEstim on real data.

## Model

### Assumptions

We consider an infinite island model where the *i*th deme consists of *N _{i}* diploid individuals and receives immigrants from the whole population at rate

*m*. We define the scaled migration parameter in the

_{i}*i*th deme as

*M*≡ 4

_{i}*N*. We consider biallelic markers,

_{i}m_{i}*i.e.*, that only two alleles (denoted by

*A*and

*a*) may occur at a given locus. We denote by

*p*the frequency of allele

_{ij}*A*in deme

*i*at locus

*j*and by

*π*the frequency of allele

_{j}*A*at the

*j*th locus in the whole population. Since we consider that the population as a whole is made of an infinite number of islands,

*π*gives the frequency of allele

_{j}*A*in the pool of migrant individuals. The following notations are used hereafter: the vector of allele frequencies in deme

*i*at locus

*j*is

**p**

*≡ (*

_{ij}*p*, 1 −

_{ij}*p*), and the vector of allele frequencies at locus

_{ij}*j*among migrants is

**π**

*≡ (*

_{j}*π*, 1 −

_{j}*π*). We consider a simple genic model of selection where, at each locus, the allele

_{j}*A*provides a selective advantage. The homozygote individuals

*AA*and the heterozygotes

*Aa*have a relative increase of fitness of 1 +

*s*and 1 +

_{ij}*s*/2, respectively, as compared to the

_{ij}*aa*homozygotes. We define the scaled coefficient of selection in deme

*i*at locus

*j*as

*σ*≡ 2

_{ij}*N*. We define the indicator variable

_{i}s_{ij}*κ*, which takes the value

_{ij}*κ*= 0 if allele

_{ij}*A*is selected for, and

*κ*= 1 if allele

_{ij}*a*is selected for. Therefore, the frequency of the selected allele in deme

*i*at locus

*j*reads

The data consist of individuals collected in a set of *n*_{d} demes and genotyped at *L* loci. We denote by *n _{ij}* the total number of genes sampled in the

*i*th deme at the

*j*th locus, out of which

*x*have allelic state

_{ij}*A*. The vector of allele counts in deme

*i*at locus

*j*therefore reads

**n**

*≡ (*

_{ij}*x*,

_{ij}*n*−

_{ij}*x*).

_{ij}Given the frequencies *p _{ij}* of allele

*A*, the conditional distribution of allele counts

**n**

*in population*

_{ij}*i*at locus

*j*is binomial: (1)In the limit of large deme size, as

*N*→ ∞, and assuming that selection and random genetic drift are of comparable strength (

_{i}*i.e.*, that

*M*and

_{i}*σ*have a finite limit as

_{ij}*N*→ ∞), the distribution of the

_{i}**p**

*may be approximated by the stationary density of a diffusion process, which has the form (2)(Wright 1949; Barton and Turelli 1987; Ethier and Nagylaki 1988; Bürger 2000). In Equation 2,*

_{ij}*C*is the constant that ensures that the distribution integrates to 1. This constant can be evaluated as (3)where

_{1}

*F*

_{1}(

*a*;

*b*;

*z*) is the confluent hypergeometric, or Kummer’s, function (see,

*e.g.*, Abramowitz and Stegun 1965, p. 504), and .

Given the model specified in Equations 1 and 2, we are interested in evaluating the parameters of interest , **π** ≡ (*π*_{1}, …, *π _{j}*, …,

*π*), , and , from the observed allele counts

_{L}**n**over all sampled demes and loci. The directed acyclic graph (DAG) for this hierarchical-Bayesian model is shown in Figure 1.

We assume a Bernoulli prior distribution for the parameters *κ _{ij}*,

*i.e.*,

*κ*∼ Ber(0.5), and a uniform prior for the

_{ij}**π**

*’s, that is*

_{j}**π**

*∼ Beta(1, 1). We further assume a log-uniform prior for the*

_{j}*M*’s with support from 0.001 to 10,000;

_{i}*i.e.*, the priors of the

*M*’s are uniform in log scale: log(

_{i}*M*) ∼ u(log(10

_{i}^{−3}), log(10

^{4})). The prior distributions for the selection coefficients

*σ*(at each locus, in each deme) are modeled hierarchically (see,

_{ij}*e.g.*, Gelman

*et al.*2004, pp. 124–125). In particular, we assume that

*σ*has an exponential prior distribution that depends upon the locus-specific hyperparameter

_{ij}*δ*, which represents the average effect of selection at locus

_{j}*j*(over all demes). We further assume that this hyperparameter

*δ*has an exponential prior distribution

_{j}*f*(

*δ*|

_{j}*λ*) ∼ exp(

*λ*

^{−1}) that depends, in turn, upon the hyperparameter

*λ*, which represents the genome-wide effect of selection over all demes and loci. Last, we assume that the prior distribution of

*λ*is

*f*(

*λ*) ∼ exp(Λ

^{−1}), with Λ = 1.0, in what follows. Assuming independence of allele frequencies among loci and populations, the posterior distribution of the parameters

*f*(

**M**,

**π**,

**σ**,

**κ**,

**δ**,

*λ*|

**n**),

*i.e.*, the conditional distribution of the parameters

**M**,

**π**,

**κ**,

**σ**,

**δ**, and

*λ*given the data

**n**, depends upon the prior distributions of the parameters and the data as

Although in Equation 4 the likelihood from Equation 1 can be integrated analytically over the distribution of unknown population frequencies given by Equations 2 and 3, we found that it increases the computational burden significantly. This is so because additional gamma and confluent hypergeometric functions are then introduced in the likelihood function ℒ′(*M _{i}*,

**π**

*,*

_{j}*κ*,

_{ij}*σ*,

_{ij}**n**

*).*

_{ij}We implemented a single-component Metropolis–Hastings (or Metropolis within Gibbs) algorithm (see, *e.g.*, Ntzoufras 2009) to sample from the joint posterior distribution of *f*(**M**, **π**, **κ**, **σ**, **δ**, *λ*| **n**), which is specified by Equation 4. In practice, we therefore update one parameter at each time, iteratively, as detailed in Supporting Information, File S1. The proposal distributions for each of the **M**, **π**, **κ**, **σ**, **δ**, and *λ* parameters are adjusted by means of 25 short pilot runs of 1000 iterations to obtain acceptance rates between 0.25 and 0.40 (see, *e.g.*, Gilks *et al.* 1996).

The software package implementing the model described here, called SelEstim, is available at http://www1.montpellier.inra.fr/CBGP/software/selestim. All the postprocessing statistical analyses were performed using the R software environment for statistical computing, v. 3.0.1 (R Core Team 2013).

### Identifying loci under selection from the MCMC outputs

Because the model assumes that each and every locus in a data set is selected to a certain extent, we are particularly interested in the posterior densities of the locus-specific hyperparameters *δ _{j}*: we expect the density to be shifted toward zero for neutral markers and to positive values for (presumably) selected loci (see Figure 2). Yet, given the hierarchical structure of our model, it would not be sufficient to simply test whether, at a particular locus, the posterior distribution of

*δ*departs from zero. This approach would neglect the genome-wide effects of selection. Since we assume in our model that the

_{j}*δ*’s are drawn independently from a common hyperdistribution with parameter

_{j}*λ*(which represents the genome-wide effect of selection), it is indeed more appropriate to compare the posterior distributions of the locus-specific coefficients of selection with the “centering” distribution derived from the hyperdistribution of the genome-wide effect of selection. This centering distribution is a good choice, rather than, for example, a point mass at zero, because it leads to consistent estimates (the hyperprior will shrink toward zero with more data). Furthermore, as becomes apparent, it provides some, albeit very weak, power to identify loci under balancing selection.

Following Guo *et al.* (2009), we consider the following steps to detect outlier loci in a data set: (i) approximate the posterior distributions of the locus-specific selection hyperparameters (*δ _{j}*); (ii) compare each of these distributions to a “centering” distribution derived from the hyperdistribution with parameter

*λ*that describes the among-locus variation in the locus-specific effect of selection; and last (iii) measure the mean of the posterior distributions of

*σ*for outlier loci over sampled populations to summarize the distribution of selection effects across sampling locations.

_{ij}Our preliminary analyses suggested that the posterior distribution of the parameters *δ _{j}* is unimodal. Furthermore, its support is on [0, ∞), by definition of

*δ*. For subsequent analyses, we approximate the posterior distributions of the

_{j}*δ*’s by a gamma distribution with the same mean and variance as estimated from the MCMC sample. In doing so, we assume that slight discrepancies in higher-order moments will not affect the comparison of these distributions with their centering distribution. In the following, we therefore consider that the posterior distribution of

_{j}*δ*is Γ(

_{j}*k*

_{0},

*θ*

_{0}) with and , where and are the mean and the variance, respectively, of the posterior distribution of

*δ*, as estimated from the MCMC outputs. Since we assumed an exponential prior distribution for the hyperparameters

_{j}*δ*,

_{j}*i.e.*,

*f*(

*δ*) ∼ exp(

_{j}*λ*

^{−1}), or equivalently

*f*(

*δ*) ∼ Γ(1,

_{j}*λ*), we defined the centering distribution of

*δ*as Γ(1,

_{j}*θ*

_{1}), where is the posterior mean of

*λ*, as estimated from the MCMC outputs. We expect that the stronger the intensity of selection at locus

*j*, the larger the departure of the posterior distribution of

*δ*from the centering distribution. We use the Kullback–Leibler divergence (KLD) to measure the distance of the posterior distribution of

_{j}*δ*from the centering distribution. The KLD of a distribution with density

_{j}*f*(

*x*) from a distribution with density

*g*(

*x*) is defined as (5)With a little algebra, one can show that the KLD of a gamma distribution with shape and scale parameters (

*k*

_{0},

*θ*

_{0}) from a gamma distribution with shape and scale parameters (1,

*θ*

_{1}) is given by (6)where is the digamma function. Note that as an alternative to these computations, one might use other KLD estimators, such as those based on nearest neighbors (see,

*e.g.*, Pérez-Cruz 2008).

To provide a decision criterion for discriminating between neutral and selected markers, we calibrate the KLD using simulations from a predictive distribution based on the observed data set. The motivation here is to generate a set of loci equivalent to those that we observe in their levels of diversity and genetic variation. Note that generating a full posterior predictive distribution for KLD is unhelpful here because extreme *P*-values would indicate poor model fit rather than give evidence of selection *per se*. A key requirement in our calibration is that we generate a distribution of KLD under a null model in which the *δ _{j}* for each locus are drawn from their centering distribution. For this reason we cannot use a strictly neutral model (

*i.e.*, the beta-binomial parameterized by

*M*). A further assumption that we make, for otherwise the approach would be computationally intractable, is that the loci are exchangeable. Thus the KLD computed for each locus in our data set is compared to the simulated distribution of KLD among all the loci generated from the predictive distribution. As described in File S2, we parameterize our predictive distribution using the estimated posterior means for

_{i}*M*,

_{i}*π*,

_{j}*κ*, and

_{ij}*λ*. We show below that, although the method is somewhat conservative, the calibration based on these posterior means is generally good. Thus, in practice, for each data set and each analysis, we use the algorithm detailed in File S2 to generate pseudo-observed data (POD). We then analyze that POD, using the same MCMC parameters (number and length of pilot runs, burn-in, chain length, etc.) as for the analysis of the original data set. The KLD values computed for each simulated locus are then combined to obtain an empirical distribution. The quantiles of this empirical distribution are computed and are used to calibrate the KLD observed for each locus in the original data:

*e.g.*, the 99% quantile of the KLD distribution from the POD analysis provides a 1% threshold KLD value, which is then used as a decision criterion to discriminate between selection and neutrality.

## Materials and Methods

### Simulation study

We evaluated the performance of the method by simulating artificial data sets for fixed parameter values. The simulations were performed according to an island model with 50 demes, each consisting of *N* = 250 diploid individuals. Following Beaumont and Balding (2004), we simulated allele counts data from a Wright–Fisher model with migration and selection.

Initialization was achieved by means of a Pólya urn scheme simulation of the coalescent (Donnelly and Tavaré 1995) with scaled mutation parameter *θ* ≡ 4*Nμ* = 1. This amounts to considering selection acting on standing variation and makes this simulation model similar in spirit to the models considered by Innan and Kim (2004) and Przeworsky *et al.* (2005). At each generation (generations were discrete and nonoverlapping), each individual produced a random number of offspring drawn from a Poisson distribution with mean 100. Mutations then occurred at rate 2 × 10^{−5}. Dispersal of the (diploid) offspring then occurred at rate *m*, with dispersing individuals reaching necessarily a distinct deme. Selection of the offspring surviving to adulthood was then achieved, according to the scheme detailed below. A number *N* of adults was drawn from among the offspring, except if the number of offspring in a deme was <*N*, in which case all offspring survived. This life cycle was repeated for 25,000 generations. Samples were then taken, but only if the minimum allele frequency (the frequency of the least frequent allele) was >0.01. All loci were considered as independent, so that each multilocus data set was made of independent realizations of that process.

To account for the possibility of positive selection to local environmental conditions, the demes were arbitrarily provided with attributes (blue, red, or uncolored), which were assigned at random, independently for each selected locus. For positively selected loci, one allele *B* was considered as advantageous in a blue deme (and neutral in a red deme), while the other allele *R* was considered as advantageous in a red deme (and neutral in a blue deme). Both alleles were considered as neutral in uncolored demes. Therefore, *BB* homozygotes had fitness (1 + *s*) in blue demes and 1 in red and uncolored demes; *RR* homozygotes had fitness (1 + *s*) in red demes and 1 in blue and uncolored demes; *BR* heterozygotes had fitness 1 + *s*/2 in red and blue demes and 1 in uncolored demes. For loci under balancing selection, only the heterozygote genotypes were selected for in the blue and red demes, with relative fitness (1 + *s*). Homozygote genotypes were neutral (relative fitness 1) in all demes, as were the heterozygote genotypes in uncolored demes.

A total of 18 data sets were generated using the Wright–Fisher model described above (see Table 1). For each locus, 50 diploid individuals (100 genes) per deme were sampled. The details that distinguish the different data sets are given in Table 1. For example, for data sets 1–9, the samples were taken in 6 demes: 2 blue demes, 2 red demes, and 2 uncolored demes, and each simulated data set consisted in 10,000 SNPs, with 8000 neutral markers, 1000 positively selected loci, and 1000 loci under balancing selection. For data sets with selected loci (data sets 1–11), we assumed that 30% of all demes were blue demes, 30% were red demes, and 40% were uncolored demes. For data sets with neutral markers only (data sets 12–18), 50,000 SNPs were simulated and all demes were uncolored. Table 1 further gives the combinations of *F*_{ST} and *σ* values used for the simulations.

An additional set of four simulations was performed to test for the robustness to departure from the island model assumptions. To that end, a classical island model, a hierarchical island model (see, *e.g.*, Excoffier *et al.* 2009), a stepping-stone model in one dimension, and a pure drift model (whereby nine populations diverge sequentially) were simulated (see Figure S1). The sample characteristics (number of individuals, number of sampled demes) were the same for all simulations, and the model parameters were tuned to achieve an overall realized *F*_{ST} of ≈ 0.24.

For each of these 22 data sets the MCMC algorithm was run to sample from the joint posterior distribution of the model parameters. For each Markov chain, 100,000 updating steps were completed after 25 short pilot runs of 1000 iteration and a burn-in of 25,000 steps. Samples were collected for all the model parameters every 25 steps (thinning) to reduce autocorrelations, yielding 4000 observations.

The data sets 1–11 (see Table 1) were analyzed using BayeScan v. 2.1 (Foll and Gaggiotti 2008) with default option values (except for the MCMC parameters, see below). BayeScan is based on the Dirichlet-multinomial model for allele frequencies in an island model of population structure. At each locus, the distribution of allele frequency in each subpopulation depends on the allele frequency in the common pool of migrants and a subpopulation-specific parameter. In BayeScan, as in Beaumont and Balding’s model (Beaumont and Balding 2004), the logit of is decomposed additively into a locus-specific component (*α _{i}*) shared by all populations, and a population-specific component (

*β*) shared by all loci. Significantly positive (resp., negative) values of

_{j}*α*are taken as evidence of positive (resp., balancing) selection. BayeScan is based on a reversible-jump MCMC algorithm, which estimates the posterior probabilities of two alternative models, a purely neutral one (

_{i}*α*= 0), and one including selection (

_{i}*α*≠ 0). Each Markov chain was run for 100,000 updating steps, after 25 short pilot runs of 1000 iteration each and a burn-in of 25,000 steps. Samples were collected every 25 steps (thinning), yielding 4000 observations. For each output, we computed the Bayes factor (BF) for the model including selection (

_{i}*α*≠ 0), assuming prior odds of 10 for the neutral model (

_{i}*α*= 0).

_{i}The comparison of the relative efficiency of the two methods was achieved by means of receiver operating characteristic (ROC) analysis (see, *e.g.*, Fawcett 2006, for further information), using the R software environment for statistical computing, v. 3.0.1 (R Core Team 2013).

### Application to human data

We applied SelEstim on the Stanford HGDP–CEPH Human Genome Diversity Cell Line Panel (Cann *et al.* 2003) SNP genotyping data, which consist of genotypes at more than 650,000 SNPs (ftp://ftp.cephb.fr/hgdp_supp1). Because we were interested, for illustrative purpose, in measuring the genetic signature of selection in the lactase gene, we considered only the 53,765 SNPs mapping to chromosome 2 (HSA2) and incorporated the genotyping data of the two SNPs reported to be tightly associated with lactase persistence (−13910C → T and −22018G → A) as published by Bersaglieri *et al.* (2004). All the populations with <15 genotyped individuals were discarded from the data set. Furthermore, we removed seven populations from Oceania and Southern America, as well as three populations from Sub-Saharan Africa (the Biaka Pygmies, Mbuti Pygmies, and the Mandenka) that were absent from Bersaglieri *et al.*’s data set (Bersaglieri *et al*. 2004). The two Bantu populations (from Kenya and South Africa) were merged, as in Bersaglieri *et al.* (2004). Finally, we discarded all the SNPs with a minimum allele frequency (MAF) below 0.01 across the populations retained. The final data set consisted in 52,633 SNPs characterized in 23 populations from Africa and Eurasia.

## Results

### Evaluating the performance of **SelEstim:**

To asses the KLD calibration procedure, we first evaluate the method using simulated data sets made of neutral markers only (data sets 12–18; see Table 1). Figure 3A shows the false-positive rate, *i.e.*, the proportion of markers that are incorrectly classified as under selection, as a function of various KLD thresholds. For each data set, KLD thresholds based on the quantiles of the KLD distributions from the POD analyses were computed and used as a decision criterion for discriminating between neutral and selected markers. Figure 3B represents the false-positive rate as a function of the quantile probability (comprised between 0 and 1). Figure 3B shows that, for the data sets considered here, the false-positive rate at any KLD threshold is always less than the corresponding quantile probability. This suggests that our calibration procedure is “conservative,” at least for the range of *F*_{ST} values considered here (ranging from 0.01 to 0.25).

Figure 4 shows the performance of the method on data set 5 (see Table 1), which corresponds to *F*_{ST} = 0.10 and *σ* ≡ 2*Ns* = 25. Figure S2, Figure S3, Figure S4, Figure S5, Figure S6, Figure S7, Figure S8, Figure S9, Figure S10, and Figure S11 provide the same outputs for simulated data sets 1–4 and 6–11. Figure 4A shows that the distribution of KLD measures for positively selected loci departs from that of the neutral markers and the loci under balancing selection. This is essentially true for the data sets for which *F*_{ST} ≥ 0.05 and *σ* ≥ 25, as can be seen from Figure S2, Figure S3, Figure S4, Figure S5, Figure S6, Figure S7, Figure S8, and Figure S9. Not surprisingly, large KLD values correspond to large *F*_{ST} estimates (Figure 4B). This is so because for positively selected loci, one allele is selected for in blue populations and the other in red populations, which tends to exacerbate differentiation. A close examination of Figure 4C further shows that the false-positive rate (the proportion of neutral markers that exhibit a signature of selection) at any KLD threshold is always less than the corresponding KLD quantile probability (as in Figure 3B). This strengthens the notion that the KLD thresholds should not be interpreted as frequentist *P*-values. All else being equal, increasing the number of sampled demes improves the discrimination of neutral and positively selected markers (compare Figure 4, Figure S10, and Figure S11). Last, Figure 4 shows that SelEstim has some weak statistical power to identify loci under balancing selection. This arises because markers with very low divergence will tend to have posteriors for *δ* that are pushed further toward zero in comparison with the centering distribution and, hence, will also have a higher KLD. Yet, although the KLD for loci under balancing selection is indeed slightly higher, on average, than for neutral markers (Figure 4, Figure S2, Figure S3, Figure S4, Figure S5, Figure S6, Figure S7, Figure S8, Figure S9, Figure S10 and Figure S11), it remains very low. This result is not surprising, though, since the selection scheme considered in our model of inference accounts only for positive genic selection. Furthermore, previous simulation studies have also shown that, in the absence of an explicit model of selection, similar methods generally lack power to detect balancing selection (Beaumont and Balding 2004; Foll and Gaggiotti 2008; Riebler *et al.* 2008).

### Comparison with **BayeScan:**

Figure 4D shows the relationship between BayeScan BF and the KLD for each and every locus from data set 5. According to Jeffreys’ rule (Jeffreys 1961; Kass and Raftery 1995), a BF >10 (log_{10}(BF) ≥ 1) provides “strong” evidence for selection, and for a BF >100 (log_{10}(BF) ≥ 2) the evidence is “decisive.” However, it should be noted that, by definition, the BF depends on the prior odds for the neutral model. For this set of simulated data, there is a good agreement between BayeScan BF criterion and the KLD, since markers with high KLD show decisive evidence of selection from BayeScan BF criterion. For BayeScan analyses, whenever the posterior probability of the model including selection (*α _{i}* ≠ 0) was equal to 1, we arbitrarily defined the log

_{10}(posterior odd) as log

_{10}(3999.5/4000) − log

_{10}(0.5/4000) to account for the chain length (4000 iterations). By construction, the maximum value that the BF can take (log

_{10}(BF) = 4.857; see Figure 4D) therefore depends on the MCMC length and the prior odds for the neutral model (here equal to 10).

The ROC analysis (Figure 5) further shows a slight power gain of SelEstim over BayeScan (Foll and Gaggiotti 2008). In the ROC analysis, the proportion of false positives and true positives is computed for each possible value of the threshold that is used to classify a locus under selection (see, *e.g.*, Fawcett 2006, for further information). For SelEstim, the classifying variable was the KLD of the posterior distribution of the locus-specific coefficient of selection *δ _{j}* from its centering distribution, while in the case of BayeScan it was the Bayes factor. The ROC analysis yields a monotonic curve with no positives (true or false) at one end and all positives at the other. If a method has no classification power, the curve should be linear with slope 1, and the area under the ROC curve (AUC) should be 0.5. If a method has perfect classification power, the curve should perfectly superimpose to the left-hand and top sides of the unit square, and the AUC should be 1. Considering positively selected loci first, the area under the ROC curve for SelEstim is slightly larger, and closer to 1, than that obtained for BayeScan (see Figure 5). In particular SelEstim appears to have progressively improved relative performance with decreasing values of

*σ*. As for loci under balancing selection, SelEstim seems slightly better than BayeScan based on the ROC analysis, although both methods lack statistical power in these sets of simulated data.

The analysis of data sets 1–11 (see Table 1) took 12,175 sec on average per data set (SD = 6921) with BayeScan and 12,633 sec (SD = 6149) with SelEstim. SelEstim is therefore 3.85% slower than BayeScan, based on the same MCMC parameters (number and length of pilot runs, burn-in, chain length, etc.) and using the same number of processor cores. Note, however, that the KLD calibration procedure of SelEstim comes at the cost of up to a doubled computing time, due to the additional analysis of the POD.

### Robustness to model misspecification

We analyzed simulated data departing from the island model assumptions (see Figure S1). The rate of false positives detected by SelEstim at a given threshold was higher for data simulated from a hierarchical island model or a stepping-stone model, as compared to data simulated from a nonhierarchical model with the same overall *F*_{ST} (Figure S12). However, for all scenarios considered, the false-positive rate at any KLD threshold was less than or nearly equal to the corresponding quantile probability (Figure S12F), which suggests that our calibration procedure is “conservative,” even for strong departures from the island model assumptions. Furthermore, the rates of false positives for these scenarios were much lower as compared to BayeScan analyses of the same data (Figure S13).

### Inference of selection coefficients

For data set 5 (see Table 1), we examined the distributions of the posterior means of the parameters *κ _{ij}* (which indicate which allele is selected for), for the 1000 positively selected loci. Here, from the hypotheses of our simulation model,

*κ*= 0 indicates that the blue allele is selected for, and

_{ij}*κ*= 1 indicates that the red allele is selected for. Figure 6A shows the distributions of the posterior means of

_{ij}*κ*in each sampled deme. Consistent with our expectation, it is apparent from Figure 6A that the posterior means of

_{ij}*κ*in demes 1 and 2 (blue demes) are shifted toward zero and that the posterior means of

_{ij}*κ*in demes 3 and 4 (red demes) are shifted toward one. Alleles of the right color are therefore selected for in the right deme. It is also reassuring to see that in demes 5 and 6 (uncolored demes), the posterior means of

_{ij}*κ*are centered around 0.5, which is consistent with the fact that neither allele should be selected for in these demes.

_{ij}For the same 1000 positively selected loci, we further examined the posterior means of the scaled coefficients of selection *σ _{ij}* ≡ 2

*N*, conditionally on

_{i}s_{ij}*κ*. By doing so, we estimate the coefficient of selection associated with the allele being effectively targeted by selection. Figure 6B shows that the posterior means of

_{ij}*f*(

*σ*|

_{ij}*κ*= 0) in blue demes and the posterior means of

_{ij}*f*(

*σ*|

_{ij}*κ*= 1) in red demes are very close to the simulated values (

_{ij}*σ*≡ 2

*Ns*=25 in data set 5; see Table 1). By contrast, in uncolored demes, the posterior means of

*σ*that were not conditioned upon

_{ij}*κ*, are much lower and closer to the prior distribution of the hyperparameter

_{ij}*λ*(which represents the genome-wide effect of selection over all demes and loci). Figure S14, Figure S15, Figure S16, Figure S17, and Figure S18 reproduce the same outputs as in Figure 6 for data sets 1–4 and 6–11. The posterior means of the scaled coefficients of selection

*σ*conditionally on

_{ij}*κ*are very close to the simulated values, for

_{ij}*F*

_{ST}≥ 0.05 and

*σ*≥ 25 (data sets 2–3, 5–6, and 8–9) and

*F*

_{ST}= 0.2 and

*σ*= 10 (data set 7). All else being equal, increasing the number of sampled demes improves the estimation of the scaled coefficients of selection

*σ*conditionally on

_{ij}*κ*(compare Figures 6 and Figure S18).

_{ij}Last, we examined the distributions of the posterior means of *κ _{ij}* for the 8000 neutral markers in data set 5. Figure 7A shows that the posterior means of

*κ*, which do not depend on the color of the sampled demes, are all centered around 0.5. This result is consistent with the fact that neither allele should be selected for in these demes. Furthermore, the distributions of the posterior means of

_{ij}*κ*for neutral markers are narrower, as compared to the posterior means of

_{ij}*κ*for selected loci in uncolored demes (see Figure 6A). The distributions of the posterior means of

_{ij}*κ*for neutral markers are therefore closer to their prior distribution. Yet, the distributions are still wider than expected from the prior distribution, since the mean over 4000 independent samples (which corresponds to the MCMC length) from a Ber(0.5) distribution should be approximately normally distributed with mean 0.5 and standard deviation 0.008. This extra variance may be due to the hierarchical structure of the model, which produces a correlation between the parameters. In addition, Figure S19A shows that, in the absence of selection (data set 16; see Table 1), the distribution of the posterior means of

_{ij}*κ*is centered around 0.5 and narrower as compared to data sets that include positively selected loci (compare with Figure 7A). Therefore, the higher variance of the distributions of

_{ij}*κ*for selected loci in uncolored demes (as compared to neutral markers) certainly stems from the influence of selection occurring for the same loci in blue and red demes, through the prior on the locus-specific hyperparameter

_{ij}*δ*. The posterior means of

_{j}*σ*for neutral markers (unconditionally upon

_{ij}*κ*) are very low and close to the posterior mean of the hyperparameter

_{ij}*λ*. In the absence of selection (data set 16), the posterior means of

*σ*for neutral markers (unconditionally upon

_{ij}*κ*) are also very low and not different from the prior mean of the hyperparameter

_{ij}*λ*(Figure S19B).

Implicitly, Figure 6 and Figure 7 demonstrate that SelEstim is able to give accurate measures of the scaled coefficient of selection at one locus in different demes and therefore to provide evidence of local adaptation. This paves the way for the inference of the distribution of selection strength across populations in a landscape as is illustrated in the next section.

Last, we analyzed a set of 11 simulations using the same parameters as for data set 5 (see Table 1), but varying the proportion of selected loci from 10 to 5000 of 10,000 markers (hence, from 0.1 to 50%). Interestingly, we found a strong correlation between the posterior mean of the genome-wide coefficient of selection *λ* and the number of positively selected loci (see Figure S20). However, as the number of positively selected loci increases, the performance of SelEstim weakens (>20% of selected markers), although less markedly than BayeScan (see Figure S21).

### Application to human data

We ran three independent MCMC analyses on a subset of the Stanford HGDP–CEPH Human Genome Diversity Cell Line Panel (Cann *et al.* 2003) SNP Genotyping Data. The data consisted in 52,631 SNPs from the HGDP–CEPH data, and two SNPs (−13910C → T and −22018G → A) known to be tightly associated with lactase persistence (Bersaglieri *et al.* 2004), genotyped in 23 populations from Africa and Eurasia. After 25 pilot runs of 1000 iterations, each MCMC was run for 100,000 updating steps, after a burn-in period of 25,000 steps. Samples were collected from the Markov chains for all the model parameters every 25 steps (thinning) to reduce autocorrelations, yielding 4000 samples for each parameter.

Convergence was assessed by computing the multivariate extension of Gelman–Rubin’s diagnostic (Brooks and Gelman 1998) on the three independent Markov chains. Gelman–Rubin’s diagnostic is based on the computation of the ratio of the pooled-chains variance over the within-chain variance and was calculated using the coda package, v. 0.16-1, (Plummer *et al.* 2006) as implemented for R (R Core Team 2013). Gelman–Rubin’s diagnostic was equal to 1.09 for the hyperparameter *λ* and to 1.07 for the parameters *M _{i}*, which indicates that the chains converge satisfactorily to the target distribution. The following analyses are based on the outputs from one of the three Markov chains.

To identify the genomic regions showing the strongest signatures of selection in the HGDP–CEPH data, 1-Mb windows were constructed for each marker by including all markers that were ≤500 kb from that marker. The average number of markers per window was ∼248. Outstanding regions were then defined as the windows containing at least three SNPs above the critical KLD value of 3.924 (corresponding to the 99.9% quantile of the KLD distribution from the POD analysis). Figure 8A shows the distribution of the KLD for each SNP along HSA2. The two SNPs that are tightly associated with lactase persistence (−13910C → T and −22018G → A) are highlighted. These two SNPs have the two largest KLD values. Furthermore, the nine SNPs with the largest KLD values were located 3.7 kb and 1.0 Mb upstream of the *LCT* gene, at <805.2 kb from −13910C → T and <813.4 kb from −22018G → A. Figure 8B represents the distribution of the posterior means of the locus-specific selection parameter *δ _{j}*, along HSA2. This figure therefore represents the variation of the strength of selection along the chromosome and depicts a very strong signal of positive selection in the vicinity of the

*LCT*gene (located from base pair 136,545,414 to 136,594,749), which encodes for the enzyme lactase–phlorizin hydrolase and is associated with adult-type hypolactasia. In addition to the

*LCT*region, we found four other outstanding genomic regions, which are indicated by arrows in Figure 8B. The closest gene from each region was determined from the UCSC Genome Browser (http://genome.ucsc.edu/), using the Genome Reference Consortium GRCh37 assembly (hg19). Table 2 provides the list of these genes, along with their functions.

Figure 9 shows the distribution of the scaled coefficients of selection *σ _{ij}* (conditionally on

*κ*indicating allele −13910C → T to be targeted by selection) across African and Eurasian populations. The map from Figure 9 was extrapolated by kriging using the R package fields (Fields Development Core Team 2006), v. 6.8. It is obvious from Figure 9 that the intensity of selection is very strong in Europe and around the Indus valley and attains similar levels in both geographic regions.

_{ij}## Discussion

We developed a hierarchical Bayesian model that considers explicitly the effect of genic selection on the distribution of allele frequencies at SNP loci. SelEstim extends previous methods based on the Dirichlet-multinomial distribution of allele frequencies (which reduces to the beta-binomial distribution for SNP data) that arises as the diffusion approximation of genetic drift in the migration-drift equilibrium island model (see, *e.g.*, Beaumont and Balding 2004; Riebler *et al.* 2008; Foll and Gaggiotti 2008; Guo *et al.* 2009; Gautier *et al.* 2010). The beta-binomial model has been argued to be robust to the vagaries of demographic history (Beaumont and Nichols 1996; Beaumont 2005) because in many situations, the genealogy of genes in a metapopulation divides into a scattering phase, which represents the recent genealogy of each deme, and a collecting phase, which represents the ancestral genealogy of the whole metapopulation (Nordborg 1997; Wakeley 2004). With this separation of timescales, there are no mutations in the scattering phase and the distribution of gene frequencies in each deme depends upon the frequencies in the pool of migrants (*i.e.*, the collecting phase) and the deme-specific *F*_{ST}. However, the assumption that each deme receives migrants from a unique migrant pool may not hold if populations share a history of successive divergences (Gaggiotti and Foll 2010). In that case indeed, gene frequencies may be correlated among closely related populations, which violates the assumption that populations are independent (Robertson 1975; Excoffier *et al.* 2009; Bonhomme *et al.* 2010; Gompert and Buerkle 2011). SelEstim should therefore be used with caution on populations that are known to be hierarchically structured (but see Figure S12). To conclude on a more positive note, we would argue that although violations from the island model assumptions certainly inflate the overall variance of the *M _{i}* parameters, it should not generate artificially correlated signals across closely linked SNPs as observed,

*e.g.*, in Figure 8. From a practical point of view, using sliding windows to identify genomic regions of interest may therefore constitute a valuable approach (see,

*e.g.*, Gautier

*et al.*2009). Accounting explicitly for the correlation of gene frequencies across populations due to shared history and gene flow might be achieved by considering the multivariate generalization of the Gaussian approximation of the gene frequency distribution (Coop

*et al.*2010; Günther and Coop 2013), which was recently extended to infer population splits and mixtures (Pickrell and Pritchard 2012). A Gaussian approximation for the distribution of allele frequencies (as suggested by Nicholson

*et al.*2002) can be justified whenever the deterministic equilibrium is located away from the boundaries (fixation of any one of the alleles). However, it is a poor approximation when the deterministic equilibrium is close to one of the boundaries (Barton and Rouhani 1987; Gautier and Vitalis 2013).

Our model introduces two major improvements over the methods based on the Dirichlet-multinomial or the beta-binomial distribution of gene frequencies. First, instead of being conceived as tests of departure from a neutral model (see, *e.g.*, Gautier *et al.* 2010), SelEstim incorporates an explicit selection model, which allows selection strength to be inferred among a set of markers. Second, SelEstim provides the distribution of selection strength across populations, which allows identifying the local population(s) where selection is acting. This is so because the hierarchical structure of our model improves the estimation of locus- and population-specific coefficients of selection (*σ _{ij}*) by borrowing strength across multiple populations.

### Detecting selection among a set of markers

At a given SNP, *j*, the locus- and population-specific parameters of selection *σ _{ij}* depend upon a locus-specific hyperparameter

*δ*that gives the population-wide effect of selection at a particular locus. It is therefore natural to use the posterior distribution of the hyperparameters

_{j}*δ*as a means to discriminate between neutral and selected markers. We indeed expect the posterior density of

_{j}*δ*to be shifted toward zero if the

_{j}*j*th marker is neutral and toward positive values if the

*j*th marker is targeted by selection. To operate this classification, it would have been possible to follow Beaumont and Balding (2004) and adopt a simple informal criterion assuming that

*δ*is significantly different from zero at some critical level

_{j}*P*whenever its equal-tailed 100(1 −

*P*)% credible interval excludes zero. Yet, this approach would neglect the genome-wide effect of selection, which in our model is driven by the hyperparameter

*λ*. We therefore proposed comparing the posterior distributions of the locus-specific coefficients of selection

*δ*with the centering distribution derived from the hyperdistribution with parameter

_{j}*λ*. To that end, we used the KLD to measure the divergence between these two distributions. We calibrated this measure using simulations from a predictive distribution based on the observed data set. We found that the false-positive rate at any KLD threshold is always less than the corresponding quantile probability (Figure 3, Figure 4, Figure S1, Figure S2, Figure S3, Figure S4, Figure S5, Figure S6, Figure S7, Figure S8, Figure S9, and Figure S10), which suggests that such calibration is conservative, even for strong departures from the island model assumptions (Figure S12F).

McCulloch (1989), Peng and Dey (1995), and Guo *et al.* (2009) proposed an alternative calibration, based on the following argument: consider flipping a “fair” coin with equal probability 0.5 for heads and tails *vs.* flipping a biased coin with probability *ν* for heads. Then the KLD between these two Bernoulli distributions denoted by Ber(0.5) and Ber(*ν*), respectively, can be computed and used as a threshold for any value of *ν*. In our case, however, we must divide *ν* in half, since we look for only unusually large KLD values. Therefore, the KLD threshold can be defined as KLD[Ber(0.5), Ber(*ν*)] = −log[*ν*(2 − *ν*)]/2. For example, the KLD between two Bernoulli distributions corresponding to flipping a fair coin and a biased coin that gives heads or tails with probability 0.05 (resp. 0.01) is equal to 1.164 (resp. 1.959). However, we found that this calibration procedure was overly conservative (see Table S1 and Table S2).

Last, we used ROC analyses, as in Riebler *et al.* (2008), to compare our model with BayeScan (Foll and Gaggiotti 2008). We found that SelEstim performed slightly better than BayeScan (Figure 5). Since BayeScan was shown to outperform Beaumont and Balding’s approach (Beaumont and Balding 2004), as well as some other popular moment-based methods using dominant markers (Pérez-Figueroa *et al.* 2010), we may therefore conclude that SelEstim represents an appreciable improvement to the population genomicist’s toolbox.

As an alternative to the KLD, it could have been possible to implement Bayesian model selection using, *e.g.*, Bayes factors to discriminate between neutral and selected loci. Foll and Gaggiotti (2008) proposed using reversible-jump MCMC to estimate, for each marker, the posterior probabilities of two alternative models: a purely neutral one and one including selection. Riebler *et al.* (2008) also proposed an elegant reparameterization of Beaumont and Balding’s model (Beaumont and Balding 2004), by introducing a Bernoulli-distributed auxiliary variable to indicate whether a locus is targeted by selection. This parameterization was later shown to facilitate the computation of Bayes factors (Gautier *et al.* 2009). Both approaches (reversible-jump MCMC and auxiliary variable) are actually straightforward to implement in our hierarchical-Bayesian model (not shown). Nevertheless, the KLD has the practical advantage that, provided the MCMC sampler converges and a large enough sample is drawn from the *δ _{j}* and the

*λ*posterior distributions, its computation does not depend upon the length of the MCMC. As an illustration example, we ran a BayeScan analysis of the 52,633 SNPs from the HGDP–CEPH data, using the same MCMC parameters (number and length of pilot runs, burn-in, chain length, etc.) as with SelEstim, assuming prior odds of 1000 for the neutral model (Figure S22A). It is clear from this figure that a substantial number of markers for which the Kullback–Leibler divergence provides no evidence of selection have log

_{10}(BF) ≥ 2 (Figure S22, B and C). This number increases with decreasing prior odds (not shown). Furthermore, since the maximum value that the BF can take is bounded by the MCMC length, we may observe a “saturation” effect with many of the outliers sharing the same evidence of selection (see Figure S22A). From a practical point of view, this may prevent the visual identification of genomic regions potentially targeted by selection (as,

*e.g.*, in Figure 8), unless very long MCMC are performed (to achieve an effective sample size of the order of the number of markers). In addition, we found that the outputs of BayeScan vary with the prior odds, which depend on the user’s prior belief for the proportion of presumably neutral SNPs. Our results therefore argue in favor of using KLD in empirical studies since it allows ranking the SNPs in order of the divergence between locus-specific and genome-wide selection strength, which indicates the degree of evidence that a locus is under selection. However, we concur with Coop

*et al.*(2010) that making statements about the statistical significance of outlier loci might be hazardous. In particular, we refrained from defining

*P*-values from KLD measures.

### Inferring selection strength across populations

In an early analysis of population differentiation using the HapMap data set, Weir *et al.* (2005) already showed the utility of estimating population-specific *F*_{ST} values (Weir and Hill 2002). In particular, concentrating their analyses on chromosome 2, they did not find any outstanding peak of population average *F*_{ST} around the *LCT* gene, although there was a clear elevation of the population-specific *F*_{ST} values for Caucasians of European descent and European Americans. Yet, in Weir *et al.*’s study (Weir *et al.* 2005), the characterization of “exceptional regions” was based on the greatest difference between population-specific *F*_{ST} values (averaged over 5-Mb windows) being larger than 3 SD, which does not provide a definitive statistical criterion to decide which loci are outliers of the empirical, genome-wide distribution of *F*_{ST}.

Like Beaumont and Balding (2004), Riebler *et al.* (2008), Foll and Gaggiotti (2008), and Guo *et al.* (2009), who considered population-specific effects on *F*_{ST}, we considered in our model that the distribution of allele frequency depends upon population-specific parameters (*M _{i}*). Since we defined a parameter that indicates which allele is selected for, the selected allele need not to be the same in all the sampled demes. Furthermore, the strength of selection need not to be the same in all demes. SelEstim therefore accounts for situations where selection is acting in some populations, but not all, possibly in opposite direction (with alternative alleles being selected for in different environments). It is therefore particularly relevant to detect the signatures of local adaptation in subdivided populations.

Not surprisingly, we found that SelEstim has weak statistical power to identify loci under balancing selection (see Figure 5). Since our genic selection model allows for only positive selection, this was somewhat expected, but by using a centering distribution we are able, in principle, to identify loci with support for unusually low values of *δ*. Beaumont and Balding (2004) concluded from simulations that their method could not easily identify loci under balancing selection, even for very strong selection. Although Foll and Gaggiotti (2008) showed that microsatellites could be used to detect balancing selection, especially with data sets containing a large number of sampled populations, they needed 10 populations with SNPs to achieve the same rate of detection (Foll and Gaggiotti 2008). In principle, however, it should be possible to scan for SNPs targeted by balancing selection using a modified version of our model, in particular Equation 2, that would account for overdominance with population-specific selection pressures (see, *e.g.*, Equation 13.60 in Wright 1969, p. 371). This strategy could be valuable for improving statistical power to identify loci under balancing selection.

Because our model accounts explicitly for positive selection, it cannot only be used to detect the genomic signatures of selection, but also to measure the strength of selection along the genome. As mentioned above, contrary to previous approaches that approximated selection as a locus-specific effect in a logistic regression model (Beaumont and Balding 2004) or a reduction in effective migration rate (see, *e.g.*, Bazin *et al.* 2010), we introduced explicitly a scaled coefficient of selection *σ _{ij}* ≡ 2

*N*for locus

_{i}s_{ij}*j*in deme

*i*, where

*s*represents the relative gain in fitness brought by a positively selected allele. We found that the posterior means of the scaled coefficients of selection

_{ij}*σ*(conditionally on

_{ij}*κ*) were close to the simulated value for positively selected loci, although slightly overestimated (Figure 6, Figure S14, Figure S15, Figure S16, Figure S17, and Figure S18). We also found that the variation of

_{ij}*σ*across populations with different selection regimes was remarkably well inferred, with selected loci exhibiting large coefficients of selection in the colored demes and small coefficients of selection in uncolored demes (Figure 6, Figure S14, Figure S15, Figure S16, Figure S17, and Figure S18).

_{ij}The strong correlation between the posterior mean of the genome-wide coefficient of selection *λ* and the number of positively selected loci (Figure S20) would tend to suggest that the parameter *λ* provides some information about the extent of selection acting on the genome. This must be nuanced, however, at least for two reasons. First, as the number of positively selected loci increases, the performance of SelEstim weakens (see Figure S21). Second, we have observed that the parameter *λ* also depends on demography, and particularly on departures from the island model assumptions (not shown). This identifiability problem therefore prevents the comparison of *λ* estimates (to infer the overall effect of selection) across species with different population structures. We note that this identifiability problem is somewhat avoided in BayeScan with the Gaussian prior (zero mean and standard deviation of 1) put on the locus-specific component *α _{i}* (which, therefore, provides no information whatsoever on the extent of selection acting on the genome).

### Application example at the *LCT* gene

To illustrate how the inference of selection strength may provide new insights into the characterization of local adaptation, we investigated the well-studied and clear-cut example of the evolution of lactase persistence in humans (see Gerbault *et al.* 2011, for a review). The region around the *LCT* gene that allows lactose tolerance to persist into adulthood is indeed a very-well-known example of selection in humans (Sabeti *et al.* 2006). The first causative polymorphism described was the −13910C → T mutation (Enattah *et al.* 2002), which lies in the *cis*-acting regulatory element located in the 13th intron of a neighboring gene, *MCM6*. Although this single mutation of purported western Eurasian origin accounts for much of observed lactase persistence outside Africa, multiple independent mutations in the same region upstream of the *LCT* gene have been associated with this trait in pastoralists from Saudi Arabia (Enattah *et al.* 2008) and Africa (Tishkoff *et al.* 2007). The lactase persistence allele at the *LCT* locus lies on a haplotype that is common in Europeans but that extends largely undisrupted for >1 Mb, much farther than is typical for an allele of that frequency (Bersaglieri *et al.* 2004). More recently, Romero *et al.* (2012) found that the −13910C → T mutation also explains a substantial proportion of lactase persistence in the Indian subcontinent. Most interestingly, they showed that the −13910C → T mutation in India is identical by descent to the European allele and is associated with the same extended haplotype in both populations, which strongly suggests that the origin of the −13910C → T mutation is shared in Europe and India. These results are consistent with the high levels of present-day milk consumption in India and with archeological and genetic evidence for the independent domestication of cattle in the Indus valley ca. 7000 years ago (Romero *et al.* 2012).

In agreement with these studies, our analyses pointed to a very strong signal of positive selection between 3.7 kb and 1.0 Mb upstream of the *LCT* gene (Figure 8). The strongest evidence of selection (in terms of KLD) was found for the two SNPs that are tightly associated with lactase persistence (−13910C → T and −22018G → A). Building on the fact that our model is able to give accurate measures of the scaled coefficient of selection at each locus in different demes, we further examined the distribution of the strength of positive selection at the −13910C → T SNP across the 23 populations analyzed. We found the strongest selection coefficients in Europe and in the Indus Valley (Figure 9), which matches the interpolated map of lactase persistence phenotype frequencies in the Old World (Itan *et al.* 2010). More precisely, we found that the coefficients of selection (*σ* ≡ 2*Ns*) at the −13910T allele ranged from 8.73 (Sardinians) to 101.66 (Orcadians) in Europe and from 4.94 (Kalash) to 77.50 (Balochi) in Central/South Asia. There have been previous attempts to measure the strength of selection acting at the *LCT* gene, although most of them relied on strong assumptions on the demographic and adaptive history of the studied populations. For example, Aoki (1986) predicted that a selection coefficient *s* > 5% would be necessary to explain the observed allele frequency of the −13910T allele, assuming that this mutation appeared 6000 years ago in a population of effective size 500, which would give *σ* ≡ 2*Ns* = 50. Bersaglieri *et al.* (2004) estimated the coefficient of selection *s* to be 1–15% for a new mutation arising in a population of effective size of 500–5000. More recently, Tishkoff *et al.* (2007) estimated selection intensity by matching simulated data under a coalescent framework to the observed centimorgan span and the observed frequency of the allele targeted by selection. They found extremely recent and strong positive selection in many African populations (*σ* ≡ 2*Ns* ranging from 800 to 1940 assuming an effective population size *N* of 10,000). Modeling a geographical structuring of selection pressure by latitude, Gerbault *et al.* (2009) found selection coefficients in the range between 0.8 and 1.8% (Gerbault *et al.* 2011), also assuming a carrying capacity of 10,000. However, assuming an effective population size *N* of 10,000 may largely overestimate *σ* ≡ 2*Ns* (see Tenesa *et al.* 2007, for more accurate estimates of effective size based on measures of linkage disequilibrium). Last, using a spatially explicit model and approximate Bayesian computation (Beaumont *et al.* 2002), Itan *et al.* (2009) estimated coefficients of selection to lie in the range of 5.2–15.9%. The difficulty in comparing these values is that strong hypotheses about the effective population size need to be made. It is clear from the stationary density of the diffusion process in Equation 2 that the two parameters *s* and *N* are not identifiable. Estimating *s* therefore requires informative priors on *N*. Furthermore, the population size considered in our model is the local effective size of a deme, not the effective size of the total population. Therefore, considering the scaled coefficient of selection (*σ* ≡ 2*Ns*) might be more appropriate for interpreting the variation of the strength of selection exerted at different loci or at one locus in different populations.

### Perspectives

SelEstim provides a new tool with which to detect signatures of selection from genome-wide scan studies and, perhaps most importantly, to infer the intensity of selection across loci and populations. However, as most *F*_{ST}-based methods aimed at looking for locus-specific effects on *F*_{ST} estimates, SelEstim assumes that molecular markers are independent from each other. There are few exceptions, though. For example Guo *et al.* (2009) introduced a conditional autoregressive model to incorporate the local correlation among SNPs. Gompert and Buerkle (2011) proposed an extension of the models developed by Beaumont and Balding (2004), Riebler *et al.* (2008), and Foll and Gaggiotti (2008), which incorporates genetic distances among haplotypes (*φ*-statistics; see Excoffier *et al.* 1992) in measures of genetic differentiation. More recently, Fariello *et al.* (2013) developed a haplotype-based method, which uses a multipoint linkage disequilibrium model (Scheet and Stephens 2006) that regroups individual chromosomes into local haplotype clusters. The reconstructed haplotypes are then used to measure differentiation between populations (see also Browning and Weir 2010). Handling conditional dependencies of markers along the genome would therefore be an essential step forward in future developments of SelEstim. In the meantime, we recommend potential users to view this method as a first step toward identify genomic regions of interest, which should then be characterized more specifically in further studies.

## Acknowledgments

We acknowledge two anonymous reviewers for helpful comments, which led to a significantly improved version of this manuscript. We also warmly thank Pierre Pudlo and Christian P. Robert for insightful discussions on KLD calibration. This work was supported by a Biotechnology and Biological Sciences Research Council grant (BBS/B/12776) awarded to M.A.B. and K.J.D. M.A.B. was supported by a Natural Research Council Advanced Fellowship (NER/J/S/2001/00792). R.V. was supported by the French National Research Agency (ANR) programme NUTGENEVOL 07-BLAN-0064. Both R.V. and M.G. were supported by the ANR programme EMILE 09-BLAN-0145-01 and by the Institut National de la Recherche Agronomique, through the support of the starting group “IG-GiPop” (2010–2012).

## Footnotes

*Communicating editor: W. Stephan*

- Received May 8, 2013.
- Accepted December 9, 2013.

- Copyright © 2014 by the Genetics Society of America