## Abstract

We address the problem of finding evidence of natural selection from genetic data, accounting for the confounding effects of demographic history. In the absence of natural selection, gene genealogies should all be sampled from the same underlying distribution, often approximated by a coalescent model. Selection at a particular locus will lead to a modified genealogy, and this motivates a number of recent approaches for detecting the effects of natural selection in the genome as “outliers” under some models. The demographic history of a population affects the sampling distribution of genealogies, and therefore the observed genotypes and the classification of outliers. Since we cannot see genealogies directly, we have to infer them from the observed data under some model of mutation and demography. Thus the accuracy of an outlier-based approach depends to a greater or a lesser extent on the uncertainty about the demographic and mutational model. A natural modeling framework for this type of problem is provided by Bayesian hierarchical models, in which parameters, such as mutation rates and selection coefficients, are allowed to vary across loci. It has proved quite difficult computationally to implement fully probabilistic genealogical models with complex demographies, and this has motivated the development of approximations such as approximate Bayesian computation (ABC). In ABC the data are compressed into summary statistics, and computation of the likelihood function is replaced by simulation of data under the model. In a hierarchical setting one may be interested both in hyperparameters and parameters, and there may be very many of the latter—for example, in a genetic model, these may be parameters describing each of many loci or populations. This poses a problem for ABC in that one then requires summary statistics for each locus, which, if used naively, leads to a consequent difficulty in conditional density estimation. We develop a general method for applying ABC to Bayesian hierarchical models, and we apply it to detect microsatellite loci influenced by local selection. We demonstrate using receiver operating characteristic (ROC) analysis that this approach has comparable performance to a full-likelihood method and outperforms it when mutation rates are variable across loci.

THE study of the effects of natural selection at the genomic level has the potential to uncover hidden aspects of the causal pathways that relate genotype to phenotype and the environment (Sabeti *et al*. 2007). A challenge for any such research program is to distinguish signals of selection from those of a myriad other processes (McVean and Spencer 2006), particularly those related to the demographic history of the population. The study of individual candidate loci or regions of the genome, in isolation, and without regard to the (generally unknown) demographic history of the population is unlikely to be fruitful because selection can generally be mimicked by demographic processes (Teshima *et al*. 2006), and, indeed, this forms the basis of many methods of simulating loci under selection (Spencer and Coop 2004). As a consequence most recent studies concentrate on large-scale surveys of genomic regions, looking for genes that are discrepant (Teshima *et al*. 2006). Within this framework there are two broad strands. One set of approaches is based around the idea of a “selective sweep” in which an allele increases in frequency, as a result of either a single novel mutation or a change in environment, leading to reduced diversity at linked sites (Kaplan *et al*. 1989). Another modeling framework is centered around the idea of “local selection” (Charlesworth *et al.* 1997) in which alternative alleles are favored in different environments. Unlike the selective-sweep scenario where the time of onset of the sweep is an important parameter, the local selection framework is essentially ahistorical: the allele frequencies within a deme are typically modeled by assuming migration–selection–drift balance (Wright 1931; Petry 1983).

It is unclear at this stage which of the two forms of selection are most common. Certainly, the selective sweep scenario is commonly studied and this is unsurprising because the two most intensely surveyed species, humans and *Drosophila melanogaster*, both have demographic histories, the invasion of novel environments, that are conducive to selective sweeps. The model of local selection envisions a rather static view of the world, whereas the commonly held perception is of constantly changing environments and population restructuring. As always, probably the truth lies in between these two extremes, and the aim of this study is to continue the development of methods for detecting local selection, while recognizing the utility of the selective sweep paradigm under many evolutionary scenarios.

Current methods of detecting local selection from gene-frequency information tend to be based around *F*_{ST}, which has a variety of interpretations (Weir and Hill 2002; Balding 2003). Here, it is defined to be the probability that two gene copies share a common ancestor within the deme in which they are sampled without either of their lineages migrating (Crow and Kimura 1970, p. 105; Vitalis *et al*. 2001). Methods based on *F*_{ST} have a long history (Cavalli-Sforza 1966; Lewontin and Krakauer 1973). The early methods were based on moment estimators of *F*_{ST}, as in the above two studies and also, for example, in Beaumont and Nichols (1996), Vitalis *et al*. (2001), and Weir *et al*. (2005). More recently, likelihood-based approaches have been developed (Beaumont and Balding 2004; Riebler *et al*. 2008; Foll and Gaggiotti 2008; Guo *et al*. 2009). These latter approaches are based on a theory for the sampling distribution of genes in the infinite-island or continent-island model of structured populations. The same distribution, multinomial Dirichlet in form (a.k.a. Pólya distribution or Dirichlet compound multinomial), can be derived either from the diffusion theory of Sewall Wright (Wright 1931; Rannala and Hartigan 1996) or from coalescent theory (Balding and Nichols 1994). The key insight that lies behind the use the multinomial-Dirichlet distribution in the detection of local selection is the following result. Marker loci, linked with recombination rate *r* to loci in which locally deleterious alleles segregate with selection coefficient *s*, have an effectively reduced migration rate, *m*, approximated in Petry (1983) as *m*′ = *m* × *r*/(*s* + *r*) (Barton and Bengtsson 1986; Charlesworth *et al*. 1997). Under the structured coalescent with constant deme size and migration rates, *F*_{ST} = 1/(1 + 2*Nm*), and hence under local selection there is expected to be heterogeneity in the estimates of *F*_{ST} among loci.

The multinomial-Dirichlet framework has the advantage of having a simple likelihood function that is rapidly computed. If the mutation rate is low enough and the number of demes high enough, then we can justify this approach by the “many demes” approximation of Wakeley (1998). Often this may not be an adequate approximation, and the method then has the disadvantage that it assumes a simplified demographic history and cannot easily take into account recombination and mutational processes. Given that likelihoods in more general frameworks are computationally intractable for large numbers of loci and recombination, it is tempting to consider using a likelihood-free approach (Pritchard *et al.* 1999; Beaumont *et al*. 2002; Marjoram *et al*. 2003; Becquet and Przeworski 2007). Typically these methods require that summary statistics are computed in a large number of Monte Carlo simulations and some match is made between simulated and observed summary statistics. A problem arises in that information on whether there is selection comes from considering all the loci jointly, but to decide whether a specific locus is under selection we also need information on that particular locus. Thus a naive approach, given *L* loci, would be to have *L* sets of summary statistics. This could lead to thousands of summary statistics for an analysis. The probability of getting a close match for all *L* simulated loci will be vanishingly small, and consequently such an approach is unlikely to succeed.

In this article we develop a general method for efficiently computing solutions in hierarchical Bayesian models using a likelihood-free approach. We formulate a hierarchical Bayesian model for identifying loci that are subject to local selection and apply our technique, which is relatively efficient and easy to parallelize on a computing cluster. We demonstrate through the use of extensive comparisons that the method approaches the accuracy of the likelihood-based method of Beaumont and Balding (2004) in situations where the assumptions of the latter hold and exceeds it when there is variability in mutation rate among genetic markers. We then apply the method to microsatellite data from chimpanzees.

## A HIERARCHICAL APPROACH TO LIKELIHOOD-FREE INFERENCE

The likelihood-free approach implemented in this study uses the regression-based method of conditional density estimation introduced in Beaumont *et al*. (2002). The approximate Bayesian computation (ABC) technique is currently undergoing quite widespread development, and a number of different approaches have been advocated. While recognizing that these recent developments should supersede the method of Beaumont *et al*. (2002), we justify the use of the regression method on the grounds that (a) as shown later, it performs well in a comparison with a full-likelihood method and (b) it has been widely used and its advantages and pitfalls are well understood. However, we note that the algorithms described in this article are particularly amenable to sequential ABC approaches (Sisson *et al*. 2007; Beaumont *et al*. 2009; Toni *et al*. 2009) and improved conditional density estimation (Blum and François 2009).

Briefly, we assume that we have measured a *d*-dimensional vector of summary statistics *S*(*x*) from a data set. Here we make the distinction between the observed data set *x* and the random variable, *X*, generated by simulation. We have *N* random draws of a (scalar) parameter Φ_{i} (*i* = 1, …, *N*) and corresponding summary statistics *S*(*X _{i}*) (

*i*= 1, …,

*N*) simulated from the joint distribution of parameters and summary statistics

*P*(

*S*(

*X*), Φ). (The model may have any number of parameters, which can be considered jointly, but the regression adjustment described here is applied to one parameter at a time.) We scale

*S*(

*x*) and

*S*(

*X*) so that each summary statistic in

*S*(·) has unit variance. We assume a linear model in whichwhere the

*e*

_{i}are drawn from a distribution common to all

*X*, with a mean of zero. We use least squares to minimize(1)where, assuming the model above,with Epanechnikov kernel(2)Given the estimates and , we can approximate posterior densities by using the assumption (above) that the distribution of errors is constant in the region where

_{i}*K*

_{ε}(‖

*S*(

*X*) –

_{i}*S*(

*x*)‖) is positive and hence adjust the parameter values as(3)(Beaumont

*et al*. 2002). The posterior density for Φ can be approximated using some density estimation method, and in this article the local-likelihood method of Loader (1996) is used, implemented in Locfit under R, weighting the points with

*K*

_{ε}(‖

*S*(

*X*) –

_{i}*S*(

*x*)‖) as above. It should be noted that the “tolerance” of the method, as discussed in this article, is not measured directly in terms of the Epanechnikov bandwidth ϵ, but in terms of

*P*

_{ε}, the proportion of simulated points where ‖

*S*(

*X*) –

_{i}*S*(

*x*)‖ ≤ ε.

In the context of the ABC algorithm above the choice of summary statistics and the choice of metric (implicitly Euclidean, in the example above through the use of the Epanechnikov kernel) are intertwined. Ideally one would choose summary statistics that are of low dimension and are also *Bayes sufficient* (Kolmogorov 1942). That is, we want the summary statistics *S*(*x*) to satisfy the condition(4)at all points ω in the parameter space, for all priors *P*(ω) (so that we are free to choose whatever prior we want). In practice, such statistics are rarely available. Many approaches to ABC (Pritchard *et al*. 1999; Marjoram *et al*. 2003; Sisson *et al*. 2007) are based on the idea of “rejection” (of observations falling outside a small acceptance region centered on the observed data), giving *P*(Φ | ρ(*S*(*X*), *S*(*x*)) < = ε) for some metric ρ(·). Thus, particularly for high-dimensional *S*(·), consideration should be given as much to the metric as to the summary statistics. Methods that place more emphasis on conditional density estimation (Beaumont *et al*. 2002; Blum and François 2009) aim to estimate *P*(Φ | *S*(*X*) = *S*(*x*)) more precisely. A goal of such methods is to estimate the density using a larger proportion, possibly all, of the simulated points (Blum and François 2009).

Application of the ABC method to the situation addressed in the present study has a number of difficulties. We wish to make inferences on the demographic history and also on individual loci. This is a problem that is suited to a hierarchical Bayesian approach, and the main contribution of this study is to devise a method for performing hierarchical Bayesian analysis in the likelihood-free framework. In simple models the parameters for each locus are assumed to be identical, and if a likelihood function is available, it is simply multiplied across loci. By contrast, taking mutation rate as an example, in a hierarchical model the *L* loci each have their own parameter. At one extreme, identical to the simple case above, if the variability in mutation rate among loci is zero, then, in the terminology of hierarchical models, strength is “borrowed” completely between the loci, and each locus has an identical posterior distribution for mutation rate, and this is the same as the posterior distribution for the hyperparameter specifying the mean of the prior for each locus (the prior for each locus, having, in this case, zero variance). This verbal description is made clearer in the examples below. At the other extreme, the mutation rates at each locus are inferred independently—they have independent posterior distributions, and their prior has a high variance. More typically, the situation is intermediate.

In a hierarchical model one may be interested only in the posterior distribution of the hyperparameters (What is the mean mutation rate among loci? Is there evidence of nonzero variance in mutation rate among loci?). It is possible to compute summary statistics that are invariant to the ordering of loci such as means, variances, and so forth. We refer to these as *symmetric* summary statistics. This is a typical use of the ABC method for data with many loci (*e.g.*, Pritchard *et al*. 1999 and subsequent likelihood-free articles). The use of means and variances of summary statistics among loci for the ABC analysis allows straightforward inference of the hyperparameters.

By contrast, the focus of the present study is to make inferences on locus-specific parameters, as well as inferring the hyperparameters. This leads to difficulty because one needs summary statistics for each locus. The problem of a plethora of summary statistics has been noted in the Introduction. More fundamental is that, in the absence of missing data, the loci simulated under the model are exchangeable (their ordering or labeling is irrelevant to the likelihood). Thus there is no preferred ordering of the sample loci when compared with those generated by simulation. This problem is intrinsic to any hierarchically structured model and has been encountered before in an ABC setting by Hickerson *et al*. (2006) and Hickerson and Meyer (2008), in which the exchangeable units were taxa (rather than loci, as here). Since the ordering is arbitrary, a naive scheme would simply be to match the summary statistics of the first simulated locus with the those in the first data locus (given an arbitrarily chosen order) and so forth. Although correct in principle, such an approach would be hopelessly inefficient in practice in situations with many loci. Since the ordering is arbitrary, one might find a permutation of the simulated loci that gives the closest match. However, again, with many loci such a procedure is likely to be highly computer intensive, and, without exhaustive searching, not guaranteed to find the optimal match. The method proposed by Hickerson *et al*. (2006) was to rank the taxa by one of the key summary statistics, which makes the problem computationally tractable. However, there is then the problem of which summary statistic to use, and if the statistics are not strongly correlated it may not be very efficient. Similar issues have also been encountered in Sousa *et al*. (2009).

Here our approach is to make use of locus-specific summary statistics together with symmetric summary statistics (those that are invariant to locus ordering) in a computationally efficient way, which we now describe. Suppose that we have a hierarchical model in which there are *L* loci. For the sake of example we concentrate on loci, but the argument can apply to populations or other repeated units. Each locus has a vector of observations (*X _{i}*) and (unobserved) parameter vectors κ

_{i}and λ

_{i}(

*i*= 1, … ,

*L*). Here, we treat λ

_{i}as a parameter of interest and κ

_{i}as a nuisance parameter. We make this distinction for ease of exposition: it is not fundamental to the treatment below. We assume the vector λ

_{i}is of relatively low dimension, while κ

_{i}may be of high dimension. Let κ = (κ

_{1}, … , κ

_{L}) and λ = (λ

_{1}, … , λ

_{L}). The likelihood function for our model is(5)where

*X*=(

*X*

_{1}, … ,

*X*). We assume that, conditional on the hyperparameter α, the priors for each locus are independent, and so(6)Thus the joint prior density

_{L}*P*(α, κ, λ) is(7)with a prior (hyperprior)

*P*(α). Because of conditional independence, it is straightforward to show (appendix) that the joint posterior density can be factorized as(8)or, marginal to the nuisance parameter κ,(9)Focusing out attention on a single locus

*i*, the hyperparameter α and the locus-specific parameter λ

_{i}have the joint density(10)

This factorization suggests that we need to use two distinct types of summary statistics in our approximate Bayesian computation: *symmetric* summary statistics, which are functions of all the loci together (*e.g.*, means, higher moments, …), *S*(*X*) = *S*(*X*_{1}, … , *X _{L}*); and unit-specific summary statistics,

*U*(

*X*). Rather than insisting that the complete set of summary statistics is Bayes sufficient (see Equation 4), we can now make do with the weaker requirement that

_{i}*S*(

*X*) and

*U*(

*X*) satisfy(11)at all points (α, λ

_{i}_{i}) for the chosen prior (or family of priors). We want this to hold exactly or at least as an adequate approximation. In the terminology of

*marginal sufficiency*introduced by Raiffa and Schlaifer (1961, 2000, p. 35) (see also Basu 1977), the factorization (11) tells us that the summary statistic

*S*(

*X*) is marginally sufficient for the parameter α and that the summary statistic (

*S*(

*X*),

*U*(

*X*)) is marginally sufficient for the locus-specific parameter λ

_{i}_{i}. These points motivate two algorithms.

#### Algorithm 1:

For

*k*= 1 to*k*=*N*iterations:Sample (

*A*,_{k}*K*, Λ_{k}_{k}) from the prior*P*(κ, λ | α)*P*(α).Simulate data

*X*(at_{k}*L*loci) from*P*(*X*|_{k}*K*, Λ_{k}_{k}).For locus

*i*= 1 to*i*=*L*compute*U*(*X*)._{k, i}Compute

*S*(*X*)._{k}

Use ABC to condition on

*S*(*X*) =*S*(*x*) (approximately) to obtain a sample of observations*A** from*P*(α |*S*(*x*)) (marginal to κ, λ).For locus

*i*= 1 to*i*=*L*:

Use ABC to condition on *S*(*X*) = *S*(*x*) and *U*(*X _{i}*) =

*U*(

*x*) (approximately) to obtain a sample of observations Λ

_{i}_{i}* from

*P*(λ

_{i}, |

*S*(

*x*),

*U*(

*x*)) (marginal to α, κ).

_{i}Providing the summary statistics are sufficient, and in the limit that the ABC tolerance , this algorithm should sample from the posterior distribution (9) above without additional approximation. There is, however, a practical problem of computer storage associated with this algorithm. If there are *u* summary statistics in *U*(*X _{i}*), we would need to store

*NLu*items. For example, with 10

^{3}loci, 10 summary statistics per locus, 10

^{6}iterations, and 8 bytes per number, we would have 80 Gb of storage as a binary file or in computer memory—much larger, if stored as text files. Thus, although the algorithm may work well with smaller problems there is a generic problem in scaling up.

The second algorithm is similar to sequential ABC algorithms (Sisson *et al*. 2007; Beaumont *et al*. 2009) in which the problem is attacked in two bites.

#### Algorithm 2:

Step 1. For *k* = 1 to *k* = *N* iterations:

Sample (

*A*,_{k}*K*, Λ_{k}_{k}) from the prior*P*(κ, λ | α)*P*(α).Simulate data

*X*(at_{k}*L*loci) from*P*(*X*|_{k}*K*, Λ_{k}_{k}).Compute

*S*(*X*)._{k}

Condition on *S*(*X*) = *S*(*x*) using ABC, to obtain a sample of observations *A** from

Step 2. For locus *i* = 1 to *i* = *L*:

For *k* = 1 to *k* = *N* iterations:

Sample from

*P*(α|*S*(*x*)) ≈*P*(α|*x*) by resampling from the observations*A** generated in step 1.Sample from the conditional prior

*P*(κ_{i}, λ_{i}|*A***)._{k}Simulate data

*X*(at locus_{k,i}*i*only) from .Compute

*U*(*X*)._{k,i}

Condition on *U*(*X _{i}*) =

*U*(

*x*) using ABC, to obtain a sample of observations (

_{i}*A****, Λ

_{i}***) from an

*approximation*to

*P*(λ

_{i}|

*x*, α)

_{i}*P*(α |

*x*).

Note that in step 2 above, if sample sizes are identical at each locus (no missing data), then it is necessary to iterate only for one locus, rather than for locus *i* = 1 to *i* = *L*, because the distribution is the same. The advantage of Algorithm 2 over Algorithm 1 is that it scales easily with increasing numbers of loci. The amount of storage is 1/*L* less than Algorithm 1. The time cost of Algorithm 2 is potentially twice as high, but for, *e.g.*, simulated data or data with equal sample size at each locus it is of the same order as that of Algorithm 1. With a computing cluster of many nodes, the overall execution time may be quite low because step 2 in Algorithm 2 can be performed independently for each locus. An additional advantage is that in the second round of simulation the hyperparameter α is already sampled from an approximation to the posterior distribution, and therefore, as with sequential methods (Sisson *et al*. 2007; Beaumont *et al*. 2009; Toni *et al*. 2009), there is a potential for increased precision in our approximation to the posterior distribution of λ_{i}, ameliorating that apparent inefficiency of having a second round of simulation. However, a key point to note is that Algorithm 2, in contrast to Algorithm 1, involves an approximation that is in addition to that arising from the use of summary statistics that do not satisfy the marginal sufficiency conditions in (11) and nonzero tolerance ε.

To simplify the explanation of this additional approximation, we assume that we are performing ABC on complete data and that, by whatever means, we can sample α from the true posterior distribution. Then in the two-step algorithm, after step 1, we have a sample from(marginal to κ_{i}), where is the random variable corresponding to the data simulated in the second round. Using ABC we then condition on = *x _{i}*. This gives us a sample of observations (

*A****, Λ

_{i}***) fromwhich is not the same as the desired posterior density

*P*(λ

_{i}, α |

*X*=

*x*).

By contrast, consider a modification of the two-step algorithm, where we sample from *P*(α | *X*_{–i} = *x*_{–i}) at step 1 [instead of *P*(α | *X* = *x*)]. (The subscript –*i* indicates all the data except that from locus *i*.) Now we have a sample fromIf we condition on , we obtain a sample of observations (*A****, Λ_{i}***) frombecauseWhen the number of loci *L* is large, we then expect that any one locus *i* will make an almost negligible contribution to the information about the hyperparameter α, so thatTherefore, in this case our two-step algorithm should differ very little from the modified algorithm that can be demonstrated to provide samples from the correct posterior distribution (with ABC error).

## APPLICATION TO GENE FREQUENCY DATA

#### The model:

The primary aim of this study was to model local selection and compare the results of the ABC algorithm with the Bayesian method of Beaumont and Balding (2004), which uses an explicit multinomial-Dirichlet function for the likelihood. We wished to investigate the relative efficiency of both methods, using receiver operating characteristic (ROC) analysis. In one case microsatellite data are simulated with low variation in mutation rate among loci, and in the other it is high. It is expected that the multinomial-Dirichlet likelihood will behave poorly in the latter case because it assumes that all genetic variation is ancestral (*i.e.*, it arises in the “collecting phase” of Wakeley 1998). To keep the models similar, we assume an island model. The multinomial Dirichlet arises under an infinite-island or continent-island case (Balding and Nichols 1994; Rannala and Hartigan 1996), but it is pragmatically easier for the ABC analysis to assume a finite number of demes equal to the number of samples. Unlike Beaumont and Balding (2004) we consider only a model in which positive local selection is modeled.

Variation in mutation rate and migration rate is modeled in a hierarchical Bayesian framework, similar in conception to that described in Storz and Beaumont (2002). We assume that there are *D* demes. The scaled mutation rate at the *i*th locus is θ_{i} = 2*N*μ_{i}, where *N* is the haploid effective size of the deme and μ_{i} is the mutation rate at the *i*th locus. The scaled mutation rate, θ_{i}, has a prior that is a log_{10}-normal distribution with (on a log_{10} scale) mean μ_{θ} and standard deviation σ_{θ}. We use a Gaussian hyperprior for μ_{θ} and a truncated Gaussian for σ_{θ} (Table 1). Note that we do not use an inverse gamma for the σ_{θ}, following the recommendation of Gelman (2006). Variation among loci in migration rate is modeled in a somewhat different way. The principal idea is that there is an indicator *Z _{i}* that takes the value 0 if the

*i*th locus is ‘neutral’ and 1 if it is subject to local selection. The prior for this is Bernoulli with probability ρ

_{Z}that the locus is under selection—

*i.e.*, the prior expected number of loci under selection is

*L*ρ

_{Z}. The hyperprior for ρ

_{Z}is beta with parameters given in Table 1. Using the approximation of Petry (1983) that local selection acts to reduce the apparent migration rate, we assume that the

*i*th locus and the

*j*th deme have scaled migration rate

*M*= 2

_{ij}*Nm*, whereThe neutral migration rate varies among demes with a log

_{ij}_{10}-normal prior having (on a log

_{10}scale) mean μ

_{M}and standard deviation σ

_{M}, with Gaussian hyperprior (Table 1). Note that since we have a constant θ across demes, we implicitly assume that variation in

*M*across demes is through

_{n}*m*and

*N*is constant. For local directional selection we assume that

*𝒮*has a prior given by a scaled beta distribution with density . Thus, for a locus under local directional selection the prior migration rate has a maximum equal to the neutral migration rate, but is more heavily weighted toward lower values. The directed acyclic graph (DAG) for this model is given in Figure 1.

_{ij}In all our examples below, the point values chosen for the parameters of the hyperpriors, *a*_{1}, … , *a*_{6}, *b*_{1}, … , *b*_{6}, are given in Table 1.

The likelihood function for our model has the form(12)where here we implicitly marginalize over the nuisance parameter, κ. Here *X* = (*X*_{1}, …, *X _{L}*) with

*X*= (

_{i}*X*

_{i1}, …,

*X*), and . The locus-specific parameters are λ

_{iD}_{i}= (θ

_{i},

*M*

_{i1}, …,

*M*,

_{iD}*Z*). The joint prior

_{i}*P*(λ, α) factorizes as in (7). The factor

*P*(α) (the hyperprior) is now of the form(13)and each factor

*P*(λ

_{i}| α), of the prior density, is of the form(14)Each factor

*P*(

*X*| λ

_{i}_{i}) of the likelihood function is of the form(15)

For a model of this form, with this choice of prior, the marginal posterior density *P*(α, λ | *X*) has a factorization of the form (9), in which α and λ_{i} are replaced by the parameters of our genetic model, as specified above. Hence our model is amenable to the use of Algorithms 1 and 2.

#### Summary statistics:

The main aim of the model is to characterize the level of genetic differentiation between populations and differences among loci in their levels of differentiation and genetic variability. The choice of summary statistics has then been based on earlier work relating the expected value of summary statistics to demographic parameters and also to work that has used summary statistics of differentiation to identify loci that are potentially under selection (Beaumont and Nichols 1996; Vitalis *et al*. 2001; Excoffier *et al*. 2009). The strategy has been to compute a set of locus-specific summary statistics *U*(*X _{i}*), and then, for the symmetric summary statistics

*S*(

*X*), the means and other moments of these statistics over loci have been computed.

#### Locus-specific summary statistics:

For each locus we computed the following:

The observed probability of nonidentity in state of gene copies between populations,

*H*_{B}, computed as in Beaumont and Nichols (1996), based on the estimator of Weir and Cockerham (1984).The Weir and Cockerham (1984) estimator of

*F*_{ST}, computed as in Beaumont and Nichols (1996).The logarithm of the variance in microsatellite length between populations. The variance in microsatellite length between populations is in Rousset (1996).

The statistic of Rousset (1996), modified from Slatkin (1995), computed as , where is the within-population variance in length, averaged over populations, without weighting for differences in sample size.

The variance in the Weir and Cockerham (1984) estimator of

*F*_{ST}estimated for individual alleles (microsatellite lengths). In this case, a locus with*K*alleles was converted into_{i}*K*biallelic loci with allele frequencies comprising those of the target allele and all the others combined._{i}In a

*K*×*D*table of presence/absence of an allele (microsatellite length) in a population, the proportion of pairwise comparisons between populations in which an allele is observed in at least one of the populations, averaged over alleles. This summary statistic has no previous theoretical basis, but was observed to reduce the mean square error of parameter estimates in simulation tests.The variance of the within-population Weir and Cockerham estimator of

*F*_{ST}(Weir and Hill 2002), computed as in Vitalis*et al*. (2001).The variance of within-population computed analogously [

*i.e.*, as , where is computed for each population rather than averaged].

#### Symmetric summary statistics:

To infer hyperparameters we computed 60 symmetric summary statistics *S*(*X*), invariant to locus ordering. These included the mean, variance, skew, and kurtosis over loci of the 8 summary statistics above, giving 4 · 8 = 32 summary statistics, and then the covariance over loci of all 28 pairs of summary statistics.

#### Transformation of symmetric summary statistics:

Previous studies have suggested the use in ABC of transformations, including rotations of the summary statistics (Fagundes *et al*. 2007; Wegmann *et al*. 2009). Because a large number of summary statistics were used, we considered the use of orthogonal transformations of the data to reduce dimensionality. There appear to be two main issues. First, with a large number of summary statistics, many of which are uninformative, a large amount of “noise” is introduced into the computation of distance of simulated data from the observations. Essentially, summary statistics that are unaffected by the parameter values should be weighted out of the distance calculation (Hamilton *et al*. 2005) or not chosen at all (Joyce and Marjoram 2008). Second, there may be a problem of collinearity and resulting instability of the regression once many summary statistics are introduced.

The use of partial least squares (PLS) in an ABC context has been suggested by Wegmann *et al*. (2009). With PLS the orthogonal axes are ordered by decreasing covariance with the independent variable, and it is often used in calibration problems (Gemperline 2007). In our two-step procedure, we need to sample parameters from the joint posterior distribution of hyperparameters, which creates a difficulty because standard PLS assumes a univariate independent variable. A modification of the PLS algorithm exists (PLS-2) for use with a multivariate independent variable. However, we have chosen to use principal component analysis (PCA), also commonly used in calibration and typically producing similar results (Mevik and Wehrens 2007), which orders the axes by decreasing variance. A potential disadvantage of PCA is that axes with small eigenvalues may still have high correlation with the independent variable (here the parameter of interest). To take into account possible correlations between eigenvalues and independent variables, at least marginally, we have defined the following procedure:

The summary statistics sampled from the prior predictive distribution were scaled to have unit variance and rotated (using the R package *Prcomp*).

A Box–Cox transformation was then applied to the resulting eigenvectors.

These were then standardized once more to have unit variance and centered to have zero mean.

The Euclidean distance between these points and the target was computed.

On the basis of the 5% closest points, for the *i*th component and *j*th parameter value, the squared correlation coefficient was computed. The components were ranked by the proportion . The set of ranked components in which was retained, for each parameter *j*.

The union was formed over all parameters of the above sets.

The 30 components with the highest eigenvalue were then retained.

The regression-based ABC method was then applied (as outlined in Equations 1–3) with *P*_{ε} = 0.02.

No claim is made that the above procedure is optimal, and it was obtained through trial and error, on the basis of simulated data with known parameter values. A particular feature of the approach is that there appears to be reduced sensitivity to the addition or removal of summary statistics. The locus-specific summary statistics were used in ABC regression without rotation or further transformation.

#### The algorithm:

Our inference procedure is divided into two steps. We initially approximate the posterior distribution of the higher-level parameters using *S*(*X*), and we then approximate the posterior distribution for locus-specific parameters using *U*(*X*), as outlined in the following ABC algorithm, based on algorithm 2 above:

Compute symmetric summary statistics from the data.

Sample the following:

ρ

_{Z}, μ_{M}, σ_{M}, μ_{θ}, σ_{θ};θ

_{i},*Z*,_{i}*𝒩*;_{j}*M*._{ij}

Run a coalescent simulation of an island model (described in Beaumont and Nichols 1996; Beaumont and Balding 2004) to obtain data sets

*X*._{ij}From the simulated data sets, compute the symmetric summary statistics from the

*X*in the same way as for step 1 above._{ij}Return to step 2 until

*n*sets of summary statistics are obtained.Perform regression ABC (as outlined in the preceding section) to obtain

*P*_{ε}*n*samples from the posterior distribution, where*P*_{ε}is the proportion of points accepted.For each locus

*i*:Compute locus-specific summary statistics from the data for this locus.

In the following order:

Sample with replacement from the

*P*_{ε}*n*samples generated at step 6, ρ_{Z}, μ_{M}, σ_{M}, μ_{θ}, σ_{θ}.Sample θ

_{i},*Z*._{i}Sample for

*j*= 1, …,*D*.Sample

*M*for_{ij}*j*= 1, …,*D*.Sample

*X*for_{ij}*j*= 1, …,*D*.Compute the locus-specific summary statistics as for step 7a above.

Return to step 7b until

*n*sets of summary statistics are obtained.

Perform ABC one locus at a time (this time measuring locus-specific summary statistics).

## PERFORMANCE

To examine the performance of the ABC approach we simulated groups of 100 data sets for 15 different combinations of parameters (scenarios), chosen to vary widely under the assumed prior (Table 1). An archive (suitable only for running on a cluster) containing source code, scripts, and input files for repeating and checking the results presented here is available at http://www.rubic.reading.ac.uk/∼mab/stuff/ABCsims.zip. These scenarios included selection coefficients of 0.02 and 0.1. We used ROC analysis, implemented in the ROCR package (Sing *et al.* 2005), as in Riebler *et al*. (2008), to compare the ABC method with BayesFst (Beaumont and Balding 2004). In the case of the ABC method the classifying variable is the posterior probability of locus *i* being under selection, *P*(*Z _{i}* = 1 |

*U*(

*X*),

_{i}*S*(

*X*)), while in the case of BayesFst it is a Bayesian

*P*-value (Beaumont and Balding 2004). Specifically, for the case here, the

*P*-value we use is the posterior probability that the locus effect, α, is less than or equal to zero and hence is a one-tailed

*P*-value, for consistency with the ABC model, rather than two tailed as in Beaumont and Balding (2004). We then compute 1 –

*P*-value so that values close to 1 indicate selection. In the ROC analysis (see,

*e.g.*, Fawcett 2006 for further information) we determine the proportion of false positives and true positives for each value of the threshold that is used to determine whether the classifying variable indicates a locus under selection. This yields a monotonic curve with no positives (true or false) at one end and all positives at the other. If a method has no classification power, the curve should be linear with slope 1, and the area under the ROC curve (AUC) should be 0.5. If a method has perfect classification power, the AUC should be 1.

We simulated data sets using the program that was used to simulate data sets under selection in Beaumont and Balding (2004). This simulates an island model and allows a certain proportion of loci to have alleles that are under selection: either locally positively selected or under balancing selection. We simulated scenarios with six demes (as in Beaumont and Balding 2004) and 100 independent loci and with 100 gene copies taken from each deme. In all simulations the migration rate varied among demes with individual population *F*_{ST}'s drawn from a beta distribution (see Table 2 legend). This leads to an approximately Gaussian distribution of , as assumed in the model. We tested 15 scenarios (Table 2). Each scenario consisted of 100 replicates (*i.e.*, the total number of simulated loci in Table 2 is 150,000). The data sets were analyzed with the ABC algorithm described above and compared with BayesFst. In the ABC analysis 500,000 iterations were used for both the genome-wide parameter estimation *P*(α | *S*(*X*)) and the locus-specific parameter estimation *P*(λ_{i} | *U*(*X _{i}*),

*S*(

*X*)). For the rejection step, we used the 2% nearest points.

An illustration of the application of the method is given in Figures 2 and 3, which are based on one of the data sets generated for the ROC analysis (scenario 15 in Table 2). Figure 2 shows the posterior distribution of genome-wide parameters and Figure 3 shows the posterior probability *P*(*Z _{i}* = 1 |

*U*(

*X*),

_{i}*S*(

*X*)) for each locus. In this example it can be seen that the loci that were simulated to be under selection generally have a higher posterior probability to be under selection, and the posterior mode of the number of loci inferred to be under selection, , is close to the true number of 5, and unsurprisingly ρ

_{Z}has a mode of ∼0.05. The demographic parameters are inferred somewhat less well in this example and reflect the influence of the chosen prior. The scaled mutation rate is well estimated, but the inferred value of the scaled migration rate is generally rather too low and weighted toward the prior. The posterior distribution for the variance in mutation rate is broad and tends to follow the prior. The estimated variance among demes in migration rate is rather low and strongly influenced by the prior. The goodness of fit of the model can be examined by seeing how well the symmetric summary statistics

*S*(

*X*) computed from the data fit within the prior predictive distribution (see also Ratmann

*et al*. 2009). Since a principal components rotation is used it is relatively straightforward to visualize the fit of the model by plotting the distribution along each axis. An example, using

*x–y*plots of a selection of axes, is given in supporting information, Figure S1. Unsurprisingly, since the data are simulated from the same model used in the analysis, there is a very good fit.

Overall, in the ROC analysis of the 15 scenarios (Table 2), the performance of the ABC method is quite competitive with BayesFst for both *s* = 0.02 and *s* = 0.1. Although the ABC method often has a slightly lower AUC, the difference is marginal and of the order of the confidence interval. However, in the two scenarios in which there is variability in mutation rate there is superior performance of the ABC method, well beyond the confidence limits of the AUC estimates. Representative numbers, corresponding to rows of Table 2, are given in Figures 4 and 5. The confidence limits are not plotted because they lie close to the estimates. The difference in performance for variable mutation rate arises because the multinomial-Dirichlet model of BayesFst assumes the mutation rate to have a negligible effect on variance in gene frequencies between demes. Thus, when the mutation rate is variable, it contributes to additional variance in gene frequencies between demes, which in BayesFst is attributed to local selection.

Figure 6 shows that, at least for this scenario for the ABC method, the precision, which is 1 – (false discovery rate), initially increases rapidly with the posterior probability that the locus is under selection (the “cutoff”) and then smooths off with a false discovery rate of <20% after a posterior probability of ∼0.2. By contrast, the BayesFst classifier shows the inverse behavior—it is most sensitive to change in the classification threshold nearer 1. This difference is not surprising, given the different nature of the methods, but suggests that, with the ABC model, posterior probabilities >0.2 are potentially “interesting.” It should be noted that unlike, for example, Riebler *et al*. (2008), who used a uniform prior, we have explicitly chosen a prior that gives most weight to *P*(*Z _{i}* = 0).

## AN EXAMPLE APPLICATION

We analyzed microsatellite data obtained from a survey of chimpanzee populations from western and central Africa, published by Becquet *et al*. (2007). These data consist of frequencies sampled in 84 chimpanzees that have been genotyped at 309 microsatellite loci. The study by Becquet *et al*. used clustering methods and identified “western,” “central,” and “eastern” groups. Of these, we used 64 individuals that had precise designations of location (rather than inferred genetically), giving sample sizes, respectively, of 41, 16, and 7. The gene frequencies were then analyzed using BayesFst and our ABC method (Figures 7 and 8).

The goodness of fit of the ABC simulation can be analyzed by comparing the observed summary statistics to the prior predictive distribution (Figure S2). In this case the data are often on the outer edges of the prior predictive distribution in some projections, but are not markedly outlying. The marginal posterior distributions obtained for the hyperparameters (Figure 7) indicate that gene flow is very low, in line with the conclusions of Becquet *et al*. (2007). This is a scenario in which it is expected that differences in mutation rate among microsatellites will have a major impact on estimates of *F*_{ST}. This is indeed observed: the BayesFst analysis yields a large number of positives (Figure 8), which, on the basis of the ROC analysis and theoretical expectations, are likely to be mainly erroneous. By contrast, the ABC analysis suggests that there are possibly two interesting loci, with posterior probabilities >0.2. This conclusion is based on the results from the simulated data sets above (see Figure 6 and related text for rationale). The estimates of posterior probabilities in the ABC analysis shown in Figure 8 have generally low standard errors (on the logit scale), which indicates a reasonable goodness of fit. If the real data are outliers under the model, then the regression step in ABC is an extrapolation, and estimates tend to have very large standard errors. The microsatellites identified by the ABC analysis are GATA81B01 and ATA28C05. The former has not been mapped in *Pan troglodytes*, but is located on the sixth chromosome in *H. sapiens*. The latter has been mapped on the X chromosome of *P. troglodytes* and its nearest ORF is LOC739998 of unknown function.

## DISCUSSION

The main contribution of this article has been to demonstrate how one can apply ABC-based models to complex scenarios where the number of summary statistics necessarily scales with the number of parameters in the model. By treating these cases within the hierarchical Bayesian framework, we show how it is possible to deal with quite complicated problems in a computationally feasible way.

We have introduced two algorithms. Both are based on the idea that two types of summary statistics are computed from the data: symmetric summary statistics *S*(*X*) used to infer the hyperparameters and those that are unit specific, *U*(*X _{i}*), used to infer parameters. Although, in our treatment, the

*S*(

*X*) are simple functions of the

*U*(

*X*), it should be noted that there is no necessity for consistency in, or any formal relationship between, the summary statistics that are used for inferring the hyperparameters and those for inferring the parameters. This distinction is essentially irrelevant providing that the posterior distribution of α is sufficiently accurately approximated. Algorithm 1 is simpler and has the theoretical advantage of sampling from the correct posterior distribution in the limit of zero tolerance and sufficient statistics. However, it suffers from quite significant problems of storage. This may not be an issue in the longer term as computing resources become more extensive. At the present time, storage is certainly an issue to consider when the number of units (loci, individuals, etc.) is >100. Algorithm 2 avoids this storage problem. However, this algorithm involves an approximation (additional to the use of summary statistics in place of the complete data). In Algorithm 2, the second round of simulation will improve the precision in estimates of because it samples α from

_{i}*P*(α|

*S*(

*X*)) rather than from

*P*(α). Therefore it may be possible to accept a high proportion of simulated observations, while using a relatively small tolerance. Looking at the problem from the perspective of importance sampling (as in Beaumont

*et al*. 2009; Toni

*et al*. 2009), it is inviting to consider the weight necessary to correct the error in Algorithm 2. It is straightforward to show that the weight is inversely proportional toThat is, if each observation

*k*in step 2 (v) of Algorithm 2 is given a weight that is inversely proportional to the marginal likelihood

*P*(

*X*|

_{i}*A*), the resulting weighted sample will be drawn from the correct distribution. Unfortunately the quantity

_{k}*P*(

*X*| α) is not in general easy to compute (otherwise there would be no need to recourse to ABC!). Our main argument in favor of Algorithm 2 is that the approximation will be very slight when the number of units (loci) is large, and scenarios when the number of units is low can be handled by Algorithm 2. The modification, 2a, to Algorithm 2, which is exact, would also be infeasible, requiring separate simulations of step 1 for each locus. Experiments (not shown) with toy simulations based on a beta-binomial model suggest that even with 2 units the approximation in Algorithm 2 is good. With the beta-binomial the ABC can be simulated exactly, the weight above can be computed, and Algorithms 1, 2, and 2a can be easily performed and compared.

_{i}One potential criticism of the comparison between our ABC approach and that of Beaumont and Balding (2004) is that one uses a model-choice framework and the other is based on Bayesian *P*-values. Thus it might be argued that we have confounded an intrinsic advantage of the model-choice framework with good performance of ABC. However, with low, nonvariable mutation rates there appears to be relatively little difference in performance of the various approaches to detecting selection that are based on differences in gene frequency. For example, Beaumont and Balding (2004) showed that the difference in performance of the moment-based method of Beaumont and Nichols (1996) was relatively slight. Riebler *et al*. (2008), who reformulated the model of Beaumont and Balding (2004) into an explicit model-choice framework, demonstrated by means of ROC analysis only a small improvement. Small improvements are also found in Foll and Gaggiotti (2008) and Guo *et al*. (2009). Therefore we argue that the similar performance of BayesFst and the ABC approach with low, nonvariable mutation rates and the better performance of the ABC method with high and variable mutation rates are not biased by an intrinsic superiority of the model-choice framework.

An additional criticism of our model is that we have not included the ability to detect balancing selection, which is present in the methods of Beaumont and Balding (2004), Foll and Gaggiotti (2008), and Riebler *et al*. (2008). Although it would be straightforward to implement, it was not an aim of this study. It is unlikely that by failing to implement a balancing selection component, we have thereby artificially increased the power of the ABC approach in comparison with the multinomial-Dirichlet model. Since the signal of local selection is increased variance in allele frequencies among demes, these would not be placed in a balancing selection category anyway. We note that the attempt to use low *F*_{ST} as a signal of balancing selection is logically somewhat problematic. If a locus is truly under balancing selection, it is unlikely that the selection coefficients will be identical in each population. Thus we might typically expect the selection coefficients to vary among populations so the equilibria should vary among populations. For populations with relatively high migration rates it is conceivable that loci under balancing selection may have elevated *F*_{ST} relative to neutral expectation.

By assuming that the scaled mutation rate θ is the same in all demes (while allowing for varying scaled migration rate, *𝒩 _{j}*), we tacitly assumed constant effective size

*N*in each deme. This may be considered somewhat unrealistic, and a future improvement to the model would allow for variation in deme size. This would be preferable to variable θ because then one could include covariance between

*𝒩*and θ through shared

*N*. Variability in effective size over time could also be considered. Such improvements may reduce the discrepancies observed in the fit of the model to the chimpanzee data (Figure S2). An advantage of the explicit model-based approach advocated here is that it is relatively easy to examine model discrepancy (see Ratmann

*et al*. 2009 for detailed discussion).

In addition to the modeling of potential candidates of balancing selection, further improvements to our demographic model could include, within the island model framework, the number of demes as a parameter to be inferred. This is potentially important when considering the effects of mutation on gene frequencies. For example, in the case of an infinite-allele, finite-island model with *F*_{ST} defined as in Rousset (1996) we have(for small *m* and μ). A locus with a higher mutation rate is therefore expected to have reduced *F*_{ST} but the strength of the effect depends on the deme size, *N*. Information about the mutation rate is provided by the metapopulation heterozygosity, *H*_{T}, which depends *both* on the deme size *and* on the number of demes becausewhereTherefore we expect that very highly heterozygous loci will have reduced *F*_{ST}, potentially leading to false positives for balancing selection (Beaumont 2008; Excoffier *et al*. 2009), but the amount that *F*_{ST} is reduced for a given level of heterozygosity depends on the number of demes in the metapopulation. If the number of demes is large, but they have small size, then an elevated mutation rate may have little effect on *F*_{ST}.

Further extensions of the model may include more general migration matrices, and range expansion, to allow for isolation-by-distance effects [necessary to model human demography (Prugnolle *et al*. 2005)]. It would also be necessary to consider more general mutation models to allow analysis of sequence data. Much of this could be achieved by the use of general-purpose packages (Rambaut and Grassly 1997; Hudson 2002; Laval and Excoffier 2004). To demonstrate the utility of the approach we have applied it to the problem of detecting loci under selection. It is important, however, to emphasize that not all problems can be handled as straightforwardly by a Bayesian hierarchical model, for example, when conditional independence cannot be assumed. There are other areas of application of our ABC method, including population assignment in a more realistic genealogical setting. Its use in fields outside population genetics can also be envisaged.

## APPENDIX: FACTORIZATION OF THE POSTERIOR DENSITY

In this appendix, we derive the factorization (8), and hence (9), under assumptions that are slightly more general than those set out in (6) and hence (5). In fact we continue to assume that the joint prior *P*(κ, λ, α) factorizes as in (6), but we assume that the likelihood function *P*(*X*|κ, λ, α) for our model has the factorization (12). Note that here, α is also a parameter of the model. This formulation covers the special case where the parameter λ_{i} is simply a function of κ_{i} and α.

From the factorization (12) of the likelihood function, and the factorization (6) of the prior density, it follows that the joint density *P*(α, κ, λ, *X*) has the factorization(A1)The marginal density *P*(α, *X*) is therefore(A2)where(A3)Dividing (A1) by (A2) we have(A4)Substituting the right-hand side of (A4) into the factorization(A5)we obtain the factorizations (8) of the posterior density *P*(α, κ, λ | *X*).

## Acknowledgments

We are grateful for the constructive critique of two anonymous referees. This work was supported by Biotechnology and Biological Sciences Research Council (BBSRC) grant BBS/B/12776 to M.B. and K.D. and Engineering and Physical Sciences Research Council grant EP/C533550/1 to M.B. E.B. acknowledges grant ANR 07-BDIV-003 (Emerfundis project) for further support. Rothamsted Research receives grant-aided support from the BBSRC.

## Footnotes

Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.109.112391/DC1.

↵1

*Present address:*Centre de Biologie et de Gestion des Populations (CBGP), Campus International de Baillarguet, CS 30 016, 34988 Montferrier/Lez Cedex, France.Communicating editor: R. Nielsen

- Received November 24, 2009.
- Accepted March 29, 2010.

- Copyright © 2010 by the Genetics Society of America