## Abstract

The distribution of fitness effects (DFE) of new mutations is of fundamental importance in evolutionary genetics. Recently, methods have been developed for inferring the DFE that use information from the allele frequency distributions of putatively neutral and selected nucleotide polymorphic variants in a population sample. Here, we extend an existing maximum-likelihood method that estimates the DFE under the assumption that mutational effects are unconditionally deleterious, by including a fraction of positively selected mutations. We allow one or more classes of positive selection coefficients in the model and estimate both the fraction of mutations that are advantageous and the strength of selection acting on them. We show by simulations that the method is capable of recovering the parameters of the DFE under a range of conditions. We apply the method to two data sets on multiple protein-coding genes from African populations of *Drosophila melanogaster*. We use a probabilistic reconstruction of the ancestral states of the polymorphic sites to distinguish between derived and ancestral states at polymorphic nucleotide sites. In both data sets, we see a significant improvement in the fit when a category of positively selected amino acid mutations is included, but no further improvement if additional categories are added. We estimate that between 1% and 2% of new nonsynonymous mutations in *D. melanogaster* are positively selected, with a scaled selection coefficient representing the product of the effective population size, *N*_{e}, and the strength of selection on heterozygous carriers of ∼2.5.

THE increasing availability of large, genome-wide data sets on DNA sequence variability within populations has stimulated the development of statistical population genetic methods for fitting models of the evolutionary forces affecting sequence evolution and variability and estimating the parameters of the models, especially the strength of positive and purifying selection (reviewed by Eyre-Walker and Keightley 2007; Wright and Andolfatto 2008; Sella *et al.* 2009; Charlesworth 2011). These methods have been applied to both noncoding and coding sequences and within coding sequences to selection on codon usage at synonymous sites (Bulmer 1991; Akashi 1995; Zeng and Charlesworth 2009; Sharp *et al.* 2010) and nonsynonymous sites (Sawyer *et al.* 1987; Bustamante *et al.* 2002; Piganeau and Eyre-Walker 2003; Eyre-Walker *et al.* 2006; Loewe *et al.* 2006; Keightley and Eyre-Walker 2007; Boyko *et al.* 2008).

There is general agreement that nonsynonymous mutations are usually subject to the strongest selection pressures relative to other types of single-nucleotide mutation, and much attention has been given to the following two questions, which can in principle be answered by large-scale studies of within-species variation and between-species divergence. What is the nature of the distribution of selection coefficients against newly arising nonsynonymous mutations that have deleterious effects on fitness? What is the fraction (α) of nonsynonymous differences between two related species that have been driven to fixation by positive selection, as opposed to neutral or slightly deleterious mutations that were fixed by random genetic drift? While these questions are far from being completely answered, evidence from a variety of organisms and methods suggests that there is a wide distribution of selection coefficients against nonsynonymous mutations, with the bulk of variants found segregating in populations being only weakly deleterious (Sawyer *et al.* 1987; Bustamante *et al.* 2002; Piganeau and Eyre-Walker 2003; Eyre-Walker *et al.* 2006; Loewe *et al.* 2006; Keightley and Eyre-Walker 2007; Boyko *et al.* 2008; Haddrill *et al.* 2010). In several species of *Drosophila*, mice, bacteria, and some plants, there is evidence that α is of the order of 50%, whereas in hominids and some plants it is apparently much lower (Smith and Eyre-Walker 2002; Charlesworth and Eyre-Walker 2006; Welch 2006; Andolfatto 2007; Shapiro *et al.* 2007; Bachtrog 2008; Boyko *et al.* 2008; Eyre-Walker and Keightley 2009; Strasburg *et al.* 2009, 2011; Gossmann *et al.* 2010; Halligan *et al.* 2010; Haddrill *et al.* 2010; Ingvarsson 2010; Jensen and Bachtrog 2010; Slotte *et al.* 2010).

The rate of fixation of advantageous nonsynonymous mutations is proportional to the product of the proportion of new nonsynonymous mutations that are selectively advantageous (*p*_{a}) and their rate of fixation once they arise in the population, assuming that the rate of adaptive evolution is limited by the supply of new mutations (Ohta and Kimura 1971). If the population size is large, the rate of fixation of new nonsynonymous mutations in a randomly mating population is proportional to the product of the effective population size, *N*_{e}, and their mean selective advantage in the heterozygous state, *s*_{a/2} (Ohta and Kimura 1971). For a given rate of fixation of neutral or slightly deleterious nonsynonymous mutations, α is thus controlled by the product *p*_{a}*N*_{e}*s*_{a}. In several recent analyses *p*_{a} and *N*_{e}*s*_{a} have been estimated separately by several different methods, using polymorphism and divergence data sets from *Drosophila*. Very disparate estimates have been obtained by these methods, ranging from a very low frequency of adaptively favorable mutations with relatively strong selective advantages (Eyre-Walker 2006; Li and Stephan 2006; Macpherson *et al.* 2007; Jensen *et al.* 2008; Jensen 2009) to a relatively high frequency with a very small mean selective advantage (Sawyer *et al.* 2003; Andolfatto 2007) and a mixture of strongly and weakly selected advantageous mutations (Sattath *et al.* 2011).

In this article, we develop a new method for dealing with this problem, in which we extend the maximum-likelihood estimation procedure of Keightley and Eyre-Walker (2007) and Eyre-Walker and Keightley (2009) for estimating the distribution of deleterious selection coefficients and α. The extended method allows the inclusion of contributions from advantageous mutations to nonsynonymous diversity, which permits *p*_{a} and *N*_{e}*s*_{a} to be estimated simultaneously with the other parameters. A similar method was developed by Boyko *et al.* (2008), but they did not explore the performance of their method in any depth. However, they did note that it was very difficult to disentangle the rate and strength of advantageous mutation when applied to data from hominids. We apply our method to the data sets of Shapiro *et al.* (2007) and Callahan *et al.* (2011) on within-species variability and between-species divergence for protein-coding genes in *Drosophila* *melanogaster* and find evidence that *p*_{a} is of the order of 1.5% and *N*_{e}*s*_{a} ∼ 5.

Our new method requires assignment of the allele corresponding to the ancestral state at each site. Using parsimony, the ancestral state would correspond to the allele found in the closest outgroup species, in this case *D. simulans*. However, this does not take into account the possibility of a substitution in the *D. simulans* lineage, nor does it consider rate variation among sites (where some sites are more constrained than others). Thus, a probabilistic approach using two different outgroups was used, in which the first step is to estimate the properties of the substitution process (the substitution rate matrix and the degree of rate variation) as well as the distances between the three species. The second step is to estimate the probability distribution for each ancestral nucleotide at each site under a probabilistic substitution model. This takes into account potential substitutions in any of the outgroups, which depend on the evolutionary distances between the species. The influence of the more distant outgroup (here *D. yakuba*) is smaller than that of the closer outgroup (*D. simulans*), but it is a strong indicator of a site being either more constrained (if the states of the two outgroups coincide) or more variable (if the outgroup states disagree).

## Materials and Methods

### Data

We analyzed two sets of *D. melanogaster* polymorphism data. The African subset of the *D. melanogaster* protein-coding gene sequences described by Shapiro *et al.* (2007) was kindly provided by Joshua Shapiro and consists of 15 *D. melanogaster* alleles (11 originating from Zimbabwe and 2 each originating from Botswana and Zambia) along with 1 outgroup allele from *D. simulans*.

The sequences of an additional outgroup, *D. yakuba*, were obtained through the UCSC Genome Browser (Kent *et al.* 2002) by aligning the *D. melanogaster* “base sequences” to the reference genome (version dm3) and then using the pairwise genome alignment (version vsDroYak2) to map each nucleotide to the orthologous nucleotide in *D. yakuba*. Of the 397 protein-coding loci, 5 had to be discarded due to ambiguous mapping to the reference genome. The remaining 392 loci were analyzed and yielded 181,415 zerofold sites that were used as nonsynonymous sites and 42,113 fourfold sites that were used as synonymous sites. A second data set, described by Callahan *et al.* (2011), was kindly provided by Peter Andolfatto and consists of 24 *D. melanogaster* alleles originating from Zimbabwe, along with *D. simulans* and *D. yakuba* outgroup alleles. The 213 protein-coding loci provide a total of 80,809 nonsynonymous (zerofold) and 19,574 synonymous (fourfold) sites. In the case of Shapiro *et al.* (2007), the maximum-likelihood (ML) analysis was simultaneously applied to sites where all 15 alleles were present and to sites with up to 5 missing alleles. For the Callahan *et al.* (2011) data set, we allowed up to 8 missing alleles from the 24 sequenced. Sites with >2 segregating alleles were excluded from the analysis.

### Model

We assume that all nucleotide sites are in linkage equilibrium and that up to two variants can segregate at a site. We assume that there is a class of sites at which mutations are exclusively neutral (“neutral sites”) and a class of sites at which both advantageous and deleterious mutations can occur (“selected sites”). We assume that there are *n*_{a} classes of advantageous mutations. We assume intermediate dominance and independence among sites. The fitness effect of class *i* (*i* = 1 … *n*_{a} ) is *f*(*s*_{d}), with scale and shape parameters *a* and *b*, respectively, and *s*_{d} is the fitness difference between the wild-type and mutant homozygotes. The fraction of advantageous mutations is *p*_{a} of mutations are deleterious.

### Obtaining the unfolded site frequency spectrum

The unfolded distribution of the number of copies of the derived allele in a sample of *n*_{T} alleles from a population (the unfolded site frequency spectrum, SFS) is a vector **p**(). Let **p**(sel) and **p**(neut) denote the vectors for selected and neutral sites, respectively. If we are dealing with simulated data, the ancestral allele is known, and thus the number of derived alleles can be determined directly. However, with real sequence data from extant species the ancestral state is unknown. If we have a close outgroup species and assume parsimony, the ancestral allele corresponds to the outgroup allele in most cases. However, substitutions between the ingroup and outgroup species, followed by mutations that cause polymorphism, can lead to a misinterpretation of low-frequency alleles as high-frequency alleles or vice versa. This can be corrected by computing the probabilities for the possible ancestral states. We therefore require a model for the substitution process between the outgroup and the ancestral sequence of the focal species, which can then be used to estimate a corrected SFS.

### Estimating the substitution parameters

The substitution process between the three species (*D. melanogaster*, *D. simulans*, and *D. yakuba*) is modeled as a Markov process under the general time-reversible (GTR) model (Tavaré 1986), assuming rate variation among sites with a proportion of invariant sites and *N*_{R} equally probable categories of rates whose means, *r _{i}*, follow a gamma distribution (Yang 1994). The substitution parameters were estimated separately for synonymous and nonsynonymous sites for each data set. All sites of a given type within a data set were concatenated to one large alignment and then analyzed using Phyml (Guindon

*et al.*2010) under the GTR model. In addition to the substitution rate matrix and the branch lengths leading to the three species, the proportion of invariant sites and the shape parameter

*a*of the gamma distribution of the site rates were also estimated (we use

*a*to denote the shape parameters to avoid confusion with α, which is introduced below as the fraction of adaptive substitutions).

### Probabilistic computation of the unfolded site frequency spectrum

For each site with a pair of segregating alleles in *D. melanogaster* (with states *x* and *y*), the probability distribution of the ancestral state *A* depends on the corresponding states *o _{s}* and

*o*of the outgroups (of

_{y}*D. simulans*and

*D. yakuba*, respectively) and on the substitution process θ (which describes the substitution rate matrix

**Q**, the branch lengths

*t*

_{mel},

*t*

_{sim}, and

*t*

_{yak}leading to the three species, and the shape parameter

*a*). Given these parameters, the likelihood of the ancestral state being

*x*is defined as follows:

*L*(

*T*), is the likelihood of the tree

*T*relating the three characters

*x*,

*o*, and

_{s}*o*, while the second term, Pr(

_{y}*S*= {

*x*,

*y*}), is the probability that the ancestral state

*x*generated the two observed alleles

*x*and

*y*, which depends on the substitution rates

**Q**and is given by Hernandez

*et al.*(2007):

Under a model of rate variation among sites with *N*_{R} discrete rate categories of equal probability with mean rates *r _{i}*, the likelihood of a tree relating the characters

*x*,

*o*, and

_{s}*o*is defined (Yang 1994) as

_{y}Finally, for a given rate *r*, the likelihood of a tree relating three species is obtained by summing over the possible states *Y* of the unknown internal node,*L*(*a* → *b*|*t*, **Q**) being the likelihood of character *a* being substituted by *b* after time *t* under a Markov model defined by rate matrix **Q**. This corresponds to the index (*a*, *b*) of the probability matrix **P**(*t*), given by

Considering only the possibilities of *A* being either in state *x* or in state *y*, the probabilities of the two possible ancestral states are obtained by normalizing the likelihoods:

The SFS is then calculated such that for each site with *m* alleles of type *x* and *n*_{T} *− m* alleles of type *y*, the corresponding allele frequency *p*()* _{m}* is increased by Pr{

*A*=

*x*} and

*A*=

*y*}.

### Calculation of the population allele frequency distribution

We used the methods described by Keightley and Eyre-Walker (2007) to compute the expected distribution of the frequency of a new mutant allele subject to selection in a finite diploid population, while incorporating a step change from an equilibrium population of size *N*_{1} to a population of size *N*_{2} at a time *t* generations in the past, under the assumption of unidirectional mutation. This involves calculating the frequency distribution of segregating sites for an equilibrium model, assuming a population of size *N*_{1}, and then applying transition matrix iteration for *t* generations in a population of size *N*_{2} to calculate the net numbers of segregating sites that are at different frequencies, conditioned on a mutation having occurred at each site (from ancestral to derived) at each possible generation in the past. In evaluating the likelihood of the data, we fix *N*_{1} at 100 and estimate *N*_{2} and *t* as parameters of the model. We have shown previously that this simple two-epoch demographic model allows the recovery of the parameters of the distribution of fitness effects (DFE) with little bias, even if the true demographic scenario is substantially more complex (Keightley and Eyre-Walker 2007; Eyre-Walker and Keightley 2009). The vector **v′**(*s*) contains the relative numbers of new mutations segregating at frequencies between 1/(2*N*_{2}) and (2*N*_{2} – 1)/(2*N*_{2}) for a selection coefficient *s*, at the time of sampling from the population. Let the sum of these relative numbers be

We define the frequencies at which neutrally evolving sites (*s* = 0) are in the ancestral and derived states at the time of sampling as *f*_{0} and *f*_{2}* _{N}*, respectively, which are estimated as parameters of the model. In our previous analysis (Keightley and Eyre-Walker 2007), the parameter

*f*

_{2}

*was not included, since ancestral and fixed alleles were not distinguished from one another.*

_{N}The likelihood function uses an allele frequency probability vector for neutral sites, **v**(0), which has elements as follows:

For the class of sites that are subject to selection, we use transition matrix methods to calculate *N*_{2}) to (2*N*_{2} – 1)/(2*N*_{2}), by integrating over the distribution of deleterious mutational effects, *f*(*s*_{d}), and including advantageous mutations with *n*_{a} classes of selective effects *p*_{a} and 1 – *p*_{a}, respectively. This vector is used to calculate a probability vector of frequencies of segregating and fixed mutations, **v**(*s*), which is used in the likelihood calculations. Elements of the vector of numbers of segregating selected mutations (elements 1 . . . 2*N*_{2} –1) are scaled in an identical manner to those for neutral mutations, to satisfy the requirement that Σ*v*′(0)* _{i}*/Σ

*v*′(

*s*)

*= Σ*

_{i}*v*(0)

*/Σ*

_{i}*v*(

*s*)

*. For segregating selected mutations, the elements of*

_{i}**v**(

*s*) are therefore scaled by the conditional number of neutral segregating mutations:

Let *u*(*N*_{a}, *s*)/*u*(*N*_{a}, 0) be the ratio of the fixation probability for a selected mutation of fitness effect *s* to the fixation probability for a neutral mutation, in a population of effective and census size *N*_{a} (Fisher 1930). Then multiplying this by *f*_{2}* _{N}* gives the fraction of sites with selection coefficient

*s*that have become fixed for the derived allele over the whole time since the split from the ancestral species. The expected frequency of selected sites that are fixed for the derived allele, averaged over the contributions of mutations with different selection coefficients (including all classes of advantageous mutations and deleterious mutations), is therefore

*s*. Under the assumption that neutral and selected-site divergence is dominated by fixations that occurred prior to any recent change in population size (the signature of which manifests itself in a departure of the neutral site SFS from its neutral expectation), it is appropriate to assume that

*N*

_{a}is the ancestral population size,

*N*

_{1}, in Equation 9. However, if there has been a change in population size many generations ago (

*i.e.*,

*t*≫

*N*

_{2}), the SFS may contain essentially no information from which to estimate

*N*

_{1}

*s*. We therefore apply an approximation, such that

*N*

_{a}is a weighted average of

*N*

_{1}and

*N*

_{2}, as described by Eyre-Walker and Keightley (2009, Equation 1). If the population size change was very recent, then

*N*

_{a}→

*N*

_{1}; if the size change was ancient, then

*N*

_{a}→

*N*

_{2}.

Finally, the frequency of ancestral selected alleles is

### Parameter inference by ML

The function for the likelihood of the site frequency spectra data was similar to that described by Keightley and Eyre-Walker (2007), with some simplifications. Let **p**(sel) and **p**(neut) be SFS vectors for selected and neutral sites, respectively, whose elements are the numbers of sites having a number of derived alleles from 0 to *n*_{T}, where *n*_{T} is the number of alleles in the sample. We assume that the observed SFSs are binomial samples from the allele frequency distributions **v**(*s*). We calculate **v**(*s*) and **v**(0) as functions of the model parameters (*i.e.*, for a single class of advantageous mutations: *a*, *b*, *N*_{2}, *t*, *f*_{0}, *f*_{2}* _{N}*). For the selected sites, the log likelihood is

*b*(

*i*|

*n*

_{T},

*q*) is the binomial probability of observing

*i*derived alleles in a sample of

*n*

_{T}alleles, if the expected derived allele frequency is

*q*. Note that the summations in Equation 11 are to

*n*

_{T}and 2

*N*

_{2}rather than to

*n*

_{T}– 1 and 2

*N*

_{2}– 1, respectively, as in Keightley and Eyre-Walker (2007), because in that article we did not distinguish between sites fixed for ancestral and derived alleles, as is the case here. For neutral sites, the log likelihood (log

*L*

_{neut}) was calculated using Equation 11, replacing

**p**(sel) with

**p**(neut) and

**v**(

*s*) with

**v**(0). The overall log likelihood was log

*L*

_{sel}+ log

*L*

_{neut}. Log likelihood was maximized using the simplex algorithm (Nelder and Mead 1965; Press

*et al.*1992), as described in Keightley and Eyre-Walker (2007), and convergence was checked by starting the simplex multiple times with random starting values.

The fraction of substitutions driven to fixation by adaptive evolution (α) can be estimated from the relation

Similarly, the rate of adaptive substitution scaled by the rate of neutral substitution is

### Simulations

To assess the accuracy of estimation of *p*_{a} and *s*_{a} and potential bias, we analyzed simulated data sets generated by sampling from expected gene frequency vectors computed by the transition matrix method (see Keightley and Eyre-Walker 2007 for details). Various data sets of 15 alleles (the number of alleles available in the smaller of the two *D*. *melanogaster* data sets that we subsequently analyze) were simulated with between 1250 and 50,000 synonymous and nonsynonymous bases that are subject to mutation during the simulation process. We simulated a single class of advantageous mutational effects, with different combinations of values for *p*_{a} and *s*_{a}, together with a gamma distribution of negative fitness effects with shape parameter *b* = 0.5, a mean fitness effect for deleterious mutations *E*(*s*_{d}) = −0.1, and a constant population size of *N* = 100. The ML estimation procedure was then used to estimate the parameters as described above. The simulated parameter values were used to initialize the procedure to minimize the estimation time and to avoid converging to incorrect local minima. For each data set with these combinations of number of sites, *s*_{a} and *p*_{a}, 1000 simulation replicates were performed. We found that the distribution of parameter estimates is highly skewed, so we present the median and the 25% and 75% quantiles to describe the distribution of estimates.

To investigate the effects of linkage on parameter estimation, another set of simulations was performed using the SFS_code software (Hernandez 2008). For each run, 1.2 × 10^{7} nucleotide sites were simulated, divided into unlinked loci of 30, 300, 3000, or 30,000 nucleotides within which linkage was complete. The simulations were performed using 100 ancestral individuals with a speciation event immediately after a burn-in phase, resulting in two populations of 100 individuals, which each evolved independently for 4000 generations. For each combination of parameter values, 100 runs of the simulation for each of the four locus lengths were performed.

## Results

### Simulations with unlinked loci

Simulations were performed to investigate the conditions under which the parameters modeling positively selected mutations can be estimated with confidence and when there is likely to be bias. In the model presented here, adaptive mutations are described by two parameters, the fraction of a single class of positively selected mutations, *p*_{a}, and their selection strength, *s*_{a}.

Figure 1 shows the estimation accuracy of these parameters as a function of the number of sites subject to mutation, for simulated values of *p*_{a} = 0.015 and *s*_{a} = 0.1 (giving *Ns*_{a} = 10), which are similar to the values estimated from the analysis of *D. melanogaster* data sets (see below). The plot suggests that the median of the estimates for both parameters can be estimated accurately, even for small numbers of sites. However, if the number of sites is small (<10,000 sites), the variances of the estimates become very high, and a single estimate can be an over- or underestimate by a factor of ≥10. For larger numbers of sites, however (≥25,000 nucleotides), the variance of the estimates becomes much smaller and allows for estimation of the parameters with high confidence. Remarkably, it appears that the product of the two parameters can be estimated accurately and with low variance, even when only a small number of sites is available. This is illustrated in Figure 2, which shows the correspondence between the estimated values of *p*_{a} and *s*_{a}. Each point corresponds to the estimates obtained from one run of the simulation with 25,000 sites. It can be seen that most points are very close to the L-shaped curve, which would be expected if the product of the two parameters was constant at 0.015 × 0.1. The parameters of the gamma distribution of negatively selected mutations [*i.e.*, the shape parameter *b* and the mean effect *E*(*s*)] are estimated with high precision: for ≥10,000 sites simulated, mean estimates are at most 1% different from expectation, and even for as few as 2500 sites, the estimates deviate by <10% from the simulated values.

The accuracy of parameter estimation is shown as a function of the selection strength *s*_{a} in Figure 3, again on the basis of simulations using 25,000 sites. This suggests that the accuracy of estimation of both *p*_{a} and *s*_{a} increases as the simulated value of *s*_{a} is increased. Interestingly, this effect is stronger for *p*_{a}. A likely reason for this behavior is that, if adaptive mutations are strongly selected, the resulting signal in the SFS (increased amounts of high-frequency nonsynonymous alleles) is easier to distinguish from random noise.

### Linkage effects

In a second set of simulations, the effect of linkage on parameter estimation was investigated. We used the program SFS_code (Hernandez 2008) to generate data in which the genome was split into varying numbers of unlinked loci within which linkage was complete. The results suggest that for many short loci (30 or 300 nucleotides), the simulated parameters can be estimated with little bias (Figure 4). For unlinked loci of longer length (*e.g.*, 3000 nucleotides), *p*_{a} tends to become overestimated and *s*_{a} underestimated. The underestimation of *s*_{a} may be due to two effects. First, there may be Hill–Robertson interference undermining selection on advantageous mutations linked to other advantageous mutations and to deleterious mutations. Second, genetic hitchhiking can drag neutral genetic variants to high frequency (Fay and Wu 2000). These high-frequency–derived mutations will give the appearance in the SFS of additional slightly advantageous mutations, and this will lead to overestimation of *p*_{a} and consequent underestimation of *s*_{a}. They may also distort the neutral SFS, leading to problems in correctly inferring the true demography. For loci of length <1000 bases, however, the extent of the bias observed is relatively modest; the complete linkage within loci assumed here is in any case likely to greatly exaggerate the effect of linkage compared with the typical situation for a *Drosophila* or mammalian gene (see McVean and Charlesworth 2000; Kaiser and Charlesworth 2009).

### Analysis of two *D. melanogaster* data sets

We applied the ML method to polymorphism data sets of *D. melanogaster* protein-coding genes of Shapiro *et al.* (2007), consisting of 15 alleles from African flies, and of Callahan *et al.* (2011), consisting of 24 alleles sampled from Zimbabwe. The results from the two data sets are very similar (Tables 1 and 2). Under a constant population size model, the inclusion of a single class of positively selected mutations greatly increases the log likelihood (*i.e.*, by 48 and 212 log-likelihood units, respectively, for the Shapiro *et al.* and Callahan *et al.* data sets). This model includes only two additional parameters (*s*_{a} and *p*_{a}); thus there is clearly a better fit to the data than that of the deleterious mutations-only model. If population size change is allowed, including adaptive mutations also significantly increases the log likelihood (by 65 and 148 log-likelihood units, respectively).

The best-fitting models indicate changes in population size. Interestingly, for the data set of Callahan *et al.* (2011), a relatively large, quite recent increase (5.5-fold with adaptive mutations) is inferred, whereas for the African data set of Shapiro *et al.* (2007), a 50% decrease is found, which could be indicative of population admixture. This contrasts with the 20-fold increase in population size reported by Keightley and Eyre-Walker (2007) from an analysis of a similar data set, but using a folded SFS and assuming that no adaptive mutations contribute to polymorphism. This is presumably a consequence of the presence of high-frequency alleles that are not distinguished from low-frequency alleles in the folded SFS. However, if there are no adaptive mutations, simulation results indicate that similar results are obtained by analyzing either the folded or the unfolded SFS (Keightley and Eyre-Walker 2007). In addition, including selection on synonymous sites, which we have ignored here, has been shown to remove the signature of population expansion in the Zimbabwe population of *D. melanogaster* (Zeng and Charlesworth 2009), so that it is likely that ignoring population expansion could be justified. For both data sets, there were only negligible increases in log likelihood if two classes of advantageous mutational effects are included.

Under the best-fitting model (with population size change), the estimated percentage of sites under positive selection is *p*_{a} = 0.96%, with a mean selection strength *N*_{e}*s*_{a} = 4.5 for the Shapiro *et al.* data set. For the Callahan *et al.* data set, the results suggest slightly higher frequencies of adaptive mutations with stronger effects, *i.e.*, *p*_{a} = 1.8% and *N*_{e}*s*_{a} = 5.7. Under the constant population size model, estimates are similar (*p*_{a} = 0.88% and *N*_{e}*s*_{a} = 4.4 for Shapiro *et al.* and *p*_{a} = 2.1% and *N*_{e}*s*_{a} = 4.9 for Callahan *et al.*). The variances of the adaptive mutation parameter estimates are relatively high, but they covary strongly. This is shown in plots of the log-likelihood landscape as a function of *p*_{a} and *s*_{a} around their ML estimates (Figure 5). The parameter estimates corresponding to a difference of 2 log-likelihood units from the maximum-likelihood estimates suggest that *p*_{a} could be between 0.3% and 2.1% and *N*_{e}*s*_{a} could be between 2.3 and 13.5. A similar analysis of the Callahan *et al.* data set provides ∼95% confidence intervals for *p*_{a} of 1.1–2.5% and for *N*_{e}*s*_{a} of 4.0–9.1, which are narrower than those obtained for the Shapiro *et al.* data set. This can possibly be explained by the higher number of alleles in this data set (24 as opposed to 15). Although the support limits are fairly wide for both parameters, the likelihood landscape clearly follows a “1/*x*” curve, indicating the high interdependence of the two parameters that we have observed in the simulations (see Figure 2).

Under the best-fitting model, estimates of the proportion of adaptive substitutions from Equation 12 are α = 0.74 (for the Shapiro *et al.* data set) and α = 0.95 (for the Callahan *et al.* data set). These are higher than the estimates obtained using the unfolded SFS of 0.52 for the Shapiro *et al.* data set (Eyre-Walker and Keightley 2009) and 0.84 for the Callahan *et al.* data set. The increases in α are probably due to the fact that Eyre-Walker and Keightley (2009) assumed that advantageous mutations are strongly selected and contributed little to amino acid polymorphism. In the current method, advantageous mutations can contribute to polymorphism, which reduces the estimate of the contribution from effectively neutral mutations. As a consequence, advantageous mutations contribute proportionally more amino acid substitutions. A similar pattern was observed by Boyko *et al.* (2008).

## Discussion

Our study was motivated by two factors. First, we wished to estimate the rate and fitness effects of advantageous mutations, since both these parameters are fundamental for our understanding of evolutionary adaptation. Second, whole-genome sequencing of multiple individuals sampled from natural populations promises to produce data that will make estimation of these parameters more tractable. Observations of excesses of amino acid substitutions over predictions based on standing polymorphism have provided evidence of widespread adaptive protein evolution in many species (Smith and Eyre-Walker 2002; Charlesworth and Eyre-Walker 2006; Welch 2006; Andolfatto 2007; Shapiro *et al.* 2007; Bachtrog 2008; Strasburg *et al.* 2009, 2011; Haddrill *et al.* 2010; Halligan *et al.* 2010; Ingvarsson 2010; Slotte *et al.* 2010). Under an additive model, the rate of adaptive substitution is largely determined by the product of the mutation rate to beneficial alleles and their average selection coefficient (*s*_{a}), since the fixation probability of a new advantageous mutation is proportional to *s*_{a}. This relationship constrains estimates of the rate and strength of adaptive evolution and makes inference strongly depend on the information used in the analysis. Sawyer *et al.* (2003) fitted a model to divergence and diversity data, in which a proportion of selected sites were assumed to be under strong negative selection, with other sites taking the strength of selection from a normal distribution. Under this model they inferred that amino acid substitutions are overwhelmingly a consequence of positive selection, but that the strength of selection on advantageous mutations is very weak. Andolfatto (2007) used an approach based on a comparison of nucleotide divergence and diversity and inferred a relatively high proportion of moderately beneficial amino acid mutations in *D. melanogaster* (*i.e.*, *N*_{e}*s*_{a} somewhat above the nearly neutral range, implying *s*_{a} of the order of 10^{−5}). In contrast, Eyre-Walker (2006) and Macpherson *et al.* (2007) have inferred that the strength of selection acting upon advantageous mutations is strong (*N*_{e}*s*_{a} of the order of ≥100). Eyre-Walker (2006) assumed that the correlation between nucleotide diversity and recombination rate was due to selective sweeps, whereas Macpherson *et al.* (2007) inferred the strength of selection from an analysis of genome-wide heterogeneity in diversity levels in *Drosophila*. Recently, Sattath *et al.* (2011) suggested that these conflicting results can be resolved. By analyzing the pattern of diversity around synonymous and nonsynonymous substitutions, they inferred that two classes of effects of adaptive mutations, with effects of ∼0.5% and ∼0.01%, best explain polymorphism and divergence data in *Drosophila*. Under this model the class of larger-effect mutations is responsible for most selective sweeps. The discrepancy between these results principally arises because the different approaches consider different scales over which an adaptive fixation event is expected to reduce nucleotide diversity in the genome (Sella *et al.* 2009).

In the simulations, we first evaluated the performance of our inference procedure using data generated under the same model as employed in the analysis. We inferred that a modest amount of data (≤10,000 sites) are needed to estimate the product of the proportion of adaptive mutations (*p*_{a}) and *s*_{a}, which is closely related to the proportion of adaptive substitutions (α, Equation 12). However, the two parameters are strongly negatively correlated, such that data can typically be explained nearly as well by high values of *s*_{a} and low values of *p*_{a} and vice versa. The reason for this is that the ratio α/(1 – α) is equal to *p*_{a}λ_{a}/(1 – *p*_{a} )λ_{d} ≈ *p*_{a}λ_{a} /λ_{d}, where λ_{a} and λ_{d} are the fixation probabilities of advantageous and deleterious mutations relative to the neutral value, respectively. Provided that α and λ_{d} are accurately estimated, *p*_{a}λ_{a} is thus strongly determined, and λ_{a} is proportional to *s*_{a}. Obtaining accurate estimates of *p*_{a} and *s*_{a} separately requires of the order of ≥10^{5} sites. Furthermore, parameters for weakly selected advantageous mutations are difficult to disentangle from parameters for mildly deleterious mutations, even in very large data sets. Whereas estimating the proportion or relative rate of adaptive substitutions (α or ω_{a}, respectively), which are functions of the product of *p*_{a} and *s*_{a}, largely depends on the frequency of sites that are fixed for the derived allele, separately estimating *p*_{a} and *s*_{a} depends on the presence of high-frequency polymorphisms (*i.e.*, adaptive mutations that are on their way to fixation). Alleles at these frequencies are expected to be uncommon; hence there is the need for genome-wide scale data for accurate inference.

We then investigated the effect of linkage on parameter inference. This is expected to potentially lead to biased parameter estimates, because selection on advantageous mutations will tend to change the frequencies of blocks of linked sites and can lead to excesses of high-frequency neutral and deleterious polymorphisms over neutral expectation (Fay and Wu 2000). We employed the SFS_code software developed by Hernandez (2008) and were able to make qualitative predictions about the nature of biases expected. Consistent with previous results on the effects of linkage on estimates of α (Eyre-Walker and Keightley 2009), we found that, provided that linkage is not too tight, the product of *s*_{a} and *p*_{a} is reasonably unbiased (Figure 4). However, we also found that linkage tends to lead to underestimation of *s*_{a} and overestimation of *p*_{a}. Presumably, increased linkage increases the effect of Hill–Robertson interference, so the effectiveness of selection on individual positively selected alleles is reduced, and hence estimates of *s*_{a} are decreased. Furthermore, hitchhiking generates high-frequency alleles that can distort both the neutral and the selected SFSs. Due to the limitations of the model and computing time, we are unable to make a quantitative prediction of the amount of bias expected for a real data set (for example, the *Drosophila* data set that we have analyzed here). To do so would require a more comprehensive simulation of an entire chromosome, with realistic distributions of sites subject to selection and amounts of recombination (at least on a scale *N*_{e}*r*, where *r* is the recombination rate between sites). However, it is likely that the scenarios investigated in Figure 4 represent much tighter linkage than is realistic for *Drosophila*, since we simulated loci with no intragenic recombination, and so will overestimate the probable effects of linkage in a natural population of flies.

Our approach has several other limitations, some of which that are intrinsic to the information used in the analysis and some that might be overcome with further work. We have implicitly assumed that variation is maintained under mutation, selection, and drift balance and disregarded the possibility that other processes, such as migration and balancing selection, maintain variation. We have fitted multiple categories of *s*_{a}, but in principle a distribution of *s*_{a} could be fitted to the model. However, it is doubtful whether the information would be sufficient to estimate parameters of a distribution in practice. We have fitted a gamma distribution of negatively selected mutational effects. Although this distribution can take a wide variety of shapes, it is always unimodal and may fail to model more complex distributions adequately. However, if we fit a distribution with *n* discrete bins, we have shown that this model is capable of adequately fitting complex distributions (*e.g.*, mixtures of gamma and beta distributions), the only limitation being the amount of data available (A. Kousathanas and P. D. Keightley, unpublished data). We have used the relatively simple demographic model of a step change in population size. However, selection parameters are recovered with little bias, even if the true scenario is substantially more complex (Keightley and Eyre-Walker 2007; Eyre-Walker and Keightley 2009). In our analysis, we have assumed that synonymous sites evolve neutrally, an assumption that is violated for this data set (Zeng and Charlesworth 2009). We have also assumed that the mutation rates for synonymous and nonsynonymous sites are equal, an assumption that is also violated, since they differ significantly in GC content. These departures from the model assumptions will presumably partially cancel each other, since the higher GC content of synonymous sites implies that they have a higher mutation rate than nonsynonymous sites. However, the net effect is uncertain.

Our analysis of two extensive data sets of *D. melanogaster* protein-coding gene sequences of Shapiro *et al.* (2007) and Callahan *et al.* (2011) revealed clear differences in the fit among the different models. The best-fitting model included adaptive mutations and a modest change in recent effective population size. Whether or not population size change was included in the model had little influence on estimates of adaptive mutation parameters. We obtained similar results for the two data sets, suggesting that ∼1.5% of amino acid mutations are adaptive with an average selection strength *N*_{e}*s*_{a} ∼ 5. The inclusion of adaptive mutations led to very significant increases in likelihood (reported in Tables 1 and 2) and resulted in a substantially better fit. Figure 6 shows the fit of the inferred SFSs to the true SFSs for selected and neutral sites for the data set from Shapiro *et al.* (2007). The two models that include adaptive mutations (Figure 6, green and purple bars) give the best fit to the selected SFSs, and only the model that additionally includes population size change (Figure 6, green bars) also fits the neutral SFS. It should be noted, however, that parameter inference requires substantial amounts of data. The simulations together with the likelihood plot for the real data (which is indicative of the variances, Figure 5) clearly show that the product *p*_{a}*s*_{a} can be estimated quite accurately from small data sets, but disentangling *p*_{a} from *s*_{a} can be done with high confidence only from larger data sets. The consistency among the two *D. melanogaster* data sets used here indicates that they are sufficiently large to get reasonably accurate estimates. But larger data sets from whole-genome sequencing of samples of individuals from natural populations will improve the accuracy of the estimates and shed more light on the role of adaptive mutations in molecular evolution.

### Software availability

The method is available via P. D. Keightley’s website, http://homepages.ed.ac.uk/eang33/. Source files for the project have been deposited in sourceforge http://sourceforge.net/.

## Acknowledgments

We thank Joshua Shapiro and Peter Andolfatto for providing *Drosophila* polymorphism data and Kai Zeng for providing the alignments to *D. yakuba*. A.S. was supported by a fellowship for prospective researchers from the Swiss National Science Foundation. P.D.K. acknowledges support from grants from the Biotechnology and Biological Sciences Research Council (BBSRC) of the United Kingdom and the Wellcome Trust. B.C. acknowledges support from grants from the BBSRC.

- Received June 16, 2011.
- Accepted September 24, 2011.

- Copyright © 2011 by the Genetics Society of America