## Abstract

We explore the effects of demography and linkage on a maximum-likelihood (ML) method for estimating selection and mutation parameters in a reversible mutation model. This method assumes free recombination between sites and a randomly mating population of constant size and uses information from both polymorphic and monomorphic sites in the sample. Two likelihood-ratio test statistics were constructed under this ML framework: LRT_{γ} for detecting selection and LRT_{κ} for detecting mutational bias. By carrying out extensive simulations, we obtain the following results. When mutations are neutral and population size is constant, LRT_{γ} and LRT_{κ} follow a chi-square distribution with 1 d.f. regardless of the level of linkage, as long as the mutation rate is not very high. In addition, LRT_{γ} and LRT_{κ} are relatively insensitive to demographic effects and selection at linked sites. We find that the ML estimators of the selection and mutation parameters are usually approximately unbiased and that LRT_{κ} usually has good power to detect mutational bias. Finally, with a recombination rate that is typical for Drosophila, LRT_{γ} has good power to detect weak selection acting on synonymous sites. These results suggest that the method should be useful under many different circumstances.

THE strength of natural selection, the rate of mutation, and the intensity of genetic drift are key players in evolution, determining patterns of diversity within species and patterns of divergence between species (Kimura 1983; Gillespie 1991). Quantifying the relative contributions of these forces to DNA and protein sequences is a central topic in modern evolutionary research (Li 1997; Charlesworth and Charlesworth 2010). At the population genetics level, they are often studied using either the infinite sites model (Kimura 1971) or the reversible mutation model (Wright 1931, 1949). The standard version of both models allows only two variants at each nucleotide site in the genome. The infinite sites model assumes that mutation is unidirectional, from the ancestral state at a fixed site to the derived states, with no mutational events allowed at segregating sites; the reversible mutation model allows bidirectional mutation between the two alleles, which can occur even at segregating sites. The latter model is arguably more realistic. Nonetheless, these two models are intrinsically related to each other, with the infinite sites model approximating the reversible mutation model when the product of the effective population size and the mutation rate is sufficiently small (Kondrashov 1995; Desai and Plotkin 2008; see also supporting information, File S1).

The infinite sites assumption has formed the basis of many methods for estimating mutational and/or selection parameters from polymorphism data collected within a species (Sawyer and Hartl 1992; Akashi and Schaeffer 1997; Bustamante *et al.* 2002; Sawyer *et al.* 2003, 2007; Eyre-Walker *et al.* 2006; Keightley and Eyre-Walker 2007; Boyko *et al.* 2008). We refer to this class of methods as Poisson random field (PRF) methods (Sawyer and Hartl 1992). In most but not all applications, they require the polarization of ancestral and derived states at each polymorphic site using an outgroup species. Because modeling linkage between sites is theoretically and computationally extremely challenging (Griffiths and Marjoram 1996; Neuhauser and Krone 1997; Stumpf and McVean 2003), linkage equilibrium between segregating sites is usually assumed. Simulations have shown that, when this assumption is violated, the PRF methods suffer from high false positive rates with respect to detecting selection (see Table 2 in Bustamante *et al.* 2001). To solve this problem, a method that requires estimation of recombination rates from the data and determination of significance using computer simulations has been developed (Zhu and Bustamante 2005).

The reversible mutation (RM) model has typically been used to study weak selection acting on synonymous codons (*i.e.*, codon usage bias), usually assuming that the scaled mutation rate is so low that the infinite sites assumption is approximately valid (Li 1987; Bulmer 1991; McVean and Charlesworth 1999). But this class of methods does not necessarily require polarization of mutations. Instead, it considers several predefined allelic classes (typically two; see Zeng 2010 for a multiallele version of the model), such as preferred and unpreferred codons at synonymous sites (Li 1987; Bulmer 1991; McVean and Charlesworth 1999) or AT *vs.* GC base pairs in noncoding regions (Galtier *et al.* 2006; Haddrill and Charlesworth 2008; Zeng and Charlesworth 2010). Under this framework, several maximum-likelihood (ML) inference methods have recently been proposed (Maside *et al.* 2004; Cutter and Charlesworth 2006; Galtier *et al.* 2006; Zeng and Charlesworth 2009; Zeng 2010); we refer to these methods as the RM methods. These methods also make the assumption of no linkage disequilibrium.

Given that the PRF methods are sensitive to the violation of the free-recombination assumption and that they are closely related to the RM methods, it is natural to ask whether the RM methods also tend to give false positive results in the presence of linkage. The answer to this question has important implications for the validity of the inferences that selection is acting on synonymous sites, as found in studies of Caenorhabditis (Cutter and Charlesworth 2006; Cutter 2008) and Drosophila (Maside *et al.* 2004; Bartolomé *et al.* 2005; Comeron and Guthrie 2005; Zeng and Charlesworth 2009, 2010), and that GC base pairs are favored over AT base pairs by selection or biased gene conversion in noncoding sequences (Galtier *et al.* 2006; Haddrill and Charlesworth 2008; Zeng and Charlesworth 2010).

In addition to linkage, another issue that needs consideration is demography. In their simplest form, both the PRF methods and the RM methods assume that the population is randomly mating and that its size is constant over time. These assumptions are often unrealistic. It is known that population structure or changes of population size can cause false inferences of selection from PRF methods, even if evolution is strictly neutral (Zhu and Bustamante 2005); new methods have been constructed to address this problem (Williamson *et al.* 2005; Keightley and Eyre-Walker 2007; Boyko *et al.* 2008; Gutenkunst *et al.* 2009). However, it is unclear to what extent the RM methods produce false inferences of selection when the underlying demographic history of the species is ignored, especially as RM methods that incorporate population size changes have only recently been developed (Zeng and Charlesworth 2009, 2010).

Finally, and more specifically relevant to the study of codon usage bias, it is important to know whether reliable inferences of selection can be obtained when applying the RM methods to synonymous sites that are embedded in a background of nonsynonymous sites that are potentially under stronger selection. It is known that the interaction between linked selected sites, known as Hill–Robertson interference (HRI) (Hill and Robertson 1966; Felsenstein 1974; McVean and Charlesworth 2000; Comeron and Kreitman 2002; Comeron *et al.* 2008; Kaiser and Charlesworth 2009; Seger *et al.* 2010), can reduce the effectiveness of selection and distort allele frequency spectra at linked sites. We thus need to know whether HRI among sites subject to reversible mutation and selection can cause the RM methods to produce false positive results.

In this study, we focus on the method of Zeng and Charlesworth (2009), a version of the RM method that is flexible enough to include the effects of demographic changes. To address the questions raised above, we generated random samples from simulated populations under various models and analyzed these samples using this method. When specific underlying assumptions are violated, we ask (1) whether selection and mutation parameters can be accurately estimated and (2) whether the tests for detecting selection and mutational bias tend to be conservative or give false positive results.

## METHODS

#### The method of Zeng and Charlesworth (2009):

This method is based on the RM model (Wright 1949; Li 1987; Bulmer 1991; McVean and Charlesworth 1999). In its simplest version, it assumes a randomly mating Wright–Fisher population of *N* diploid individuals (see Table 1 for notation). Two alleles, *A*_{0} and *A*_{1}, are possible at an autosomal nucleotide site. The mutation rate from *A*_{0} to *A*_{1} is κ*u*, and that in the reverse direction is *u*; mutation is said to be biased when κ ≠ 1. The fitnesses of the three genotypes, *A*_{0}*A*_{0}, *A*_{0}*A*_{1}, and *A*_{1}*A*_{1}, are 1, 1 – *s*, and 1 – 2*s* (genic selection). Zeng and Charlesworth (2009) showed that the equilibrium distribution of the frequency of *A*_{1} in the population can be obtained by numerically solving a linear system jointly determined by *N*, *u*, κ, and *s*.

For a sample of *n* chromosomes, each composed of *L* unlinked sites and taken randomly from the population, we can count the number of sites where *A*_{1} is represented *i* times, denoted as *d _{i}* (0 ≤

*i*≤

*n*). We often call

**D**= (

*d*

_{0},

*d*

_{1},…,

*d*) the (sample) allele frequency spectrum (

_{n}*L*= ). Standard diffusion theory suggests that, under weak evolutionary forces, sampling properties can be characterized by two compound parameters: θ = 4

*Nu*and γ = 4

*Ns*(Ewens 2004, Chap. 5). Thus, under the free recombination assumption, we can write down the likelihood function of the data,

*L*(θ, κ, γ |

**D**) (Zeng and Charlesworth 2009). Maximum-likelihood estimates of the parameters, denoted as , , and , can then be obtained numerically using the simplex algorithm (Press

*et al.*1992).

To detect selection, Zeng and Charlesworth (2009) proposed the use of the following likelihood-ratio test statistic:(1a)

For detecting mutational bias, the following statistic was proposed:(1b)

Under the free recombination assumption, the distributions of these two statistics should converge to a chi-square distribution with 1 d.f., denoted as .

#### Proper statistical tests:

A statistical test is referred to as a proper test if the observed rejection rate is close to the nominal significance level, which is 5% throughout this study. Statistically, we can use the Kolmogorov–Smirnov (KS) test (as implemented in the R software package) to determine whether the distribution of the *P*-values obtained by applying a test of interest to the simulated data (see below) is uniform on [0, 1], which should be the case if the test is proper.

#### Tajima's *D*:

The skewness of an observed frequency spectrum was measured by Tajima's *D* (Tajima 1989b). To see whether the observed Tajima's *D* values are close to the theoretical minimum, we obtained values of the relative Tajima's *D* (Schaeffer 2002) whose minimum value is −1.

To determine the level of significance for the *D* value calculated from a sample with *S* segregating sites, we first used the coalescent process to generate 15,000 random samples, each with *S* segregating sites, assuming constant population size and complete linkage (Hudson 1990); a *D* value was obtained for each of these samples. The observed *D* value was regarded as significant if it was smaller (or larger) than the 5th (or 95th) percentile of the simulated distribution (Wall and Hudson 2001).

#### Forward simulation algorithms:

We used a forward simulation algorithm to generate random samples. We modeled a haploid Wright–Fisher population (Ewens 2004, p. 20), *e.g.*, with 1000 chromosomes, corresponding to a diploid population size of *N* = 500. A small population size was used due to computational constraints. Nonetheless, the fact that sample properties are determined by the compound parameters such as θ and γ under weak evolutionary forces (Ewens 2004, Chap. 5) provides the theoretical justification for “scaling down” the population size, provided that the θ- and γ-values for the true population size are preserved. This rescaling method has been widely used in population genetics and has been shown to be highly effective (McVean and Charlesworth 2000; Tachida 2000; Comeron and Kreitman 2002; Kaiser and Charlesworth 2009). Haploids were used to avoid the complications of extreme associative overdominance that can arise with strong selection in a small population (Charlesworth and Charlesworth 1997; Palsson 2002).

Each chromosome had *L* sites, each with two possible variants, *A*_{0} and *A*_{1}. Let *L*_{0} and *L*_{1} be the numbers of *A*_{0} and *A*_{1} sites in a given chromosome (*L*_{0} + *L*_{1} = *L*). It was assumed that in this chromosome, in a given generation, a Poisson number of *A*_{0} sites with mean κ*uL*_{0} mutated to *A*_{1} and a Poisson number of *A*_{1} sites with mean *uL*_{1} mutated to *A*_{0}. In models with nonneutral evolution, fitnesses were recalculated after adding new mutations to all the chromosomes, assuming multiplicative fitness over selected sites,(2)where *w _{i}* = 1 −

*s*if

_{i}*A*

_{1}is present at the

*i*th selected site and

*w*= 1 if

_{i}*A*

_{0}is present at this site.

In the absence of recombination, the next generation was formed by sampling with replacement from the current generation, with the chance of sampling a particular individual being proportional to its fitness. When recombination was present, to generate a new individual, two parental chromosomes were first chosen, then recombinants were constructed, and finally one of the two recombinants was randomly chosen to be retained in the new generation.

Recombination was assumed to be caused by either crossing over or gene conversion, or both. For each pair of parental chromosomes, the numbers of crossing over and gene conversion events were drawn from Poisson distributions with mean values of *cL* and *gL*, respectively. These events were randomly placed onto the chromosome. For a gene conversion event, *T* sites downstream of the initiation point were converted, where *T* was a geometric random variable with mean *E*(*T*). For purposes of comparison with real populations, we used values of *cL* and *gL* such that their products with *N* are comparable with the corresponding products with estimates of effective population size *N*_{e} for Drosophila populations, as described by Kaiser and Charlesworth (2009).

#### Neutral demographic models:

Complete linkage between sites was assumed in all neutral demographic models. To examine the effects of changes of population size, we assumed that a population of *N*_{1} diploids was initially at statistical equilibrium. At time zero, the population size changed instantly to *N*_{2} and stayed constant afterward. The time since the change in population size is measured by τ = *t*/(2*N*_{2}), where *t* is the number of generations since the change. Similarly, a bottleneck model (Thornton and Andolfatto 2006) was also examined.

To model population structure, we used Wright's symmetric island model with *d* demes (reviewed in Charlesworth and Charlesworth 2010, pp. 317–320). Each deme had 2*N* = 500 chromosomes. Let *m* be the probability that an individual is a migrant. Then the equilibrium value of the widely used measure of genetic isolation between demes, *F*_{ST}, is given by(3)

Different values of *m* and *d* were used; all haplotypes were sampled from one deme (Table 5), which is likely to maximize the distortion of variant frequency spectra caused by population structure (De and Durrett 2007). The effects of other sampling schemes were examined in Table S3.

#### Models of codon structure and selection at nonsynonymous sites:

To model codon structure, we assumed that, along the whole length of each chromosome, a pair of selected nonsynonymous sites was followed by a synonymous site (Kaiser and Charlesworth 2009). At the beginning of each replicate simulation, the selection intensity for each nonsynonymous site (*i.e.*, γ_{i}) was sampled from a log-normal distribution with shape and location parameters of 3.022 and 0.0368, respectively, and was assumed to be constant throughout this replicate. The shape and location parameters correspond to the exponentials of the standard deviation and mean of ln(*s*) (Loewe and Charlesworth 2006). With *N* = 500, this distribution gives a harmonic mean selection coefficient, *s*_{h}, such that *Ns*_{h} = 10. This corresponds approximately to the mean selection coefficient for nonsynonymous mutations that are segregating in a typical Drosophila population (Loewe and Charlesworth 2006; Loewe *et al.* 2006). The log-normal distribution was used because it has been shown to provide good fit to the data obtained in population genetic surveys (Loewe and Charlesworth 2006; Loewe *et al.* 2006; Boyko *et al.* 2008). Furthermore, a recent simulation study showed that the above model can accurately predict the level of synonymous diversities observed for Drosophila dot and neo-*Y* chromosomes (Kaiser and Charlesworth 2009). Synonymous sites may be either neutrally evolving or subject to weak selection. In the latter case, it was assumed that all synonymous sites were subject to the same selection intensity.

#### Burn-in time:

Previous simulation studies showed that the time required to reach statistical equilibrium is of the order of 1/*u* generations under the reversible mutation model (McVean and Charlesworth 2000; Tachida 2000; Comeron and Kreitman 2002). In this study, for each parameter combination, we inspected the amount of time needed to reach equilibrium. A burn-in period of at least 5/*u* generations was implemented before samples were taken.

#### Some implementation details:

The simulation algorithms were written in the Java programming language. A chromosome was represented by an array of 64-bit-long integers (computer words), with the state of a given bit representing the state of a nucleotide site. To improve efficiency, the object-oriented features of the Java language were exploited to reuse the integer arrays as much as possible to avoid copying.

#### Predicting the effects of background selection:

In the model with codon structure and selection on nonsynonymous sites, the effective population size, *N*_{e}, as estimated from the equilibrium diversity at linked, neutrally evolving sites, is reduced by selection at linked sites (Kaiser and Charlesworth 2009). Under the assumption that selected sites are close to deterministic mutation–selection balance (background selection; Charlesworth *et al.* 1993), the expected reduction in neutral diversity can be calculated as follows. We define *B* as *N*_{e}/*N*, where *N* is the effective population size in the absence of selection at linked sites. According to Nordborg *et al.* (1996), we have(4)where *r _{i}* is the recombination rate between the focal neutral site and the

*i*th selected site, as given by Equation 3 of Loewe and Charlesworth (2007), and

*s*is the selection coefficient against carriers of a deleterious nonsynonymous mutation at this site. For a region of length

_{i}*L*, we calculated

*B*only for the synonymous site in the center and ignored the spatial pattern for

*B*reported previously (Loewe and Charlesworth 2007). To take into account the log-normal distribution of

*s*, we sampled an

_{i}*s*value for each nonsynonymous site and calculated

_{i}*B*. Then we repeated the sampling and calculation 1000 times and used the mean value of

*B*as the predicted reduction in

*N*

_{e}.

## RESULTS

#### The effects of linkage on the method of Zeng and Charlesworth under neutrality in a constant-size population:

It is known that, under the infinite sites model, linkage between sites does not change the shape of the site frequency spectrum, but merely increases the variance (Hudson 1983; Bustamante *et al.* 2001). In Figure S1, we show that the same conclusion holds under the reversible mutation model considered here.

The maximum-likelihood inference method of Zeng and Charlesworth (2009) assumes free recombination between sites and uses both polymorphic and monomorphic sites to infer the parameters of the model: θ, κ, and γ. The extra variance induced by linkage raises concerns about the reliability of this method. To address this issue, we used our inference method to analyze random samples generated by the forward simulation algorithm (see methods), assuming neutrality and a constant population size, but with various levels of linkage between sites. Reassuringly, the results in Table 2 (see also Table S1) suggest that the inference method is approximately unbiased, regardless of the level of linkage. This is consistent with a previous theoretical study (Wiuf 2006), which showed that likelihood methods neglecting linkage provide consistent estimators under a wide range of neutral population genetic scenarios.

Zeng and Charlesworth (2009) constructed two likelihood-ratio tests, one to detect the signal of selection (γ = 0 *vs.* γ ≠ 0), denoted by LRT_{γ}, and the other to detect the presence of mutational bias (κ = 1 *vs.* κ ≠ 1), denoted by LRT_{κ}. Standard statistical theory suggests that, if data are generated under their respective null models, the *P*-values of these two tests should follow a uniform distribution on the interval [0, 1]; they are then regarded as proper tests. We used the KS test to determine whether LRT_{γ} and LRT_{κ} are proper tests in the presence of linkage between sites (see methods). Intriguingly, both LRT_{γ} and LRT_{κ} are proper tests over a wide range of parameter combinations (Table 2; see also Table S1). In fact, an analysis using the quantile–quantile plot (Figure S2) suggests that, even with complete linkage, the distribution of these two test statistics conforms to a chi-square distribution with 1 d.f. (). These results suggest that LRT_{γ} and LRT_{κ} are likely to be proper statistical tests even when the data do not conform to the assumption of free recombination. Furthermore, LRT_{κ} seems to have good power to detect the presence of mutational biases (>97% in Table 2; see also Table S1).

The observation that linkage has only a limited influence on LRT_{γ} and LRT_{κ} is counterintuitive, especially when considering the dramatic increase in variance due to tight linkage (Figure S1). In Figure 1A, we plot summary statistics for the distributions of , , and as functions of the rate of crossing over, for the case where θ = 0.02, κ = 1, and *L* = 10 kb (see Figure S3 for cases with *L* = 1 or 100 kb). Two features are of note. First, as the above results imply, all observed distributions of and , regardless of the level of linkage, are indistinguishable from those obtained under the free recombination assumption. On the other hand, the distribution of becomes more variable as the level of linkage between sites increases (*i.e*., as ρ = 4*Nr* decreases). Thus, the results in Figure 1A suggest that the extra variance in the frequency spectrum induced by linkage between sites manifests itself mainly in the increased variability of the distribution of . Additional simulations also suggest that, for the parameter values we have considered, with ρ ≥ 0.05, the distribution of becomes indistinguishable from that observed under free recombination (results not shown).

For LRT_{γ}, some exceptions to the above conclusions have been found. For example, when θ = 0.05, κ = 2, and *L* = 10 kb, the KS test suggests that the distribution of the *P*-values deviates from uniformity (*P* = 2.58 × 10^{−13}), and, at a significance level of 5%, LRT_{γ} rejects neutrality in 14.8% of the random samples. As above, we plot the distributions of , , and as functions of the rate of crossing over (Figure 1B with *L* = 10 kb; see also Figure S3). As in the case with a lower mutation rate in Figure 1A, the variance of increases sharply as ρ approaches zero. However, with a high mutation rate, the variances of and also increase as ρ decreases, but to a much lesser extent than that of . In fact, using a two-sample Kolmogorov–Smirnov test, the distributions of and under ρ = 0 are significantly different from those observed under ρ = ∞ (the *P*-values for the two tests are 1.8 × 10^{−6} and 7.2 × 10^{−6}, respectively). When a moderate level of crossing over is introduced, the effects of linkage on the distributions of and vanish. For example, with ρ = 0.001, the two-sample KS test suggests that the two distributions are indistinguishable from those observed under ρ = ∞, and LRT_{γ} rejects 5.8% of the random samples.

Additional simulations suggest that, with complete linkage, a high mutation rate, and κ ≠ 1, the false positive rate of LRT_{γ} seems to be positively correlated with *L*, but crossing over tends to be effective in reducing false rejections (Table S1 and Figure S3). In general, we find that LRT_{γ} tends to be too liberal when these three conditions are met simultaneously: (1) κθ > 0.05, (2) κ > 1, and (3) ρ < 0.001 (Table S1). However, this situation has limited relevance to most eukaryote species, whose scaled per-site mutation rate is usually of the order of a few percent (see Figure 1.10 of Charlesworth and Charlesworth 2010).

#### Neutral models with population size changes or population structure:

Many methods for detecting selection assume that population size is constant over time. It is well known that violations of this assumption can make methods that rely on the infinite sites model become counterconservative (Simonsen *et al.* 1995; Jensen *et al.* 2005; Nielsen 2005; Zhu and Bustamante 2005; Zeng *et al.* 2006). In this section, we investigate the joint influence of linkage and demography on LRT_{γ} and LRT_{κ}. In the simulations, we assumed neutral evolution and complete linkage between sites.

First, we modeled a 10-fold population expansion (*i.e.*, *N*_{2}/*N*_{1} = 10; see methods). It is known theoretically that an abrupt increase in population size causes an excess of rare mutations in the sample and, consequently, negative Tajima's *D* values (Tajima 1989a; Slatkin and Hudson 1991). This effect renders many tests (*e.g.*, Tajima's *D*) counterconservative with respect to detecting purifying selection (*e.g.*, Simonsen *et al.* 1995; Zeng *et al.* 2006). The results in Table 3 suggest that LRT_{γ} and LRT_{κ} tend to have slightly elevated rejection rates (up to ∼11%; see Table S2 for more data) under this demographic model, although these two tests seem to be more conservative than Tajima's *D*. In most cases, the KS test reveals that the distribution of *P*-values deviates significantly from the uniform distribution. Interestingly, the method seems to produce roughly unbiased estimates of γ and κ, with the γ-estimator behaving better than the κ-estimator.

Extensive empirical investigations suggest that the following statistically *ad hoc* approach is effective in solving the problem of counterconservativeness: instead of using , we can perform LRT_{γ} and LRT_{κ} assuming that they follow a chi-square distribution with 2 d.f. (); we refer to these two *ad hoc* tests as LRT_{γ2} and LRT_{κ2}, respectively. Using data simulated under various models, we find that LRT_{γ2} and LRT_{κ2} are more conservative than LRT_{γ} and LRT_{κ} (see Figure S4 for more details). For example, when the data were generated under the expansion model with τ = 0.04 and κ = 1, LRT_{γ2} rejects 0.6, 2.2, and 5.4% of the samples at significance levels of 1, 5, and 10%, respectively; LRT_{κ2} has similar properties (Table 3). When generating data using the same expansion model but with κ = 2, at a significance level of 5% the original LRT_{κ} rejects 46.8% of the samples, whereas the *ad hoc* LRT_{κ2} rejects 31.2% (Table 3). Note that, because type I error is uncontrolled for LRT_{κ}, we cannot equate the 46.8% rejection frequency to power. On the contrary, LRT_{κ2}, although conservative, has reasonable power to detect mutational bias in the presence of recent population expansion (see Table S2 for more data).

Next, we simulated a 10-fold population size reduction (*i.e.*, *N*_{2}/*N*_{1} = 0.1). In contrast to the case of population expansion, a reduction in population size usually generates an excess of variants with intermediate frequencies and, consequently, positive Tajima's *D* values (Fu 1996). Under the assumed model, the rejection rates using Tajima's *D* can be as high as 40% (Table 3). For LRT_{γ} and LRT_{κ}, the KS test indicates that the distributions of these two test statistics usually do not follow a -distribution. Nonetheless, all the observed rejection rates are below the 5% nominal significance level. Hence, LRT_{γ} and LRT_{κ} tend to be conservative tests following a reduction in population size, so it is unnecessary to perform the even more conservative LRT_{γ2} and LRT_{κ2}. When mutational bias was included, LRT_{κ} has ∼80% power to detect its presence, whereas LRT_{κ2} has ∼68% power. Again, we obtain approximately unbiased estimates of γ and κ.

The third case we consider is a bottleneck model, which has been used to describe the evolution of non-African *Drosophila melanogaster* populations (Thornton and Andolfatto 2006). This model contains a brief bottleneck period, when the population size is reduced to 2.9% of the prebottleneck value, and is likely to induce false positive results with respect to detecting selection (*e.g.*, Barton 1998). In fact, Tajima's *D* becomes liberal, with up to 41.4% of the samples being rejected (Table 4). Nonetheless, the bottleneck event has only mild effects on LRT_{γ} and LRT_{κ}, and the KS test usually does not detect significant deviation from uniformity (Table 4). As under the expansion model, LRT_{γ2} and LRT_{κ2} tend to be very conservative. The mean value of is close to zero, but estimates of κ tend to be upwardly biased, especially when the samples were taken shortly after the recovery of population size.

Finally, we consider the effects of population structure. For simplicity, we adopted a finite island model in the simulations (see methods). The results are given in Table 5. First, data were generated under a two-deme model. Here, relatively low levels of geographic isolation (as measured by *F*_{ST}) were assumed. None of the three tests, Tajima's *D*, LRT_{γ}, and LRT_{κ}, is significantly affected. LRT_{κ} has only moderate power (∼50%) to detect mutational bias. Next, we modeled a case with a higher level of isolation and a larger number of demes (five demes in Table 5). Under this model, Tajima's *D* becomes counterconservative, rejecting up to 22% of the samples. Encouragingly, LRT_{γ} and LRT_{κ} seem to be proper tests, and LRT_{κ} seems have reasonable power to detect mutational bias. Note that these results are based on a sampling scheme where all chromosomes are taken from one deme. Nonetheless we have obtained similar conclusions using other sampling schemes (*e.g.*, with one chromosome taken from one deme and all the rest from another deme; Table S3).

In summary, we suggest (i) that LRT_{γ} and LRT_{κ} are relatively insensitive to the neutral demographic models we have examined, (ii) that they tend to produce approximately unbiased estimates of γ and κ, (iii) that LRT_{κ} seems to have reasonable power to detect mutational bias, and (iv) that the two *ad hoc* tests, LRT_{γ2} and LRT_{κ2}, tend to be conservative and rarely produce false positive results even when LRT_{γ} and LRT_{κ} become too liberal. The observation that the mean of is usually close to zero, even though the frequency spectrum is distorted, is counterintuitive. In File S1, we present a numerical example based on the population expansion model, which may shed light on these seemingly paradoxical observations.

#### The effects of selection at nonsynonymous sites on linked, neutrally evolving synonymous sites:

The original purpose of the method of Zeng and Charlesworth (2009) was to estimate the parameters governing the evolution of synonymous sites. In this regard, the models considered in the previous sections are unrealistic, since they assume neutrality and neglect the fact that synonymous sites are linked to nonsynonymous sites in the same gene, which are under natural selection (Loewe and Charlesworth 2007). To address this issue, we conducted simulations taking into account codon structure and selection at nonsynonymous sites, but retaining the assumption of neutrality at synonymous sites (see methods). In particular, we assume that mutation is reversible at all sites and κθ = 0.02, which should be realistic for *D. melanogaster* (assuming *N*_{e} ≈ 10^{6} and a mean per site mutation rate of 5 × 10^{−9}) (Kreitman 1983; Keightley *et al.* 2009). We sampled the fitness effects of the selected sites from a log-normal distribution, which has been shown to provide a good fit to the data obtained from population genetic surveys (Loewe and Charlesworth 2006; Boyko *et al.* 2008; Kaiser and Charlesworth 2009). The statistics obtained on synonymous sites were then analyzed to examine the performance of the method of Zeng and Charlesworth (2009).

In a previous simulation study, Kaiser and Charlesworth (2009) concluded (i) that, with a large number of tightly linked nonsynonymous sites, the frequency spectrum at linked, neutrally evolving synonymous sites can be seriously distorted, showing extreme negative Tajima's *D* values (see also Seger *et al.* 2010), and (ii) that gene conversion alone is ineffective in removing HRI. Here, we substantiate these conclusions by including mutational bias in the simulation and by providing distributions for the quantities of interest (Figure 2).

We highlight several observations. First, incorporating mutational bias into the simulation has little impact on the frequency spectrum at synonymous sites (Figure S5). Second, the ineffectiveness of gene conversion in removing HRI is clearly demonstrated by the fact that the observed distribution of Tajima's *D* under gene conversion overlaps substantially with that observed under complete linkage (Figure 2), although gene conversion was assumed to have occurred at a rate that is realistic for *D. melanogaster* (Hilliker *et al.* 1994; Loewe and Charlesworth 2007). In contrast, with a rate of crossing over of ρ = 0.04/bp, which should also be realistic for *D. melanogaster* [assuming *N*_{e} ≈ 10^{6} and a mean per-site rate of crossing over of 10^{−8} (Hey and Kliman 2002)], the distribution of Tajima's *D* is much closer to the neutral expectation of zero. As noted previously, the effects of selection at linked sites on the frequency spectrum at synonymous sites become important when there are a large number of linked selected sites. In fact, with >20 kb of tightly linked sites, the 97.5th percentile of the distribution of Tajima's *D* barely overlaps with the neutral expectation of zero (Figure 2).

To examine the details of the frequency spectra for derived alleles, we used the forward simulation algorithm to obtain the unique genealogy that relates all extant haplotypes in the absence of recombination. For each segregating site in the sample, we determined ancestral and derived alleles by comparing the extant sequences with the ancestral sequence at the most recent common ancestor for this sample. In Figure S6, the spectrum observed at synonymous sites is shown and compared to the neutral expectation. We observe an excess of low-frequency variants and a deficit of intermediate- and high-frequency ones. This is in line with the prediction under the standard background selection theory with weak selection (Charlesworth *et al.* 1995; Fu 1997; Gordo *et al.* 2002).

In the absence of recombination, Tajima's *D* is counterconservative when applied to synonymous sites, with observed rejection rates up to 74% (Table 6 and Table S4). For LRT_{γ} and LRT_{κ}, their *P*-values are no longer uniformly distributed (Table 6), and the observed rejection rates are slightly elevated (<14%; Table S4). As in the case of population expansion, LRT_{γ2} and LRT_{κ2} tend to be conservative, with all observed rejection rates <5% (Table 6, Table S4, and Figure S7). Note that, because type I error is uncontrolled, power is undefined for LRT_{κ} in the case of no recombination. On the other hand, using LRT_{κ2}, we still have some power to detect mutational bias: 15.2% for *L* = 10 kb, 36.6% for *L* = 50 kb, and 50.4% for *L* = 100 kb (Table 6).

Additional simulations suggest that a relatively high level of crossing over is necessary to reduce the effects of selection at linked sites on LRT_{γ} and LRT_{κ}. For example, with *L* = 10 kb, ρ > 0.01 is necessary (Table 6). Note that, since the simulation model does not include any noncoding regions, the rate of crossing over needed to alleviate the effects of selection at linked sites is likely to be overestimated. Nonetheless caution should be exercised when applying LRT_{γ} and LRT_{κ} to genomic regions where crossing over is relatively infrequent; for these regions, LRT_{γ2} and LRT_{κ2} are probably more reliable.

The mean ML estimates for γ tend to be unbiased, whereas those for κ seem slightly upwardly biased when recombination is infrequent (Table 6). In all parameter combinations, the mean -values are severalfold lower than the input values (Table 6), and cases with larger *L* values tend to have smaller -values. This reflects the well-known fact that selection at linked sites reduces the level of neutral diversity, reflecting a reduction in *N*_{e} (McVean and Charlesworth 2000; Comeron *et al.* 2008; Kaiser and Charlesworth 2009).

#### The power to detect selection at synonymous sites in the presence of linked nonsynonymous sites:

In this section, we investigate whether LRT_{γ} has any power to detect natural selection at synonymous sites when there is strong HRI caused by linked nonsynonymous sites. To this end, we introduced weak selection on synonymous sites into the simulation model described in the previous section. Here, the selection coefficient, *s*, against a deleterious mutation at a synonymous site was assumed to be constant across the entire simulated region. The region size was set to 50 kb and the input scaled intensity of selection at synonymous sites was γ = 2 (Table 7); very similar results were obtained when the size was 100 kb (results not shown). HRI tends to reduce *N*_{e} as measured by equilibrium diversity at neutral sites, especially when there are a large number of linked selected sites (McVean and Charlesworth 2000; Comeron *et al.* 2008; Kaiser and Charlesworth 2009). As a consequence, the weak selective differences between the two variants at synonymous sites may be invisible to natural selection, due to their very small *N*_{e}*s* (Stephan *et al.* 1999).

In the case of complete linkage, with the selection and mutation parameters for nonsynonymous sites used here, the observed diversity at neutral synonymous sites is 0.0015, ∼13-fold lower than the input value of 0.02 (Table 6). This is consistent with observed levels of diversity on the Drosophila dot chromosome (Charlesworth *et al.* 2010). Hence, for synonymous sites under selection, 4*N*_{e}*s* predicted from the *N*_{e} value corresponding to the neutral diversity level is equal to ≈ 0.15. The mean value of is 0.088 (Table 7), somewhat lower than this predicted value (see discussion). Not surprisingly, with such a small 4*N*_{e}*s*, neither LRT_{γ} nor LRT_{γ2} has any power to detect selection (Table 7). Introducing gene conversion does not significantly increase the power of the two tests. These results imply that it will be hard to detect selection on synonymous sites in parts of the genome with extremely reduced recombination (*e.g.*, the dot chromosome in Drosophila), consistent with the evidence that selection on codon usage is greatly reduced in these regions (reviewed by Charlesworth *et al.* 2010).

The power of LRT_{γ} starts to increase as the rate of crossing over becomes higher (Table 7). When ρ = 0.02, about half of the typical value of 0.04 for *D. melanogaster*, LRT_{γ} achieves virtually 100% power in detecting selection. Nonetheless, it should be noted that, for cases with ρ < 0.01, the power of LRT_{γ} should be interpreted with caution because of the problem of inflated type I error rates reported in the previous section. Encouragingly, the more conservative LRT_{γ2} has only slightly reduced power compared to LRT_{γ} (*e.g.*, 93.6% *vs.* 97.4% for ρ = 0.02). Further, LRT_{γ2} can be applied to regions with reduced recombination, where LRT_{γ} is counterconservative (*e.g.*, when ρ = 0.008).

In all cases, the estimates of κ are approximately unbiased (Table 7). On the other hand, with higher levels of recombination the mean values of and becomes closer to the input values of 0.02 and 2, although even under the highest level of recombination (ρ = 0.04 and ϕ = 0.028) the mean estimates of these two parameters are still lower than the input values, suggesting that the effects of selection at linked sites have not been completely removed. This reduction is due to the action of background selection (Charlesworth *et al.* 1993; Hudson and Kaplan 1994). In fact, using the theory developed by Nordborg *et al.* (1996) and Loewe and Charlesworth (2007), for ρ = 0.02 and 0.04, the predicted effect on *N*_{e} at synonymous sites due to selection at linked nonsynonymous sites is *N*_{e}/*N* = 0.32 and 0.54, respectively (see Equation 4); the expected 4*N*_{e}*s* values for linked weakly selected sites are therefore 0.64 and 1.07, which are very close to the mean values of given in Table 7, in agreement with the results of Stephan *et al.* (1999) for one strongly selected and one weakly selected site. However, when recombination rates are very low, the background selection formula tends to overestimate the reduction in *N*_{e} and γ (Table 7) for reasons given in the discussion.

If the above conclusions are correct, then the previous estimate of γ = 1.03 (Zeng and Charlesworth 2009), obtained by analyzing twofold degenerate synonymous codons located in normally recombining regions in the *D. melanogaster* genome (*c* > 1.15 cM/Mb; *i.e.*, ρ > 0.04), is likely to be close to the actual intensity of selection acting on these sites.

## DISCUSSION

#### Properties of the methods based on the reversible mutation model:

In a series of articles (Cutter and Charlesworth 2006; Zeng and Charlesworth 2009, 2010; Zeng 2010), we proposed several methods for inferring the selection and mutation parameters using the reversible mutation model, which are especially suitable for detecting selection on synonymous mutations or on GC *vs.* AT variants in noncoding sequences. These methods have the following features: (1) They do not require the use of outgroup sequences to infer ancestral states and are thus free from the potential problem of false inference caused by misidentification of ancestral states (Baudry and Depaulis 2003; Hernandez *et al.* 2007); (2) they allow simultaneous estimation of selection and mutational bias parameters from population genetic data; and (3) they can be extended to account for the effects of past changes of population size.

We have further shown here that the method of Zeng and Charlesworth (2009), a representative of the class of methods cited above, is fairly insensitive to the violation of the free recombination assumption. Additionally, this method seems to be relatively insensitive to demographic changes, population structure, and selection at linked sites, compared to the commonly used Tajima's *D* and related tests for detecting skews in the site frequency spectrum (reviewed by Charlesworth and Charlesworth 2010, pp. 287–291). In cases where our likelihood-ratio tests for detecting selection and mutational bias are counterconservative, we find that the *ad hoc* approach of performing the tests with a chi-square distribution with 2 d.f. can effectively solve the problem of inflated type I error rates, while retaining reasonable power to detect selection and mutational bias (Tables 3 and 6).

We have paid particular attention to the effect of selection at linked sites on the behavior of neutral or weakly selected sites. Consistent with earlier work (Kaiser and Charlesworth 2009; Seger *et al.* 2010), we found that selection on nonsynonymous sites causes a large skew in the site frequency spectrum at neutral sites when the recombination rate is very low, with very negative values of Tajima's *D*, many of which approach the minimal possible value (Figure 2). This pattern could easily be mistaken for the effect of a selective sweep, as discussed by Kaiser and Charlesworth (2009).

In addition, as mentioned in the description of the results shown in Tables 6 and 7, when sites subject to weak purifying selection, such as synonymous or noncoding sites, are embedded in a large low-recombination region containing many strongly selected sites, the estimated scaled intensity of selection, γ = 4*N*_{e}*s*, for the weakly selected sites (Table 7) is reduced by considerably more than that predicted from the reduction in *N*_{e} caused by the strongly selected (nonsynonymous) sites, as estimated from diversity at neutral sites (see the last section of results). This is probably caused by the additional HRI among the weakly selected sites themselves (McVean and Charlesworth 2000; Comeron and Kreitman 2002; Comeron *et al.* 2008).

For recombination rates that are typical of normally recombining regions of the Drosophila or human genomes, in the presence of the strongly selected sites, there is still a reduction in *N*_{e} for neutral synonymous sites and in γ for weakly selected sites, which is well predicted by the background selection equation (Equation 4). But with low levels of recombination, this equation tends to overestimate the reduction in *N*_{e} and γ (Table 7). This phenomenon was noted previously (Kaiser and Charlesworth 2009) and reflects HRI between the nonsynonymous sites in regions with very low recombination rates, which weakens the effective strength of selection acting on them. As a result, the deterministic mutation–selection model on which standard background selection theory is based does not apply.

It should be noted that we have not attempted to construct a realistic model of genome structure for normally recombining genomic regions; the results shown in Tables 6 and 7 make no allowance for noncoding sequences either within or between genes, which are common in eukaryotes and can potentially serve as spacers reducing HRI (Comeron and Kreitman 2002), although there is evidence that noncoding regions are themselves under selection (*e.g.*, Andolfatto 2005; Haddrill *et al.* 2005). The effects of HRI that we have described here are probably overestimates of the true effects for genes in normally recombining regions. Further work is needed to determine the properties of more realistic models.

#### Differences between the PRF methods and the RM methods:

The key difference between the PRF methods and the RM methods is that with PRF alleles are typically classified as ancestral and derived using outgroup sequences, whereas with RM alleles are assigned to several predetermined classes (typically two; see Zeng 2010 for a multiallele model) that are potentially selectively different. Examples of the latter include preferred and unpreferred synonymous codons (Li 1987; Bulmer 1991; McVean and Charlesworth 1999) or AT and GC in noncoding regions (Galtier *et al.* 2006; Haddrill and Charlesworth 2008; Zeng and Charlesworth 2010). The need to define selectively different allele classes makes it hard to apply the RM approach to such problems as the distribution of fitness effects of nonsynonymous mutations (Eyre-Walker *et al.* 2006; Keightley and Eyre-Walker 2007; Boyko *et al.* 2008). Further work is needed to determine whether the desirable properties of the RM methods can be used for this purpose.

A question that is raised by our results is the fact that close linkage does not seem to produce a high variance in the estimates of γ and κ from the RM method (Figure 1), in contrast to the well-known effect of close linkage on the variance of estimates of θ (Hudson 1983, 1990). The intuitive reason for this is that the variance of the estimate of θ reflects the large stochastic variation in the size of gene genealogies; linked sites have a high covariance of tree sizes, which inflates the variance of the mean tree size across sites compared with that of a set of independent genealogies (Pluzhnikov and Donnelly 1996; McVean 2002). The associated variability in the shapes of trees will also generate more variability in the frequencies of derived *vs.* ancestral variants for sets of closely linked sites compared with independent sites, resulting in a higher frequency of false positive results of tests for selection from PRF methods (Bustamante *et al.* 2001; Zhu and Bustamante 2005; Desai and Plotkin 2008).

In contrast, the estimate of γ from the RM approach comes from the frequencies of favored *vs.* disfavored variants, such that a skew toward higher frequencies of favored variants indicates the action of selection or biased gene conversion. A high-frequency, putatively favorable variant could be either derived or ancestral, so that the chance of observing it is relatively independent of tree size or shape. Similarly, estimates of κ come mainly from the relative frequencies of sites that are fixed for the different alleles in the model (Zeng and Charlesworth 2009; Zeng 2010), which is also little affected by tree size or shape. Similar considerations may apply to the effects of population structure and population size changes. Overall, therefore, there are reasons to expect that the RM approach is more robust than the PRF method. However, it is worth emphasizing that, due to computational constraints, the results presented in this study are based on a limited number of combinations of parameter values. Therefore, they should not be overinterpreted.

## Acknowledgments

We thank Andrew Rambaut for valuable discussions on various computer algorithms. We also thank Noah Rosenberg and two anonymous reviewers for helpful comments. This study made use of the resources provided by the Edinburgh Compute and Data Facility (ECDF) (http://www.ecdf.ed.ac.uk/). The ECDF is partially supported by the e-Science Data, Information, and Knowledge Transformation (eDIKT) initiative (http://www.edikt.org.uk). K.Z. acknowledges support from a Biomedical Personal Research Fellowship awarded by the Royal Society of Edinburgh and the Caledonian Research Foundation.

## Footnotes

Supporting information is available online at http://www.genetics.org/cgi/content/full/genetics.110.122150/DC1.

Communicating editor: N. A. Rosenberg

- Received August 13, 2010.
- Accepted September 27, 2010.

- Copyright © 2010 by the Genetics Society of America