## Abstract

Two maximum-likelihood methods are proposed for detecting recent, strongly positive selection and for localizing the target of selection along a recombining chromosome. The methods utilize the compact mutation frequency spectrum at multiple neutral loci that are partially linked to the selected site. Using simulated data, we show that the power of the tests lies between 80 and 98% in most cases, and the false positive rate could be as low as ∼10% when the number of sampled marker loci is sufficiently large (≥20). The confidence interval around the estimated position of selection is reasonably narrow. The methods are applied to X chromosome data of *Drosophila melanogaster* from a European and an African population. Evidence of selection was found for both populations (including a selective sweep that was shared between both populations).

RECENT positive selection can be detected because it leaves footprints in the genome around the sites of selection. For instance, positive selection leads to a reduction of genetic diversity around the selected locus due to genetic hitchhiking (Maynard Smith and Haigh 1974), an excess of rare variants (Fu and Li 1993), and an excess of mutations at high frequency (Fay and Wu 2000). On the basis of these effects, several efforts have been undertaken to detect recent positive selection (Fay and Wu 2000), and some methods have been developed to estimate the parameters of simple models of genetic hitchhiking (Kim and Stephan 2002; Przeworski 2003).

Here two exact-likelihood methods for detecting strongly positive selection and estimating the position of the selected site along a recombining chromosome are proposed, both of which are based on the combined effects of a local reduction of genetic variation and an excess of rare mutations. These methods go beyond the recently proposed composite likelihood-ratio tests of Kim and Stephan (2002) and Kim and Nielsen (2004) that treated each polymorphic site independently. Our approaches also differ from the *ad hoc* method of Sabeti *et al*. (2002), who analyzed the decay of haplotype structure around a selected locus.

The identification of genes contributing to the adaptation of local populations is of great biological interest (Harr *et al*. 2002; Storz *et al*. 2004). Thus our primary goal is to map these genes to a reasonably small DNA segment by detecting selective sweeps in the genome. We do not focus on the estimation of the other parameters of the hitchhiking model, *i.e*., the selection strength (α=2*Ns*) and the time of the hitchhiking event in the past (τ), where *s* is the selection coefficient and *N* the effective population size. Instead, to make the methods practicable, we opted to assign values to these two parameters (these values may be obtained by different methods). Extensive simulations show that the estimation of the position of the selected locus is unbiased when the selected site is at the center of the region, and the tests are robust even when the true and assigned values of the hitchhiking model differ to some extent.

## METHODS

#### Coalescent simulation:

Extensive coalescent simulations are needed to construct the ancestral recombination graph for large DNA segments (Hudson 1990; Griffiths and Marjoram 1997; Li and Fu 1998; Wiuf and Hein 1999). Let us consider *m* neutral loci such that there is no recombination within a locus; thus each locus is treated as a point in the ancestral recombination graph. This assumption is reasonable because large DNA segments are considered such that the distance between marker loci and the selected site is large relative to the length of the loci. This procedure makes the simulation for large DNA segments (in the order of hundreds of kilobases) practicable since it is not needed to trace each nucleotide site.

The positions of these neutral loci were determined by two different ways. First, they were distributed randomly within a certain region. This strategy is called locus position strategy 1 (LPS1). Second, denote the region by [0, *w*). It is divided into *m* equally large segments, and there is only one neutral locus per segment. The position of the locus in the *i*th segment is distributed uniformly over [(*i* − 1)*w*/*m*, *iw*/*m*); thus it is also independent of the positions of the other neutral loci. The positions of loci tend to be uniformly distributed. This strategy is denoted as locus position strategy 2 (LPS2). The assigned selection strength and the time of the hitchhiking event in the past are and , respectively, where *ŝ* is the assigned selection coefficient. To plot the results obtained from the simulated data in physical distance rather than genetic distance, we assume that the recombination rate is 1 cM/Mb in the DNA segment under study (which is appropriate for Drosophila), and the population has a constant effective size *N*.

Denote the present time (when a population is sampled) as zero. Then, looking backward in time, *t* represents the time in units of 2*N* generations before present. The ancestral recombination and coalescence history is divided into three phases, the first neutral phase, the selective phase, and the second neutral phase (Braverman *et al*. 1995). Assuming the fixation time of the selected allele is *t*_{s}, the first neutral phase is [0, τ), and the selective phase is , and the second neutral phase is , where τ is the time of the fixation event in the past.

The selective phase is the period when a beneficial mutation that causes a hitchhiking effect is on the way to fixation. The beneficial allele *B* has a genic selective advantage *s* over the parent allele *b*. The allele frequency of *B*, which is denoted as *x*, may be assumed to change deterministically from 1−ψ to ψ if the population size is large and selection is strong; *e.g*., is large (typically ; Kaplan *et al*. 1989). Then *x* at time *t* is given by(1)(Stephan *et al*. 1992), where , which is the length of the selective phase. We use ψ = 1/2*N* for the simulations. The coalescent and recombination probabilities follow previous work (Braverman *et al*. 1995; Kim and Stephan 2002).

#### Likelihood method:

The mutation rate of the *k*th neutral locus is per generation, and . The number of sampled chromosomes is *n*, where *n* ≥ 5. Let denote the number of mutations that are on *i* chromosomes. For example, is the number of mutations that are observed on one chromosome, and ξ_{2} is the number of mutations that occur on two chromosomes. Furthermore, we have . Then the compact mutation frequency spectrum over *m* loci is defined as represents the high-frequency mutations when sample size is small. In this approach, some information, like “branching information” indicating where the mutations happened and which size they are, has been partially discarded. The strategy has advantages, in particular when recombination is considered. In practice, most randomly sampled genealogies are not consistent with the sequence data when the sample size is sufficiently large, and the failure rate can be >99.99%. An example is shown in Figure 1.

Using the compact mutation frequency spectrum, however, allows us to sample genealogies effectively. The probability is 1 that a genealogy has at least one internal branch with size ≥3 when *n* ≥ 5. Then, a uniform sampling strategy can be used, and each random genealogy is consistent with the data. This sampling strategy is different from both importance sampling (Griffiths and Tavaré 1994) and the Markov chain Monte Carlo method (Kuhner *et al*. 1995).

Then, following Felsenstein and his colleagues (Felsenstein 1992; Kuhner *et al*. 1995), the probability that **D** is observed given the position of selected site (*M*) is(2)where , and is the genealogy for the *k*th locus. We also mention here that *M* could be a set of parameters, for example, the position of selected site, the strength of positive selection, and the time of fixation of the favored allele. In this study, we generally denote *H* discrete candidate positions of the selected site asThen we need to compute the likelihood function for a given value of *M _{i}*, to find the value of

*M*that maximizes

_{i}*L*.

Since it is impossible to obtain an analytical expression for the likelihood function, a simulation approach is proposed. Equation 2 requires a summation over a huge number of topologies, and each topology has an infinite number of possible branch lengths. Therefore, rather than sampling all genealogies, we consider a large random sample of **G**. The approach is possible and efficient because each simulated **G** is consistent with the compact mutation frequency spectrum over *m* loci (**D**) when . Since *P*(**G**|*M*) is determined in the simulation process (conditioned on the parameter set *M*), an estimate of *L* can be obtained by the following procedure:

Simulate genealogies (topology without mutation) for

*m*loci conditioned on the position of selected site (*M*), and .Compute the value of aswhere is given by the Poisson probability,with and as the length of the branches with size

*i*, and . The length of the branches is scaled such that 1 unit represents 2*N*generations.Repeat steps 1 and 2

*K*times. Then .

Obviously, the accuracy of the estimation is improved by using large values of *K*. In the following, this procedure is denoted by L1. In addition, we propose a similar procedure, L2, as follows. The *K* simulations conditioned on *M*, given and , are used to calculate the average branch length , where *i* = 1, 2, *X*, and *j* = 1, 2,…, *m*. Then these average lengths are used to calculate according to a minor modification of step 2.

The composite-likelihood method of Kim and Stephan (2002) (henceforth called KS) can be compared with L1 and L2. However, a minor revision is needed because the KS method is designed for continuous sequences under the infinite-site model. In this study, it is assumed that a locus in the KS model is composed of 300 nucleotide sites, and each nucleotide site within the locus has the same recombinational distance to the selected site, and there is no recombination within the loci.

The likelihood-ratio test (LRT) is a statistical test of the goodness-of-fit between two models. Neutrality can be seen as a special case of hitchhiking, namely that the selected site is far away from the considered region such that there is no hitchhiking effect observed within the region. Hence, . Therefore, one parameter is restricted in the neutral model on the basis of the hitchhiking assumption, and thus these two models are hierarchically nested. Then, we have , and this LRT statistic approximately follows a chi-square distribution with 1 d.f., where *L*_{neutrality} can be estimated by procedures similar to L1 and L2, and *L*_{max} is the maximum-likelihood value under the hitchhiking model calculated by the method L1 or L2.

## RESULTS

The parameters α or τ can be estimated by the methods of Kim and Stephan (2002) and Przeworski (2003) or simply assigned by the following procedure. For a fixed value of the recombination rate, the length of the chromosomal region affected by a single hitchhiking event depends primarily on the strength of selection (Figure 2). A large region is affected when selection is strong, and thus the assigned strength of selection should be adjusted according to the length of the region when the selection strength is unknown. Assume that a selected site is at the center of the region and a neutral locus is at the edge of window. Let *h* be the expected relative heterozygosity after a single hitchhiking event (*i.e*., the ratio of expected heterozygosity under hitchhiking to that under neutrality), which is given by Equation 19 of Stephan *et al*. (1992) or Equation 3 of Kim and Stephan (2000). Then is the selection strength that makes the expected relative heterozygosity equal to the chosen *h*. We recommend to choose the size of the region such that 0.7 ≤ *h* ≤ 0.95. The effect of hitchhiking will be erased when τ increases. Therefore, it is expected that we have low power to detect these events if they happened some time ago. Thus, is recommended and used in this study.

is depicted for a single simulated data set in Figure 3. The compact mutation frequency spectra were recorded at 10 loci, the positions of which were chosen randomly according to the LPS1 method. The selected site was at 100 kb when simulating the polymorphic data set. To estimate the position of the selected site by L1, it was assumed that selection must have happened within the 200-kb region, and the discrete candidate positions of the selected site were placed uniformly within the region. In this case, the space between two neighboring candidate positions is 5 kb. Discrete positions were used here because of the limit of computation power. *L*_{1} can be calculated by the method L1 when assuming that selection happened at the first candidate position. Thus, *L*_{1}, *L*_{2},…, can be calculated. *L*_{22} is the global maximum-likelihood value in this example, and the corresponding position is 105 kb. *L*_{neutral} can be obtained by a similar procedure based on simulations of the standard neutral model. Since the likelihood-ratio test rejected the standard neutral model in this example, selection was correctly detected. When the data created under the standard neutral model are considered, the neutral model could be falsely rejected by the likelihood-ratio test, which means a false positive.

Let us consider a certain region and assume that selection occurred at the center of the region. Then, what is the standard deviation of estimated position of the selected site given randomly selected samples and loci? The standard deviation of the estimated target position of selection is given in Table 1 and Figure 4 when α and τ are known (namely, and ). The estimated position of the selected site is unbiased in all situations (results not shown). It is obvious that the more neutral marker loci are surveyed, the more precise the estimation becomes. Generally, the KS method (Kim and Stephan 2002) performs similarly or slightly better than L1 and L2 when the number of neutral loci is small (*m* < 10). L1 has a larger standard deviation than L2 and KS when the number of loci is large (*m* > 10). L2 behaves best when the number of loci is not small (*m* > 5) due to the lowest standard deviation. Overall, the power of both tests is so high that most of simulated hitchhiking events have been detected correctly. The false positive rate decreases with an increasing number of loci. It should be important to lower the false positive rate unless it is known that positive selection has occurred within the region considered or unless positive selection events happen very frequently. We suggest that the number of neutral loci should be 10–20 or more in the region considered.

If the selected site lies at the center of the region, the estimated position of the selected site is always unbiased even if the distribution of the estimated position is uniform. Therefore, we also considered the cases that the target of selection is located at the edge of the region (Table 2). Generally, L1 and L2 are less biased than the KS method, and the more neutral marker loci are surveyed, the less biased the estimation becomes. Moreover, when the selected site lies at the edge of the region rather than at the center of the region (Table 1), standard deviation is larger and the power smaller.

Usually the strength of selection and the time of the hitchhiking event in the past are unknown. Table 3 gives the effect of the difference between the assigned or estimated parameter values ( and ) and the true values (α and τ). The results suggest that the proposed methods can reveal most selection events (>82%) if the true strength of selection is equal to or greater than the assigned value () and selection happened very recently (τ < 0.15). The methods will fail if and τ ≥ 0.5.

This suggests that a minimum strength of selection is detectable given the data. A large number of loci is required to obtain a low false positive rate, for example, 5%, and thus the size of the region cannot be reduced indefinitely. Therefore, and the minimum detectable value of α cannot be too small.

Furthermore, it is possible to study the power or the probability of detecting a selection event given that the beneficial allele (with the selection strength α) fixed at time *t*, where *t* is uniformly distributed between [*t*_{0}, *t*_{1}]. When α is known and the assigned selection strength is , the probability is given by(3)where , the power given the beneficial allele fixed at the specified time *t*, can be obtained by simulation. When α is unknown but , we have because we have empirically (Table 3).

Next we consider different ways to choose the neutral loci (Figure 5). The LPS2 method generates less standard deviation than LPS1 in all comparisons. This is because the positions of neutral loci chosen according to LPS2 are more likely to be equally distributed than those of LPS1, and thus the former contains more information on the spatial distribution of polymorphisms. Thus, to increase the chance of detecting the hitchhiking event, it is recommended that, if possible, the marker loci should be equally or nearly equally distributed along the chromosome or within candidate regions.

In practice, the sequencing load is often a limiting factor. The more loci are chosen, the less base pairs per locus can be sequenced, and vice versa. Figure 6 displays the effect of the different choices. All comparisons show clearly that the first strategy (solid box) is better than the second one (open box). Therefore, we recommend that the priority should be put on the number of loci. By increasing the sequenced length per locus, more segregating sites are expected, so a more precise estimation of the level of local polymorphism can be obtained. However, Figure 6 shows that obtaining more information on the spatial distribution of polymorphisms along the whole region is more important than getting more precise estimates of local levels of polymorphism.

Finally, we consider the case that the priority is given to the number of sampled chromosomes (*n*) rather than to the number of loci (*m*) (Figure 7). The priority given to the number of loci is the better strategy when the number of loci is not very large. This difference disappears when the number of loci gets large.

## APPLICATION

The proposed likelihood methods were applied to a region on the X chromosome of *D. melanogaster*. The analyzed region extends over 660 kb. The region was selected due to the dense distribution of marker loci. This region contains 19 fragments that are nearly uniformly distributed over the region (Figure 8). The data for seven of these loci are from Glinka *et al*. (2003); the remaining data were kindly supplied by Lino Ometto and Sascha Glinka. On the basis of the published data (Glinka *et al*. 2003), the mean of θ across the X chromosome in the European and African populations is 0.0044 and 0.0127 per site, respectively, so that the value of θ for each fragment is obtained as the mean value per site times the length of the fragment (excluding insertions and deletions). The estimated recombination rate is 3.8 cM/Mb (Glinka *et al*. 2003). The ancestral status of each polymorphic nucleotide site was determined by comparison with the *D. simulans* sequence (Glinka *et al*. 2003).

To perform our analyses, the selection coefficient (*ŝ*) we assigned was 0.053, so that *h*, the relative heterozygosity, was 0.92 at the edge of the region. Moreover, we used .

In the European population, the likelihood-ratio test via L1 and L2 rejected the standard neutral model in favor of the selective sweep model at the 5% level. It suggests that the observed polymorphism in the 19 partially linked fragments can be explained better by the hitchhiking model with the assigned *ŝ*- and -values than by the standard neutral model. In the African population, only the likelihood-ratio test via L2 rejected the standard neutral model.

In the European population, the positions of selected site estimated by L1 and L2 differ slightly. Both of them are between fragments 195 and 196 (the positions are 18.5 and 5.1 kb away from fragment 195, respectively). The 95% confidence regions of the positions estimated by L1 and L2 are shown in Figure 8. The 10,000 simulated data sets also show that the power of the tests is 96.3 and 97.5%, and the false positive rate is 11.3 and 15.7% for L1 and L2, respectively.

In the African population, the position of the selected site estimated by L2 is between fragments 196 and 197 (14.3 kb away from fragment 197), which is rather close to the positions estimated by L1 and L2 in the European population. The simulated data sets also show that the power of the test is high (97.7%) and the false positive rate low (7.9%) for L2. The 95% confidence region of the position estimated by L2 is shown in Figure 8. It overlaps with those of the European population.

## DISCUSSION

#### Method:

In this study, two exact-likelihood methods are proposed. The methods are generally based on coalescent simulations using the ancestral recombination graph. Furthermore, the compact mutation frequency spectrum is used to compute the likelihood.

In the proposed methods, L2 shows a lower standard deviation in the estimated position of selection than L1 in some cases because of the difference in dealing with branch lengths. There is a large variance of branch length in individual simulations in L1, while in L2 the variance is reduced by averaging the branch lengths. Furthermore, to understand this difference, let us consider the genealogies under hitchhiking for two loci with two sampled chromosomes. In some cases, there is no recombination between the two loci, so that the two branches will coalesce. And thus the distance between loci is not counted efficiently by the likelihood in L1. However, by averaging the branch length among simulations, a shorter average branch length will be observed in the locus that is closer to the selected site, and thus the distance between loci is counted by the likelihood in L2.

However, L1 also shows a lower standard deviation of the estimated position of selection than L2 in some cases. This is because information on correlations among loci is discarded in L2. Thus, both L1 and L2 are recommended in practice. For a given genomic region, the estimated positions of the selected site by L1 and L2 could be different. In such a case, the variance of estimated position of the selected site can be obtained, and the method generating a lower variance should be used.

Unlike the Bayesian approach (Przeworski 2003), the proposed likelihood methods do not incorporate prior information about the model parameters. If we expect that beneficial mutations occur preferentially in coding and control regions, the Bayesian approach may be helpful.

#### A global sweep in *D. melanogaster*:

In application, the two proposed methods were applied to 19 partially linked loci on the X chromosome of *D. melanogaster*. The hitchhiking model was accepted for the assigned parameters, and the position of the selected site (estimated by the various methods) falls into a 54-kb region. The divergence between *D. melanogaster* and *D. simulans* in this region is at the same level as the average divergence between the two species over the whole X chromosome (data not presented). Thus the deficiency of polymorphism cannot be explained by a relatively low mutation rate. Given the fact that the hitchhiking model was accepted in both the European and African populations, we suggest that a recent hitchhiking event may have occurred in the ancient African population. As the confidence intervals of the estimated positions overlap in the African and European populations, this selective sweep may have continued in the European population driven by the same selected allele. The sweep in Africa is likely to be older as levels of nucleotide diversity are higher than those in Europe. Thus we may have a global selected sweep. Alternatively, two independent local hitchhiking events have to be postulated, one in the European population and another in the African population. These alternatives may be distinguished by mapping the positions of the selected sites more precisely on the basis of more densely spaced marker loci. For the time being, a global sweep is the more parsimonious explanation.

In this study, we did not consider the effect of demography, such as population size bottlenecks and expansions. However, our approach can be extended to analyze populations with variable size. Finally, we note that the methods can also be used to estimate α and τ after minor revision, but it would consume much more computing time.

## Acknowledgments

We thank David De Lorenzo, Sascha Glinka, and Lino Ometto for providing unpublished data and Joachim Hermisson and Sylvain Mousset for helpful discussions. This work was funded by the VolkswagenStiftung (grant I/78815).

## Footnotes

Communicating editor: D. Rand

- Received February 1, 2005.
- Accepted June 9, 2005.

- Copyright © 2005 by the Genetics Society of America