# A Composite-Likelihood Method for Detecting Incomplete Selective Sweep from Population Genomic Data

^{*}Interdisciplinary Program of EcoCreative, Ewha Womans University, Seoul, Korea 120-750^{†}Department of Life Science, Ewha Womans University, Seoul, Korea 120-750

- 1Corresponding author: Department of Life Science, Ewha Womans University, Ewhayeodae-gil 52, Seodaemun-gu, Seoul, Korea 120-750. E-mail: yuseob{at}ewha.ac.kr

## Abstract

Adaptive evolution occurs as beneficial mutations arise and then increase in frequency by positive natural selection. How, when, and where in the genome such evolutionary events occur is a fundamental question in evolutionary biology. It is possible to detect ongoing positive selection or an incomplete selective sweep in species with sexual reproduction because, when a beneficial mutation is on the way to fixation, homologous chromosomes in the population are divided into two groups: one carrying the beneficial allele with very low polymorphism at nearby linked loci and the other carrying the ancestral allele with a normal pattern of sequence variation. Previous studies developed long-range haplotype tests to capture this difference between two groups as the signal of an incomplete selective sweep. In this study, we propose a composite-likelihood-ratio (CLR) test for detecting incomplete selective sweeps based on the joint sampling probabilities for allele frequencies of two groups as a function of strength of selection and recombination rate. Tested against simulated data, this method yielded statistical power and accuracy in parameter estimation that are higher than the *iHS* test and comparable to the more recently developed *nS*_{L} test. This procedure was also applied to African *Drosophila melanogaster* population genomic data to detect candidate genes under ongoing positive selection. Upon visual inspection of sequence polymorphism, candidates detected by our CLR method exhibited clear haplotype structures predicted under incomplete selective sweeps. Our results suggest that different methods capture different aspects of genetic information regarding incomplete sweeps and thus are partially complementary to each other.

- positive selection
- selective sweep
- composite likelihood
- polymorphism

POSITIVE natural selection is one of the most fundamental driving forces for biological evolution. However, it is known that mutations conferring higher relative fitness to carriers, or beneficial mutations, do not occur frequently at a given gene or genomic region of interest in most natural populations of plants and animals. Even if a beneficial allele is currently under strong directional selection, its direct identification at the sequence level is not easy since the allele frequency change is likely to be too slow to follow over time in typical population genetic surveys unless the generation time is very short and a large amount of serially sampled sequences are available. Therefore, it is extremely difficult to directly follow the random occurrence of beneficial mutations and their spread under selective environments in nature. For this reason, the investigation depends heavily on detecting the signature of past episodes of positive selection, whether the beneficial mutation is already fixed in the population or still on the way to fixation (*i.e.*, ongoing selection for a mutation that occurred in the past but still segregating in the population), from the present-day patterns of within- and between-species genetic variation (reviewed in Nielsen 2005; Sabeti *et al.* 2006; Akey 2009; Stephan 2010). Such signatures of positive selection provide information for reconstructing evolutionary events that happened in the population’s history. In addition, signals of positive selection imply functional importance of the loci and thus can be used to identify genetic variation that contributes to phenotypic diversity or annotate the genome functionally (Biswas and Akey 2006).

One of the basic methods for detecting positive selection is to search for the distinct pattern of within-species genetic variation left by a “selective sweep.” A selective sweep occurs when a new advantageous mutation increases in frequency quickly in the population and results in a great reduction in variation, a temporary increase in linkage disequilibrium, and a skew in allele frequency distribution in the nearby region of a recombining chromosome (Maynard Smith and Haigh 1974; Kaplan *et al.* 1989; Fay and Wu 2000; Kim and Nielsen 2004). A selective sweep may be “complete” when the advantageous mutation goes to fixation and all local variation is removed except those that escaped the sweep by recombination. This type of selective sweep has drawn much attention and a number of statistical tests, mostly based on summary statistics such as Tajima’s *D*, Fu and Li’s *D* and *F*, and Fay and Wu’s *H* test, were proposed to detect mainly complete positive selection from sequences sampled shortly after the fixation of a beneficial mutation (Tajima 1989; Fu and Li 1993; Fay and Wu 2000). More advanced statistical tests based on composite likelihood were also proposed (Kim and Stephan 2002; Meiklejohn *et al.* 2004; Nielsen *et al.* 2005).

Hudson *et al.* (1994) first observed evidence of an ongoing selective sweep—a subgroup of sampled sequences harboring very low variation due to linkage to the putative beneficial allele that reached an intermediate frequency—at the *Sod* locus in *Drosophila melanogaster*. However, as the availability of population genomic data was limited and discovering rare episodes of recent selective sweeps was considered very difficult in natural populations, capturing such “incomplete” or ongoing selective sweeps must have been considered even more difficult. Therefore, theoretical work mainly focused on inferring selective sweeps that were already completed in the past (Kaplan *et al.* 1989; Barton 1998; Fay and Wu 2000; Kim and Stephan 2002; Przeworski 2002). However, Sabeti *et al.* (2002), in one of the first large-scale population genomic surveys for detecting recent positive selection, showed that the human genome harbors a number of loci with clear signatures of incomplete selective sweeps. Since then, detecting this type of selective sweep soon became an important topic in both empirical and theoretical population genetics (Quesada *et al.* 2003; Meiklejohn *et al.* 2004; Sabeti *et al.* 2006; Saunders *et al.* 2006; Voight *et al.* 2006).

Sabeti *et al.* (2002) introduced a long-range haplotype test based on extended haplotype homozygosity (EHH) that quantifies the residual association between an allele at the core locus and its genetic background (*i.e.*, the linked haplotype at the time of the allele’s mutational origin). Under neutrality, a haplotype associated with an allele at higher frequency extends to a shorter distance, thus yielding smaller EHH, since the allele is older (Toomajian *et al.* 2003). A significantly large EHH for a given allele frequency at the focal locus then suggests the hitchhiking effect driven by positive selection. If the ancestral *vs.* derived alleles of a polymorphic site can be distinguished, positive selection is expected to generate a much larger EHH for the derived allele than that for the ancestral allele. This is the rationale of the *iHS* statistic in Voight *et al.* (2006) that is now routinely used in population genomic studies. Recently, Ferrer-Admetlla *et al.* (2014) proposed a new statistic, *nS*_{L}, that is similar to *iHS* but is robust to recombination rate variation and exhibits improved power to detect sweeps.

The success and popularity of discovering incomplete sweeps may be attributable to unique haplotype structures that can be relatively easily and reliably captured by a rather simple test statistic such as *iHS*. If the local mutation rate fluctuates, it may create a random region of severely reduced variation that might be taken as a candidate for a complete selective sweep (Kim and Stephan 2002). With an incomplete sweep, the pattern of polymorphism in the haplotype block containing the ancestral allele of the focal locus reflects genetic variation that existed before the start of the selective sweep. Then, this haplotype block is effectively a negative control for the selective sweep that would alleviate the problem of local fluctuation in mutation rate. In the case of local adaptation, the inclusion of sequences from a neighboring deme, where positive selection did not take place, into analysis was shown to increase the statistical power of detecting positive selection (Innan and Kim 2008). The ancestral haplotype block in an incomplete sweep is expected to play a similar role in increasing statistical power to detect selection to that of the neighboring deme for complete sweeps.

The analysis of incomplete selective sweeps therefore provides a great opportunity for understanding positive natural selection in nature. However, as the current methods are not built on an explicit model of selection, information regarding the process of selection underlying the incomplete sweeps was limited. In this study, we obtain an approximate formula for sampling probabilities in a model of an incomplete selective sweep and then build a composite-likelihood-ratio (CLR) test for formal hypothesis testing and parameter estimation. Previously, Meiklejohn *et al.* (2004) proposed a CLR test for detecting an incomplete sweep by extending the sampling probabilities under complete selective sweeps of Kim and Stephan (2002) into cases where the final frequency of the beneficial allele in the population is less than one. However, in this approach the probability of sampling a neutral variant from the entire set of samples was obtained without explicitly specifying the polymorphic site causing an incomplete sweep or the joint configuration of polymorphism in the neutral and the putative selected loci. While a key parameter in their composite-likelihood ratio is the final frequency, β, of a beneficial mutation in the population, the frequency spectrum of the total data contains only a limited amount of information, yielding a very broad peak of the composite-likelihood ratio over the parameter space. Therefore, the joint estimation of β and the location/strength of selection was not accurate and the statistical power to detect selection was much lower compared to that of the *iHS* test. To overcome this difficulty, this study uses an approach to take each single-nucleotide polymorphism (SNP) in data as a putative locus under selection, essentially identical to the *iHS* method above. Namely, the derived alleles at all SNPs are tested to find whether they increased to the current frequencies by strong directional selection, by jointly analyzing the pattern of linked polymorphism surrounding the derived allele of each SNP and that surrounding the ancestral allele. This test is aimed at detecting selection in large-scale population genomic data generated by next-generation sequencing (NGS) methods, which inevitably contain occasional low-quality or missing base calls. The composite-likelihood approach can be straightforwardly applied to such data with missing information. By applying this method to simulated data and a population genomic data set in *D. melanogaster*, we demonstrate that this approach improves our ability to detect clear signatures of incomplete selective sweeps.

## Materials and Methods

### Sampling probability under an incomplete selective sweep

We aim to detect the signature of an incomplete selective sweep in which a beneficial allele originating from a single event of a point mutation (thus a hard selective sweep) reaches an intermediate frequency in a population. Consider multisite polymorphism observed in the alignment of *n* randomly sampled homologous chromosomes (Figure 1). It is assumed that neutral alleles segregate at these polymorphic sites, except one under selection (denoted the “*S* locus”) with *n*_{1} copies of the beneficial allele and *n*_{2} (= *n* – *n*_{1}) copies of the ancestral allele. The strength of selection for the beneficial allele is given by α = 2*Ns*, where *N* is the number of diploid individuals in the population and *s* is the selection coefficient [the relative fitness of the beneficial over the ancestral allele is 1 + *s*, assuming codominance (*h* = 0.5)]. At a neutral site that is *d* nucleotides away from the *S* locus, let *k*_{1} (*k*_{2}) be the count of the derived allele in the subsample of *n*_{1} (*n*_{2}) chromosomes carrying the beneficial (ancestral) allele. If *d* is small enough to generate the hitchhiking effect of the beneficial allele, an increased or decreased frequency of the derived neutral allele due to hitchhiking is reflected by *k*_{1}/*n*_{1}, while its frequency before hitchhiking is estimated by *k*_{2}/*n*_{2}, assuming that the frequency of the neutral allele among chromosomes carrying the ancestral allele at the *S* locus does not change during the sweep (see *Appendix*). Therefore, the hypothesis of an incomplete sweep acting on the (putative) *S* locus predicts a very distinct joint probability distribution of *k*_{1} and *k*_{2}, compared to an alternative (*i.e.*, neutral) hypothesis (Figure 2). Our goal is to build a parametric test based on this joint sampling probability, denoted by for detecting an incomplete selective sweep (*i.e.*, identifying the *S* locus in DNA sequence polymorphism).

We obtained two approximate solutions to such a joint sampling probability, and by modifying the equivalent solution for complete sweeps in Nielsen *et al.* (2005) and that in Etheridge *et al.* (2006), respectively (*Appendix* ). The corresponding probability under the null hypothesis (no selection), can also be obtained. The primary parameter that determines and is where *r* = *r*_{n}*d* (*r*_{n} = recombination rate per nucleotide per generation) is the recombination rate between the *S* locus and the neutral site and *R* = 4*Nr*. In comparison against simulated data generated by *msms* (Ewing and Hermisson 2010) under the model of an incomplete sweep, approximates the sampling probability much better than for small recombination rates (Supporting Information, Figure S1). However, is not applicable for larger recombination rates () (Etheridge *et al.* 2006).

### CLR test

Let *x* be the position of the putative *S* locus, which is assumed to be one of the polymorphic sites in the sequence alignment and thus partition the data into subsamples of *n*_{1} and *n*_{2} chromosomes carrying the derived and ancestral alleles at the locus, respectively. *n*_{1} is also denoted as *n*_{1}(*x*) to emphasize that the position *x* uniquely determines the derived allele frequency of the putative *S* locus. This partition by the *S* locus also determines the counts of the derived neutral allele at nucleotide site *i* in the two subsamples, and Then, for a given *x*, a maximum-composite-likelihood estimate of the strength of selection, , is obtained as a value of α (if *R* per site is externally given) that maximizes the CLR, (1)where (2)and (3)are composite likelihoods under the hypotheses of incomplete selective sweep and neutrality, respectively. In the following, unless stated otherwise we use only for Equation 2 despite its error for small *r*/*s*. The impact of this error on the performance of our likelihood test is addressed below. Unless stated otherwise, multiplication above is done across all sites in the data, including monomorphic sites It is also possible to multiply probabilities over polymorphic sites only (analogous to *L*_{2} and *L*_{4} in Kim and Nielsen 2004), which leads to a composite-likelihood test for detecting selection based on the joint allele frequency spectrum only but not on the patterning of polymorphic sites along the sequence.

It is straightforward to calculate the CLR given in Equation 1 in the presence of missing values in sequence data, for example due to low-quality base calls that are common in NGS data sets. If missing base calls are made at a site on chromosomes in the sample, the sampling probability for this site is calculated after *n*_{1} and/or *n*_{2} are reduced accordingly. If base calls are missing at the core SNP (the putative *S* locus), entire chromosomes carrying the missing base calls are excluded from the calculation of composite likelihoods.

Next, the maximum-composite-likelihood estimate of the locus under selection is obtained by calculating for all polymorphic sites in a given chromosomal region and then identifying the site (at position ) that maximizes This procedure also implies that a test statistic for hypothesis testing would be given by (4)and we may reject the null hypothesis (neutrality) if *T*_{0} is larger than a certain cutoff value. The null distribution of *T*_{0} is determined by applying the above calculation to polymorphic sites in a large number of data sets simulated under the neutral model. Imposing fixed polymorphic sites (-s option in ms) or fixed scaled mutation rate (-t option) but conditioning on the similar number of polymorphic sites in simulated data led to almost identical distributions (data not shown). However, it was observed that the maximum CLR for a given focal site () is negatively correlated with , most likely because a derived allele with smaller allele frequency originated more recently and is thus associated with a longer extended haplotype. If not corrected, this will bias the estimated locus of selection to be a polymorphic site with a lower frequency of the derived (putatively beneficial) allele. A solution to this problem would be to transform to remove its correlation with We tried and evaluated various forms of the normalized test statistic. The following procedure yielded the most optimal performance of parameter estimation (see below). Let and be the mode and the 1 − *ε* quantile of the distribution of obtained from polymorphic sites whose derived allele frequency is *f* in simulated neutral data sets. Then, we define a new statistic (5)Then, the estimated location of the *S* locus, is the value of *x* that achieves the maximum in the above formula. This also leads to the final estimate of the strength of selection, for a given set of sequences. For a given *ε*, the null distribution of *T*_{1} is obtained by applying the above procedure to a large number of sequence samples, with an equivalent number of polymorphic sites and a scaled recombination rate, that are generated by neutral simulation. Unless stated otherwise, we reject the null hypothesis of neutrality (no selection) with significance level *P* = 0.001, which resulted in an optimal range of statistical powers with varying parameter values chosen below.

### Analysis of *D. melanogaster* population genomic data

We used 22 primary core genomes in the Rwanda (RG) sample of *D. melanogaster* described in Pool *et al.* (2012), available for download from the DPGP2 project (http://www.dpgp.org/). As the violation of the random sample from unrelated individuals may generate spurious occurrence of long-range haplotype homozygosity, we removed identical-by-descent (IBD) tracts detected by Pool *et al.* (2012): in the sample of 22 sequences, if any pair of chromosome segments are IBD, we treated one of them as a missing observation, replacing it by a sequence of “N” characters. Then, we extracted a phased table of polymorphic sites with their physical locations. Next, the ancestral and derived alleles were inferred using the syntenic assembly of *D. melanogaster* and *D. simulans* (available at www.dpgp.org) and designating the allele observed in *simulans* as ancestral or the table of ancestral allele probability for polymorphic sites calculated for DPGP1 RAL sequences (Chan *et al.* 2012). This procedure could assign the ancestral/derived states for ∼85% of polymorphic sites obtained above. The remaining polymorphic sites were not included in input files. We also excluded from analysis the telomeric and centromeric regions of each chromosome arm with low recombination rates: from the midpoint of a chromosome arm we moved toward the telomere and toward the centromere until the points over which the mean genetic distance per megabase first becomes <1 cM, using the best-fitting equations for crossing-over rates on 100-kb windows obtained by Comeron *et al.* (2012).

Composite likelihood is calculated by taking a SNP as the putative *S* locus (“core SNP”): thus the sample is partitioned into *n*_{1} and *n*_{2} sequences as described above. The sample frequency of the focal derived allele is therefore *f* = *n*_{1}/(*n*_{1} + *n*_{2}). Note that, as this SNP may contain missing values (N in data) and the corresponding chromosomes are excluded from calculating the composite likelihood, *n*_{1} + *n*_{2} can be <22. For computational convenience, we assumed scaled recombination rate 4*Nr*_{n} = 0.012 per site in the calculation of likelihood for all chromosome arms. As sampling probability under selection is primarily a function of but only slightly modified by α alone, a deviation of actual recombination rate from the above assumption would lead to a corresponding error in the estimate of α, without affecting the location and value of the maximum-composite-likelihood ratio. Local fluctuation in the scaled mutation rate, θ, was also ignored: we estimated mean θ for each chromosome arm and used it in the calculation of likelihoods for any region within the chromosome arm. Incorrect assumptions of θ were shown to affect minimally the performance of our test (see below).

Joint sampling probabilities were obtained using the approximation proposed in Nielsen *et al.* (2005), *i.e.*, assuming that the ancestral pattern of polymorphism at the time of the beneficial mutation follows either standard neutral equilibrium (test option A) or the currently observed genome-wide empirical frequency spectrum (test option B) (*Appendix*). The significance of the CLR, maximized with respect to α and then normalized for the derived allele frequency, is assessed as described for *T*_{1} above, however, using the site-wise null distribution of CLR obtained from individual polymorphic sites (5 × 10^{5} SNPs) generated by *msms* under neutrality, with parameters adjusted to match sample size, mean recombination rate, and the mean density of polymorphic sites to those of *Drosophila* genome data. Namely, multiple-test correction, as implemented above by the null distribution of the local maximum of test statistic (*T*_{1}) in a window of defined sequence length, is not performed here. Therefore, a *P*-value determined this way cannot be compared to that used for analyzing simulated incomplete sweeps above. We consider sites that yield large normalized CLR, corresponding to *P* < 0.001, as candidate loci under selection. This level of significance is rather arbitrary. However, rather questionable candidates of incomplete sweeps (with unclear haplotype structure upon visual inspection; see *Results* below) are already detected at this level and, therefore, a less stringent level will likely increase the number of such loci.

### Haplotype homozygosity tests

We applied two haplotype homozygosity tests, *iHS* (Voight *et al.* 2006) and *nS*_{L} (Ferrer-Admetlla *et al.* 2014), to detect incomplete sweep in simulated data as well as *D*. *melanogaster* data. For the analysis of simulated data, unstandardized *iHS* (log[*iHH*_{A}/*iHH*_{D}]) was calculated for individual polymorphic sites according to Voight *et al.* (2006), using the rehh R package (Gautier and Vitalis 2012) (http://cran.r-project.org/web/packages/rehh/index.html), and unstandardized *nS*_{L} was calculated through the program provided by Ferrer-Admetlla *et al.* (2014) at http://cteg.berkeley.edu/∼nielsen/. Using the same set of simulated neutral samples as used above for CLR analysis, the 1 − *ε* quantile and mode of the distribution of the unstandardized *iHS* were obtained for each derived allele frequency and these values were used to define standardized *iHS* by applying the procedure of obtaining *T*_{1} by Equation 5. The test statistic for detecting an incomplete selective sweep in a replicate of a 100-kb sequence sample is therefore the most negative standardized *iHS* among sites and the procedure of obtaining the null distribution and assessing the significance of this statistic is identical to that of *T*_{1} by the CLR method. The same normalization procedure was applied for the *nS*_{L} statistic. We also tried the standardization procedure based on the assumption of normal distribution described in Voight *et al.* (2006) for both statistics and discovered that our standardization procedure leads to slightly increased statistical power (data not shown).

For the genomic scan of *D. melanogaster* data below, we first obtained standardized *iHS* and associated *P*-values for individual polymorphic sites in the data according to the procedure described by Voight *et al.* (2006) performed by the rehh package. In the calculation of *iHH*_{A} and *iHH*_{D} by this package haplotype homozygosity for sequences does not extend from the core SNP if missing base calls are encountered in a subset of the sequences. Namely, missing bases (N) are treated as an allele distinct from A, C, G, or T. We found that this frequently generates very small *iHH*_{A} and thus erroneously very negative *iHS* (*i.e.*, false detection of selection). To correct this problem, we wrote our own code that calculates *iHH*_{A} and *iHH*_{D} while skipping positions of missing bases in extending haplotype homozygosity. We also used the full data including all polymorphic sites rather than the input data used above for the CLR test, in which ∼15% of polymorphic sites were excluded as their ancestral/derived alleles could not be determined. Excluding these sites caused frequent false positives since the excluded sites are often clustered to make the region to falsely appear monomorphic and thus inflate *iHH*_{D}. These corrections led to detection of clearer signatures of incomplete sweeps (upon visual inspection of haplotype structures). For the *nS*_{L} statistic, since we find it less sensitive to missing data than *iHS*, we used the same input data as used for the CLR test and then performed standardization according to Ferrer-Admetlla *et al.* (2014).

### Simulated data under different demographic assumptions

To explore the robustness of the CLR test to demographic assumptions, we generated neutral data sets, using *msms* (Ewing and Hermisson 2010) under three different scenarios: population bottleneck, exponential population growth, and population subdivision. All data sets were generated with equal sequence length (100 kb) and number of polymorphic sites (3000). For the bottleneck model, we simulated a population bottleneck lasting from 0.4*N* to 0.2*N* generations in the past with different severities *c* = 0.4, 0.2, and 0.1 (*c* = *N*_{b}/*N*, where *N*_{b} is the population size during the bottleneck). In the case of the exponential growth model, populations start growing exponentially from a population size of 0.4*N*, with three different growth rates *g* = 10, 100, and 500. For the population subdivision model, we simulated a two-island model with symmetric, constant migration rates *M* = 0.1, 1, 10 and then drew all sequences for each sample from one island. We also varied the recombination rate [*R* (per 10 kb) = 4000, 6000, 8000, 10,000, 12,000] in each model to study the effect of altered linkage disequilibrium on the null distribution. For each parameter set, at least 1000 replicates with a sample size of 20 chromosomes were obtained.

### Codes and scripts

All source codes developed here for analyzing simulated and actual data are available upon request. Command line scripts for simulations performed above using *ms* and *msms* are provided in File S1.

## Results

### Statistical power of the composite-likelihood test

To evaluate the performance of the composite-likelihood method described above, we applied it to simulated data sets generated by *msms* (Ewing and Hermisson 2010) under the model of incomplete selective sweeps. In simulation, the beneficial allele of the *S* locus, located in the middle of the 100-kb sequence, reaches frequency β = 0.5 in the population and a sample of 20 sequences is generated. Then, test statistic *T*_{1} in Equation 5 was determined after the maximum CLRs were calculated over all SNPs in the sample with derived allele frequency ≥3 and ≤17. The null hypothesis of neutrality was rejected if *T*_{1} > the 99.9th percentile of the null distribution, which was obtained from data sets simulated under the model of neutral equilibrium with the same sequence length and recombination rate. This cutoff value (*P* = 0.001) of *T*_{1} is a function of *ε*. When various values of *ε* (0.0006, 0.001, 0.0016, 0.002, and 0.003) were tried, the statistical power fluctuated moderately (∼5%) while *ε* = 0.002 resulted in the best performance in parameter estimation (the largest proportion of replicates in which the correct site, at position 50 kb, yielded the largest *T*_{1}). We thus use *ε* = 0.002 in the following analyses.

The statistical power of the test increased as the final frequency of the beneficial mutation, β, in the population increased: with *R* = 2000, α = 2000, and β increasing from 0.3 to 0.7 by 0.1, the statistical powers were 0.45, 0.71, 0.87, 0.95, and 0.98, respectively. This test performed better with larger β presumably because as a larger proportion of individuals (thus sequences in a sample) are affected by selection, the pattern of polymorphism becomes more distinctive from the neutrality, and also because the component of sampling probability was obtained from the solution obtained for a complete selective sweep. For a fixed value of β, the statistical power increased with increasing strength of selection, as expected (Table 1).

This performance of the composite-likelihood test was compared to that of long-range haplotype methods that use *iHS* and *nS*_{L} statistics (Voight *et al.* 2006; Ferrer-Admetlla *et al.* 2014). Instead of using the normal distribution-based standardization of *iHS* and *nS*_{L}, we applied the normalization procedures that were used to obtain *T*_{1} above (see *Materials and Methods*), which made it possible to directly compare the performance of the CLR, *iHS*, and *nS*_{L} methods. In all parameter sets tested, the statistical power of our composite-likelihood method is higher than that of *iHS* but only slightly better than that of the *nS*_{L} method [Table 1; note that results here were obtained assuming that the correct scaled recombination rate of the sequence is available (see below)]. Interestingly, there are a relatively large number of simulated incomplete sweeps detected by either CLR or *nS*_{L} only, particularly with weaker strength of selection (Figure 3), suggesting that the CLR and *nS*_{L} methods capture different aspects of data as signatures of incomplete sweeps and thus are largely complementary to each other.

### Effect of recombination rate and linkage disequilibrium

The above result is based on the null distribution of the test statistic obtained from neutral simulations that used recombination rates identical to those used in the simulation of incomplete sweeps. However, in practice, the correct rate of recombination, scaled or unscaled, for a given genomic region may not be available. This turns out to be a serious problem for our CLR method, as we found that the null distribution of the likelihood ratio is highly sensitive to the scaled recombination rate (Figure 4). It appears that, with decreasing recombination rate, linkage disequilibrium (LD) between adjacent polymorphic sites increases and this inflates the likelihood of an incomplete sweep (*L*_{IS}) relative to that of neutral evolution (*L*_{N}). Therefore, one approach to control the false-positive rate of detection might be to adjust the recombination rate during neutral simulation until the average LD among sites matches that of data under examination (see below for the case of demographic complications). As an incomplete selective sweep generates a high level of LD around the *S* locus (Stephan *et al.* 2006), to generate samples with an equivalent level of LD [measured by the average ρ^{2} over all pairs of sites, where ρ is the normalized LD as measured by correlation of allele frequencies between two loci (Hill and Robertson 1968)] the recombination rate needs to be greatly reduced in the neutral simulation. When we obtained the null distribution of *T*_{1} from such low-recombination simulation, the statistical power of our CLR method decreased dramatically (Table 1). In contrast, the null distribution of *nS*_{L} was affected minimally by recombination rate variation (data not shown), as it was originally proposed to cope with uncertainty in recombination rates (Ferrer-Admetlla *et al.* 2014).

### Inferring the strength of selection and the position of the *S* locus

Because sequences are randomly sampled from a population, the copy number of the beneficial allele, *n*_{B} = *n*_{1}(50,000), in a sample is variable (binomial): with β = 0.5, *n*_{B} < 3 or >17 in <0.5% of replicates, which makes it impossible to detect the true locus under selection. In the other replicates, the exact locus is detected if the maximum *T*_{1} is obtained at the correct site (*i.e.*, = 50,000). Compiling results from all replicates (regardless of whether the correct site is inferred or not), we find that the estimate of the strength of selection is unbiased, although the variance of the estimate is large (Table 1). More than half of replicates yielded within ∼1–3 kb from the target of selection. The proportion of replicates in which the exact site is inferred ranges from 9 to 16%, more accurate estimates occurring with higher recombination rates. If the sample frequency of the beneficial allele matches the population frequency (0.5), this proportion significantly increases (Table 1).

In the *iHS* and *nS*_{L} methods the estimated location of the *S* locus, , is given as the polymorphic site from which the most negative normalized statistic is obtained. Applied to the same sets of simulated data, by *iHS* was less accurate than by either CLR or *nS*_{L} (Table 1). The exact position of the *S* locus was correctly inferred about three times more often by CLR than by *iHS* but roughly as often as by *nS*_{L}. CLR also yielded the smallest mean deviation of from the true location. Overall, the accuracies of estimates are similar between the CLR and *nS*_{L} methods. Surprisingly, however, the three methods are weakly correlated with respect to estimating the exact location of selection (Figure 3): for example, applied to 10,000 replicates of simulation with α = 4000, the CLR and *nS*_{L} methods detected the correct site under selection in 1390 and 1255 replicates, respectively. However, in only 252 replicates the correct site was detected by both methods. Again, this result suggests that the CLR and *iHS*/*nS*_{L} methods capture slightly different information in multisite polymorphism to detect incomplete sweeps and estimate the position of the putative *S* locus. When we define a new estimate as the average over those by the CLR and *nS*_{L} methods, its mean deviation from the correct site in kilobases [*i.e.*, ] is 2.74, 2.42, and 3.25 for α = 1000, 2000, and 4000, respectively (with *R* = 4000), which is smaller than the deviation obtained by an individual method (Table 1). Therefore, small improvements in the accuracy of position estimates are made by combining the two methods.

### Modification of composite likelihoods

So far, sampling probability based on approximation by Nielsen *et al.* (2005), was used for obtaining composite likelihoods. For small recombination rates (), we may replace by more accurate approximation, based on Etheridge *et al.* (2006). This, however, did not lead to a significant change in the profile of the CLR (Figure S2). We also examined the effect of not including monomorphic sites in the data. When the CLR is calculated by multiplying joint sampling probabilities over only polymorphic sites in the data, it leads to lower statistical power to detect selection and larger errors in estimating the strength and position of selection than when multiplication was done over all sites (Table 1). This result suggests that not only the (joint) frequency spectrum of polymorphism but also the spatial distribution or density of polymorphic sites contains information regarding incomplete selective sweeps.

### Effect of complex demography

Next, to evaluate the robustness of the CLR method to complex demography and population structure, we examined how the null distribution of the test statistic (*T*_{1}) changes if it is obtained from data sets simulated under the models of population bottleneck, expansion, and subdivision (see *Materials and Methods*). In each model parameters were chosen to produce a significant deviation of the frequency spectrum from that under neutral equilibrium. The number of polymorphic sites (3000) per sample remained constant for varying models and parameters. First, with a population bottleneck that lasted from 0.4*N* to 0.2*N* generations ago, decreasing the size of the bottlenecked population (*c* = *N*_{B}/*N* decreasing from 0.2, 0.1, to 0.05) dramatically shifted the distribution of the CLR upward (Figure S3A). This shift appears to be explained by a reduction in scaled (population-level) recombination rate due to the bottleneck, which leads to increased LD: when the recombination rate was increased to reduce LD (quantified by mean pairwise ρ^{2}), the distribution of the CLR shifted back downward (Figure S4A). With matching LD, distributions obtained under the bottleneck (4*Nr*_{n} = 0.1; mean ρ^{2} = 0.0543) and under the standard neutral model (a constant-sized panmictic population; 4*Nr*_{n} = 0.04; mean ρ^{2} = 0.0543) are very similar (Figure S4A). However, the right tail of the distribution is still slightly larger than that of neutral equilibrium.

Similarly, the null distribution of the CLR shifts upward due to rapid exponential growth of population size (*g* > 100) in the expansion model and limited migration (*M* < 1) in the subdivision model (Figure S3, B and C). Again, by increasing the recombination rate in the simulation, thus reducing the average level of LD among sites, these distributions are shifted downward. Similar distributions of the CLR (right tails) are obtained from simulations under the standard and complex demography if the levels of LD match (Figure S4). These results suggest that, in the analysis of a genomic region for which underlying population demography and/or correct recombination rate are not known, the false-positive rate of detecting incomplete sweeps by CLR can be greatly reduced, if not completely, by generating samples with matching LD by standard neutral simulation.

Results above were obtained by calculating the likelihood of incomplete sweeps, assuming the standard neutrality at the time of beneficial mutation [*f*_{0}(*p*) = θ/*p*; test option A]. We can replace *f*_{0}(*p*) with the empirical distribution of the derived allele frequency observed in the simulations of these demographic models (test option B). The latter option is essentially the approach by Nielsen *et al.* (2005) to minimize the compounding effect of complex demography in detecting the signature of selection. However, it had little effect in correcting the null distribution and did not prevent the inflation of the CLR with increasing LD between segregating sites (Figure S3).

### Application to *D. melanogaster* genomic data

The composite-likelihood method described above was applied to population genomic data of *D. melanogaster* to detect incomplete sweeps. We used 22 haploid genome sequences from Rwanda (the RG sample) described in Pool *et al.* (2012). As the species’ ancestral range is known to lie within southern and eastern Africa, the RG sample is likely to satisfy the assumption of equilibrium demography (constant-sized random-mating population before the start of the sweep in our model) better than any other available genomic data sets in *D. melanogaster*. However, when we examined the genome-wide distribution of derived allele frequency, a slight but clear deviation (excess of rare alleles) from the standard neutrality was observed (Figure S5). This is likely due to nonequilibrium demography (mild population bottleneck and recent population growth) that may have affected the RG sample (Pool *et al.* 2012) but might also be due to errors in base calling and ancestral/derived state inference.

A genome scan was conducted by sequentially taking all polymorphic sites in the data with derived allele frequencies satisfying 0.35 < *f* < 0.8 as core SNPs and calculating composite likelihoods. We observed clear clustering of SNPs yielding a large CLR (Figure 5 for chromosome arm 2R), corresponding to *P* < 0.001 (see *Materials and Methods*), scattered over the five major chromosome arms. We consider each cluster as a footprint of a single episode of an incomplete selective sweep. (Other scattered and isolated sites that yield *P* < 0.001 but do not form clusters were not considered.) A SNP with the largest CLR within a cluster (the “peak”) is therefore a candidate position of ongoing selection. There are 42 clusters in total, using test option A, and we identified an annotated gene in FlyBase (version FB2014_03) containing or closest to the peak in each cluster (Table 2). Test options A and B generated very similar profiles of the CLR along the chromosome (Figure S6) and thus led to the detection of almost identical sets of candidate loci in each chromosome arm. When clusters are ranked according to *T*_{1} within each chromosome arm, ranks by options A and B are strongly correlated (Table 2). Upon visual inspection of aligned and sorted sequences, we observed clear segregating patterns of SNPs indicative of incomplete sweeps—far fewer polymorphisms and high linkage disequilibrium among chromosomes containing the derived allele compared to those containing the ancestral allele at the core SNP—at the majority of these candidate loci (Figure 5 and Figure S7).

The calculations above were performed using a uniform value of scaled mutation rate, θ, for each chromosome arm. To examine whether local variation in θ has an effect on the accuracy of inference, we performed the CLR test with the local value of θ calculated from a 10-kb window surrounding each core SNP. This procedure yielded almost the same profile of composite likelihood along the chromosome and the same list of selection candidates (data not shown), presumably because the ratio of composite likelihoods depends weakly on θ: change in θ appears to affect *L*_{IS} and *L*_{N} in Equation 1 to a similar degree.

Patterns similar to the outcome of incomplete selective sweeps may arise by a complete selective sweep: at an appropriate recombination distance from the position of the beneficial mutation that reached fixation, low variation and high frequency of derived alleles would be observed among chromosomes whose linkages to beneficial mutation were not broken by recombination. However, a normal level of variation will be observed among chromosomes that recombined away from the beneficial mutation. We therefore checked whether our candidate regions of incomplete selective sweeps overlap with those of complete selective sweeps in the RG sample detected by Pool *et al.* (2012) (343 regions listed in their table S13). Seventeen of our 42 clusters overlap with the candidate regions of complete sweeps (Table 2).

Next, we calculated *iHS* and *nS*_{L} statistics for the same data set for which CLR was obtained above. Even though corrections were made to address the complexity of data (missing base calls and incomplete inference of ancestral/derived alleles; see *Materials and Methods*), many sites yielding very negative *iHS* appear to be false positives because clear haplotype structures predicted under incomplete sweeps are not observed at those loci (Figure S8). On the other hand, sites yielding very negative *nS*_{L} are associated with a much clearer haplotype pattern. However, there are still cases of very unclear haplotype patterns detected by *nS*_{L} (Figure S8). We could identify clusters of negative *iHS* and those of negative *nS*_{L}, similar to clusters of large CLR above. However, the overall pattern of clustering for negative *iHS* or *nS*_{L} is not clear, whereas very distinct clusters of large CLR were observed (Figure 5). Many sites generated large negative *iHS* or *nS*_{L} by themselves without belonging to any cluster and we did not consider them as candidate loci under selection. We found that these isolated occurrences of large negative *iHS*/*nS*_{L} and other sites with large negative *iHS*/*nS*_{L} but without clear haplotype structure of incomplete sweep are associated with unusually small *iHH*_{A}. Namely, stochastic fluctuation in haplotype structure surrounding the ancestral allele appears to frequently generate false-positive signatures of selection captured by *iHS* or *nS*_{L}.

To examine whether the CLR, *iHS*, and *nS*_{L} methods detect common candidate loci under selection, we adjusted the *P*-value cutoff of *iHS* or *nS*_{L} for each chromosome arm so that the numbers of *iHS* or *nS*_{L} clusters match that of the CLR in the same chromosome arm (Table S1 and Table S2). If a CLR cluster and an *iHS* or *nS*_{L} cluster are not >50 kb away from each other, they are defined as overlapping candidates of selection. Of 25 CLR clusters that do not overlap with candidates of complete sweeps, 13 overlap with *iHS* clusters (Table 2). Ten of those 13 *iHS* clusters are also *nS*_{L} clusters, reflecting a very high level of overlap between the *iHS* and *nS*_{L} methods. There is only one case of coincidence between CLR and *nS*_{L} peaks not being an *iHS* peak (excluding those overlapping with complete sweeps). Therefore, less than half of CLR peaks were detected also by the *nS*_{L} method. Visual inspection of haplotype structures indicates that such candidate loci detected by all three methods tend to exhibit a much clearer pattern of incomplete sweeps than others (Figure S7). However, there are also loci detected by the CLR method only but with clear haplotype patterns (for example, near position 5,770,000 in 2R; Figure 5). We also identified a few peaks of negative *nS*_{L} with clear haplotype patterns not overlapping with CLR or *iHS* peaks (for example, near position 3,886,000 in 2R; Figure 5). However, such cases are exceptional: if an *nS*_{L} peak is not overlapping with the CLR or *iHS* peaks, it is more likely to show unclear than clear haplotype patterns (Figure S8).

## Discussion

We developed a composite-likelihood method for detecting incomplete selective sweeps and inferring the location and strength of positive selection from DNA sequence polymorphism. As this method is built on analytic approximations to sampling probabilities under an explicit model of the evolutionary process, hypothesis testing and parameter estimation can be performed systematically, for example, allowing the estimation of the strength of selection. This approach also has the potential to be extended to incorporate more complex scenarios of incomplete sweeps if the sampling probabilities can be obtained as functions of additional parameters. On the other hand, statistical methods aiming to capture the extended haplotype such as the *iHS* and *nS*_{L} tests (Voight *et al.* 2006; Ferrer-Admetlla *et al.* 2014) have an advantage of requiring fewer assumptions about the evolutionary process to be inferred (*i.e.*, how directional selection occurs) and are also easier to implement the procedure and to interpret the result. We thus compared the performance of our CLR method and the extended haplotype method, using both simulated and actual sequence data.

Analysis of simulated data showed that our CLR approach achieves statistical power and accuracy in estimating the location of selection similar to those by the *nS*_{L} method (Table 1), however, under the assumption that the true scaled recombination rate of the genomic region is known when generating the null distribution by neutral simulation. If a falsely lower estimate of the scaled recombination rate is used for a genomic region under test, which is likely true if an incomplete selective sweep left a polymorphism with long-range LD, it will greatly reduce the statistical power to detecting it as the cutoff value in the null distribution becomes larger. Such a large sensitivity of the CLR to the recombination rate (the level of linkage disequilibrium) is a major problem that needs to be addressed in future improvement of our approach. However, if local recombination rate or map distance is well estimated in advance over a large genomic region (much larger than typical sizes of sweep-affected areas), scaled recombination at a particular locus might be correctly inferred from observed polymorphism in the neighboring regions, given that LD over a large region is much less affected by local fluctuation, for example by selection. Namely, generating the null distribution with neutral simulation that yields the observed level of LD in data under test, as we suggested to correct the effect of unknown recombination rate, might be an unnecessarily conservative test, if the observed LD is definitely unusual (*i.e.*, higher) compared to that in neighboring regions.

A related problem due to the sensitivity of our statistic to the level of linkage disequilibrium is the increased chance of detecting false-positive incomplete sweeps in the presence of nonstandard demography (Figure S4). Because various demographic processes can inflate the level of LD throughout the genome, which upwardly shifts the distribution of *T*_{1} in the absence of selection, obtaining the null distribution under the assumption of the standard neutral model can lead to erroneous detections of sweeps. Again, if the nature of (complex) demography affecting the data is not known, the false-positive detection might be controlled by the null distribution from simulated samples under the standard neutral model but adjusted to exhibit the level of LD observed in the data.

A more important result in the comparison between the CLR and *iHS*/*nS*_{L} tests is that their performances are rather complementary to each other, as their outcomes are not so strongly correlated, especially for weak selection (α = 1000; Figure 3). It is probably because the two methods are designed to detect slightly different footprints of incomplete selective sweeps. Our method primarily captures joint frequency spectra at linked neutral loci for the two subsamples divided according to the *S* locus (Figure 2), whereas the *iHS* and *nS*_{L} methods target the extended haplotype homozygosity, although these two signatures are obviously closely related through the reduction of polymorphism surrounding the putative beneficial allele.

As it was not as feasible to evaluate statistical significance of CLR tests by generating appropriate null distributions for a large number of genomic regions in *D. melanogaster*, we applied the CLR and *iHS*/*nS*_{L} methods as outlier detection approaches. We evaluated the relative performance of the three methods by obtaining similar numbers of outliers (candidate loci) for each chromosome arm and visually inspecting haplotype structures surrounding the putative sites under selection. In general, the clearest haplotype patterns of incomplete selective sweeps were obtained when the loci were detected by all three methods. Candidates detected only by our CLR method exhibited relatively clean patterns compared to those detected by the *iHS* or *nS*_{L} method (Figure 5, Figure S7, and Figure S8). Again this can be attributed to the gain of additional information from DNA sequence polymorphism in the CLR approach. Visual inspection also suggests that many false positives are detected by *iHS* because extended homozygosity surrounding the ancestral allele of the core SNP can be randomly reduced to very small values. Namely, while *iHH*_{D} captures the hitchhiking effect of the beneficial allele, stochastic fluctuation of *iHH*_{A} greatly increases the variance of *iHH*_{A}/*iHH*_{D}. In addition, if a small number, say *n*′, of sequences containing the derived allele of focal SNP are highly homozygous (*e.g.*, hidden identity by descent) by chance while the other *n*_{1} − *n*′ sequences are heterozygous at the normal level, it can lead to a very large *iHH*_{D}. Our approach is not affected by such problems, as our CLR does not simply depend on differences in the levels of variation between the two subsamples of data but compares neutral *vs.* selective scenarios as potential explanations for the subdivided pattern of polymorphism. The stochastic fluctuation of SNP density in the ancestral block appears to be less of a problem for *nS*_{L} than for *iHS*, given that much clearer haplotype structures are detected by *nS*_{L} than by *iHS*, probably because it does not use genetic map distance but the number of intervening SNPs for measuring the size of the extended haplotype.

As population genomic data are obtained predominantly by NGS platforms, missing or low-quality base calls in data may greatly affect the performance of evolutionary inferences from DNA sequence polymorphism. It is straightforward to calculate sampling probability under both neutral and selective hypotheses given the configuration of missing bases at each site in the data. Therefore, our CLR approach can be applied to data with an arbitrary frequency of missing bases without systematic problems. On the other hand, it is not clear how to handle missing bases in quantifying the extended homozygosity for the *iHS* or *nS*_{L} test. We skipped the site containing a missing base in calculating the extension of homozygosity for a pair of sequences because clear haplotype structure of an incomplete sweep could not be identified otherwise. It is not clear how this procedure would affect the performance of the *iHS* test.

In conclusion, we proposed a composite-likelihood method for detecting incomplete selective sweeps and demonstrated that it achieves improvements in parameter estimation and ability to capture clear haplotype patterns compatible with incomplete sweeps compared to long-range haplotype tests. Although it has a disadvantage in not being robust to uncertainty in scaled recombination rates and complex demography, our composite-likelihood ratio provides information that is not captured by an advanced haplotype-based method using *nS*_{L}. We thus recommend that both CLR and *nS*_{L} be used together to maximize the chance of detecting true targets of selection. As incomplete selective sweeps provide excellent opportunities to estimate the strength and location of selection, due to the presence of ancestral polymorphism in the data, compared to complete sweeps, these methods will contribute to broadening our understanding of adaptive evolution in nature. In the framework of the likelihood-ratio test, we may conceive extension of this approach to study further details of incomplete selective sweeps beyond simple confirmation of positive selection and basic parameter estimation. For example, recent analysis predicted that many beneficial mutations are likely to stall at intermediate frequencies due to heterozygote advantage (Sellis *et al.* 2011). If this process generates sampling probabilities distinct from that left by simple directional selection with incomplete dominance, we may detect it under the current framework of the composite-likelihood test.

## Acknowledgments

This research was supported by the Global Top5 Grant of Ewha Womans University 2013 and the National Research Foundation of Korea grants 2012R1A1A2004932 (to Y.K.).

## Appendix

### Derivation of and

We consider a constant-sized population of *N* diploid individuals that reproduce in discrete generations according to the Wright–Fisher model, thus equivalent to a population of 2*N* haploid individuals. Assume that mutation to a beneficial allele occurred at position *x* of a chromosome at time *T* = τ (generations counted backward in time) in the past. At the time of sampling (*T* = 0), this mutant allele reaches an intermediate frequency β in the population. A random sample of *n* chromosomes is assumed to contain *n*_{1} and *n*_{2} = *n* − *n*_{1} copies of the beneficial and the ancestral allele, respectively, that define the corresponding partition of the sample into two subsamples as illustrated in Figure 1. Let *k*_{1} and *k*_{2} be the counts of the derived allele in respective subsamples at a neutrally evolving site at position *x* − *d* or *x* + *d*. The probability of observing *k*_{1} and *k*_{2} jointly is given by (A1)where is the probability density of the derived allele frequency at the time of beneficial mutation (*T* = τ). is the probability of sampling *k*_{2} derived alleles in a sample of *n*_{2} chromosomes in a neutrally evolving population in which frequency of the allele drifted for τ generations starting from *p*. During the course of a selective sweep, the deterministic change of the linked neutral allele frequency among chromosomes carrying the ancestral allele at the *S* locus (frequency *p*_{A} in the “ancestral background”) is predicted to be small (Stephan *et al.* 1992; Meiklejohn *et al.* 2004). A moderate deterministic change in *p*_{A} occurs while β < 0.8, the range to which our method applies (Figure S9). We, however, ignore this change. We also ignore the change of allele frequency by genetic drift in the ancestral background, assuming and obtain (A2)Namely, we assume that the subsample of *n*_{2} chromosomes effectively captures the ancestral polymorphism at the time of beneficial mutation. Next, is the probability of observing *k*_{1} copies of the derived allele at position *d* in the subsample of *n*_{1} sequences that carry the beneficial allele. Strictly, this probability must be a function of the frequency of the beneficial allele at the time of sampling. However, as the frequency of the neutral allele among chromosomes carrying the beneficial allele (*i.e.*, in the “beneficial background”) is known to change drastically only at the early stage of hitchhiking when the frequency of the beneficial allele is low and then change little until the fixation of the beneficial allele (Stephan *et al.* 1992), we approximate by sampling probability for the case of the complete selective sweep. We multiply and inside the integral of (A1), assuming that the frequency of linked neutral alleles in the beneficial background is distributed independently of possible stochastic change in allele frequency in the ancestral background and that chromosomes are sampled independently in the two genetic backgrounds. In reality, the “migration” of lineages by recombination during the selective sweep may cause correlated stochastic changes of allele frequencies in the two backgrounds. However, we ignore such complications, as the stochastic fluctuation of *p* in the ancestral background by genetic drift is ignored in the first place (see above).

Nielsen *et al.* (2005) and Etheridge *et al.* (2006) provided approximate solutions that allow the derivation of the above sampling probability as a function of neutral allele frequency, *p*, at the time of the beneficial mutation. Using a star-like genealogy approximation, Nielsen *et al.* (2006) obtained the probability of observing *k*_{1} derived alleles at the neutral locus from the sample of *n*_{1} chromosomes after a selective sweep, (A3)where is the probability that *k* of *n* distinct ancestral lineages at *T* = τ carry the derived mutant alleles and is the probability that *k* of *n* lineages at *T* = 0 escape the sweep by recombining away from the beneficial allele, with the escaping probability per lineage given by

Replacing (A2) and (A3) into (A1), we obtain (A4)where is the probability of having *k* derived alleles in a sample of *n* chromosomes at time τ . can be given by assuming the population at this time is under neutral equilibrium, or by the proportion of polymorphic sites with *k* derived alleles in the data, namely assuming that the distribution of the derived allele frequency at time τ is identical to that observed at present. The latter approach of using the empirical frequency spectrum was suggested by Nielsen *et al.* (2005) to correct for nonequilibrium demography. These two approximations are bases of CLR test options A and B, respectively.

Alternatively, we may derive the sampling probability from the work of Etheridge *et al.* (2006), which showed that *n* lineages at a linked neutral locus sampled at the time of a beneficial allele’s fixation are divided into three parts: *l* late recombinants, *e* early recombinants, and *n* − *l* − *e* nonrecombinants. Given the selection coefficient *s* and recombination rate *r*, the joint distribution of *l* and *e*, *P*(*l*, *e*), follows equation 2.7 of Etheridge *et al.* (2006). However, this result in terms of genealogical structure needs to be translated into sampling probability by considering the transmission of mutant alleles along the lineages. The probability of sampling *k* derived alleles can be obtained separately in the following four cases.

First, consider the case in which the beneficial allele appears on a chromosome carrying the derived allele at the neutral locus. In addition, the ancestor of early recombinants carries the ancestral allele. Therefore, the sample contains at least *n* − *l* − *e* derived alleles and at least *e* ancestral alleles. In addition, assume that in *l* late recombinants, there are *l*_{d} derived alleles and *l* − *l*_{d} ancestor alleles. Then, the total number of derived allele in the sample is *k* = *n* – *e* – (*l* – *l*_{d}). Since *l*_{d} = *l* – (*n* – *e* – *k*), the probability for this case is (A5)where *p* is the initial frequency of the derived allele before hitchhiking. In the case that the ancestor of early recombinants carries the derived allele, (A6)Next, the beneficial mutation is now assumed to appear on a chromosome carrying the ancestral allele of the neutral locus. Probabilities that there are *k* derived alleles in the sample if the ancestor of early recombinants carries the ancestral and the derived allele are, respectively, (A7)and (A8)Since these cases are mutually exclusive, the final solution for sampling probability for a complete selective sweep is after the above probabilities are weighted accordingly: (A9)Using Equations A2 and A9, Equation A1 is now turned into our second approximation:

## Footnotes

*Communicating editor: R. Nielsen*Supporting information is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.175380/-/DC1.

- Received February 5, 2015.
- Accepted April 23, 2015.

- Copyright © 2015 by the Genetics Society of America