# Clear: Composition of Likelihoods for Evolve and Resequence Experiments

^{*}Department of Electrical and Computer Engineering, University of California, San Diego, California 92093^{†}Institut für Populationsgenetik, Veterinärmedizinische Universität Wien, 1210 Vienna, Austria^{‡}Department of Computer Science and Engineering, University of California, San Diego, California 92093

- 1Corresponding author: EBU3 Room 4250, Department of Electrical and Computer Engineering, University of California, San Diego, 9500 Gilman Dr., La Jolla, CA 92093. E-mail: airanmehr{at}gmail.com

## Abstract

The advent of next generation sequencing technologies has made whole-genome and whole-population sampling possible, even for eukaryotes with large genomes. With this development, experimental evolution studies can be designed to observe molecular evolution “in action” via evolve-and-resequence (E&R) experiments. Among other applications, E&R studies can be used to locate the genes and variants responsible for genetic adaptation. Most existing literature on time-series data analysis often assumes large population size, accurate allele frequency estimates, or wide time spans. These assumptions do not hold in many E&R studies. In this article, we propose a method—composition of likelihoods for evolve-and-resequence experiments (Clear)—to identify signatures of selection in small population E&R experiments. Clear takes whole-genome sequences of pools of individuals as input, and properly addresses heterogeneous ascertainment bias resulting from uneven coverage. Clear also provides unbiased estimates of model parameters, including population size, selection strength, and dominance, while being computationally efficient. Extensive simulations show that Clear achieves higher power in detecting and localizing selection over a wide range of parameters, and is robust to variation of coverage. We applied the Clear statistic to multiple E&R experiments, including data from a study of adaptation of *Drosophila melanogaster* to alternating temperatures and a study of outcrossing yeast populations, and identified multiple regions under selection with genome-wide significance.

- experimental evolution
- selection
- genetic drift
- time-series data
- hidden Markov model
- Wright–Fisher process

NATURAL selection is a key force in evolution, and a mechanism by which populations can adapt to external “selection” pressure. Examples of adaptation abound in the natural world (Fan *et al.* 2016), including classic examples like lactose tolerance in Northern Europeans (Bersaglieri *et al.* 2004), human adaptation to high altitudes (Simonson *et al.* 2010; Yi *et al.* 2010), but also drug resistance in pests (Daborn *et al.* 2001), HIV (Feder *et al.* 2016), cancer (Gottesman 2002; Zahreddine and Borden 2013), malarial parasite (Nair *et al.* 2007; Ariey *et al.* 2014), and others (Spellberg *et al.* 2008). In these examples, understanding the genetic basis of adaptation can provide valuable information, underscoring the importance of the problem.

Experimental evolution refers to the study of the evolutionary processes of a model organism in a controlled (Hegreness *et al.* 2006; Bollback and Huelsenbeck 2007; Barrick *et al.* 2009; Lang *et al.* 2011; Orozco-ter Wengel *et al.* 2012; Lang *et al.* 2013; Oz *et al.* 2014) or natural (Barrett *et al.* 2008; Reid *et al.* 2011; Denef and Banfield 2012; Winters *et al.* 2012; Daniels *et al.* 2013; Maldarelli *et al.* 2013; Bergland *et al.* 2014) environment. Recent advances in whole-genome sequencing have enabled us to sequence populations at a reasonable cost, even for large genomes. Perhaps more important for experimental evolution studies, we can now evolve and resequence (E&R) multiple replicates of a population to obtain *longitudinal time-series data*, to investigate the dynamics of evolution at the molecular level. Although constraints such as small sizes, limited timescales, and oversimplified laboratory environments may limit the interpretation of E&R results, these studies are increasingly being used to test a wide range of hypotheses (Kawecki *et al.* 2012) and have been shown to be more predictive than static data analysis (Sawyer and Hartl 1992; Boyko *et al.* 2008; Desai and Plotkin 2008). In particular, longitudinal E&R data are being used to estimate model parameters including population size (Pollak 1983; Waples 1989; Williamson and Slatkin 1999; Wang 2001; Terhorst *et al.* 2015; Jónás *et al.* 2016), strength of selection (Bollback *et al.* 2008; Illingworth and Mustonen 2011; Illingworth *et al.* 2012; Malaspinas *et al.* 2012; Mathieson and McVean 2013; Steinrücken *et al.* 2014; Terhorst *et al.* 2015), allele age (Malaspinas *et al.* 2012), recombination rate (Terhorst *et al.* 2015), mutation rate (Barrick and Lenski 2013; Terhorst *et al.* 2015), quantitative trait loci (Baldwin-Brown *et al.* 2014), and for tests of neutrality hypotheses (Burke *et al.* 2010; Bergland *et al.* 2014; Feder *et al.* 2014; Terhorst *et al.* 2015).

While many E&R study designs are being used (Barrick and Lenski 2013; Schlötterer *et al.* 2015), we restrict our attention to the adaptive evolution due to standing variation in fixed-size populations. This regime has been considered earlier, typically with *Drosophila melanogaster* as the model organism of choice, to identify adaptive genes in longevity and aging (600 generations) (Burke *et al.* 2010; Remolina *et al.* 2012), courtship song (100 generations) (Turner *et al.* 2011), hypoxia tolerance (200 generations) (Zhou *et al.* 2011), adaptation to new laboratory environments (59 generations) (Orozco-ter Wengel *et al.* 2012; Franssen *et al.* 2015), egg size (40 generations) (Jha *et al.* 2015), C-virus resistance (20 generations) (Martins *et al.* 2014), and dark-fly (49 generations) (Izutsu *et al.* 2015).

The task of identifying selection signatures can be addressed at different levels of specificity. At the coarsest level, identification could simply refer to deciding whether some genomic region (or a gene) is under selection or not. In the following, we refer to this task as *detection*. In contrast, the task of *site identification* corresponds to the process of finding the favored mutation/allele at the nucleotide level. Finally, *estimation of model parameters*, such as strength of selection and dominance at the site, can provide a comprehensive description of the selection process.

In the effort to analyze E&R selection experiments, many authors chose to adapt existing tests that were originally used for static data, pairwise comparisons (two time points), and single replicates to perform a null scan. For instance, Zhou *et al.* (2011) used the ratio of the estimated population size of case and control populations to compute a test statistic for each genomic region. Burke *et al.* (2010) applied the Fisher exact test to the last observation of data on case and control populations. Orozco-ter Wengel *et al.* (2012) used the Cochran–Mantel–Haenszel (CMH) test (Agresti and Kateri 2011) to detect SNPs whose read counts change consistently across all replicates of two time-point data. Turner *et al.* (2011) proposed the diffstat statistic to test whether the change in allele frequencies of two populations deviate from the distribution of change in allele frequencies of two drifting populations. Bergland *et al.* (2014) calculated to populations throughout time to signify their differentiation from ancestral (two time-point data) as well as geographically different populations. Jha *et al.* (2015) computed a test statistic of generalized linear-mixed model directly from read counts.

Alternatively, *direct* methods have been developed to analyze time-series data by taking a likelihood approach, and estimating population genetics parameters. Bollback *et al.* (2008) proposed a hidden Markov model (HMM) to estimate the selection coefficient *s* and population size by using a diffusion approximation to the Wright–Fisher process. Steinrücken *et al.* (2014) proposed a general diploid selection model which takes into account dominance of the favored allele and approximates likelihood analytically. Recently, Schraiber *et al.* (2016) proposed a Bayesian framework to estimate parameters using Markov chain Monte Carlo sampling. Mathieson and McVean (2013) adopted HMMs to structured populations and estimated parameters using an expectation maximization procedure on a discretized allele frequency. Feder *et al.* (2014) modeled increments in allele frequency with a Brownian motion process, proposed as the frequency increment test (FIT). More recently, Topa *et al.* (2015) proposed a Gaussian process (GP) for modeling single-locus, time-series, whole-genome sequencing of pools of individuals (pool-seq) data. Terhorst *et al.* (2015) extended GP to compute joint likelihood of multiple loci under null and alternative hypotheses. Finally, Levy *et al.* (2015) proposed a Bayesian model to handle sequencing, amplification, and growth noise in a large population of barcoded lineages.

Among the methods specifically designed for time-series data, many make assumptions which may not hold in E&R studies. One common assumption is that the underlying population size is large, so it is reasonable to model dynamics of allele frequencies using continuous-state models (Bollback *et al.* 2008; Feder *et al.* 2014; Terhorst *et al.* 2015). Second, many existing methods were originally designed to process the wider time spans seen in ancient DNA studies, an assumption that does not hold for E&R experiments (Steinrücken *et al.* 2014; Schraiber *et al.* 2016). Finally, many E&R analysis tools assume that allele frequencies in the input data are unbiased (*e.g.*, Bollback *et al.* 2008), which may not be valid for shotgun sequencing experiments.

Here, we consider an HMM, similar to Williamson and Slatkin (1999) and Bollback *et al.* (2008) but under a “small-population-size” regime. Specifically, we use a discrete state (frequency) model. We show that for small population sizes, discrete models can compute likelihood exactly, which improves statistical performance, especially for short time-span experiments. Additionally, we add another level of sampling noise to the traditional HMM model, allowing for heterogeneous ascertainment bias due to uneven coverage among variants. We show that for a wide range of parameters, Clear provides higher power for detecting selection, estimates model parameters consistently, and localizes the favored allele more accurately compared to the state-of-the-art methods, while being computationally efficient.

## Materials and Methods

Consider a panmictic diploid population with fixed size of *N* individuals. Let be the frequencies of the derived allele at generations for a given variant, where at generations samples of *n* individuals are chosen for pooled sequencing. The experiment is replicated *R* times. We denote allele frequencies of the *R* replicates by the set To identify the genes and variants that are responding to selection pressure, we use the following procedure:

Estimating population size: The procedure starts by estimating the effective population size, under the assumption that much of the genome is evolving neutrally.

Estimating selection parameters: For each polymorphic site, selection and dominance parameters are estimated so as to maximize the likelihood of the time-series data, given

Computing likelihood statistics: For each variant, a log-odds ratio of the likelihood of selection model to the likelihood of neutral evolution/drift model is computed. Likelihood ratios in a genomic region are combined to compute the Clear statistic for the region.

Hypothesis testing: An empirical null distribution of the Clear statistic is calculated using genome-wide drift simulations, and used to compute

*P*-values and thresholds for a specified false discovery rate (FDR). We perform single-locus hypothesis testing within selected regions to identify significant variants and report genes that intersect with the selected variants.

These steps are described in detail below.

### Estimating population size

Methods for estimating population sizes from temporal neutral evolution data have been developed (Williamson and Slatkin 1999; Anderson *et al.* 2000; Bollback *et al.* 2008; Terhorst *et al.* 2015; Jónás *et al.* 2016). Here, we aim to extend these models to explicitly model the sampling noise that arise in pool-seq data. Specifically, we model the variation in sequence coverage over different locations, and the noise due to sequencing only a subset of the individuals in the population. In addition, many existing methods (Bollback *et al.* 2008; Feder *et al.* 2014; Terhorst *et al.* 2015; Topa *et al.* 2015) are designed for large populations, and model frequency as a continuous quantity. We observed that using Brownian motion to model frequency drift may be inadequate for small populations, low starting frequencies, and sparse sampling (in time); factors that are common in experimental evolution (see *Results*, Figure 3, A–C, and Figure 2). To this end, we model the Wright–Fisher Markov process for generating pool-seq data (Supplemental Material, Figure S1 in File S3) via a discrete HMM (Figure 1B). We start by computing a likelihood function for the population size given neutral pool-seq data.

#### Likelihood for the neutral model:

We model the allele frequency counts as being sampled from a binomial distribution. Specifically,where *π* is the global distribution of allele frequencies in the base population. Note that *π* depends on the demographic history of the founder lines and can be estimated from the site-frequency spectrum (see Figure S2 in File S3) of the initial population. For notational convenience, henceforth we omit the dependence of likelihoods to the parameter *π*.

To estimate frequency after *τ* transitions, it is enough to specify the transition matrix where denotes probability of change in allele frequency from to in *τ* generations:(1)(2)Furthermore, in an E&R experiment, individuals are randomly selected for sequencing. The sampled allele frequencies, are also binomially distributed:(3)We introduce the sampling matrix *Y*, where stores the probability that the sample allele frequency is given that the true allele frequency is

We denote the pool-seq data for that variant as where represent the coverage and the read count of the derived allele, respectively. Let be the sequencing coverage at different generations. Then, the observed data are sampled according to(4)The emission probability for an observed tuple is(5)For let denote the probability of emitting and reaching state *j* at Then, can be computed using the forward procedure (Durbin *et al.* 1998):(6)where The joint likelihood of the observed data from *R* independent observations is given by(7)where The graphical model and the generative process for which data are being generated is depicted in Figure 1B and Figure S1 in File S3, respectively.

Finally, the last step is to compute an estimate that maximizes the likelihood of all *M* variants in the whole genome. Let denote the time-series data of the *i*th variant in replicate *r*. Then,

### Estimating selection parameters

#### Likelihood for the selection model:

Assume that the site is evolving under selection constraints and where *s* and *h* denote selection strength and dominance parameters, respectively. By definition, the relative fitness values of genotypes 0|0, 0|1, and 1|1 are given by and Then, the frequency at time (one generation ahead), can be estimated using:(9)The machinery for computing likelihood of the selection parameters is identical to that of population size, except for transition matrices. Hence, here we only describe the definition transition matrix of the selection model. Let denote the probability of transition from to in *τ* generations, then (see Ewens 2004, p. 24, equations 1.58–1.59):(10)(11)The maximum likelihood estimates are given by(12)Using grid search, we first estimate *N* (Equation 8), and subsequently, we estimate parameters *s* and *h* (Equation 12, Figure S3 in File S3). By broadcasting and vectorizing the grid search operations across all variants, the genome scan on millions of polymorphisms can be done in a significantly smaller time than iterating a numerical optimization routine for each variant (see *Results* and Figure 6).

### Empirical likelihood-ratio statistics

The likelihood-ratio statistic for testing directional selection, to be computed for each variant, is given by(13)where Similarly, we can define a test statistic for testing if selection is dominant by(14)While extending the single-locus Wright–Fisher model to multiple linked loci can improve the power of the model (Terhorst *et al.* 2015), it is computationally and statistically expensive to compute exact likelihood. In addition, computing linked-loci joint likelihood requires haplotype resolved data, which pool-seq does not provide. Here, similar to Nielsen *et al.* (2005), we calculate the *composite-likelihood-ratio* score for a genomic region,(15)where *L* is a collection of segregating sites and is the likelihood-ratio score based for each variant in *L*. The optimal value of the hyper-parameter *L* depends upon a number of factors, including initial frequency of the favored allele, recombination rates, linkage of the favored allele to neighboring variants, population size, coverage, and time since the onset of selection (duration of the experiment). In File S3, we provide a heuristic to compute a reasonable value of *L*, based on experimental data.

We work with a normalized value of given by(16)where and are the mean and standard deviation of values in a large region We found different chromosomes have different distributions of values, and therefore decided to use single chromosomes as

### Hypothesis testing

#### Single-locus tests:

Under neutrality, log-likelihood ratios can be approximated by distribution (Williams and Williams 2001), and *P*-values can be computed directly. However, Feder *et al.* (2014) showed that when the number of independent samples (replicates) is small, is a crude approximation to the true null distribution and results in more false positives. Following their suggestion, we first compute the empirical null distribution using simulations with the estimated population size (see Figure S1 in File S3). The empirical null distribution of statistic *H* is used to compute *P*-values as the fraction of null values that exceed the test score. Finally, we use the method of Storey and Tibshirani (2003) to control for FDR in multiple testing.

#### Composite likelihood tests:

Similar to single-locus tests, we compute the null distribution of the statistic using whole-genome simulations with the estimated population size, and subsequently compute the FDR. The simulations for generating the null distribution of are described next.

### Simulations

We use the same simulation procedure for two purposes. First, we use the simulations to test the power of Clear against other methods in small genomic windows. Second, we use them to generate the distribution of null values for the statistic to compute empirical *P*-values. We mainly chose parameters that are relevant to *D. melanogaster* experimental evolution (Kofler and Schlötterer 2013). See also Figure 1A for illustration.

Creating initial founder line haplotypes: Using msms (Ewing and Hermisson 2010), we created neutral populations for

*F*founding haplotypes with command $./msms 〈*F*〉 1 −*t*〈2*μWN*〉 −_{o}*r*〈2*rWN*〉 〈_{o}*W*〉, where is the number of founder lines, is the effective founder population size, is the recombination rate, and is the mutation rate. The window size*W*is used to compute and We chose*W*= 50 kbp for simulating individual windows for performance evaluations, and*W*= 20 Mbp for simulating*D. melanogaster*chromosomes for*P*-value computations.Creating initial diploid population: An initial set of haplotypes was created from step 1, and duplicated to create

*F*homozygous diploid individuals to simulate generation of inbred lines.*N*diploid individuals were generated by sampling with replacement from the*F*individuals.Forward simulation: We used forward simulations for evolving populations under selection. We also consider selection regimes in which the favored allele is chosen from standing variation (not

*de novo*mutations). Given initial diploid population, position of the site under selection, selection strength*s*, number of replicates recombination rate and sampling times simuPop (Peng and Kimmel 2005) was used to perform the forward simulation and compute allele frequencies for all of the*R*replicates. For hard sweep (respectively, soft sweep) simulations we randomly chose a site with an initial frequency of (respectively, to be the favored allele. For generating the null distribution with drift for*P*-value computations, we used this procedure withSequencing simulation: Given allele frequency trajectories we sampled depth of each site in each replicate identically and independently from Poisson (

*λ*), where is the coverage for the experiment. Once depth*d*is drawn for the site with frequency*ν*, the number of reads*c*carrying the derived allele are sampled according to binomial For experiments with finite depth the tuple is the input data for each site.

### Data availability

The source code and running scripts for Clear are publicly available at https://github.com/airanmehr/clear.

*D. melanogaster* data was originally published (Orozco-ter Wengel *et al.* 2012; Franssen *et al.* 2015). The data set of the *D. melanogaster* study, until generation 37, is obtained from the Dryad digital repository (http://datadryad.org) under accession DOI: 10.5061/dryad.60k68. Generation 59 of the *D. melanogaster* study is accessed from the European Sequence Read Archive (http://www.ebi.ac.uk/ena/) under the project accession number PRJEB6340. The data set containing the experimental evolution of yeast populations (Burke *et al.* 2014) is downloaded from http://wfitch.bio.uci.edu/∼tdlong/PapersRawData/BurkeYeast.gz (last accessed January 24, 2017). University of California, Santa Cruz browser tracks for *D. melanogaster* and yeast data analysis are found in File 1 and File 2, respectively.

## Results

### Modeling allele frequency trajectories in small populations

We first tested the goodness of fit of the discrete *vs.* Brownian motion (a continuous-state model) in modeling allele frequency trajectories, under general E&R parameters. For this purpose, we conducted 100 K simulations with two time samples where is the parameter controlling the density of sampling in time. In addition, we repeated simulations for different values of starting frequency (*i.e.*, hard and soft sweep) and selection strength (*i.e.*, neutral and selection). Then, given initial frequency we computed the expected distribution of the frequency of the next sample under two models to make a comparison. Figure 2, A–F, shows that Brownian motion (continuous model) is inadequate when is far from 0.5, or when sampling times are sparse If the favored allele arises from standing variation in a neutral population, it is unlikely to have a frequency close to 0.5, and the starting frequencies are usually much smaller (see Figure S2 in File S3). Moreover, in typical *D. melanogaster* experiments, for example, sampling is sparse. Often, the experiment is designed so that (Zhou *et al.* 2011; Orozco-ter Wengel *et al.* 2012; Kofler and Schlötterer 2013; Franssen *et al.* 2015).

In contrast to the Brownian motion approximation, discrete Markov chain predictions (Equation 11) are highly consistent with empirical data for a wide range of simulation parameters (Figure 2, A–M). Moreover, the discrete Markov chain can be modified to model the case when the the allele is under selection.

### Detection power

We compared the performance of Clear against other methods for detecting selection. For each method we calculated detection power as the percentage of true positives identified with a false-positive rate For each configuration (specified with values for selection coefficient *s*, starting allele frequency and coverage *λ*), the power of each method is evaluated over 2000 distinct simulations, half of which modeled neutral evolution and the rest modeled positive selection.

We compared the power of Clear with GP (Terhorst *et al.* 2015), FIT (Feder *et al.* 2014), and CMH (Agresti and Kateri 2011) statistics. FIT and GP convert read counts to allele frequencies prior to computing the test statistic. Clear shows the highest power in all cases and the power stays relatively high even for low coverage (Figure 3 and Table S1 in File S3). In particular, the difference in performance of Clear with other methods is pronounced when starting frequency is low. The advantage of Clear stems from the fact that the favored allele with a low starting frequency might be missed by low coverage sequencing. In this case, incorporating the signal from linked sites becomes increasingly important. We note that methods using only two time points, such as CMH, do relatively well for high selection values and high coverage. However, the use of time-series data can increase detection power in low coverage experiments or when the starting frequency is low. Moreover, time-series data provide means for estimating selection parameters *s* and *h* (see below). Finally, as Clear is robust to change of coverage, our results (Figure 3, B and C) suggest that taking many samples with lower coverage is preferable to sparse sampling with higher coverage. For comparison purposes, we also tested Clear using the single-locus statistic For the most part, Clear showed an improvement over other methods even with or showed similar performance. The performance improved with higher *L*.

### Site identification

In general, localizing the favored variant using pool-seq data is a nontrivial task due to extensive linkage disequilibrium (LD) (Tobler *et al.* 2014). To measure performance, we sorted variants by their *H* scores and computed rank of the favored allele for each method. For each setting of and *s*, we conducted 1000 simulations and computed the rank of the favored mutation in each simulation. The cumulative distribution of the rank of the favored allele in 1000 simulations for each setting (Figure 4) shows that Clear outperforms other statistics.

An interesting observation is revisiting the contrast between site identification and detection (Long *et al.* 2013; Tobler *et al.* 2014). When selection strength is high, detection is easier (Figure 3, A–F), but site identification is harder, due to the high LD between flanking variants and the favored allele (Figure 4, A–F). Moreover, site identification becomes more difficult whenever the initial frequency of the favored allele is low; *i.e.*, at the onset of selection, LD between a favored allele and its nearby variants is high. For example, when coverage and selection coefficient the detection power is 75% for hard sweep, but 100% for soft sweep (Figure 3, B–E). In contrast, the favored site was ranked as the top in 14% of hard sweep cases, compared to 95% of soft sweep simulations.

### Estimating parameters

Clear estimates effective population size and selection parameters, and , as a byproduct of the hypothesis testing. We computed bias of selection fitness and dominance for Clear and GP for 1000 simulations in each setting. The distribution of the error (bias) for 100× coverage is presented in Figure 5 for different configurations. Figures S4 and S5 in File S3 provide the distribution of estimation errors for 30× and 300× coverage, respectively. For hard sweep, Clear provides estimates of *s* with lower variance of bias (Figure 5A and Figure S6 in File S3). In soft sweep, GP and Clear both provide unbiased estimates of *s* with low variance (Figure 5B). Figure 5, C and D, shows that Clear provides unbiased estimates of *h* as well when and We also tested if Clear provides unbiased estimates of *N* by estimating population size on 1000 simulations when As shown in Figure 7A and Figure S9, A–C, in File S3 maximum likelihood is attained at the true value of the parameter.

### Running time

As Clear does not compute exact likelihood of a region (*i.e.*, does not explicitly model linkage between sites), the complexity of scanning a genome is linear in the number of polymorphisms. Calculating the score of each variant requires an computation for However, most of the operations are can be vectorized for all replicates to make the effective running time for each variant faster. We conducted 1000 simulations and measured running times for computing site statistics *H*, FIT, CMH, and GP with different numbers of linked loci. Our analysis reveals (Figure 6) that Clear is orders of magnitude faster than GP, and comparable to FIT. While slower than CMH on the time per variant, the actual running times are comparable after vectorization and broadcasting over variants (see below).

These times can have a practical consequence. For instance, to run GP in the single-locus mode on the entire pool-seq data of the *D. melanogaster* genome from a small sample (≈1.6-M variant sites), it would take 1444 CPU hours (≈1 CPU month). In contrast, after vectorizing and broadcasting operations for all variant operations using the numba package, Clear took 75 min to perform a scan, including precomputation; while the fastest method, CMH, took 17 min.

## Analysis of Real Data

#### Analysis of a *D. melanogaster* adaptation to alternating temperatures:

We applied Clear to the data from a study of the adaptation of *D. melanogaster* to alternating temperatures (Orozco-ter Wengel *et al.* 2012; Franssen *et al.* 2015), where three replicate samples were chosen from a population of *D. melanogaster* for 59 generations under alternating 12-hr cycles of hot-stressful (28°) and nonstressful (18°) temperatures, and sequenced. In this data set, sequencing coverage is different across replicates and generations (see figure S2 of Terhorst *et al.* 2015) which makes variant depths highly heterogeneous (Figure S10 in File S3).

We first filtered out heterochromatic, centromeric, and telomeric regions (Fiston-Lavier *et al.* 2010), and those variants that have a collective coverage of >1500 in all 13 populations: three replicates at the base population, two replicates at generation 15, one replicate at generation 23, one replicate at generation 27, three replicates at generation 37, and three replicates at generation 59. After filtering, we ended up with 1,605,714 variants.

Next, we estimated genome-wide population size (Figure 7B and Figure S9E in File S3), which is consistent with previous studies (Orozco-ter Wengel *et al.* 2012; Jónás *et al.* 2016). The likelihood curves of Clear are sharper around the optimum compared to that of the method of Bollback *et al.* (2008) (see figure S1 in Orozco-ter Wengel *et al.* 2012). Also, chromosomes *3*L and *3*R appear to have a smaller population size, respectively. Others have made similar observations on this data. In particular, Jónás *et al.* (2016) showed that the chromosome-wise population size varies even more when it is computed for each replicate separately (see table 1 in Jónás *et al.* 2016). For instance, is 131 for chromosome *3*R replicate 1, while it is 328 for chromosome *X* replicate 2.

While it would be ideal to compute the Clear statistic for each replicate and chromosome separately, computing empirical *P*-values and significant regions become computationally intensive as the empirical null distribution of each replicate and each chromosome needs to be computed. Hence, we use a single genome-wide estimate in all analyses, but we normalize statistic separately for each chromosome.

We used a heuristic calculation (see File S3) to choose the sliding window size *L* as the distance where the LD between the favored mutation and a site away remains strong. For *D. melanogaster* parameters, we obtained We computed the normalized test statistic on sliding windows of size of 30 kbp and step size of 5 kbp over the genome (see Figure 8A).

The empirical null distribution of was estimated by creating 100 whole-genome simulations (400 K statistic values), as described in *Material and Methods*. Then, the *P*-value of the test statistic in each region in the experimental data was calculated as the fraction of the null statistic values that are greater than or equal to the test statistic(see Figure S11 in File S3). After correcting for multiple testing, we identified five contiguous intervals (Figure 8) satisfying and covering polymorphic sites. We further performed single-locus hypothesis testing on the sites to identify 174 individual variants with an (Figure 8B).

The final set of 174 variants fall within 32 genes (Table S3 in File S3), including many serine inhibitory proteases (serpins) and other genes involved in endocytosis. Recycling of synaptic vesicles is seen to be blocked at high temperature in temperature-sensitive *Drosophila* mutants (Kosaka and Ikeda 1983). This is also supported by gene ontology (GO)-enrichment analysis, where a single GO term “inhibition of proteolysis” is found to be enriched (corrected *P*-value = 0.0041). To test for dominant selection, we computed the *D* statistic on simulated neutral and experimental data, and computed *P*-values accordingly. After correcting for multiple testing, 96 variants were discovered with an (Figure S12 in File S3).

### Analysis of outcrossing yeast populations

We also applied Clear to 12 replicate samples of outcrossing yeast populations (Burke *et al.* 2014), where samples were taken at generations We observed a significant variation in the genome-wide, site-frequency spectrum of certain populations over different time points for some replicates (Figure S13 in File S3). The variation does not have an easily identifiable cause. Therefore, we focused analysis on seven replicates with a genome-wide, site-frequency spectrum over the time range (Figure S14 in File S3).

We estimated population size to be haplotypes (Figure 7C and Figure S9F in File S3), and computed the and *H* statistics accordingly. To compute *P*-values, we created 1-M single-locus neutral simulations according to the experimental data’s initial frequency and coverage. By setting the FDR cutoff to 0.05, only 18 and 16 variants show significant signal for directional and dominant selection, respectively (Figure 9 S12 in File S3). Selected variants for directional selection are clustered in two regions, which match two of the five regions (regions C and E in figure 2A in Burke *et al.* 2014) identified by Burke *et al.* (2014) in their preliminary analysis. UCSC browser tracks for analysis of D. *melanogaster* and Yeast datasets are available as File S1 and File S2, respectively.

## Discussion

We developed a computational tool, Clear, that can detect regions and variants under selection E&R experiments. Using extensive simulations, we show that Clear outperforms existing methods in detecting selection, locating the favored allele, and estimating model parameters. Also, while being computationally efficient, Clear provides a means for estimating populations size and hypothesis testing.

Many factors; such as small population size, finite coverage, LD, finite sampling for sequencing, duration of the experiment, and the small number of replicates; can limit the power of tools for analyzing E&R. Here, by discrete modeling, Clear estimates population size and provides unbiased estimates of *s* and *h*. It adjusts for heterogeneous coverage of pool-seq data, and exploits the presence of linkage within a region to compute a composite likelihood ratio statistic.

It should be noted that, even though we described Clear for small fixed-size populations, the statistic can be adjusted for other scenarios, including changing population sizes when the demography is known. For large populations, transitions can be computed on sparse data structures, as for large *N* the transition matrices become increasingly sparse. Alternatively, frequencies can be binned to reduce dimensionality.

The comparison of hard- and soft-sweep scenarios showed that the initial frequency of the favored allele can have a nontrivial effect on the statistical power for identifying selection. Interestingly, while it is easier to detect a region undergoing strong selection, it is harder to locate the favored allele in that region.

There are many directions to improve the analyses presented here. In particular, we plan to focus our attention on other organisms with more complex life cycles, experiments with variable population size, and longer sampling-time spans. As E&R experiments continue to grow, deeper insights into adaptation will go hand in hand with improved computational analysis.

## Acknowledgments

The authors thank the anonymous reviewers whose comments substantially improved the quality and clarity of this manuscript. A.I., A.A., and V.B. were supported by grants from the National Institutes of Health (1R01 GM-114362) and National Science Foundation (DBI-1458557 and IIS-1318386). C.S. is supported by the European Research Council grant ArchAdapt. V.B. is a cofounder, has an equity interest, and receives income from Digital Proteomics, LLC (DP). The terms of this arrangement have been reviewed and approved by the University of California, San Diego in accordance with its conflict of interest policies. DP was not involved in the research presented here.

## Footnotes

Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.116.197566/-/DC1.

*Communicating editor: Y. S. Song*

- Received November 3, 2016.
- Accepted March 31, 2017.

- Copyright © 2017 by the Genetics Society of America

Available freely online through the author-supported open access option.