## Abstract

Recent advances in DNA sequencing techniques have made it possible to monitor genomes in great detail over time. This improvement provides an opportunity for us to study natural selection based on time serial samples of genomes while accounting for genetic recombination effect and local linkage information. Such time series genomic data allow for more accurate estimation of population genetic parameters and hypothesis testing on the recent action of natural selection. In this work, we develop a novel Bayesian statistical framework for inferring natural selection at a pair of linked loci by capitalising on the temporal aspect of DNA data with the additional flexibility of modeling the sampled chromosomes that contain unknown alleles. Our approach is built on a hidden Markov model where the underlying process is a two-locus Wright-Fisher diffusion with selection, which enables us to explicitly model genetic recombination and local linkage. The posterior probability distribution for selection coefficients is computed by applying the particle marginal Metropolis-Hastings algorithm, which allows us to efficiently calculate the likelihood. We evaluate the performance of our Bayesian inference procedure through extensive simulations, showing that our approach can deliver accurate estimates of selection coefficients, and the addition of genetic recombination and local linkage brings about significant improvement in the inference of natural selection. We also illustrate the utility of our method on real data with an application to ancient DNA data associated with white spotting patterns in horses.

- natural selection
- linked loci
- Wright-Fisher diffusion
- hidden Markov model
- particle marginal Metropolis-Hastings

NATURAL selection is a fundamental evolutionary process that maintains function and drives adaptation, thereby altering patterns of diversity at the genetic level. Methods for detecting and quantifying natural selection have important applications such as identifying the genetic basis of diseases and understanding the molecular basis of adaptation. There has been a long line of theoretical and experimental research devoted to modeling and inferring natural selection, and the vast majority of earlier analyses are based on allele frequency data obtained at a single time point that requires unrealistic assumptions of ancestral demography and selective regimes (see Bank *et al.* 2014, for a review). With advances in DNA sequencing technologies, an ever-increasing amount of allele frequency data sampled at multiple time points are becoming available. Such time series genetic data can arise from experimental evolution (*e.g.*, Burke *et al.* 2010; Orozco-terWengel *et al.* 2012; Lang *et al.* 2013; Wiser *et al.* 2013), viral/phage populations (*e.g.*, Wichman *et al.* 1999, 2005; Holder and Bull 2001; Bollback and Huelsenbeck 2007), or ancient DNA (aDNA) (*e.g.*, Hummel *et al.* 2005; Ludwig *et al.* 2009; Orlando *et al.* 2013; Mathieson *et al.* 2015). Temporally spaced samples provide much more valuable information regarding natural selection since expected changes in allele frequencies over time are closely related to the strength of natural selection acting on the population. One can therefore expect time series allele frequency data to improve our ability to estimate selection coefficients and test hypotheses regarding natural selection.

There has been a growing literature on the statistical inference of natural selection from time series data of allele frequencies over the past decade (*e.g.*, Bollback *et al.* 2008; Malaspinas *et al.* 2012; Mathieson and McVean 2013; Feder *et al.* 2014; Foll *et al.* 2014, 2015; Lacerda and Seoighe 2014; Steinrücken *et al.* 2014; Terhorst *et al.*, 2015; Ferrer-Admetlla *et al.* 2016; Schraiber *et al.* 2016; Shim *et al.* 2016; He *et al.* 2019; Paris *et al.* 2019), reviewed in Bank *et al.* (2014) and Malaspinas (2016). A common method to analyzing time series allele frequency data are based on the hidden Markov model (HMM) framework of Williamson and Slatkin (1999), where the underlying population is assumed to evolve under the Wright-Fisher model introduced by Fisher (1922) and Wright (1931), and the observations are modeled through independent binomial sampling from the underlying population at each given time point (see Tataru *et al.* 2017, for a review of the statistical inference in the Wright-Fisher model using allele frequency data). However, such approaches can become computationally infeasible for large populations because they require a prohibitively large amount of computation and storage for the calculation of the likelihood. Most existing HMM-based methods are therefore built on either the diffusion approximation of the Wright-Fisher model (*e.g.*, Bollback *et al.* 2008; Malaspinas *et al.* 2012; Steinrücken *et al.* 2014; Ferrer-Admetlla *et al.* 2016; Schraiber *et al.* 2016; He *et al.* 2019) or the moment-based approximation of the Wright-Fisher model (*e.g.*, Feder *et al.* 2014; Lacerda and Seoighe 2014; Terhorst *et al.* 2015; Paris *et al.* 2019). Such approximations enable efficient integration over all possible allele frequency trajectories of the underlying population, thereby allowing the likelihood computation to be completed in a reasonable amount of time.

The recent advent of high-throughput sequencing technologies has made it possible to monitor genomes in great detail over time. This provides an opportunity for detecting and estimating natural selection at multiple linked loci from time series data of allele frequencies while taking the process of genetic recombination and the information of local linkage into account. Properly accounting for genetic recombination and local linkage can be expected to provide more precise estimates for the selection coefficient and more accurate hypothesis testing on the recent action of natural selection since genetic recombination may either reinforce or oppose changes in allele frequencies caused by natural selection according to the levels of linkage disequilibrium (He *et al.* 2020). However, with the exception of Terhorst *et al.* (2015), all existing methods built on the Wright-Fisher model for inferring natural selection from time series allele frequency data are limited to either a single locus (*e.g.*, Bollback *et al.* 2008; Malaspinas *et al.* 2012; Steinrücken *et al.* 2014; Schraiber *et al.* 2016; He *et al.* 2019) or multiple independent loci (*e.g.*, Foll *et al.* 2014, 2015; Ferrer-Admetlla *et al.* 2016; Shim *et al.* 2016; Paris *et al.* 2019), *i.e.*, genetic recombination effect and local linkage information are ignored in these approaches. The exception among these methods, Terhorst *et al.* (2015), extended a moment-based approximation of the Wright-Fisher model introduced by Feder *et al.* (2014) to the case of multiple linked loci with an application to the pooled sequencing (Pool-Seq) data from evolve-and-resequence (E&R) experiments, where the allele frequency transition between two consecutive sampling time points is modeled deterministically, with added Gaussian noise.

In the present work, we propose a novel HMM-based method for Bayesian inference of natural selection at two linked loci from time series data of allele frequencies while accounting for the process of genetic recombination, thereby incorporating the information on local linkage. Our key innovation is that a diffusion approximation to the Wright-Fisher model of the stochastic evolutionary dynamics under natural selection at two linked loci is used as the hidden Markov process to characterize the changes in the haplotype frequencies of the underlying population over time, which enables us to explicitly model genetic recombination and local linkage. The diffusion approximation we use in our approach allows us to avoid the restriction that the allele frequency trajectory of the underlying population remains far away from allele fixation or loss, which was imposed by the Gaussian approximation used in Terhorst *et al.* (2015). Our posterior computation is carried out with the particle marginal Metropolis-Hastings (PMMH) algorithm developed by Andrieu *et al.* (2010), which enables us to efficiently calculate the likelihood. Also, our method can handle sampled chromosomes with unknown alleles, which is common in aDNA data due to postmortem damage. In addition, our method can be readily extended to model a range of complex evolutionary scenarios like nonconstant demographic histories.

We evaluate the performance of our Bayesian inference procedure through extensive simulations. We show that our method enables efficient and accurate estimation of selection coefficients from time series genetic data, regardless of whether sampled chromosomes contain unknown alleles or not. We present two scenarios where existing single-locus methods fail to deliver precise estimates for selection coefficients whereas our approach still works well, especially when the loci are tightly linked. This shows the efficacy of our method in modeling genetic recombination and local linkage. Finally, we apply our Bayesian inference procedure to analyze the aDNA data associated with white spotting patterns in horses from Wutke *et al.* (2016) and find that, in horses, there is no evidence showing that the tobiano pattern is positively selected but strong evidence of the sabino pattern being negatively selected.

## Materials and Methods

In this section, we begin with a short review of the Wright-Fisher diffusion for two linked loci evolving subject to natural selection over time, and then describe our Bayesian procedure for the inference of natural selection at the two linked loci from temporally spaced samples, *e.g.*, how to set up the HMM framework and how to compute the posterior probability distribution for the population genetic quantities of interest with Markov chain Monte Carlo (MCMC) techniques.

### Wright-Fisher diffusion

Consider a diploid population of randomly mating individuals at two linked loci A and B evolving under natural selection according to the two-locus Wright-Fisher model with selection (see, *e.g.*, He *et al.* 2017), for which we assume discrete time and nonoverlapping generations. At each locus, there are two possible allele types, labeled , and , , respectively, resulting in four possible haplotypes , , , and , labeled haplotypes 1, 2, 3, and 4, respectively. We attach symbols and to the mutant alleles, which are assumed to arise only once in the population and be positively selected once it exists, and we attach symbols and to the ancestral alleles, which are assumed to originally exist in the population.

We incorporate viability selection into the population dynamics and assume that the viability is fixed from the time that the mutant allele was created in the population and is only determined by the genotype at each single locus. More specifically, we assume that the relative viabilities of the 16 possible (ordered) genotypes at the two loci are determined multiplicatively from the relative viabilities at individual loci, and the relative viabilities of the three possible genotypes at each single locus, *e.g.*, genotypes , , and at a given locus A, are taken to be 1, , and , respectively, where is the selection coefficient and is the dominance parameter. For example, the relative viability of the genotype is . We let be the recombination rate of the two loci on the same chromosome (*i.e.*, the fraction of recombinant offspring showing a crossover between the two loci). We assume that the population size is fixed to be *N* diploid individuals for all time.

#### Two-locus Wright-Fisher diffusion with selection:

We consider a scaling limit of the Wright-Fisher model, where the unit of time is rescaled by . The scaled selection coefficients and , and the scaled recombination rate *ρ* = 4*Nr* are kept constant while the population size *N* is taken to infinity. As the population size approaches infinity, the haplotype frequency trajectories follow a standard diffusion limit of the two-locus Wright-Fisher model with selection (see, *e.g.*, He *et al.* 2020), called the two-locus Wright-Fisher diffusion with selection. The Wright-Fisher diffusion has already been successfully applied in the inference of natural selection from time series allele frequency data. The partial differential equation (PDE) satisfied by the transition probability density function of the Wright-Fisher diffusion was used in *e.g.*, Bollback *et al.* (2008), Steinrücken *et al.* (2014), He *et al.* (2019). In this work, as used in *e.g.*, Schraiber *et al.* (2016), we characterize the Wright-Fisher diffusion as the solution of the stochastic differential equation (SDE) instead.

We let * X_{i}*(

*t*) denote the frequency of haplotype

*i*in the population at time

*t*for , and be the frequencies of the four possible haplotypes in the population by

*(*

**X***t*), which evolves in the state space (

*i.e.*, a three-simplex)and satisfies the SDE in the following form(1)with initial condition

*(*

**X***t*

_{0}) =

**x**_{0}. In Equation 1, the drift term

**(**

*μ**) is a four-dimensional vector being(2)the diffusion term*

**x***(*

**ν****) is a four by three matrix satisfying(3)and**

*x**(*

**W***) is a three-dimensional standard Brownian motion. The term*

**t***x*

_{1}

*x*

_{4}−

*x*

_{2}

*x*

_{3}in Equation 2 is a measure of the linkage disequilibrium between the A and B loci introduced by Lewontin and Kojima (1960), which quantifies the nonrandom association of the alleles at the two loci.

#### Forward-in-time simulation of the Wright-Fisher diffusion:

To obtain a numerical solution of the Wright-Fisher SDE in Equation 1, we need to compute the diffusion term * ν*(

*) which we have to perform at each time step in most existing numerical simulation schemes. The diffusion term*

**x***(*

**ν***) can be analytically derived with the Cholesky decomposition (Sato 1976), which, however, explodes at the boundaries. There exist other matrix decompositions capable of computing the diffusion term*

**x***(*

**ν***) such as spectral decomposition, which are valid for positive semidefinite matrices, typically at the expense of either additional numerical errors and computational costs, or limitations in applicability to the infinitesimal covariance matrix*

**x****Σ**(

*) of the form in Equation 3.*

**x**Following He *et al.* (2020), we reformulate the Wright-Fisher SDE in the following form(4)with initial condition ** X**(

*t*

_{0}) =

**x**_{0}, where the diffusion term

**(**

*σ**) can be explicitly written down as(5)and*

**x***(*

**W***t*) is a six-dimensional standard Brownian motion. Combining Equations 3 and 5, we havewhich implies that the two Wright-Fisher SDE’s have the same infinitesimal generatorthus having the same weak solution. This guarantees that we can achieve the solution of the Wright-Fisher SDE of the form in Equation 1 by numerically solving the Wright-Fisher SDE of the form in Equation 4, which enables us to avoid boundary issues and reduce computational costs.

There exist a number of numerical simulation schemes for SDEs [see Kloeden and Platen (1992) for an excellent introduction]. The numerical approach we adopt in this work is the commonly used Euler-Maruyama scheme, one of the most popular numerical methods for SDEs in practice due to its high efficiency and low complexity. More specifically, we divide each generation into *L* subintervals by setting and then the Euler-Maruyama approximation of the Wright-Fisher diffusion can be formulated asfor *i* = 1,2,3,4, where are independent and normally distributed with mean 0 and variance Δ*t* for *j* = 1,2,…,6. The Euler-Maruyama scheme is numerically stable and strongly consistent (see, *e.g.*, Kloeden and Platen 1992), and the convergence of the Euler-Maruyama approximation of the Wright-Fisher diffusion is guaranteed by Zhang (2006).

### Bayesian inference of natural selection

Suppose that the available data are always sampled from the underlying population at a finite number of distinct time points, say where the time is measured in units of 2*N* generations to be consistent with the timescale of the Wright-Fisher diffusion. At the *k*-th sampling time point, we let and denote the counts of mutant alleles and ancestral alleles observed at loci A and B in the sample of *n _{k}* chromosomes drawn from the underlying population, respectively. The population genetic quantities of interest in this work are the scaled selection coefficients and the dominance parameters and and the scaled recombination rate

*ρ*, which are denoted by

#### Hidden Markov model:

Similar to Bollback *et al.* (2008), the underlying population is assumed to evolve according to the two-locus Wright-Fisher diffusion with selection in our HMM framework, and the observations are modeled as independent samples drawn from the underlying population at each sampling time point (see Figure 1 for the graphical representation of our HMM framework). To compute the posterior probability distribution we condition and integrate over all possible haplotype frequency trajectories of the underlying population at each sampling time point. More specifically, we let denote the haplotype frequency trajectories of the underlying population at the sampling time points *t*_{1:}* _{K}*. The posterior probability distribution for the population genetic quantities of interest can then be written down as(6)where(7)In Equation 7, is the prior probability distribution for the population genetic quantities of interest and can be taken to be a uniform prior over the parameter space if prior knowledge is poor, is the probability distribution for the haplotype frequency trajectories of the underlying population at the sampling time points

*t*

_{1:K}, and is the conditional probability for the observations at the sampling time points

*t*_{1:}

*given the haplotype frequency trajectories of the underlying population.*

_{K}Since the Wright-Fisher diffusion is shown to be a Markov process, the probability distribution for the haplotype frequency trajectories of the underlying population at the sampling time points *t*_{1:}* _{K}* can be decomposed aswhere is the prior probability distribution for the haplotype frequencies of the underlying population at the initial sampling time point and can be taken to be a uniform prior over the state space Ω

*, known as the flat Dirichlet distribution, if prior knowledge is poor. The term in the product above, is the transition probability density of the Wright-Fisher diffusion between two consecutive sampling time points for which can be obtained by numerically solving the Kolmogorov backward equation (or its adjoint) associated with the Wright-Fisher diffusion. However, this requires a fine enough discretisation of the state space*

_{X}*Ω*

*, if a finite difference method is used, and strongly depends on the underlying population genetic parameters (Ragsdale and Gutenkunst 2017). In addition, numerically solving such a PDE in three dimensions for our posterior computation is computationally challenging and prohibitively expensive. We therefore resort to an “exact-approximate” Monte Carlo procedure (Andrieu and Vihola 2016) in this work that only involves simulating the Wright-Fisher SDE in the form of Equation 4, as a tractable alternative.*

_{X}Given the haplotype frequency trajectories of the underlying population, the observations at each sampling time point are independent of one another, which means thatwhere is the conditional probability for the observations at the *k*-th sampling time point given the haplotype frequency trajectories of the underlying population for To calculate the emission probability we let denote the counts of the , , , and haplotypes in the sample at the *k*-th sampling time point, which are usually unobserved (see Figure 2 for the graphical representation of our HMM framework incorporating the additional level of sampling noise). We then have(8)whereConditional on the haplotype frequency trajectories of the underlying population at the *k*-th sampling time point, the haplotype counts of the sample can be modeled through multinomial sampling from the underlying population with sample size *n _{k}*. We can then formulate the first term in the summation of Equation 8 as(9)The second term in the summation of Equation 8 can be decomposed as(10)Let

*ϕ*denote the probability that a sampled chromosome at a single locus is of unknown type, which we assume to be identical for all loci. We therefore have(11)(12)where(13)The probability that the sampled chromosome at a single locus is of unknown type is usually unavailable, but we can estimate it with

#### Particle marginal Metropolis-Hastings:

To compute the marginal posterior , we resort to MCMC techniques since the posterior probability distribution in Equation 6 is unavailable in a closed form. A Metropolis-Hastings (MH) scheme can be devised to explore the population genetic quantities of interest with a fairly arbitrary proposal probability distribution, *e.g.*, a random walk proposal, where a sample of new candidates of the parameters is drawn from the proposal and is accepted with the MH ratio(15)Our problem reduces to the calculation of the intractable marginal likelihood in Equation 15, which can be formulated asand achieved with a MC estimate (Beaumont 2003; Andrieu and Roberts 2009). This pseudomarginal MCMC algorithm exploits the fact that the MC estimate of the marginal likelihood is unbiased (or has a constant bias independent of the parameters ϑ) and targets the marginal posterior

We adopt a closely related approach developed by Andrieu *et al.* (2010), which obtains an unbiased sequential MC (SMC) estimate of the marginal likelihood and targets the joint posterior This method is called PMMH and permits a joint update of the population genetic quantities of interest and the latent population haplotype frequency trajectories. The coestimation of the haplotype frequency trajectories of the underlying population is interesting in its own right, but our interest here lies only in the population genetic parameters. We therefore employ a special case of the PMMH algorithm in this work, where we do not generate and store the haplotype frequency trajectories of the underlying population in the state of the Markov chain. Full details about the PMMH algorithm can be found in Andrieu *et al.* (2010). Fearnhead and Künsch (2018) provided a review of MC methods for estimating parameters in the HMM based on the particle filter.

In our Bayesian inference procedure, the implementation of the PMMH algorithm requires the SMC estimate of the marginal likelihood This can be achieved by the bootstrap particle filter introduced by Gordon *et al.* (1993) in the following manner. For the sampling time point *t*_{1}, we first generate a sample of *M* particles, denoted by from the prior and assign each particle a weight given byfor where the superscript *m* denotes the particle label. We then calculate the SMC estimate of the marginal likelihood for the observations **u**_{1} and **v**_{1} byand resample *M* times with replacement from the sample of particles with the probabilities given by the normalized weights We repeat the following steps for the sampling time points *t*_{2:}* _{K}*:

Step 1: Generate each particle by simulating the Wright-Fisher diffusion

(*X**t*) on the time interval starting at the frequency with the Euler-Maruyama scheme forStep 2: Assign each particle a weight given by

for

Step 3: Calculate the SMC estimate of the marginal likelihood for the observations

*u*_{1:}and_{k}*v*_{1:}by_{k}Step 4: Resample

*M*times with replacement from the sample of particles with the probabilities given by the normalized weights

Our Bayesian inference procedure then consists in the followings. We first generate a sample of initial candidates of the parameters ϑ from the prior and then run a bootstrap particle filter with the proposed parameters ϑ to obtain the SMC estimate of the marginal likelihood We repeat the following steps until a sufficient number of the samples of the parameters ϑ have been obtained:

Step 1: Generate a sample of new candidates of the parameters from the proposal

Step 2: Run a bootstrap particle filter with the proposed parameters to obtain the SMC estimate of the marginal likelihood

Step 3: Accept the proposed parameters with the Metropolis-Hastings ratio

Once enough samples of the parameters ϑ have been obtained, we can get the minimum mean square error (MMSE) estimates for the population genetic quantities of interest, defined byAlternatively, using nonparametric density estimation techniques (see Izenman, 1991, for a review), we can compute the posterior with the samples of the parameters ϑ and achieve the maximum *a posteriori* probability (MAP) estimates for the population genetic quantities of interest, defined by

### Data availability

The authors state that all data necessary for confirming the conclusions of this work are represented fully within the article. Source code implementing the approach described in this work is available at https://github.com/zhangyi-he/WFM-2L-DiffusApprox-FwdPMMH. Supplemental material available at figshare: https://doi.org/10.25386/genetics.12821585.

## Results

In this section, we show how our Bayesian inference method performs on simulated datasets with known population genetic parameter values, including a scenario where sampled chromosomes contain unknown alleles. We also present two examples to show the improvement in the inference of natural selection by explicitly modeling genetic recombination and local linkage. Finally, we apply our approach to the aDNA data associated with horse white spotting patterns from previous studies of Ludwig *et al.* (2009), Pruvost *et al.* (2011) and Wutke *et al.* (2016).

### Analysis of simulated data

We run forward-in-time simulations of the two-locus Wright-Fisher model with selection and evaluate the performance of our approach on these replicate simulations by examining the bias and the root mean square error (RMSE) of our Bayesian estimates. In what follows, we take the dominance parameters to be and (*i.e.*, the heterozygous fitness is the arithmetic average of the homozygous fitness, called genic selection) and choose a population size of unless otherwise noted. In principle, however, the conclusions hold for any other values of the dominance parameters and the population size

For each simulated dataset, given the values of the population genetic parameters ϑ and the initial population haplotype frequencies we simulate the haplotype frequency trajectories of the underlying population according to the two-locus Wright-Fisher model with selection. After obtaining the simulated population haplotype frequency trajectories, we draw the unobserved sample haplotype counts independently at each sampling time point according to the multinomial distribution in Equation 9 first, and then we generate the observed sample mutant allele counts and ancestral allele counts with Equations 10–13.

#### Power to infer natural selection:

We vary the selection coefficients with and and the recombination rate with in our simulation studies. We perform 100 replicates for each of the 12 possible combinations of the selection coefficients and the recombination rate. For each replicate, we set the initial population haplotype frequencies and simulate the haplotype frequency trajectories of the underlying population according to the two-locus Wright-Fisher model with selection. We sample 50 chromosomes from the underlying population at every 50 generations throughout 500 generations.

We choose a uniform prior over the interval [−1,1] for the selection coefficients and and a flat Dirichlet prior for the initial population haplotype frequencies **x**_{0} in our Bayesian inference method. We divide each generation into five subintervals in the Euler-Maruyama scheme and run the PMMH algorithm with 1500 particles and 10,000 iterations. We discard the initial 2000 iterations as the burn-in period, and then thin the remaining PMMH output by selecting every fourth value.

The resulting boxplots of the empirical studies are shown in Figure 3 for the allele frequency datasets generated without missing values ( in Equations 11 and 12) and Figure 4 for the allele frequency datasets generated with missing values ( in Equations 11 and 12), respectively. In the two figures, the tips of the whiskers denote the 2.5%-quantile and the 97.5%-quantile, and the boxes represent the first and third quartile with the median in the middle. As can be seen from the boxplot results, our estimates for the selection coefficients at both loci show little bias across different parameter ranges, no matter whether sampled chromosomes contain unknown alleles or not, although one can discern a slight bias for small selection coefficients. With the increase of the selection coefficients, our estimates for the selection coefficients become more accurate. See the bias and the RMSE of the resulting MMSE estimates in Supplemental Material, Tables S1 and S2.

For each combination of the selection coefficients and the recombination rate, we calculate the proportion of the 95% HPD intervals that include the true values, shown in the bottom left corner of each boxplot in Figures 3 and 4. On average, 92.00% of the runs result in the true values of the selection coefficients being within the 95% HPD intervals for the simulated datasets without missing values, 93.33% for tightly linked loci and 90.67% for loosely linked loci. For simulated datasets with 2% missing values, 92.08% of the runs result in the true values of the selection coefficients being within the 95% HPD intervals, *i.e.*, 93.33% for tightly linked loci and 90.83% for loosely linked loci. We can see that small recombination rates can lead to better results for both loci.

In Figure 5, we illustrate the resulting boxplots of the empirical studies where the simulated data are given as haplotype frequencies (instead of allele frequencies in Figures 3 and 4). Compared to the estimates from allele frequency data, our estimates from haplotype frequency data are closer to their true values with smaller variances, especially for tightly linked loci. On average, 92.50% of the runs result in the true values of the selection coefficients being within their 95% HPD intervals on average, with 93.67% for tightly linked loci and 91.33% for loosely linked loci. The bias and the RMSE of the resulting MMSE estimates are summarized in Table S3. This improvement in the performance of our estimates is to be expected as all else being equal haplotype frequency data contain more information than allele frequency data. The complex interplay between the four haplotypes in the sample can be directly observed in haplotype frequency data but only partially observed in allele frequency data.

However, as illustrated in Figure 5, our estimates are still slightly biased for small selection coefficients. This may be caused by the initial population frequencies of the haplotypes that contain mutant alleles being close to 0 in our simulated data. In such a situation, the population frequency trajectories of these haplotypes will be, with high probability, near 0 during the sampling period for small selection coefficients (see Figures S1 and S2). This can cause a number of simulated datasets to have sample counts 0 for the haplotypes that contain mutant alleles, especially when the selection coefficients are small. Such datasets contain little information on the underlying selection coefficients. As can be observed from Figure 6, the bias can be almost completely eliminated for all combinations of the selection coefficients and the recombination rate if the starting population frequencies of the haplotypes that contain mutant alleles are taken to be intermediate values like The bias and the RMSE of the resulting MMSE estimates are summarized in Table S4. The haplotype frequency trajectories of the underlying population for the haplotype frequency datasets simulated with the initial population haplotype frequencies can be found in Figures S3 and S4. We also assess the performance of our method for the case that a new mutation arose in the population (at frequency ) at *t* = 0 when the neighboring mutation became established. See Figure S5 and Table S5 for boxplots of the resulting MMSE estimates with their bias and RMSE, which show that our approach can still produce precise estimates of the selection coefficients in this case. It should be noticed that in this case we condition the mutant alleles at both loci to survive until the most recent sampling time point and sample 50 chromosomes from the underlying population at every 120 generations throughout 1200 generations so that a significant number of the realizations of the haplotype frequency trajectories of the underlying population can capture a significant proportion of the selective sweep.

In summary, our Bayesian inference procedure can deliver accurate estimates of the selection coefficients based on time series data of allele frequencies across different parameter ranges, regardless of whether sampled chromosomes contain unknown alleles or not. We also generate datasets with other selection schemes, *e.g.*, the dominance parameters and The resulting boxplots of the simulation studies are shown in Figure S6, with the bias and the RMSE of the resulting MMSE estimates summarized in Table S6. In addition to MMSE estimates, we present the bias and the RMSE of MAP estimates (see Tables S7–S12), which display very similar characteristics to the MMSE estimates. The boxplots for MAP estimates show little bias, with the upper and lower quartiles of the MAP estimates being similar to those of the MMSE estimates (see Figures S7–S12).

#### Improvement from modeling genetic recombination and local linkage:

In the case where a pair of loci are both suspected to be subject to natural selection, one can still use a single-locus method to each locus to estimate selection coefficient. To our knowledge, there has been a considerable amount of work on the statistical inference of natural selection at a single locus from time series data of allele frequencies (*e.g.*, Bollback *et al.*, 2008; Malaspinas *et al.*, 2012; Steinrücken *et al.* 2014; Schraiber *et al.* 2016; Ferrer-Admetlla *et al.* 2016; He *et al.* 2019). However, using a single-locus approach may lead to inaccurate estimates of the selection coefficients when the two loci are linked (He *et al.* 2020). In the case of tightly linked loci, modeling genetic recombination and local linkage becomes necessary, thus our two-locus method is far more desirable. Below, we illustrate with two examples of tightly linked loci with the recombination rate *r* = 0.00001. We simulate the haplotype frequency trajectories of the underlying population through the two-locus Wright-Fisher model with selection and draw 200 chromosomes from the underlying population at generations 0, 100, 200, 300, 400, and 500.

In the first example, we consider a positively selected locus A tightly linked with a selectively neutral locus B, where we set the selection coefficients and respectively. We take the initial haplotype frequencies of the underlying population to be The mutant allele frequency trajectories of the sample are shown in Figure 7A. The posterior probability distributions obtained through our single-locus approach, described in File S3, are shown in Figure 7B, and the posterior probability distributions achieved with our two-locus method, described in *Bayesian inference of natural selection*, are shown in Figure 7C. Bayesian estimates of the selection coefficients and are summarized in Table 1.

One can observe that with a single-locus method, the estimate for the selection coefficient is reasonably accurate, but the estimate for the selection coefficient is off by a large amount. The true value for the selection coefficient is 0, but the single-locus approach produces an estimate of roughly 0.005 and a 95% HPD interval that only encompasses positive values, which is strong evidence for the presence of positive selection. In comparison, the estimates for both of the selection coefficients and are fairly accurate with the two-locus method.

To understand the poor performance of the single-locus method in this example, we plot the mutant allele frequency trajectories of the underlying population in Figure 7D and the haplotype frequency trajectories of the underlying population in Figure 7E. The increase in the frequency of the allele with time, despite it having a selection coefficient of 0, is caused by the haplotype, which has a selection coefficient of 0.01. This compensates for the decrease in the frequency of the haplotype, resulting in an increasing trajectory for the allele, albeit with a slower rate of increase than the allele. With the two-locus approach, however, the interplay between all four haplotypes are taken into account and it produces accurate estimates for both of the selection coefficients and

In the second example, we consider two positively selected and tightly linked loci A and where we take the selection coefficients to be and respectively, and set the initial haplotype frequencies of the underlying population to be The results are illustrated in Figure 8 and summarized in Table 2. In this example, with the single-locus method, the estimate for the selection coefficient is reasonably accurate, but the estimate for the selection coefficient is off by a large amount, *i.e.*, its true value lies outside the 95% HPD interval. In fact, although the allele is favored by natural selection with a selection coefficient of 0.005, the resulting estimate for the selection coefficient is roughly −0.0015 with a 95% HPD interval that includes the value 0, which implies no strong evidence for natural selection. In comparison, the two-locus method again produces fairly accurate estimates for both of the selection coefficients and

As shown in Figure 8, the frequency of the allele increases with time due to the increase in the frequencies of the and haplotypes, which are the two most selected haplotypes, with the selection coefficients of 0.015 and 0.01, respectively. The allele is made up of the and haplotypes, with the selection coefficients of 0.01 and 0.005, respectively, which are the second and third most selected haplotypes. As a result of their initial conditions and selection coefficients, the frequency of the allele roughly holds constant in time, since it is somewhat out-competed by the allele. Viewing the trajectory of the allele in isolation does not give strong evidence that it is selectively advantageous, which results in an estimate of roughly 0 in its selection coefficient through the single-locus approach. Moreover, even the 95% HPD interval for the single-locus method does not include the true selection coefficient of 0.005 for the allele. Using the two-locus approach, we are again able to obtain accurate estimates for both of the selection coefficients and

In these two examples, we choose a uniform prior over the interval [−1,1] for the selection coefficients, and we select a flat Dirichlet prior for the initial population haplotype frequencies in the two-locus method and a uniform prior over the interval [0,1] for the initial population allele frequency in the single-locus method, respectively. Other settings in the Euler-Maruyama scheme and the PMMH algorithm are the same as we applied in the empirical studies in *Power to infer natural selection*. Compared to existing single-locus approaches, our two-locus method explicitly incorporates the effect of genetic recombination and the information of local linkage through the two-locus Wright-Fisher diffusion with selection. Indeed, the dynamics of the two-locus Wright-Fisher diffusion with selection can demonstrate complex and unpredictable behavior (see, *e.g.*, Yu and Etheridge 2010; Cuthbertson *et al.* 2012), which can result in inaccurate estimates of the selection coefficients if one simply employs a single-locus approach. In contrast, applying our two-locus method can yield precise estimates of the selection coefficients at both loci.

### Analysis of real data

We apply our Bayesian inference method to real data by reanalyzing time serial samples of segregating alleles of the equine homolog of proto-oncogene c-kit (*KIT*). These data come from previous studies of Ludwig *et al.* (2009), Pruvost *et al.* (2011) and Wutke *et al.* (2016), and the sample information and genotyping results for all successfully typed horses can be found in Wutke *et al.* (2016), which are summarized in Table 3. The *KIT* gene in horses resides on the long arm of chromosome 3 and lies in two intervals associated with white spotting patterns, one in the intron 13, which codes for tobiano (*KIT13*), with the other in intron 16, which codes for sabino (*KIT16*). At the *KIT13* locus, the ancestral allele is designated *KM0*, while the mutant allele, associated with the tobiano pattern and acting as dominant (Brooks *et al.* 2007), is designated *KM1*. The tobiano pattern is characterized by depigmented patches of skin and associated hair that often cross the dorsal midline and cover the legs. At the *KIT16* locus, the ancestral allele is designated *sb1*, while the mutant allele associated with the sabino pattern and acting as semidominant (Brooks and Bailey 2005), is designated *SB1*. The sabino pattern is characterized by irregularly bordered white patches of skin and associated hair that begin at the extremities and face, and may extend up to the belly and midsection.

We set the dominance parameters *h* = 0 for *KIT13* as the *KM1* allele is dominant, and *h* = 0.5 for *KIT16* as the *SB1* allele is semidominant. Following Der Sarkissian *et al.* (2015), we take the population size to be *N* = 16,000 and the average length of a generation of horse to be 8 years, the same as in Schraiber *et al.* (2016). As can be seen in Table 3, there are various sampling time points when the sequencing of the aDNA material yielded a number of unknown alleles at loci *KIT13* and/or *KIT16*. We show all possible mutant allele frequency trajectories of the sample at the *KIT13* and *KIT16* loci in Figure 9. Neither mutant allele was found in the first two samples dated 17,146 and 7029 years before present (BP). Indeed, both sabino and tobiano patterns are only present in domestic horses (Wutke *et al.* 2016). We assume that both mutant alleles, *KM1* and *SB1*, arose after the domestication of the horse, which is thought to have started in the Eurasian Steppes ∼5500 years BP (Outram *et al.* 2009). We therefore discard the first two samples from our analysis in this section, but, for completeness, in Figures S13–S18, we also present the results of the inference when these two samples are taken into account.

As a result of the low quality of the *KIT* dataset, it becomes difficult to intuit whether either or both mutant alleles at the *KIT13* and *KIT16* loci are selected by simply inspecting the mutant allele frequency trajectories of the sample. Using our two-locus Bayesian inference procedure, described in *Bayesian inference of natural selection*, we jointly estimate the selection coefficients for the mutant alleles at the *KIT13* and *KIT16* loci under the case that sampled chromosomes contain variants with unknown alleles. For the recombination rate, we choose three average rates of recombination, 5 × 10^{−9}, 1 × 10^{−8}, and 5 × 10^{−8} crossovers/bp, as suggested in Dumont and Payseur (2008), and multiply them by the genetic distance 4688 bp to get the recombination rates between the *KIT13* and *KIT16* loci. All settings in the Euler-Maruyama scheme and the PMMH algorithm are the same as we applied in the previous section. The resulting posterior probability distributions are shown in Figure 10, and the MAP and MMSE estimates, as well as the 95% HPD intervals, are summarized in Table 4.

As can be found in Table 4, the MMSE estimates with different values of the recombination rate are essentially unchanged, while the MAP estimates vary a bit more than the MMSE estimates. This may be caused by the way we achieve our MAP estimates, where the posterior probability distribution is approximated through the two-dimensional kernel density estimation with an axis-aligned bivariate normal kernel (Venables and Ripley 2002). Therefore, the MAP estimates may depend on the number of the iterations of the PMMH. The resulting Bayesian estimates of the selection coefficients suggest that the *KM1* allele at the *KIT13* locus is weakly positively selected, whereas the *SB1* allele at the *KIT16* locus is strongly negatively selected, but the 95% HPD intervals for both selection coefficients include the value 0. For the *KIT13* locus, the posterior probability for positive selection is 0.564, not strong evidence for the *KM1* allele at the *KIT13* locus being positively selected. However, for the *KIT16* locus, the posterior probability for negative selection is 0.982, strong evidence to support the *SB1* allele at the *KIT16* locus being negatively selected. This conclusion is further confirmed with the estimates obtained with different values of the population size (*i.e.*, *N* = 8000 and *N* = 32,000), which can be found in Figures S19 and S20.

We also used our single-locus Bayesian inference procedure, described in File S3, to independently estimate the selection coefficients for the mutant alleles at the *KIT13* and *KIT16* loci under the case that sampled chromosomes contain unknown alleles. All settings in the Euler-Maruyama scheme and the PMMH algorithm are the same as we applied in the previous section. The resulting posterior probability distributions are shown in Figure 11, and the MAP and MMSE estimates, as well as the 95% HPD intervals, are summarized in Table 5. The resulting Bayesian estimates of the selection coefficients suggest that the *KM1* allele at the *KIT13* locus is weakly selectively advantageous, whereas the *SB1* allele at the *KIT16* locus is weakly selectively deleterious. However, as illustrated in Figure 5, the posterior probability distributions for the *KIT13* and *KIT16* loci are both roughly symmetric about 0. This indicates that there is no evidence to support the *KM1* allele at the *KIT13* locus or the *SB1* allele at the *KIT16* locus being selected, which is consistent with the findings of Ludwig *et al.* (2009) obtained using the approach of Bollback *et al.* (2008). Compared to the results shown in Figure 10 and Table 4, we fail to tease apart negative selection at the *KIT16* locus without considering genetic recombination effect and local linkage information. We present an example that mimics the *KIT13* and *KIT16* loci, *i.e.*, a negatively selected locus tightly linked with a selectively neutral locus, which shows similar results to those using the real dataset (see Figure S21 and Table S13).

### Computational issues

In the PMMH algorithm, it is desirable to generate a large number of particles in the bootstrap particle filter to yield an accurate estimate of the marginal likelihood However, this can be computational burdensome since each iteration of the PMMH algorithm requires a run of the bootstrap particle filter even though fewer iterations are required. Balancing the particle number and the MCMC iteration number to obtain good performance at a reasonable computational cost was investigated by Pitt *et al.* (2012) and Doucet *et al.* (2015). In pseudomarginal algorithms, if the estimates of the marginal likelihood are too noisy, the chain tends to be “sticky” with excessive autocorrelation (Beaumont 2003). A simple rule-of-thumb is to select a particle number such that the standard deviation (SD) of the log-likelihood estimates is in the range from 1.0 to 1.7. Nevertheless, the PMMH algorithm exactly targets the marginal posterior for any number of particles.

In each run of the bootstrap particle filter, we simulate the particles according to the two-locus Wright-Fisher diffusion with selection using the Euler-Maruyama scheme. It is desirable to take a large *L* in the Euler-Maruyama scheme to get an accurate approximation of the Wright-Fisher diffusion, but large *L* increases the computational load. Stramer and Bognar (2011) suggested choosing L to be *L*^{*} such that the estimates of the marginal likelihood are approximately the same for any *L* > *L*^{*}, where *L*^{*} can be obtained through extensive simulations.

In practice, we divide each generation into five subintervals in the Euler-Maruyama scheme, *i.e.*, *L* = 5. Our running time for a single iteration of the PMMH algorithm with 1500 particles (see Figure 12), achieving the SD of the log-likelihood at ∼1.504, on a single core of an Intel Core i7 processor at 4.2 GHz, is ∼12.360 sec for the *KIT* dataset. In principle, every particle can be simulated in parallel on a different core. Running 10,000 iterations of the PMMH is sufficient for a relatively smooth resulting posterior surface, as shown in Figure 10. We discard the initial 2000 iterations as the burn-in period and then thin the remaining PMMH output, taking every fourth value and regarding these as independent. Dahlin and Schön (2015) outlined a selected number of possible improvements and best practices for implementation. All of our code in this work is written in R with C++ by using Rcpp and RcppArmadillo.

Exact-approximate particle filtering approaches such as the PMMH algorithm we use in this work seem to be useful for the inference of population genetic parameters from time series data of allele frequencies. This methodology can be generalized to a range of complex evolutionary scenarios, *e.g.*, nonconstant demographic histories. Although computationally demanding, improvements to the PMMH algorithm continue to be developed (*e.g.*, Yıldırım *et al.* 2018).

## Discussion

In this work, we have developed a novel MCMC-based approach to jointly estimate natural selection at two linked loci from time series genetic data while explicitly accounting for genetic recombination and local linkage. Our Bayesian inference procedure is built on an HMM framework incorporating the two-locus Wright-Fisher diffusion with selection. Our Bayesian estimates of selection coefficients are achieved with the PMMH algorithm. We have demonstrated that our method can accurately and efficiently estimate selection coefficients from simulated data, regardless of whether sampled chromosomes contain unknown alleles or not. We have found that, under certain circumstances, especially in the case of tightly linked loci, existing single-locus approaches fail to deliver precise estimates for selection coefficients, but our two-locus method still works well. We have applied our Bayesian inference procedure to the *KIT* gene in horses, which is involved in the formation of white spotting patterns.

As noted earlier, the ancient horse DNA dataset has been the subject of earlier analyses by Malaspinas *et al.* (2012), Steinrücken *et al.* (2014), Schraiber *et al.* (2016), and He *et al.* (2019). Compared with many datasets describing experimental evolution under controlled laboratory or field mesocosm conditions, aDNA datasets are more likely to be composed of short degraded DNA fragments, typically with a high degree of genotyping error (Racimo *et al.* 2016). However, aDNA data provide an opportunity to investigate the chronology and tempo of natural selection across evolutionary timescales, which has the advantage of being associated with an interesting narrative (MacHugh *et al.* 2017). A motivation for the analysis is to see whether the statistical developments described here can shed further light on these data. We have found strong evidence showing that the sabino pattern caused by the *SB1* allele at locus *KIT16* has been selectively deleterious, but no evidence showing that the tobiano pattern caused by the *KM1* allele at locus *KIT13* has been selectively advantageous. One explanation for our findings may be that there was a decreasing attractiveness of spotted horses since the Middle Ages due to religious and cultural ideas (Wutke *et al.* 2016). Based on ancient Roman records, solid horses were preferred to spotted horses as the latter were considered to be of inferior quality. During medieval times, spotted horses had a negative connotation after several epidemics, resulting in a lower religious prestige for these patterns. Additionally, people might no longer see the need to distinguish wild (solid) horses from domesticated (spotted) horses as wild populations gradually became scarcer and approached extinction. Further reasons for the spotted horses being selectively deleterious might have been novel developments in weaponry such as the longbow, with these horses being an easier target than solid ones (see Wutke *et al.* 2016, and references therein).

In addition to our method, Terhorst *et al.* (2015) is the only existing approach that can model linked loci and genetic drift for the inference of natural selection from temporal changes in allele frequencies. In Terhorst *et al.* (2015), the underlying population dynamics at multiple linked loci was modeled using the Wright-Fisher model in their HMM framework, and the likelihood computation was carried out by approximating the Wright-Fisher model through a deterministic path with added Gaussian noises, which aims to fit a mathematically convenient transition probability density function by equating the first two moments of the Wright-Fisher model. Such a moment-based approximation works well for many applications when modeling the allele frequencies with intermediate values. However, as soon as the allele frequencies get close to their boundaries 0 or 1 (*i.e.*, allele loss or fixation), the Wright-Fisher model will be poorly approximated due to the infinite support of the Gaussian distribution that will leak probability mass into the values of the allele frequency that are <0 or >1, which is not mathematically possible. This issue becomes more problematic in the inference of natural selection since natural selection is expected to rapidly drive the allele frequencies toward the boundaries.

The MCMC-based method we have developed in this work is built on the standard diffusion limit of the Wright-Fisher model of the stochastic evolutionary dynamics under natural selection at a pair of linked loci, which is shown to be a good approximation even if the allele frequencies get close to their boundaries 0 or 1 (He *et al.* 2020). The diffusion approximation enables our approach to work well for the allele frequencies with all possible values. Our method can handle sampled chromosomes that contain unknown alleles, which one might expect to encounter in real data, especially in aDNA studies. Even though we have only illustrated the utility of our method on aDNA data in this work, our Bayesian inference procedure can also be used to analyze Pool-Seq time series data from E&R experiments, as in Terhorst *et al.* (2015). Given the PMMH algorithm we used to infer natural selection in this work, our method lends itself naturally to joint estimates of the haplotype frequency trajectories of the underlying population without any increase in computational complexity. Furthermore, our method can be readily extended to model a range of complex evolutionary scenarios, *e.g.*, time-varying population size and selection coefficients, as it is built on simulating the Wright-Fisher diffusion.

One limitation of our approach is that we assume that mutant alleles were created before the initial sampling time point. Once a sample contains at least one copy of the mutant allele, we can reasonably assume that the mutant allele arose before the time of that sample. However, in the case of earlier samples without any mutant allele, there is uncertainty in pinpointing when the mutant allele arose. This problem can be remedied by coestimating the allele age as in *e.g.*, Malaspinas *et al.* (2012), Schraiber *et al.* (2016) and He *et al.* (2019), but these works all investigate natural selection at a single locus. Jointly estimating the selection coefficients at linked loci along with the allele ages can be expected to be cumbersome as there are many cases to take into account. In the case of the ancient horse DNA data, we did not wish to make the assumption that the mutant alleles, *KM1* and *SB1*, arose earlier than the time that horses were domesticated. However, we can compare the inference results obtained with different choices of the initial sampling time point (see Tables S14–S16) and reach the same conclusion that there is no strong evidence for the *KM1* allele at locus *KIT13* to be positively selected, but there is strong evidence for the *SB1* allele at locus *KIT16* to be negatively selected.

Our Bayesian statistical framework lends itself to being extended to infer natural selection at multiple linked loci from time series data of allele frequencies, which might further improve the inference results of natural selection. The challenge is that, with the increase in the number of linked loci, modeling the underlying population dynamics subject to natural selection becomes increasingly difficult. For example, there are eight haplotypes to take into account in the case of three linked loci each with two alleles. As a tractable alternative, we can apply our approach to multiple linked loci in a pairwise manner by using the PMMH algorithm within the Gibbs sampler, but this might only work for a small number of linked loci due to the computational cost of our two-locus approach. In practice, it will be necessary to find a good approximation of the Wright-Fisher model for the method to be computationally feasible, which will be the topic of future investigation. An important consideration is to what degree the results of the inference of natural selection are affected by the choice of stochastic or deterministic dynamics for the allele frequency trajectories (Jewett *et al.* 2016), and whether, in many scenarios, approximation with a deterministic model is satisfactory.

## Acknowledgments

We are grateful to the anonymous reviewers and the communicating editors for their helpful comments on the earlier version of this work. This work was funded in part by the Engineering and Physical Sciences Research Council (EPSRC) grant EP/I028498/1 to F.Y.

## Footnotes

Supplemental material available at figshare: https://doi.org/10.25386/genetics.12821585.

*Communicating editor: R. Nielsen*

- Received November 13, 2019.
- Accepted August 15, 2020.

- Copyright © 2020 by the Genetics Society of America