Abstract
Allele frequencies vary across populations and loci, even in the presence of migration. While most differences may be due to genetic drift, divergent selection will further increase differentiation at some loci. Identifying those is key in studying local adaptation, but remains statistically challenging. A particularly elegant way to describe allele frequency differences among populations connected by migration is the F-model, which measures differences in allele frequencies by population specific FST coefficients. This model readily accounts for multiple evolutionary forces by partitioning FST coefficients into locus- and population-specific components reflecting selection and drift, respectively. Here we present an extension of this model to linked loci by means of a hidden Markov model (HMM), which characterizes the effect of selection on linked markers through correlations in the locus specific component along the genome. Using extensive simulations, we show that the statistical power of our method is up to twofold higher than that of previous implementations that assume sites to be independent. We finally evidence selection in the human genome by applying our method to data from the Human Genome Diversity Project (HGDP).
MIGRATION is a major evolutionary force homogenizing evolutionary trajectories of populations by promoting the exchange of genetic material. At some loci, however, the influx of new genetic material may be modulated by selection. In case of strong local adaptation, for instance, migrants may carry maladapted alleles that are selected against. Identifying loci that contribute to local adaptation is of major interests in evolutionary biology because these loci are thought to constitute the first step toward ecological speciation (e.g., Wu 2001; Feder et al. 2012) and allow us to understand the role of selection in shaping phenotypic differences between populations and species (e.g., Bonin et al. 2006; Fournier-Level et al. 2011).
A simple, yet flexible and useful, approach to identify loci contributing to local adaptation is to scan the genome using statistics that quantify divergence between populations. One frequently used statistic is FST, which measures population differentiation, and loci with greatly elevated FST have been reported for many population comparisons (e.g., Jones et al. 2012; Andrew and Rieseberg 2013; Stölting et al. 2013). While other statistics measuring absolute divergence (Cruickshank and Hahn 2014) or incongruence between a population tree and locus-specific genealogies (Durand et al. 2011; Peter 2016) may be more suited in some situations, genome scans suffer from two inherent limitations. First, multiple evolutionary scenarios may explain the deviations in those statistics, making interpretation difficult (e.g., Cruickshank and Hahn 2014; Eriksson and Manica 2012). Second, the definition of outliers is arbitrary, allowing for the detection of candidate loci only. Indeed, loci also vary in their divergence between populations that were never subjected to selection, but outlier approaches would still identify outliers.
Multiple methods have thus been developed that explicitly incorporate the stochastic effects of genetic drift. A first important step to improve the reliability of outlier scans was the proposal to compare observed values of such statistics against the distribution expected under a null model. Among the first, Beaumont and Nichols (1996) proposed to obtain the distribution of FST through simulations performed under an island model. While the idea to evidence selection by comparing FST to its expectations is far from new (e.g., Lewontin and Krakauer 1973), the difficulty of properly parameterizing the null model was quickly realized (e.g., Nei and Maruyama 1975). The success of the method by Beaumont and Nichols (1996) relies on tailoring the parameters of the underlying island model to match the observed heterozygosity at each locus—an approach that is also easily extended to structured island models (Excoffier et al. 2009).
A more formal approach is given by means of the F-model (Rannala and Hartigan 1996; Balding 2003; Falush et al. 2003; Gaggiotti and Foll 2010), under which allele frequencies are measured by locus and population specific coefficients that reflect the amount of drift that occurred in population j at locus l since its divergence from a common ancestral population. In the case of biallelic loci, the current frequencies
are then given by a beta distribution (Beaumont and Balding 2004)
(1)where pl are the frequencies in the ancestral population and θlj is given by
It is straightforward to extend this model to account for different evolutionary forces that affect the degree of genetic differentiation. For instance, Beaumont and Balding (2004) proposed to partition the effects of genetic drift and selection into locus-specific and population-specific components αl and βj, as well as a locus-by-population specific error term γij:
(2)Loci with αl ≠ 0 are interpreted to be affected by either balancing (αl < 0) or divergent (αl > 0) selection, either because they are targets of selection or through hitch-hiking (Beaumont and Balding 2004). Such loci may be identified by contrasting models with αl = 0 or αl ≠ 0 for each locus l, either through Bayesian variable selection (Riebler et al. 2008) or via reversible-jump MCMC, as is done in the popular software BayeScan (Foll and Gaggiotti 2008).
A common problem of this, and many other, genome-scan methods is the assumption of independence among loci, which is easily violated when working with genomic data. By evaluating information from multiple linked loci jointly, however, the statistical power to detect outlier regions is likely increased considerably. Indeed, even a weak signal of divergence may become detectable if it is shared among multiple loci. Similarly, false positives may be avoided as their signal is unlikely to be shared with linked loci.
Unfortunately, fully accounting for linkage is often statistically challenging as well as computationally very costly. One solution is to split the problem by first inferring haplotypes for each sample, and then performing selection scans on the haplotype structure. The extended haplotype homozygosity (EHH) and its derived statistics (Sabeti et al. 2002; Voight et al. 2006; Sabeti et al. 2007; Tang et al. 2007), for instance, identify shared haplotypes of exceptional length. More recently, Fariello et al. (2013) introduced methods that identify haplotype clusters with particularly large frequency differences between populations and showed that using haplotypes rather than single markers increases power substantially.
An alternative solution is to model linkage through the autocorrelation of hierarchical parameters along the genome, which does not require knowledge of the underlying haplotype structure. Boitard et al. (2009) and Kern and Haussler (2010), for instance, proposed a genome-scan method in which each locus was classified as selected or neutral, and then used a Hidden Markov Model (HMM) to account for the fact that linked loci likely belonged to the same class, while ignoring autocorrelation in the genetic data itself.
Here, we build on this idea to develop a genome-scan method based on the F-model. While an HMM implementation of the F-model was previously proposed to deal with linked sites when inferring admixture proportions (Falush et al. 2003), we use it here to characterize autocorrelations in the strength of selection αl among linked markers. As we show using both simulations and an application to human data, aggregating information across loci results in an increase in power of up to twofold at the same false-discovery rate (FDR).
Materials and Methods
A model for genetic differentiation and observations
We assume the classic F-model, in which J populations diverged from a common ancestral population. Since divergence, each population experienced genetic drift at a different rate. We quantify this drift of population at locus
by θjl. We further assume each locus to be biallelic with ancestral frequencies pl, in which case the current frequencies
are given by a beta distribution (Beaumont and Balding 2004), as shown in (2). We thus have
(3)where ql = 1 − pl,
,
and
is the gamma function.
Let njl denote the allele counts in a sample of Njl haplotypes from population j at locus l, which is given by a binomial distributionand hence
(4)Equations (3) and (4) combine to a beta-binomial distribution

Model of selection
We decompose θjl into a population-specific component βj shared by all loci, and a locus-specific component αl shared by all populations:Here, the locus-specific component αl quantifies an excess or dearth of differentiation, which is attributed to the effect of either divergent or balancing selection, respectively (Beaumont and Balding 2004). Note that we adopt here the formulation of Foll and Gaggiotti (2008) and omit the error term γij of the original model of Beaumont and Balding (2004) shown in (2), as there is generally not enough information to estimate these parameters from the data (Beaumont and Balding 2004).
To account for autocorrelation among the locus-specific component, we propose to discretize αl = α (sl), where are the states of a ladder-type Markov model with m = 2smax + 1 states such that
(6)for some positive parameters αmax. The transition matrix of this Markov model shall be a finite-state birth-and-death process
(7)with elements
denoting the probabilities to go from state i at locus l − 1 to state j at locus l given the strength of autocorrelation measured by the positive scaling parameter κ and the known distance dl between these loci, either in physical or in recombination space. Here, L is the m × m generating matrix
where the middle row at position smax + 1 reflects neutrality, and is given by the element
As exemplified in Figure 1, the two parameters μ and ν control the distribution of sites affected by selection (i.e., having αl ≠ 0) in the genome, with ν affecting the number of selected regions and μ their extent and selection strength, with higher values leading to more sites affected selection. It is important to note that we do not assume all sites with αl ≠ 0 to be targets of selection. Instead, many will be linked to a target of selection and experience αl ≠ 0 due to hitch-hiking.
(A) Expected proportion of neutral sites as a function of rates μ and ν. (B, C) Example paths of αl along 1000 loci simulated at a distance of dl = 100 with smax = 10 positive and negative states up to αmax = 3.0. Autocorrelation among loci was simulated with log(κ) = −3.0, v = 0.02, and μ = 0.91 (B, square) or μ = 0.74 (C, circle), respectively. The two cases correspond to an expected proportion of 20% and 10% of the genome under selection, as marked in (A).
The stationary distribution of this Markov chain is given bywith
Note that, as κ → ∞, our model approaches that of Foll and Gaggiotti (2008) implemented in BayeScan, but with discretized αl.
Hierarchical island models
Hierarchical island models, first introduced by Slatkin and Voelm (1991), address the fact that divergence might vary among groups of populations. They were previously used to infer divergent selection, both using a simulation approach (Excoffier et al. 2009) as well as in the case of F-models (Foll et al. 2014). Here, we describe how our model is readily extended to additional hierarchies.
Consider G groups each subdivided into Jg populations with population-specific allele frequencies that derive from group-specific frequencies pgl as described above with group-specific parameters μg, vg, and κg. Analogously, we now assume group-specific frequencies to have diverged from a global ancestral frequency Pl according to locus-specific and group-specific parameters Θgl. Specifically,
such that
(8)where Ql = 1 − Pl and qgl = 1 − pgl. The parameter Θgl is given by
(9)As above, Bg quantifies group specific drift,
are the states of a Markov model with m states and transition matrix
with parameters μ and ν, a positive scaling parameter κ, and A (Sl) and Amax defined as in (6). Hence, we assume independent HMM models of the exact same structure at both levels of the hierarchy, as outlined in Figure 2. Additional levels could be added analogously.
A directed acyclic graph (DAG) of the proposed model with two hierarchical levels.
Inference
We developed a Bayesian inference scheme for the parameters of the proposed model using a Markov chain Monte Carlo (MCMC) approach with Metropolis–Hastings updates, as detailed in the Supplementary Material. As priors, we usedFollowing Beaumont and Balding (2004), we used μb = −2 and
throughout. We further set ap = bp = 1.
To identify candidate regions under selection, we used MCMC samples to determine the FDRsfor divergent and balancing selection, respectively, where
and
denote the full data.
Implementation
We implemented the proposed Bayesian inference scheme in the easy-to-use C++ program Flink.
Given the heavy computational burden of the proposed model, we introduce several approximations. Most importantly, we group the distances dl into E + 1 ensembles such that el = log2 dl, and use the same transition matrix Q(2e) for all loci in ensemble e. We then calculate Q(1) for the first ensemble using the computationally cheap yet accurate approximation
with
where D = 2smax + 1 is the dimensionality of the transition matrix (Ferrer-Admetlla et al. 2016). The transition matrices of all other ensembles is obtained through the recursion
(See Supplementary Information for other details regarding the implementation).
Data availability
The authors affirm that all data necessary for confirming the conclusions of the article are present within the article or available from repositories as indicated. The source-code of Flink is available through the git repository https://bitbucket.org/wegmannlab/flink, along with detailed information on its usage. Additional scripts used to conduct simulations are found at https://doi.org/10.5281/zenodo.3949763. Supplemental material available at figshare: https://doi.org/10.25386/genetics.13077284.
Results
Comparison with BayeScan
Simulation parameters:
To quantify the benefits of accounting for autocorrelation in the locus specific components αl among linked loci, we first compared our method implemented in Flink against the method implemented in BayeScan (Foll and Gaggiotti 2008) on simulated data. All simulations were conducted under the model laid out above for a single group, using routines available in Flink and with parameter settings similar to those used in (Foll and Gaggiotti 2008). Specifically, we focused on a reference simulation in which we sampled N = 50 haplotypes from J = 10 populations with βj chosen such that in the neutral case (αl = 0). Following Foll and Gaggiotti (2008), we simulated all
and about 20% of sites affected by selection (i.e., with αl ≠ 0) by setting μ = 0.91 and v = 0.02. We further set smax = 10 (resulting in m = 21 states) and
and simulated 103 loci for each of 10 chromosomes, with a distance of 100 positions between adjacent sites and strength of autocorrelation log(κ) = −3. We then varied the number of populations J, the sample size N,
or the strength of autocorrelation κ individually, while keeping all other parameters constant (Table 1). We further added a case without linkage (i.e., κ → ∞) by simulating each locus on its own chromosome.
To infer parameters with Flink, we set smax and αmax to the true values and ran the MCMC for 7·105 iterations, of which we discarded the first 20·105 as burn-in. During the chain, we recorded parameter values every 100 iterations as posterior samples. To infer parameters with Bayescan, we used version 2.1 and set the prior odds for the neutral model to 50, which we found to result in the same power as Flink in the reference simulation (see below) and in the absence of linkage (κ → ∞). We identified loci under selection at an FDR threshold of 5% for both methods.
Power of inference:
We first evaluated the power of Flink in inferring the hierarchical parameters βj, ν, μ, and κ. As shown through the distributions of posterior means across all simulations, these estimates were very accurate and unbiased, regardless of the parameter values used in the simulations (Figure 3). This suggests that the power to identify selected loci is not limited by the number of loci we used to infer hierarchical parameters.
Boxplot of the parameters β1 (left), ν and μ (center), and log(κ) (right). The values are obtained from the mean of the posterior distributions obtained using Flink on the 10 simulations run for each of the set of parameters reported in Table 1. The red dotted lines show the true values of the respective parameters.
We next studied the impact of the sample size and the strength of population differentiation on the power (the true positive rate) to identify loci affected by selection (i.e., loci with αl ≠ 0). In line with findings reported by Foll and Gaggiotti (2008), power generally increased with , the number of sampled haplotypes, and the number of sampled populations (Figure 4, A–C). Larger sample sizes or stronger differentiation was particularly relevant for detecting loci under balancing selection, for which the power was generally lower and virtually zero at low differentiation
or if only few populations were sampled (J = 2). Importantly, the FDR was below the chosen 5% threshold in 100% and 98.6% of all simulations conducted for loci identified as affected by divergent and balancing selection, respectively (Supplemental Material, Figure S1). The false positive rates (FPR) for these classes was < 0.1% in 98.6% and 97.1% of all simulations (see Figure 4 for neutral sites).
The true positive rate (power) in classifying loci as neutral (black) or under divergent (orange) or balancing selection (blue) as a function of the FST between populations (A), the number of haplotypes N (B), the number of populations J (C), and the strength of autocorrelation κ (D). Lines indicate the mean and range of true positive rates obtained with Flink (solid) and BayeScan (dashed) across 10 replicate simulations. Filled dots and the vertical gray line indicate the reference simulation shown in each plot.
Compared to BayeScan run on the same set of simulations, Flink had a higher power at the same FDR across all simulations, and often considerably so, unless if very many populations were sampled (Figure 4). If J = 10 populations were sampled, for instance, the power of Flink was about 0.2 higher for loci under divergent selection, and even up to 0.4 higher for those under balancing selection (Figure 4, A and B). Importantly, this increase in power described here is fully explained by Flink accounting for autocorrelation among the αl values as we chose the prior odds in BayeScan to result in the same power if the strength of autocorrelation vanishes (i.e., κ → ∞). Exploiting information from linked sites to identify divergent or balancing selection can thus strongly increase power, certainly if linkage extends to many loci. This is maybe best illustrated by the much higher power of Flink to identify loci under balancing selection at low differentiation (, Figure 4A), in which case even many neutral loci are expected to show virtually no difference in allele frequencies and only an aggregation of loci with a subtle reduction in
can be interpreted as a reliable signal for a target of selection in the region (Foll and Gaggiotti 2008).
Runtime:
Thanks to careful optimization, there is little to no overhead of our implementation compared to that of BayeScan. On the reference simulation of 104 loci from 10 populations, for instance, Flink took on average 130 min on a modern computer if calculations were spread over four CPU cores. On the same data, BayeScan took 361 min. However, we note that comparing the two implementations is difficult due to many settings that strongly impact run times, such as the number of iterations or the use of pilot runs in BayeScan. Without pilot runs, the run time of BayeScan reduced to 182 min on average for the default number of iterations (105 including burn-in). In the same time, Flink runs for close to 106 iterations, but also requires more to converge.
But since computation times scale linearly with the number of loci, they remain prohibitively slow for whole genome applications in a single run. However, computations are easily spread across many computers by analyzing the genome in independent chunks such as for each chromosome or chromosome arm independently. This is justified because (1) linkage does not persist across chromosome boundaries and is usually weak across the centromere, and (2) because our simulations indicate that 104 polymorphic loci were sufficient to estimate the hierarchical parameters accurately.
Effect of model misspecification
The F-model makes the explicit assumption that the allele frequencies in a structured population can be characterized by a multinomial Dirichlet distribution. This distribution is appropriate for a wide range of demographic models, but not if some pairs of populations share a more recent ancestry than others (Beaumont and Balding 2004; Excoffier et al. 2009). Unsurprisingly, several previous studies found high FPRs when challenging BayeScan with models of isolation-by-distance (IBD), recent range expansions, recent admixture, or asymmetric divergence (e.g., Lotterhos and Whitlock 2014; Luu et al. 2017). These high FPRs are partially mitigated by choosing higher prior odds (e.g., 50 as used here, Lotterhos and Whitlock 2014) or when using the hierarchical version of BayeScan (Foll et al. 2014), particularly in case of asymmetric divergence. In the case of a recent range expansion or recent admixture, however, the F-model is unlikely to be appropriate and other methods have been shown to outperform BayeScan, in particular hapFLK (Fariello et al. 2013) and pcadapt (Luu et al. 2017).
Here, we investigated how the sensitivity of the linkage-aware implementation of an F-model in Flink is affected by such model misspecifications. We focused on the case of a recent range expansion as this model is difficult to accommodate even with a hierarchical F-model. Using quantiNemo (Neuenschwander et al. 2018), we simulated genomic data from 11 populations with carrying capacity 103 each that form a one-dimensional stepping-stone model. Initially, only the left-most population contained individuals that then colonized the remaining populations through symmetric dispersal between neighboring populations at rate 0.1 and with a population growth rate of 0.1. After 103 generations, 20 diploid individuals were sampled from each population. We simulated 10 independent chromosomes of 104 neutral loci each with initial allele frequencies drawn from a Beta distribution We run these simulations for different recombination rates by setting the total length of the genetic map per chromosome to either 1, 10, or 100 cM. We then inferred selection on all loci still polymorphic at the end of the simulations with both BayeScan and Flink for 10 replicates per set.
Across all simulations, BayeScan identified no locus as affected as balancing selection and only 0.16% as affected by divergent selection. This low FPR is consistent with the generally low power of BayeScan to identify loci affected by balancing selection as well as the used prior odds of 50 in favor of the neutral model. Similar results were obtained with Flink on simulations with high recombination (genetic map of 100 cM), in which case no linkage information could be exploited. Across these simulations, Flink identified no locus as affected by balancing selection and only 0.14% as affected by divergent selection. The number of false positives, however, was rising sharply with decreasing recombination rate. At a genetic map of 10 cM, 5.0% and 2.8% of all loci were wrongly inferred as affected by balancing and divergent selection, respectively. At a genetic map of only 1 cM and, hence, tight linkage, the corresponding FPRs were 22.7% and 7.5%, respectively. These results thus highlight that the power gained by Flink in exploiting linkage information also translates into a higher FPR in case the model is misspecified. Under such scenarios, other methods such as hapFLK (Fariello et al. 2013) or PCAdapt (Luu et al. 2017) are thus more appropriate.
Application to humans
To illustrate the usefulness of Flink, we applied it to SNP data of 46 populations analyzed as part of the HGDP (Rosenberg et al. 2002, 2005) and available at https://www.hagsc.org/hgdp/files.html. We then used Plink v1.90 (Chang et al. 2015) to transpose the data into vcf files, and used the liftOver tool of the UCSC Genome Browser (Kent et al. 2002) to convert the coordinates to the human reference GRCh38.
We divided the 46 populations into six groups (Table 2) of between 4 and 15 populations each according to genetic landscapes proposed by Peter et al. (2017). We then inferred divergent and balancing selection using the hierarchical version of Flink on all 22 autosomes, but excluded 5 Mb on each side of the centromere and adjacent to the telomeres. The final data set consisted of 563,589 SNPs. We analyzed each chromosome arm individually with αmax = 4.0, smax = 10 and using an MCMC chain with 7·105 iterations, of which we discarded the first 2·105 as burn-in. Estimates of hierarchical parameters are shown in Figure S2 and the locus-specific FDRs qd (l) and qb (l) are shown for all loci, all groups as well as the higher hierarchy in Figures S4–S42. All regions identified as potential targets for selection are further detailed in the Supplementary Files. As summarized in Table 2, we discovered between 759 and 1889 and between 433 and 1735 candidate regions for divergent and balancing selection, respectively, spanning together about 10% of the genome.
Comparison with BayeScan
We first validated our results by running BayeScan on the same data. We then identified divergent regions as continuous sets of SNP markers that passed an FDR threshold of 0.01 or 0.05 for each method, and determined the FDR threshold necessary to identify at least one locus within these regions by the other method. To ensure the observed differences between methods is due to accounting for linkage only, we used the hierarchical version BayeScanH (Foll et al. 2014) that also implements the same hierarchical island model as Flink.
As shown in Figure 5A for selected regions among Europeans, the majority of regions identified by BayeScanH were replicated by Flink at small FDR thresholds. In contrast, most of the regions identified by Flink were not replicated by BayeScanH, in line with a higher statistical power for the former. Visual inspection indeed revealed that for most regions identified by Flink but not BayeScanH, the latter also showed a signal of selection at multiple markers, each of which not passing the FDR threshold individually (see Figure 5B for examples). In contrast, sites identified by BayeScanH but not Flink usually consisted of a signal at a single site, suggesting many of those are likely false positives (Figure 5C).
(A) The fraction of regions identified as divergent among Europeans by Flink (green) and BayescanH (black) at a false discovery rate (FDR) of 0.01 (solid) and 0.05 (dashed) also identified by the other method at different FDR. (B–D) Examples of regions found under divergent selection by Flink (B), BayeScanH (C), or both (D) among Europeans. Dashed lines indicate the 0.01 FDR threshold.
Results were similar for the other groups (Figure S3), but the correspondence between the methods was higher for the African group and considerably lower for the American group, likely due to the different patterns of divergence among populations (Figure S2).
Comparison with a recent scan for selective sweeps
Positive selection acting in a subset of populations may also lead to an increase in population differentiation (Nielsen 2005). We therefore compared our outlier regions also to those of a recent scan for positive selection that combined multiple test for selection using a machine learning approach (Sugden et al. 2018). Among the 593 candidate loci reported for the CEU population of the 1000 Genomes Project (1000 Genomes Project Consortium et al. 2015) and overlapping the chromosomal segments studied here, 293 loci (49.4%) fall within a region we identified as under divergent selection either among European populations (154 loci), at the higher hierarchy (132 loci), or both (7 loci).
To test if this overlap exceeds random expectations, we generated 10,000 bootstrapped data sets by randomly sampling the same amount of loci among all those found polymorphic in the 1000 Genome Project CEU samples and within the chromosomal segments studied here. We then determined the overlap with our outlier regions for each data set. On average, 46.6 loci overlapped with our regions identified among European populations or at the higher hierarchy. Importantly, the largest overlap observed among the bootstrapped data set (72 loci) was much smaller than that observed (293 loci, P < 10−4).
Example: the LCT region
As an illustration, we show the FDRs qd (l) and qb (l) for 30 Mb around the LCT gene in Figure 6 for the higher hierarchy as well as the European, Middle Eastern, and East Asian group. The LCT gene is a well studied target of positive selection that has acted to increase lactase persistence in several human populations, including Europeans (Bersaglieri et al. 2004; Burger et al. 2020). Lactase persistence varies among Europeans and decreases on a roughly north–south cline (Bersaglieri et al. 2004; Burger et al. 2007; Leonardi et al. 2012), consistent with the signal of divergent selection we detected among European populations (Figure 6). In line with previous findings (e.g., Grossman et al. 2013), we detected a signal of divergent selection among Europeans also in various genes around LCT, most notably in R3HDM1 but also MIR128-1, UBXN4 and DARS. In contrast, we detected no such signal for the other groups.
Signal of selection around the LCT gene on Chromosome 2q. The orange and blue lines indicate the locus-specific FDR for divergent (orange) and balancing (blue) selection, respectively. The black dashed line shows the 1% FDR threshold. A zoom of the highlighted region is shown on the right indicating the position of several genes: R3HDM1 (R3), MIR128-1 (MI), UBXN4 (UB), MCM6 (MC), DARS (DA), and DARS-AS1 (DA1). The entire Chromosome 2q is shown in Figure S7.
Discussion
Genome scans are common methods to identify loci that contribute to local adaptation among populations. Here, we extend the particularly powerful method implemented in BayeScan (Foll and Gaggiotti 2008) to linked sites.
Accounting for linkage in population genetic methods, while desirable, is often computationally hard. We propose to alleviate this problem by modeling the dependence among linked sites through autocorrelation among hierarchical parameters, rather than the population allele frequencies or haplotypes themselves. In the context of genome scans, this has been previously used successfully to classify each locus as selected or neutral using HMMs (Boitard et al. 2009; Kern and Haussler 2010). Here, we extend this idea by modeling autocorrelation among the degree of spatial differentiation acting at individual loci. While ignoring autocorrelation at the genetic level certainly leads to a loss of information, the resulting method remains computationally tractable. And as we showed here with simulations and an application to human data, the resulting method features much improved statistical power compared to BayeScan—a similar method that ignores linkage completely.
This is particularly evident for loci with more similar allele frequencies among populations than expected by the genome-wide divergence. These loci are generally interpreted as being under balancing selection (Beaumont and Balding 2004; Foll and Gaggiotti 2008), but may also be the result of purifying selection restricting alleles from reaching high allele frequencies. Given the large number of loci we inferred in this class from the HGDP data (about 5% of the genome), we speculate that balancing selection is unlikely the main driver, and caution against overinterpreting these results. But we note that the empirical FDR for loci under balancing selection was extremely low in our simulations, except if the assumptions underlying the F-model was violated.
A benefit of accounting for autocorrelation among locus-specific effects was previously postulated by Guo et al. (2009), who proposed a conditional autoregressive (CAR) prior on αl such thatwhere α−l denotes the collection of all other αm, m ≠ l,
indicates the covariance between loci l and m, which is assumed to decrease exponentially with distance, and
While Guo et al. (2009) did not evaluate the benefit of their CAR implementation on the power of selection inference, they found that it was a better fit to high resolution data. Here, we show that the power increase by exploiting autocorrelation among loci is substantial: of all regions identified as under divergent selection by Flink, less than half were also identified by BayeScan, despite evidence that these consist mostly of true outliers.
In this context, it is important to note that due to computational challenges, Guo et al. (2009) suggested to run their method on low-resolution data with few markers first, and then to apply the CAR version on inferred candidate regions only. As our analysis suggests, such an approach would likely fail to harvest the full benefit of accounting for autocorrelation among locus-specific parameters. Running Flink on high-resolution data are possible because the first-order Markov assumption on locus-specific effects αl allows for cheap MCMC updates at a single locus that does not require a recalculation of the prior on the full vector . Unfortunately, however, no implementation of the method by Guo et al. (2009) is available for a direct comparison.
Our proposed model has yet another computational advantage: while the hierarchical parameters of the exponential decay in the model by Guo et al. (2009) need to be fixed upfront due to numerical instabilities, the hierarchical parameters of the discrete Markov model proposed here are all estimated well if sufficient sites are provided. Our simulations indicated that 104 polymorphic loci were sufficient, based on which we decided to parallelize the analysis of the human data by chromosome arm. Smaller windows may be considered, but the model may struggle to differentiate between population-specific and locus-specific components if too few consecutive loci are used. The window analyzed should therefore span significantly more loci than are expected to be affected by selection within an outlier region. But note that the model does not make any assumption regarding the spacing of loci within the analyzed window, nor does it assume that all individuals have data: it accounts for both the distances between loci as well as the locus-specific sample size explicitly. Hence, Flink may well be used on data obtained with reduced representation techniques such as RAD-seq, albeit with little benefit over BayeScan if loci are in weak linkage only.
Another major difference between Flink and the CAR method of Guo et al. (2009) is that the former discretizes the locus-specific effects αl. While such a discretization leads to a loss of precision in estimating locus specific effects, it allows to directly calculate a FDR to identify outlier loci at any desired level of confidence, similar to BayeScan or the method of Riebler et al. (2008). In contrast, the method by Guo et al. (2009) identifies outliers indirectly as those for which the posterior distributions on θl are significantly different from the distribution of θl values under the inferred hyper-parameters. Importantly, the discretization seems to come at no cost on power: in our simulations, Flink and BayeScan had virtually identical power if we simulated unlinked data.
An obvious drawback of modeling the locus-specific selection coefficients as a discrete Markov Chain is that, for most candidate regions we detected, multiple loci showed a strong signal of selection, making it difficult to identify the causal variant. However, once a region is identified, estimates of FST can be obtained for each locus individually to identify the locus with the strongest signal. Complementary methods such as SWIF(r) (Sugden et al. 2018) may further be used on the identified regions to infer locus-specific selection coefficients or other statistics informative about the targets of selection.
Finally, we note that the implementation provided through Flink allows to group populations hierarchically. Accounting for multiple hierarchies was previously shown to reduce the number of false positives in FST based genome scans (Excoffier et al. 2009) and also applied in an F-model setting (Foll et al. 2014). Aside from accounting for structure more accurately, a hierarchical implementation also allows for genome-wide association studies (GWAS) with population samples. In such a setting, each sampling location would constitute a “group” of, say, two “populations”, one for each phenotype (e.g., cases and controls). The parameters at the higher hierarchy will then accurately describe population structure and loci associated with the phenotype will be identified as those highly divergent between the two “populations”. A natural assumption would then be that the locus-specific coefficients αl are shared among all groups, i.e., that they are governed by a single HMM. While we have not made use of such a setting here, we note that it is readily available as an option in Flink.
Acknowledgments
We are grateful for the very constructive feedback of two anonymous reviewers and the editor Nick Barton on an earlier version of this manuscript. This study was supported by two Swiss National Foundation grants to DW with numbers 31003A_149920 and 31003A_173062.
Footnotes
Supplemental material available at figshare: https://doi.org/10.25386/genetics.13077284.
Communicating editor: N. Barton
- Received May 17, 2020.
- Accepted October 3, 2020.
- Copyright © 2020 by the Genetics Society of America