# Estimating Barriers to Gene Flow from Distorted Isolation-by-Distance Patterns

^{*}Institute of Science and Technology Austria, Klosterneuburg A-3400, Austria^{†}Department of Botany and Biodiversity Research, University of Vienna, A-1030, Austria

- 1Corresponding author: Institute of Science and Technology Austria, Am Campus 1, 3400 Klosterneuburg, Austria. E-mail: hringbauer{at}ist.ac.at

## Abstract

In continuous populations with local migration, nearby pairs of individuals have on average more similar genotypes than geographically well-separated pairs. A barrier to gene flow distorts this classical pattern of isolation by distance. Genetic similarity is decreased for sample pairs on different sides of the barrier and increased for pairs on the same side near the barrier. Here, we introduce an inference scheme that uses this signal to detect and estimate the strength of a linear barrier to gene flow in two dimensions. We use a diffusion approximation to model the effects of a barrier on the geographic spread of ancestry backward in time. This approach allows us to calculate the chance of recent coalescence and probability of identity by descent. We introduce an inference scheme that fits these theoretical results to the geographic covariance structure of bialleleic genetic markers. It can estimate the strength of the barrier as well as several demographic parameters. We investigate the power of our inference scheme to detect barriers by applying it to a wide range of simulated data. We also showcase an example application to an *Antirrhinum majus* (snapdragon) flower-color hybrid zone, where we do not detect any signal of a strong genome-wide barrier to gene flow.

- barriers to gene flow
- isolation by distance
- identity by descent
- demographic inference

MANY populations are distributed across geographically extended habitats that are sometimes interrupted by barriers to gene flow. They can arise due to physical obstacles that reduce migration, but can also be caused by genetic incompatibilities, which reduce gene flow across a hybrid zone (Barton 1979). Barriers can prevent locally adapted populations from being swamped by dispersal and they can facilitate divergence, ultimately leading to speciation. Therefore, they play a central role not only in conservation, but also in evolutionary biology and ecology. As direct observations of individual movement and reproduction are time consuming, expensive, and can only give a snapshot in time, there is much interest in indirect methods that infer such barriers from observed geographic genetic structure.

Such methods to detect barriers from genetic data can be grouped into two distinct approaches (Guillot *et al.* 2009): clustering methods, which detect geographic genetic discontinuities between populations by grouping individuals into population units based on genetic similarity (Dupanloup *et al.* 2002; Falush *et al.* 2003; Guillot *et al.* 2005); and edge-detection methods, which identify areas of sharp genetic change (Womble 1951; Manni *et al.* 2004; Cercueil *et al.* 2007). Neither of these approaches is directly linked to any spatial population genetic model. They can therefore infer the existence of a barrier, but cannot give meaningful and biologically interpretable estimates of its strength. In addition, these approaches are often confounded by isolation-by-distance patterns (Safner *et al.* 2011; Meirmans 2012), whereby individuals nearby are more similar than distant individuals (Wright 1943) due to recent coancestry. While the description of this effect has a long history in theoretical population genetics of homogeneous populations (Malécot 1948; Slatkin 1993; Rousset 1997; Hardy and Vekemans 1999; Barton *et al.* 2002), it has not been included into a practically applicable method to estimate the strength of a barrier to gene flow.

Here, we fill this gap and introduce a method that infers the strength of a barrier in a two-dimensional population by fitting a population genetic model. Our method uses the fact that a barrier to gene flow distorts classical isolation-by-distance patterns (Figure 1). Based on the theoretical work of Nagylaki (1988), Barton (2008) constructed a theoretical framework. He showed that in two spatial dimensions, where fluctuations of allele frequencies are more localized than in one dimension, these effects of a barrier on allele-frequency fluctuations can already be significant for intermediate barrier strengths. This signal therefore has great potential for demographic inference. The derivation of Barton (2008) also shows that the effect of a barrier depends primarily on short-lived, localized fluctuations. In general, isolation-by-distance patterns equilibrate relatively quickly and depend mostly on recent demography (Barton *et al.* 2013; Aguillon *et al.* 2017). Therefore, an inference scheme based on distorted isolation-by-distance patterns infers contemporary barriers to gene flow, and should be robust to confounding effects of ancestral structure.

Here, we first expand previous theoretical results that describe the effect of a barrier on classical isolation-by-distance patterns (Nagylaki 1988; Barton 2008). We introduce a model where ancestry diffuses backward in time and is partially reflected by a barrier. This allows us to numerically calculate the probability of recent coancestry, which can then be fitted to genetic data. As single nucleotide polymorphism (SNP) data sets are currently widely used, we develop and implement ways to fit such biallelic genetic markers.

We test our inference scheme on synthetic data simulated under an explicit population genetics model and investigate how it is affected by confounding factors, for instance in a scenario of secondary contact. We also show a practical application in which we apply our inference scheme to a hybrid zone population of *Antirrhinum majus*, in which a sharp transition in flower color and a causal flower-color gene occurs (Whibley *et al.* 2006). We apply our method to test whether there is also a genome-wide barrier to contemporary gene flow. To this end, we analyze a data set of 12,389 individuals and 60 suitable SNP markers.

## Materials and Methods

We first outline the model underlying our inference scheme and discuss its assumptions, and then describe how our method fits this model to observed genotype data. In brief, we use a diffusion approximation for the spread of ancestry to calculate the probability of recent identity by descent between pairs of samples. We then fit our model to data by finding the demographic parameters that maximize the fit of observed homozygosity between all sample pairs.

### Model

No model can capture all complexities of the real demographic history of a population. Therefore, the aim is not to have a mathematically exact model, but one that robustly captures general patterns of spatial fluctuations of allele frequencies. We use a model of a two-dimensional continuous habitat that is interrupted by a barrier and assume that the demographic parameters are the same on both sides. In short, we calculate the chance of pairwise coalescence before a long-distance migration or mutation event. We use a diffusion model to trace lineages backward in time and assume that rare long-distance migration events, which counteract the buildup of local allele-frequency fluctuations, occur at a constant rate. To calculate the equilibrium identity-by-descent pattern, *i.e.*, the probability that two lineages coalesce before the occurrence of a rare long-distance migration or mutation event, we first derive the coalescence probability at specific times *t* in the past and then integrate over *t*.

Our model is closely related to previous theoretical treatments of allelic identity by state in presence of a barrier to gene flow (Nagylaki 1988; Barton 2008). For a population occupying a linear habitat, Nagylaki (1988) derived continuous equations for identity by state by taking the limit of a model of linear demes that exchange migrants (the so-called stepping-stone model). Barton (2008) expanded this by solving an analogous equation for two-dimensional populations. His formulas are given as a numerical Fourier transform that diverges for nearby individuals. These equations for two-dimensional populations are formally problematic as they were not obtained by rescaling (which is impossible in two spatial dimensions), but Barton (2008) demonstrated that the solution is in close agreement with the solutions from a discrete stepping-stone model for all but very close distances.

Here, we base our inference scheme on a different approach. We use a diffusion approximation to describe the spread of ancestry backward in time. While the results are formally equivalent, we found our model to be computationally more robust and, most importantly, more efficient. This approach allows us to apply our method to sample sizes of hundreds to thousands of individuals, and dozens to hundreds of loci.

#### Diffusion approximation:

We model the spread of ancestry using a geographic diffusion approximation, which has a long history in population genetics (Fisher 1937; Wright 1943; Malécot 1948; Nagylaki 1978). Tracing a lineage of one locus backward in time, the total spatial movement is the sum of many independent migration events. If these events are sufficiently uncorrelated, the central limit theorem establishes that the total displacement tends toward a normal distribution. Therefore, the overall spread of ancestry can be approximated as a random walk process. This approximation is accurate as long as rare large-scale events do not significantly influence the movement of ancestral lineages. The diffusion approximation is expected to be most accurate on recent to intermediate timescales, on which large-scale events such as colonizations often play only a minor role.

In the absence of a barrier, the process is a free diffusion and the probability density function of finding an ancestor at position *x* at time *t* back along a given axis is given by a Gaussian probability density function around the current position and variance that increases linearly backward in time:(1)The dispersal rate *σ* describes the speed of the spread of ancestry. In the case of a homogeneous density of individuals across the landscape, the backward dispersal probability density equals the probability density of lineages moving forward in time. In this particular case, the dispersal rate can be interpreted as the axial variance of the one generational dispersal kernel (Rousset 1997). The diffusion approximation is fully determined by the equation:(2)with initial condition:

#### Partial barrier to gene flow:

We model a barrier as a partially permeable barrier to diffusion of ancestry. For a barrier at the following interface boundary conditions have to be supplied (Grebenkov *et al.* 2014):(4)The first line describes the constancy of the flux across the barrier. For there is no flux across the barrier and the barrier is infinitely strong. On the other hand, the case implies the continuity of the probability density across the barrier and the solution reduces to free diffusion. Comparing to the differential equation 54 in Nagylaki (1988), which is derived by rescaling a stepping-stone model, gives an intuitive interpretation of *κ*: This parameter corresponds to the fraction of successful migrants across a barrier if demes are spaced one dispersal unit apart. The quotient corresponds to an equivalent factor in equation 54 of Nagylaki (1978). It is also the inverse of the barrier strength parameter *B* defined by Barton and Bengtsson (1986), which has the dimension of distance.

Equations 2–4 allow for an analytic solution for the probability density with a barrier (Grebenkov *et al.* 2014):(5)where erfc denotes the complementary error function. These expressions are valid for starting positions and their extension to is straightforward by the symmetry The transition density converges to the Gaussian of free Brownian motion for For a two-dimensional diffusion process with a linear barrier, the full transition density is obtained by multiplying the one-dimensional density functions from Equation 2 for movement parallel to the barrier with Equation 5 for movement orthogonal to the barrier.

Forien (2017) showed that the one-dimensional transition density can be also obtained as a scaling limit of a one-dimensional discrete random walk with a partially reflective barrier. This result indicates that the continuous model is also a good approximation for stepping-stone barrier models. We compared the continuous formula to discrete random walk simulations, and there is indeed close agreement (Figure 2).

#### Pairwise coalescent probabilities:

We use the diffusion approximation to model the distribution of coalescence times for pairs of individuals. We approximate the chance of coancestry originating in a small time interval around time *t* ago as the product of the probability of coming close and a rate of local coalescence (Ringbauer *et al.* 2017). The local density describes a rate with which nearby lineages coalesce (Wright 1943). In a stepping-stone model, corresponds to the number of diploid individuals per deme (Barton *et al.* 2002). This approximation ignores that lineages do not move apart again once they have coalesced. An equivalent simplification is made by Barton (2008), and it is accurate as long as coalescence is sufficiently rare (Wilkins 2004). Since the dispersal process is symmetrical in time in our model of constant population density, the chance that two lineages at current position and are close at time *t* back equals the probability density that a single lineage moves from to in time Formally, the probability density of coalescence at time *t* in the past is approximated by(6)Empirical simulations for a stepping-stone model confirm that this model approximates the pairwise coalescence time distribution on recent to intermediate timescales (Supplemental Material, File S1).

#### Identity by descent:

We define identity by descent *F* of two samples at and as the chance that two lineages coalesce before a long-distance migration or, equivalently (but unlikely for SNPs), a mutation event occurs along one of the lineages. This definition is closely related to the widely used fixation index and both definitions agree in the limiting case of an infinite population (see Rousset 2002 for a review). If one assumes that mutation or long-distance migration occurs at a constant rate *μ*, it is straightforward to calculate the probability that two lineages coalesce before a long-distance event:(7)In the absence of a barrier, the probability of identity by descent depends only on the Euclidean distance between two individuals. In this special case, the integral in Equation 7 has an analytical solution, the classical Wright–Malécot formula (Barton *et al.* 2002, 2013):(8)where is the modified Bessel function of the second kind of degree zero. A caveat of this analytical solution is that diverges logarithmically as (Barton *et al.* 2002). Similarly, the integral in Equation 7 diverges for nearby individuals. As for the Wright–Malécot formula, this is caused by a behavior of the diffusion approximation for short timescales, as the chance of two lineages being close diverges as for A solution to circumvent this problem is to start integration at time Here, we choose one generation time, a biologically plausible value. We could not find an analytical solution, but Equation 7 can be numerically integrated.

Our calculations show that if a barrier to gene flow is present, identity is decreased across the barrier, and increased for points on the same side of the barrier (Figure 3). Interestingly, the increase of *F* for a pair of points and on the same side of the barrier equals the decrease of *F* between points and where is the reflection of across the barrier. This symmetry originates from a reflection principle of the underlying random-walk model, as lineages that do not cross the barrier behave as if they were reflected. This symmetry already occurs in the barrier point density function (Equation 5). It implies that, for a complete barrier, identity by descent can increase to at most twice of the value in absence of a barrier, as observed in the equivalent case of a range boundary (Wilkins 2004).

As already noted by Barton (2008), a barrier to gene flow mostly affects recent coancestry, whereas deeper coalescence probabilities are not distorted much (File S1). In particular, weighted mean coalescence times for individuals from the same deme remain completely invariant in deme models with conservative migration (*i.e.*, constant population density), independent of the specific migration scheme (Nagylaki 1998; Laporte and Charlesworth 2002). We used models of this type in our simulations below to approximate our continuous spatial theory. However, this invariance of mean coalescence times is not problematic for our inference method. When calculating allele frequency correlations, pairwise coalescence probabilities are weighted by a factor which accounts for rare, but nonnegligible long-distance migration events. It decays exponentially with time. Therefore, our method is based on a signal of mostly recent coalescence events. Reassuringly, this is also the timescale on which barriers affect pairwise coalescence probabilities the most, and on which the spatial diffusion approximation is likely most accurate (File S1).

#### Rescaling:

Not all parameters in Equation 7 are independent. Consequently, they cannot be estimated separately, as in absence of a barrier (Rousset 1997; Barton *et al.* 2013). Therefore, we replace the four demographic parameters in Equation 6 with three compound parameters neighborhood size [a classical parameter that goes back to Wright (1943)], a scaled barrier parameter (corresponding to the inverse of Barton’s *B*), and a scaled long-distance migration rate (Appendix).

### Fitting the model to data

A typical data set consists of diploid genotypes for a marker *i* and individuals at geographic positions To infer the underlying demographic parameters from observed data, we have to develop a way to fit our model to such data.

In principle, it is straightforward to transform the probability of identity by descent as calculated by our model into expected allele-frequency covariances. Denoting the probability of identity by descent between a pair of samples *k* and *l* by and the population mean allele frequency of a marker *i* by the expected covariance between their genotypes is given byHowever, often mean allele frequencies have to be estimated from the data, and estimating the means of allele frequencies for many markers would lead to overfitting. To circumvent this caveat, we developed and tested different methods to fit identity by descent to genotype data without directly estimating all mean allele frequencies. We included one approach that models individual genotypes as binomial draws from latent allele frequencies modeled by a Gaussian random field (File S2).

#### Fitting pairwise homozygosity:

Of all tested fitting methods, a relatively simple approach that fits the fraction of pairwise homozygosity (defined as the fraction of identical genotypes) has the least bias and sampling variation (File S2). Throughout this work, we use this method for data analysis. In the following we give a brief outline of how we calculate expected homozygosity and how we fit it to data.

The observed average homozygosity for a pair of individuals *k* and *l* with genotypes and at markers can be straightforwardly calculated from the data:(9)Our model predicts the pairwise chance of identity by descent *F*, and these probabilities can be used to calculate the expected values of average homozygosities The expected pairwise homozygosity of a pair of samples *k* and *l* at a marker *i* is given by:The first term gives the probability of having the same genotype due to identity of descent and the second describes the probability of having the same genotype by chance. To avoid estimating all mean allele frequencies, we average over all *n* markers to get the expected fraction of pairwise identical genotypes:(10)Instead of fitting all unknown allele frequencies, now only one additional compound parameter *s* has to be fit in addition to the demographic parameters *θ*. We tried fitting this formula to observed data with a composite likelihood approach. However, we found that minimizing the sum of all squared deviations between the expected and observed pairwise homozygosities(11)gives almost identical results, while also being much faster (File S2).

Fitting pairwise homozygosities can be extended to deme data easily, where nearby individuals have been binned, by plugging deme allele frequencies into Equation 9.

#### Estimation uncertainties:

To learn about estimation uncertainty, we bootstrap over genetic markers. Unlinked markers contain almost independent information because their spatial movements are typically correlated only on very short timescales. Therefore, resampling loci at random is expected to yield accurate empirical confidence intervals.

### Implementation

In brief, our inference scheme runs the following three computational steps for a given set of demographic parameters

Calculate pairwise

*F*for all pairs of samples (integral Equation 7).Use the pairwise

*F*to calculate the expected pairwise homozygosity for all pairs of samples (Equation 10).Calculate the sum of squared differences between the expected and observed pairwise homozygosities (Equation 11).

Our program then finds the parameters that minimize this function by using the Levenberg–Marquardt algorithm, as implemented in the Python package Scipy.

We implemented the described simulation and inference methods mostly in Python. To speed up calculations, we parallelized the calculations for pairwise *F* so that they can be run simultaneously on different central processing units (CPUs). The evaluation of the integrand in Equation 7 is a computational bottleneck. We implemented this calculation in C, to make use of the superior speed of a compiled language.

The inference scheme has to compute the expected identity by state for every pair of samples. It therefore scales quadratically with the number of individuals, as there are such pairs. It has a run time of several hours for an individual/deme number of 1000 when run on a single standard desktop CPU. To produce a sufficient number of replicates and bootstraps, we used a scientific computer cluster at the Institute of Science and Technology Austria. To speed up run time, individuals can be grouped into demes. If clustering is done on small scales for bins at most a few *σ* in diameter, it does not significantly affect the estimation scheme (File S2). In our results, we capped *γ* to a maximum value of 1, as for the effect of a barrier becomes negligible (Figure 2).

### Simulations

We extensively tested our inference scheme on simulated data sets. We used a stepping-stone model with individuals per deme, and we traced ancestry backward in time (Figure 4). Every generation, each individual picks an ancestor at random with probabilities given by a dispersal kernel. Here, we use a discretized Laplace distribution as an axial dispersal kernel. Due to the rapid convergence to the continuous diffusion approximation (Figure 2), the specific choice of dispersal kernel has no significant impact as long as its axial variance remains finite. If two lineages happen to pick the same ancestor, they coalesce into a single ancestral lineage. We simulate long-distance migration events to occur at a constant rate. If they occur, the corresponding lineage picks an allele at random from the population mean allele frequency To model the effects of a barrier, we follow Nagylaki (1988) and realize migration events across the barrier only with relative probability *κ*, the barrier strength parameter. For constant deme sizes, this backward model is equivalent to a forward model in which a large number of gametes disperse with the same dispersal kernel (Nagylaki 1988).

After a preset maximum number of generations, every lineage picks an allele at random according to the mean allele frequency Different, unlinked loci were simulated as independent runs. We picked mean allele frequencies at random according to a predetermined distribution, usually Gaussian with SD around an overall mean of 0.5. We also investigated how robust our inference scheme is to scenarios of secondary contact. We simulated them by assigning each ancestral lineage an allele with probability or according to its location at time of secondary contact.

We stress that the data are simulated under a process very similar to our model. For real data, there could be other deviations which might further reduce the power of the inference scheme. Therefore, our results should be seen as limits for the inference scheme in the case of ideal data.

*A. majus* data

To show the practical utility of our inference scheme, we applied it to data from an *A. majus* hybrid zone. This hybrid zone is located in a valley in the Eastern Pyrenees. It shows a geographically narrow transition between two flower-color morphs, in which a range of hybrid flower-color phenotypes occur. This transition is mainly determined by three major flower-color genotypes that regulate the intensity and patterning of flower color (Whibley *et al.* 2006; Bradley *et al.* 2017). We applied our method to a data set of 12,389 plants collected between 2009 and 2014, which were genotyped for 112 SNP markers.

To satisfy the assumptions of our model as good as possible, we filtered markers based on four quality criteria: minor allele frequency, large-scale geographic correlation, linkage disequilibrium, and deviations from local Hardy–Weinberg equilibrium. SNP design and filtering are explained in detail in File S3. After this data cleaning step, we were left with 60 unlinked polymorphic SNPs that were spaced throughout most of the genome (File S3).

### Data availability

The source code for the implementation of our inference scheme is freely accessible at the Github repository https://github.com/hringbauer/BarrierInferPublic.git.

The *A. majus* data set is a subset of samples collected from 2009 to 2014 with the long-term goal to build a pedigree. The details of this data set and data filtering are described in File S3.

## Results

### Inference on simulated data

We investigated the overall capability of this method to estimate barrier strengths and the accuracy of empirical bootstrap uncertainty estimates. Our tests show that the inference scheme can reliably recover barrier strengths as well as demographic parameters (Figure 5 and Figure 7). Estimates of the neighborhood size are robust, but show a slight upward bias. These slight biases are likely due to the fact that a continuous model is used to fit to discrete simulations. Estimates of the long-distance migration rate *m* are more variable, but are not significantly biased. In all cases, the range of bootstrap estimates mostly overlaps with the true value to the expected degree. This result indicates that bootstrapping gives accurate uncertainty estimates (Figure 7).

The power to infer a barrier depends on several factors, including the particular sampling scheme, the number of sampled loci, various demographic parameters, as well as the strength of the barrier. We simulated several specific scenarios to explore overall patterns. Our results indicate that estimates of the barrier strength *γ* get more variable with increasing *γ* (Figure 5, notice the two different scales). Weak barriers () were sometimes inferred as no barrier at all, and vice versa. In contrast, strong barriers () were mostly also estimated as such. When varying demographic parameters, estimation uncertainty of *γ* grows with increasing long-distance migration rate *m* and neighborhood size (Figure 6). Changing the variance of the mean allele frequencies has comparatively little effect. In addition, we performed power simulations for varying numbers of sampled loci and sampled individuals. They indicate that at least a few dozen biallelic markers and several hundred samples are needed to robustly infer a strong barrier to gene flow (File S2).

#### Secondary contact:

Barriers to gene flow sometimes coincide with areas of secondary contact, for instance in secondary hybrid zones (Barton and Hewitt 1985). Allele frequencies might have diverged before this contact, and present day allele-frequency differences are not caused by the presence of a barrier. However, these clines resemble the effect of a barrier and might be mistakenly inferred as such. One salient way to deal with this problem is to remove markers that show a large-scale geographic structure of allele frequency. One can base inference on a subset of markers that have a similar mean allele frequency across the whole population range and only display fluctuations on small geographic scales, which equilibrate quickly (Barton *et al.* 2013). We tested this approach on simulated data. When applying the inference scheme to a simulated scenario of secondary contact with divergent allele frequencies, it wrongly infers a barrier in case of no filtering (File S2). However, when using the subset of loci that show no large-scale correlation with geography, the false positive signal decreases. Moreover, filtering out loci with a large-scale structure does not remove the signal in case of a true barrier (File S2). However, if sampling is only done on small spatial scales, such filtering could become problematic as one might remove signal from local fluctuations as well. Therefore, we advise to always check that the sampling area is bigger than the spatial scale of isolation-by-distance patterns before any markers are removed.

#### Unknown barrier locations:

Our inference scheme assumes that the location of the putative barrier is known *a priori*. In practice, one might not always have this information or one perhaps wants to test the hypothesis of barriers in different locations. In this case, one can repeatedly apply the inference scheme and fit different potential barrier positions. When testing this approach on simulated data, the inference scheme only inferred a strong barrier near the true position (Figure 7 and Figure 8). The estimated uncertainties on the habitat edges are inflated. This effect is caused by limited power to infer barriers near sampling edges: one needs a sufficient number of samples on both sides of a barrier to fit the strength of a barrier.

### Hybrid-zone analysis

We observed a clear isolation-by-distance pattern across the *Antirrhinum* hybrid zone (Figure 9). On average, mean identity by state for nearby plants is elevated by ∼2% above the background level, and falls away with increasing pairwise distance, most rapidly over the first 2000 m. Our inference scheme fits this pattern well, with an estimated neighborhood size of 188 ( bootstrap confidence interval: Figure 10).

Obviously, there are demographic complications that are not captured by our model, such as heterogeneities in plant distributions and density. The population is not distributed uniformly in two dimensions, as plants are often found in patches of suitable habitat (Figure 10). However, our analysis indicates that isolation-by-distance patterns are neither strongly influenced by the cardinal direction or relative plant positions, nor the geographic location within the hybrid zone (File S3). These observations imply that our model assumptions are not grossly violated.

For putative barriers in the center of the hybrid zone, our inference scheme estimates no barrier (). Bootstrap estimates rarely fall below and all of them are above When testing for barriers toward the flank of the hybrid zone, estimates get more variable. Some estimates in both flanks indicate an intermediate barrier to gene flow, but in each case some of the corresponding bootstrap estimates also take the value of no barrier (). This signal likely reflects the lower power to infer barriers in these regions, as there is a higher sampling density in the center of the hybrid zone ( of the samples originate from within 500 m of the flower-color transition). Only in one case, for the leftmost tested barrier, the bootstrap estimates do not overlap with the value of the null hypothesis (). This area also shows the strongest small-scale isolation-by-distance pattern (File S3), and the density of plants is very low in sampled patches (M. Melo, personal communication). Therefore, a potential explanation for this significant deviation from the null hypothesis of constant demography and no barrier to gene flow is an exceptionally low plant density in this area.

Given the overall good fit of isolation by distance with our inference scheme (Figure 9), our results indicate that there is no strong genome-wide barrier to contemporary gene flow that coincides with the flower-color transition. As such a strong barrier would require many barrier loci spaced densely throughout the genome (Barton and Bengtsson 1986), this result comes as no surprise. At the moment, no other traits apart from flower color are known to be divergent across the hybrid zone, despite much work to detect them (M. Melo, personal communication).

Previous results suggest the presence of a barrier to exchanging flower-color alleles (Whibley *et al.* 2006; Bradley *et al.* 2017) and indicate that selection maintains differences in flower color (Ellis 2016). Therefore, we applied our method to a subset of polymorphic markers in the genetic neighborhood of two genes, *Rosea* and *Eluta*, known to affect flower-color variation in the hybrid zone. However, bootstrap estimates varied widely for all tested barrier locations (results not shown), which indicates that there is not sufficient signal in the data. This lack of power is likely due to the low number of suitable SNP markers without steep allele-frequency clines near this region in our data set (). Simulations confirmed that for such a low number of markers there is not enough power to detect even strong barriers (File S2).

## Discussion

To our knowledge, our scheme is the first method that infers the strength of the barrier based on an explicit spatial population genetic model. There are several similarities with the inference method BEDASSLE (Bradburd *et al.* 2013), which aims to disentangle the effects of geographic isolation by distance and differences in ecological variables. This scheme is not based on an explicit population genetic model, but rather fits the decay of genetic similarity with distance to a heuristic formula. In light of possibly very complex demographic structures, this approach is not necessarily worse than fitting our spatial model. However, this approach does not take the increased covariances on the same side of a barrier into account, and does not make use of some valuable signal because of that. Moreover, it uses an MCMC approach that is based on a model of Gaussian random fields, which is computationally too expensive to apply to hundreds or even thousands of individuals or demes. In contrast, our method is well suited to data sets of this magnitude.

Another widely used method to infer barriers to gene flow, Geneland, clusters individuals using their explicit geographic coordinates (Guillot *et al.* 2005). Safner *et al.* (2011) identified it as one of the most potent methods to infer barriers to gene flow. Therefore, we tested its performance on some of the data sets which we have generated to test our scheme (File S4). As described previously (Guillot and Santos 2009; Safner *et al.* 2011), Geneland’s ability to accurately infer barriers to gene flow decreases when isolation-by-distance patterns are present, as its underlying model assumptions of discrete populations with well-defined allele frequencies are violated. Indeed, it fails to detect a barrier to gene flow in our test data sets, which all exhibit such isolation-by-distance patterns (File S4). In contrast, the method introduced here can give accurate estimates of the barrier strength in these scenarios. It is not confounded by isolation-by-distance patterns; in fact, it relies on the presence of this signal. Our method can therefore be seen as complementary to Geneland.

In contrast to BEDASSLE, Geneland, and most other existing methods, which all heuristically describe the strength of the parameter; the inference scheme introduced here fits an explicit spatial population genetics model. It corresponds to Nagylaki’s *γ* (Nagylaki 1978), whose inverse is equal to Barton’s barrier strength *B* (Barton and Bengtsson 1986). This correspondence makes the inferred barrier strength parameter *γ* interpretable directly in terms of population genetic theory. For instance, the parameter which has a dimension of time, must be large to retard the spread of even a neutral allele (Barton and Bengtsson 1986).

Our method can reliably estimate the presence of strong barrier (), but there is little power to distinguish between a weak barrier and no barrier (Figure 5 and Figure 7). The reason for this is not a shortcoming of our inference scheme, but due to the fact that relatively weak barriers do not significantly affect the spread of ancestry (Figure 2). Therefore, it is infeasible to estimate the strength of weak barriers to gene flow, simply because they do not have a significant effect on allele frequency covariances. This effect was already observed by Barton *et al.* (2013), who found that in two spatial dimensions the effect of a barrier starts to have appreciable effects on the spatial pattern of genetic marker alleles when barrier strength *B* (which corresponds to ) is

The exact power of the inference method depends on a range of factors. As a barrier mostly affects nearby allele-frequency fluctuations, a high sampling density spread evenly on both sides of a putative barrier is ideal (Figure 8). Our results demonstrate that estimates of the barrier strength get more variable for increasing long-distance migration/mutation rate, or increasing neighborhood size (Figure 6). These patterns are not surprising, since the strength of isolation by distance, which constitutes the signal for our inference method, decreases when *m* or Nbh grow. Our method is data hungry as it needs at least a few dozen SNP markers and at least hundreds of sampled individuals (File S2). This wealth of data is required since recent pairwise coalescence events, which constitute the signal for our inference scheme, are rare events in most realistic scenarios. As a rough rule of thumb, our inference scheme can be applied in cases in which there is sufficient power to detect an overall isolation-by-distance pattern.

Any realistic scenario has its own set of parameters and specific sampling scheme. Our power simulations can only test a tiny fraction of possible combinations of these. They helped to elucidate general underlying patterns, but cannot cover all specific cases. Therefore, we recommend doing customized power simulations. Simulating the specific sampling scheme, the marker number together with likely demographic parameters will help to determine whether the inference scheme has sufficient power to detect putative barriers for the specific scenario of interest.

### Outlook

The inference scheme introduced here fits a linear barrier, the most straightforward model for a barrier in two dimensions. We used an analytical formula (Equation 5) to model the spread of ancestry, which in turn allows one to reduce calculations for pairwise *F* to a single numerical integral (Equation 7). However, barriers might be geographically more complex in practice. There could also be multiple barriers in different locations which would be partly indicated by our method, but also invalidate its underlying model of a single barrier. Such more-complicated scenarios will most likely not allow for simple formulas, and calculations for the chances of recent coancestry become much more challenging. One could trace the geographic ancestry distribution back with discretized simulations, and use them to first calculate the expected distributions of recent pairwise coalescence (Equation 6) and consequently identity-by-descent patterns (Equation 7). This salient extension to our model poses a numerical challenge, but it seems to be within reach of present day computational power.

Our method fits allele frequencies that can be confounded by deeper ancestral patterns. By filtering loci that show large-scale geographic variation, one can in principle remove some of this ancestral genetic structure, but one might accidentally remove true signal as well by doing so. This problem can be a severely confounding factor when applying the inference method to scenarios where ancestral structure is present, for instance in zones of putative secondary contact.

One promising way to overcome this problem would be to base inference on identity-by-descent blocks, the direct genetic traces of recent coancestry (Browning and Browning 2012). As blocks of ancestral genetic material are split up at a constant rate by recombination, the probability of sharing a block of length *l* decays exponentially back in time (Ralph and Coop 2013). For example, blocks >5 cM are therefore very unlikely to originate from coancestry older than 100 generations, even under relatively extreme demographic scenarios. Moreover, the length of the blocks contains information about the time of coalescence. Identifying such blocks is a nontrivial task, particularly when only unphased genotype data are available (Browning and Browning 2012), as it requires dense genotype data and linkage information. But in cases where identity-by-descent blocks can be robustly called—as is already possible for humans and some model organisms—an inference scheme based on this signal holds great potential. Our method to model the spread of ancestry can be combined with formulas for block sharing (Ralph and Coop 2013; Ringbauer *et al.* 2017) to calculate the expected number of shared identify-by-descent blocks in presence of a barrier. These results could be used to fit observed block-sharing data.

In summary, our method is only a first step to robustly infer barriers to gene flow from genotype data. The techniques outlined here can be expanded in various directions to better deal with the complexities of real data and to make full use of the opportunities within the era of population genomics. We hope that this will ultimately lead to a better understanding of barriers to gene flow within many natural populations.

## Acknowledgments

We thank Himani Sachdeva and Raphaël Forien for several useful discussions and for comments on previous drafts of this article. We thank Maria Melo for many discussions about the *Antirrhinum* system, as well as members of the Barton and Coen laboratory and many volunteers for organizing and carrying out the field work that gave rise to the data set analyzed here. We thank Jon Wilkins and an anonymous reviewer for their very helpful suggestions on how to improve the manuscript.

## Appendix

Here we give the full formula we fit and describe the rescaling of a set of independent effective parameters. Let denote the *x*-coordinate of the samples and and their separation along each axis. For the identity by state on different sides of the barrier, plugging into Equation 7 gives for pairwise *F*:We now rescale time such that The integral () transforms to:Defining and gives the full formula used for inference:The formula for the same sides of the barrier is rescaled analogously. The additional term in the integrand becomes:

## Footnotes

Supplemental material is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.117.300638/-/DC1.

*Communicating editor: J. Wakeley*

- Received October 19, 2017.
- Accepted December 23, 2017.

- Copyright © 2018 by the Genetics Society of America