## Abstract

We look at how to choose genetic distance so as to maximize the power of detecting spatial structure. We answer this question through analyzing two population genetic models that allow for a spatially structured population in a continuous habitat. These models, like most that incorporate spatial structure, can be characterized by a separation of timescales: the history of the sample can be split into a scattering and a collecting phase, and it is only during the scattering phase that the spatial locations of the sample affect the coalescence times. Our results suggest that the optimal choice of genetic distance is based upon splitting a DNA sequence into segments and counting the number of segments at which two sequences differ. The size of these segments depends on the length of the scattering phase for the population genetic model.

WE consider the problem of learning about spatial structure from population genetic data. We focus on the situation where we have both genetic and spatial data from a random sample of individuals from a population in a continuous habitat. The spatial information relates to the sampling location of the individuals, and the genetic information will be the genetic type of those individuals at a series of loci. From this data we want to answer questions such as whether there is spatial structure within the population (as opposed to the data being consistent with a panmictic population), and if so to quantify features of how this structure affects the genetic diversity of the population.

A simple, but commonly used, approach to answering whether there is spatial structure is to look for correlation between the spatial and genetic distance between two individuals from the population. This can be calculated by considering all pairs of individuals within the data set, calculating the correlation between the set of paired spatial and genetic distances, and then assessing the significance of any observed correlation through a permutation test (Sokal and Oden 1978; Shimatani and Takahashi 2003). This idea can be extended to look at the relationship of spatial separation on genetic difference by plotting a smoothed estimate of how genetic distance varies with spatial separation for the pairs of individuals within the data set (see, *e.g.*, Shimatani and Takahashi 2003; French *et al.* 2005).

However, to implement these approaches requires the definition of spatial and genetic distance for a pair of individuals. Often Euclidean distance is a natural choice for spatial distance. However, there can be multiple possible choices of genetic distance, and in some situations the choice of distance can affect the results of the subsequent analysis (Shimatani and Takahashi 2003).

As a motivating example, consider the study of *Campylobacter jejuni* in French *et al.* (2005). Here the genetic data for each *C. jejuni* isolate consisted of multilocus sequence types (MLSTs). An MLST records the DNA sequence of the isolate at ≈500-bp fragments of seven housekeeping genes that are roughly evenly spread around the genome. If we consider the data from two isolates at a single gene, then two natural measures of genetic distance are (i) the number of polymorphic differences between the two sequences and (ii) whether or not the sequences are identical. There are also alternative measures of distance that could be considered (see methods). A natural and important question is which choice of distance is best in terms of detecting and learning about the effect of any spatial structure on genetic diversity.

We investigate this question via analysis of two spatial population genetic models (see methods). Both models assume a population that exists in a continuous habitat and that the spatial location of an offspring is centered around the location of its parent. Both models apply only to nonrecombining loci, and thus we focus on the choice of genetic distance for a single nonrecombining locus. (We are unaware of appropriate spatial-genetic models that incorporate recombination.)

## METHODS

#### Spatial-genetic models:

Our results are based on two population genetic models for continuous spatial habitats, also known as isolation-by-distance (IBD) models. The first assumes complete density regulation: the population density is constant through space and time. This model can be constructed as the limit of a two-dimensional stepping-stone model as the number of demes tends to infinity. This model has been analyzed by Maruyama (1971), Malécot (1975), Barton and Wilson (1995, 1996), and Barton *et al.* (2002) among others. However, here we use the simulation method and analytic approximations of Wilkins (2004), and throughout this article we call this model the Wilkins' IBD model.

The second model is based on the isolation-by-distance model of Wright (1943). We call this Wright's IBD model. This model has no density regulation, which has the disadvantage that it produces infinite clumping of the population (Felsenstein 1975).

As we are interested in the property of estimators that use the genetic and spatial information on pairs of chromosomes, we consider samples of size 2 from these models. We consider a single nonrecombining locus and assume this locus consists of *L* sites, with two alleles at each site. We further assume the same mutation rate at each site and parameterize the mutation rate in terms of a scaled rate per site θ = 2*N*_{e}*u*, where *N*_{e} is the effective (haploid) population size and *u* is the per generation mutation rate for the locus. The effective population size is defined so that the mean number of mutations in the locus that separates a randomly sampled pair of haploid individuals is *L*θ.

#### Wilkins' isolation-by-distance model:

We consider a haploid population inhabiting a square habitat [0, 10] × [0, 10]. The model is parameterized in terms of a population density, ρ, and a dispersion parameter σ^{2}. A simple description of the ancestral process for this model is as described below (see Wilkins and Wakeley 2002 and Wilkins 2004 for fuller details). Note that this model is equivalent to one for a habitat [0, 10/*c*] × [0, 10/*c*] with population density *c*^{2}ρ and dispersal rate σ^{2}/*c*^{2} for any *c* > 0.

We consider a sample taken from known locations. We can then trace the ancestry of our sample back in time. At any time in the past this ancestry will consist of a number of lineages, which correspond to the unique descendants of the population at that time. The position of a lineage undergoes a two-dimensional symmetric Gaussian random walk, with variance σ^{2} in each direction. (We assume reflecting boundaries at the edge of the habitat.) Two lineages coalesce (share a common ancestor) if the lineages fall within an area containing a single individual (which is of size 1/ρ).

Wilkins (2004) shows that qualitatively the genealogy from this model can be split into two phases, known as the “scattering” and the “collecting” phase (this terminology was first used in Wakeley 1999). The scattering phase is the initial phase of the genealogy and corresponds to the period of time that the coalescence times depend on the sampling locations. This is then followed by the collecting phase, when coalescences are independent of the sampling locations and the genealogy can be closely approximated by Kingman's coalescent (Kingman 1982).

During the collecting phase, the distribution of the genealogy is described by a single parameter: the effective population size *N*_{e}. This governs the rate of coalescence of a pair of lineages (which is 1/*N*_{e}). Wilkins (2004) gives various approximations for *N*_{e} in terms of the parameters of the model, and this can also be estimated through simulation. Within the scattering phase, the distribution of the coalescent time for a pair of individuals sampled at *x*_{1} and *x*_{2}, respectively, depends on the scaled distance ‖*x*_{1} − *x*_{2}‖/σ, where ‖ · ‖ is the standard Euclidean distance.

To show this we plotted the hazard function of the coalescent time distribution for a range of distances between the sampled individuals. The hazard function of a random variable *T* is defined as Pr(*T* = *t*)/Pr(*T* ≥ *t*). Under a panmictic population model, the hazard function of the coalescent time would be constant through time and equal to 1/*N*_{e}. Figure 1a shows the hazard functions we obtained, and we see that these tend to a constant value of ∼1/*N*_{e} regardless of the position of the sample. In this case convergence occurs at around the 1000th generation. Prior to this time, we note quite different behavior in the hazard functions.

A further important parameter of the model is the time at which the scattering phase ends and the collecting phase starts, which we call *T*_{c}. Wilkins (2004) gives ways of calculating this, although we have resorted to using illustrations such as Figure 1 to estimate an appropriate value. (In practice this time is not clearly defined, and rough estimates, such as the value of 1000 generations for Figure 1a, are sufficient for our needs.)

We considered a range of parameter values for the results we present here. In each case we calculated the distribution of the coalescence time for a sample of two individuals. We examined this both using the analytic approximation of Wilkins (2004) and through simulation using the tracker program (http://www.santafe.edu/∼wilkins/software.html).

In all cases we sampled individuals from close to the center of the habitat, to avoid any edge effects of the model.

#### Wright's isolation-by-distance model:

This is also a model for a haploid population. We consider a slight generalization of the IBD model of Wright (1943).

We consider a random sample from a structured population. By random, we mean that the probability of an individual being sampled does not depend on its genetic type. We do allow the sampling to depend on the location of the individuals and calculate the distribution of the coalescence time of a pair of individuals conditional on their sampling locations. To calculate this conditional distribution we first need to consider the unconditional distribution of the coalescence time and the distribution of the spatial locations given the coalescence time.

Forward in time, the model assumes a fixed population size evolving over discrete generations. Ignoring the spatial information, the evolution of the population is given by the Wright–Fisher model: each descendant in the next generation “chooses” its parent independently and uniformly from the individuals in the current generation. Conditional on this, the location of the descendant is a small perturbation of the location of its parent. By considering a suitable limit as the population size tends to infinity we have a model where, marginal to the spatial information, the genealogy of a sample is given by Kingman's coalescent (Kingman 1982). Conditional on the genealogy and the location of the most recent common ancestor (MRCA) of the sample, the location of the sampled individuals is obtained by simulating Brownian motion along the branches of the genealogy. Thus if we consider a branch in the genealogy of length *t*, if the parental node is at location *y*, then the location of the daughter node (or tip) is drawn from a transition density *p*_{t}(**x** | **y**), where *p*_{t}(**x** | **y**) is a (two-dimensional) Gaussian distribution with mean *y* and variance *t*σ^{2} in each direction.

We consider the special case of a model for a population on a closed habitat. We again choose the habitat to be [0, 10] × [0, 10]. For this model the transition density, *p*_{t}(*x* | *y*), is defined to be that of two-dimensional Brownian motion constrained to the habitat. So if **x** = (*x*_{1}, *x*_{2}) and **y** = (*y*_{1}, *y*_{2})where is the density of the Gaussian random variable with mean μ and variance σ^{2}. The infinite sum in this expression is to allow for the reflecting boundary of the habitat (see the Appendix of Wilkins 2004). This model is chosen to have the same spatial dynamics as Wilkins' IBD model. For this model, the distribution of the location of the MRCA is given by the stationary distribution of the transition density *p*_{t}(*x* | *y*), which is just uniform on the habitat.

Intuitively this model behaves similarly to a neutral coalescent model, where we treat the location of the individual in the same way as a genetic locus. In this case the “type” of the individual at this locus lies in [0, 10] × [0, 10], and the “mutation” process is Brownian motion with reflecting boundaries. The type of the MRCA is drawn from the stationary distribution of this mutation process.

Now we can calculate the conditional distribution of the coalescence time of a sample of size 2 given their sampling locations. Consider a sample taken from specified locations, **x** and **y**, say. For a population of effective population size *N*_{e} the conditional distribution of the number of generations until the sample has a common ancestor is given, using Bayes' formula, byHere the first term comes from the exponential prior distribution of the coalescence time, and the second term is the conditional distribution of the location of the sample given the coalescence time. (This simplifies due to the reversibility of the dispersal process.) Note that here we have defined time in terms of generations, so σ^{2} for our model is defined in terms of the variance of the dispersal over one generation.

Plots of the hazard function of the coalescence times for this model again show a separation of timescales effect (Figure 1b). Again it will be appropriate to summarize the model by two parameters: the effective sample size, *N*_{e}, and the time at which the collecting phase starts (the hazard rate is ∼1/*N*_{e}), *T*_{c}. For this model the time depends only on σ and the habitat size. For our habitat *T*_{c} ≈ 10/σ^{2}.

#### Genetic distances:

The two most natural measures of genetic distance between a pair of sequences at a locus are (i) the number of segregating sites, which we call *d _{L}*, and (ii) whether or not the sequences at the locus are the same, which we call

*d*

_{1}. Note that

*d*

_{1}=

*I*(

*d*> 0).

_{L}We can obtain a range of measurements in between these extremes by considering segments made up of subsets of the *L* sites at the locus. Consider such a set of *l* such segments, each of which consists of *L*/*l* sites. (The natural definition would be to consider the first segment to consist of the first *L*/*l* sites, the second the next *L*/*l* sites, and so on.) Now define the genetic distance to be the number of these segments at which the two sequences differ. We define this to be *d _{l}*.

#### Power to detect spatial structure:

Consider a choice of genetic distance. Let be the expectation of this distance under a panmictic population model and μ(*x*) and *V*(*x*) be the expectation and variance of this under a spatial model for samples chosen at distance *x* from each other. Then we (indirectly) measure the power to detect spatial structure through(1)This is a natural measure, as it directly relates to the noncentrality parameter of a chi-square statistic to detect whether .

## RESULTS

We first examined the distribution of the number of segregating sites in a sample of size 2. We simulated data assuming a biallelic mutation model at each site, with mutation rate θ = 2*N*_{e}*u* = 0.01. Thus for a locus consisting of 500 bases we would expect around five segregating sites in a sample of two chromosomes, which is consistent with *C. jejuni* MLST data. (Fearnhead *et al.* 2005).

In Figure 2 we plot the hazard function for the number of segregating sites for Wilkins' IBD model. We consider a range of demographic parameters and three sizes of locus: 500, 2500, and 12,500 bp. If *S* is the number of segregating sites then the hazard function evaluated at the value *s* is defined as Pr(*S* = *s*)/Pr(*S* ≥ *s*). We plot this as under a panmictic model: the hazard function is constant and thus it highlights deviation from the panmictic model.

Each plot shows the hazard for four different degrees of separation of the sample chromosomes (for full details see Figure 2 legend). The common feature of the plots is that for the 500-bp locus, the only notable difference in the hazard is for 0 segregating sites, with greater probability for closely sampled pairs of chromosomes. As the size of the locus increases (and with it the mutation rate of the locus) we observe differences in the hazard for other numbers of segregating sites.

The importance of this result is that for small loci the only information about the spatial model will be found in the proportion of pairs of identical chromosomes at different distances. (Conditional on *S* > 0 the hazards are almost identical for different spatial separations, so there is almost no information in the conditional distribution of the number of *S* given *S* > 0.) Thus measuring genetic distance via whether two chromosomes are identical at that locus will be optimal. However, for larger loci there is likely to be information over and above whether two chromosomes are identical, and in these situations other genetic distances may perform better at detecting spatial structure.

The reason for this dependence on locus size is related to the separation of timescales property of spatial models. It is only during the scattering phase (up to time *T*_{c}, see methods) that the hazard of coalescence times depends on the spatial separation of the chromosomes. This difference will manifest itself in the hazard only for small numbers of segregating sites. For two lineages that coalesce prior to time *T*_{c}, the expected number of mutations will be bounded by 2*T*_{c}*Lu* = *L*θ*T*_{c}/*N*_{e}, and only when this is of the order of ≥1 will one observe a notable difference in the hazard function at ≥1 segregating sites. For the columns of Figure 2, *T*_{c}/*N*_{e} ≈ 0.04, 0.08, 0.04, and 0.01, respectively. Thus *L*θ*T*_{c}/*N*_{e} < 1 for *L* = 500 in all cases and also for *L* = 2500 for the rightmost column.

Note that the results depend primarily on the value of *L*θ, and changing *L* and θ while keeping this product the same has little effect on the results (at least while θ ≪ 1). Furthermore the patterns we observe in Figure 2 are representative of a range of choices of the demographic parameters (results not shown). Note that for Wilkins' IBD model there is a limit on the range of values of *T*_{c}/*N*_{e} that can be obtained (due to their codependence on σ), and the choices given in Figure 2 show models with a reasonable spread over the possible values of *T*_{c}/*N*_{e}.

Similar qualitative results are obtained for Wright's IBD model (see first three columns of Figure 3). The first three columns of Figure 3 have identical *T*_{c} and *N*_{e} values to the first three columns of Figure 2, and we obtain qualitatively very similar results. For the rightmost column we have chosen *N*_{e} to be sufficiently small so that *T*_{c}/*N*_{e} = 0.2 and in this case we observe differences in the hazard for *S* = 1 even when *L* = 500.

Second, we looked at the effect of different measures of genetic distance on the power to detect spatial structure. We did this through fixing a spatial separation *x* and calculating the normalized distance *D*(*x*) (see Equation 1 in methods). Larger values of *D*(*x*) correspond to greater power. The top two rows of Figure 4 correspond to four different spatial models, for *x* = 0 and *x* = 5σ. There are similar patterns for the different values of *x* for a given model—the main difference is that the power to detect variation from a panmictic model increases as *x* decreases. This pattern is observed over a greater range of *x* values (results not shown).

For each plot in the top two rows of Figure 4, we plot the power for four different choices of genetic distance as a function of the size of the locus being analyzed. These genetic distances include the two extremes, namely the number of segregating sites (denoted *d _{L}*, see methods) and whether or not the two sequences are identical (

*d*

_{1}). The two further distances are based on splitting the locus into two or three segments and counting the number of segments at which the two sequences differ (

*d*

_{2}and

*d*

_{3}, respectively). The optimal choice of distance varies both with size of locus and with the spatial model.

To investigate this further, the bottom row of Figure 4 shows equivalent results in the *x* = 0 case, but now each curve is based on splitting the locus into segments of different size. We see that for each scenario there appears to be an optimal size of segment; and now increasing the size of the locus has little effect, with the *D*(*x*) values converging to a fixed value for each choice of size of segment.

The optimal size of segment varies between spatial models, and the reason for this is the different *T*_{c}/*N*_{e} values for these models. Choosing different-sized segments is equivalent to looking at differences in the distribution of coalescent events over different timescales. Very large segments correspond to focusing on differences over very short timescales, whereas small segments focus on differences over much larger timescales. Thus models with small *T*_{c}/*N*_{e} values should use large segments and vice versa. As an approximate rule, the optimal segment size has population-scaled mutation rate θ_{s} where θ_{s}*T*_{c}/*N*_{e} ≈ 1. This can be seen from the bottom row of plots in Figure 4. The *N*_{e}/*T*_{c} values are ∼28, 100, 28, and 5 for the four models. Over the values of θ_{s} considered, the optimum appears to be 100, 20, and 5 for the last three models, with θ_{s} = 20 and θ_{s} = 50 giving very similar results for the first model.

## DISCUSSION

We have looked at two spatial population genetic models to give us insight into the choice of genetic distance for detecting spatial structure in population genetic data. The models we considered differed in their assumptions of density regulation and they each take one of the two extreme possibilities. Wilkins' IBD model assumes complete density regulation, whereas Wright's IBD model assumes no density regulation. Both models are unrealistic in real life (in particular Wright's IBD model leads to infinite population density) but the similarity in the results we obtain for both cases suggests that our results are informative about more realistic scenarios that lie between these two extremes.

The qualitative features of our results can be traced to the separation-of-timescales properties of these models, namely that there is an often short scattering period where the spatial location of the samples affects the genealogy, and that this is then followed by a collecting phase where the initial locations have no effect on the genealogy. The choice of distance should be based upon looking for differences across segments of DNA, where the size of the segment is chosen so that there will be of the order of 1 mutation expected between two haploid individuals that coalesce within the scattering phase. The fact that many spatial models (Wilkins and Wakeley 2002; Wilkins 2004; Slade and Wakeley 2005) can be described via a separation of timescales suggests that this guideline will apply quite generally.

One difficulty with applying this result is knowing or inferring the length of the scattering phase, *T*_{c} (or the ratio *T*_{c}/*N*_{e}). All we can do is draw some general conclusions. For models with strong density regulation, such as Wilkins' IBD model, there are constraints on the values that *T*_{c}/*N*_{e} can take, and the results that we presented are representative of the results obtained for a range of different parameter values for the model. These suggest the segment you choose should have a high mutation rate—of the order of 20–50. (It is possible to choose parameter values that require larger segments, but not smaller ones.) So, for example, for the *C. jejuni* MLST data the mutation rate is of the order of 5–10 for a gene fragment. Thus the optimal choice of genetic distance will be to look at whether or not sequences for a gene fragment are identical. Note also that some spatial models are equivalent to *T*_{c} ≈ 0 (Slade and Wakeley 2005), in which case measuring genetic distance by whether or not the complete DNA sequence at a locus is identical would be best. Nonequilibrium changes in population structure, such as range expansion or contraction, will affect the values that *T*_{c}/*N*_{e} can take. Historic contraction would increase the value *T*_{c}/*N*_{e}, which would result in segments with lower mutation rates being preferred.

In our study we have ignored recombination; this is due to the difficulty with analyzing or simulating from spatial-genetic models that include recombination. However, the results we present may be robust to situations where there is some recombination. The reason for this is that in looking for spatial structure, we are looking for differences in the distribution of coalescence times within the scattering phase, up to *T*_{c}. Thus it is only recombination events that occur before *T*_{c} that will affect our conclusions. As *T*_{c}/*N*_{e} is often small, the probability of such recombination events will be small except for large or highly recombinant loci. In particular, for genetic loci that do not include a recombination hotspot (Mcvean *et al.* 2004), the effect of recombination may be small.

One final conclusion from our work is that, as spatial structure affects the genealogy of a sample only during the generally short scattering phase, large amounts of genetic data may be needed to make strong conclusions about the presence and effect of spatial structure on genetic variation. In particular, it is likely to be beneficial to type fewer individuals over more of the genome.

## Acknowledgments

I thank Jon Wilkins for help with using tracker. This work was funded by Engineering and Physical Research Council grant no. EP/C531558.

## Footnotes

Communicating editor: J. Wakeley

- Received February 23, 2007.
- Accepted June 29, 2007.

- Copyright © 2007 by the Genetics Society of America