- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Wilkins, J. F.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Wilkins, J. F.
Genetics, Vol. 168, 2227-2244, December 2004, Copyright © 2004
doi:10.1534/genetics.103.022830
A Separation-of-Timescales Approach to the Coalescent in a Continuous Population
Jon F. Wilkins1
Society of Fellows, Harvard University, Cambridge, Massachusetts 02138
1 Address for correspondence: Bauer Center for Genomics Research, 7 Divinity Ave., Harvard University, Cambridge, MA 02138.
E-mail: jwilkins{at}cgr.harvard.edu
>ABSTRACT
THE MODEL
THE SCATTERING PHASE
THE COLLECTING PHASE
EFFECTIVE POPULATION SIZE
THE COALESCENCE PROBABILITY...
DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
This article presents an analysis of a model of isolation by distance in a continuous, two-dimensional habitat. An approximate expression is derived for the distribution of coalescence times for a pair of sequences sampled from specific locations in a rectangular habitat. Results are qualitatively similar to previous analyses of isolation by distance, but account explicitly for the location of samples relative to the habitat boundaries. A separation-of-timescales approach takes advantage of the fact that the sampling locations affect only the recent coalescent behavior. When the population size is larger than the number of generations required for a lineage to cross the habitat range, the long-term genealogical process is reasonably well described by Kingman's coalescent with time rescaled by the effective population size. This long-term effective population size is affected by the local dispersal behavior as well as the geometry of the habitat. When the population size is smaller than the time required to cross the habitat, deep branches in the genealogy are longer than would be expected under the standard neutral coalescent, similar to the pattern expected for a panmictic population whose population size was larger in the past.
NATURAL populations are often geographically structured. That is, there is frequently a correlation between the geographic locations of parents and their offspring, resulting in the accumulation of genetic differences between local populations. The connection between genealogies and geography has long been a subject of interest in population genetics. In particular, many efforts have been made to construct methods for making geographic and demographic inferences about populations from patterns of genetic diversity.
The best-studied models of geographic population structure assume that the population is divided into a number of subpopulations, or demes. In the "island model" (WRIGHT 1931; MARUYAMA 1970a), migration between demes occurs through a common migrant pool. Because the destination of a migrant is independent of its origin, this model lacks explicit geography. Geographically explicit models are typically some variant of the stepping-stone model (KIMURA and WEISS 1964). In these models, demes are arrayed in a lattice, with migration limited to adjacent demes, or biased in favor of nearby demes. Models of continuous habitats, in which geographic structure is more commonly referred to as "isolation by distance" (WRIGHT 1943), are in some sense a special case of a stepping-stone model. If we enforce strict density regulation and let the deme size become small, the stepping stone becomes a lattice model, with one individual occupying each position in the lattice. Models of this sort have been developed by MALéCOT (1968) and WEISS and KIMURA (1965) and specifically in the case of a two-dimensional habitat of finite extent by MARUYAMA (1971) and MALéCOT (1975).
Most analyses of stepping-stone models have relied on one of two types of assumption to secure mathematical tractability. One class of models (the infinite stepping stone) assumes an infinite array of demes (WEISS and KIMURA 1965; NAGYLAKI 1974a; SAWYER 1976, 1977). These models are useful for considering the short-term relationship between genetic and geographic distance, but on longer timescales they predict infinite divergence between individuals (SAWYER 1976; GRIFFITHS 1981; WILKINSON-HERBOTS 1998). The other class of models invokes periodic boundary conditions (MARUYAMA 1970b, 1971; NAGYLAKI 1974b, 1977; STROBECK 1987; SLATKIN 1991). In these models, the ends of the array of demes are joined together to form the "circular stepping stone" in one dimension. In two dimensions, the array of demes takes on the shape of a torus. Other analyses have undertaken the problem of a finite array of demes and the effect of range boundaries on the genetic structure of populations (MARUYAMA 1970c,d, 1972; FLEMING and SU 1974; MALéCOT 1975; NAGYLAKI and BARCILON 1988). The general result of these analyses is that proximity to the edge of the deme array increases probabilities of identity and the covariance in gene frequency across demes.
Much of the recent work on geographic inference is based on the coalescent (KINGMAN 1982a,b; HUDSON 1983; TAJIMA 1983). Coalescent models focus on the history of a particular sample, rather than on the population as a whole. In this framework, models of population structure are analyzed to make probabilistic statements about the times to common ancestry for samples drawn from specific geographic locations. Observed patterns of genetic diversity, then, can be used to infer parameter values (e.g., migration rate, population size) that best describe particular populations within a given model. In the context of geographically structured populations, this approach has been formalized as the structured coalescent (NOTOHARA 1990; WILKINSON-HERBOTS 1998). Analysis of genealogical processes in a continuous habitat faces a particular challenge, pointed out by FELSENSTEIN (1975). If not all organisms reproduce, and offspring appear in locations close to where their parents were, the population becomes clumped. However, most analyses have assumed a uniform population density that is inconsistent with this mode of reproduction. A common method for addressing this inconsistency (and the one that is followed in this analysis) is to assume strong local density regulation that forcibly maintains the uniform population density. A fully occupied lattice model represents an extreme form of local density regulation.
Traditional (precoalescent) population genetics models have focused on measures such as the probability of identity and the covariance in gene frequency. These measures are closely tied to the distribution of coalescence times (SLATKIN 1991); however, the formulation of the problem within the coalescent framework provides results that make greater use of the information in DNA sequence data. The question of how best to infer migration patterns in a two-dimensional habitat has been considered by a number of authors (SLATKIN and BARTON 1989; SLATKIN and MADDISON 1990; BARTON and WILSON 1996; ROUSSET 1997). The effects of habitat range boundaries on coalescence times have been investigated in the one-dimensional stepping-stone model by HEY (1991) and HERBOTS (1994)(pp. 66 and 145146), who found shorter coalescence times nearer to the edges of the array, consistent with the effects on probability of identity and covariance in gene frequency. The analysis of WILKINS and WAKELEY (2002) in the continuous one-dimensional model found a similar relationship between sampling location and coalescence time.
The goal of the work described in this article is to derive explicit expressions for the distribution of coalescence times for pairs of sequences sampled from specific locations within a finite two-dimensional habitat. Geographic structuring arises in this model as a result of limited intergenerational dispersal, or gene flow. In terms of the coalescent process, we can imagine tracing the locations of the ancestors of sequences in our sample backward in time. The locations of these ancestral lineages are modeled as a random walk in the habitat, where the size of the steps is determined by the rate of gene flow. Two lineages coalesce if they are derived from the same individual in a given generation. Under strict density regulation, this is equivalent to saying that the lineages coalesce if they approach the same location at the same time. The proximity within which two lineages must approach each other to coalesce is determined by the population density. To construct the distribution of coalescence times, we need to consider the locations of the two lineages back through time, conditional on their not yet having coalesced. Because of this conditioning, the two random walks are not independent of each other. The dependence of each random walk on the location of the other lineage is the source of the difficulty of obtaining mathematical expressions for this process.
The problem of coalescence times was treated originally under different terms by WRIGHT (1943), but has been addressed most extensively by BARTON and WILSON (1995)(1996). Their work derives recursion equations for the probability of coalescence for pairs of sequences. In the recent past, coalescence probabilities are determined only by local dynamics. However, for deeper portions of the genealogy, coalescence probabilities are influenced by the fact that populations inhabit finite ranges. Barton and Wilson present an expression for their solution for the toroidal habitat and indicate how an expression can be derived under reflecting boundary conditions, where the habitat would be rectangular. More recent work derives recursion equations in a continuous habitat for different degrees of local density regulation (BARTON et al. 2002).
The challenge of using genealogical data to infer geographic structure lies both in the fact that those analytic expressions that can be derived are often unwieldy and in the fact that only a fraction of the genealogical history contains information relevant to geographic structure. As Barton and Wilson point out, the deeper branches in genealogical trees will often represent lineages that have crossed the species range multiple times. The fact that geographically relevant information is restricted to recent parts of the genealogy is the basis of the success of rare allele methods of estimating gene flow (SLATKIN 1985). Other analyses of models of gene flow have also found deep branches in the genealogies to be uninformative (SLATKIN and MADDISON 1990). Recent simulation work on genealogical structures in continuous populations (IRWIN 2002) illustrates the poor correspondence between processes occurring at the geographic level and the deeper branches of genealogies. Under higher migration rates, deep genealogical divisions show little correlation with geography. Under low migration, these deep divisions can suggest specific barriers to gene flow where none actually exist.
A number of models of population structure have recently been studied with some success using a separation-of-timescales approach. These analyses treat the coalescent process in the recent past as a function of the details of the population structure and the sampling scheme. In the more distant past, the genealogical process is assumed to be independent of the sampling scheme. WAKELEY (1999) has referred to these two processes as the "scattering phase" and the "collecting phase," respectively, and has successfully used this method to analyze island-type models with large numbers of demes (WAKELEY 1998, 1999, 2000, 2001; WAKELEY and ALIACAR 2001; WAKELEY and LESSARD 2004). Similar methods have been used to describe the coalescent process in plants with selfing (NORDBORG 1997, 2000; NORDBORG and DONNELLY 1997; MöHLE 1998).
Separation of timescales is most powerful when the scattering phase is very short compared with the collecting phase, so that it may be treated as essentially instantaneous. However, even when these conditions do not hold, the technique may permit the generation of tractable, if approximate, expressions to describe genealogies. A further affordance of the separation-of-timescales approach accrues if the collecting phase can be described using the equations developed for the standard neutral coalescent model (SNCM; KINGMAN 1982a,b; HUDSON 1983; TAJIMA 1983).
Coalescent simulations (presented below) indicate that a range of conditions exists over which the genealogical process in a continuous two-dimensional habitat converges approximately on the SNCM. That is, the long-term genealogical behavior can be described by some effective population size Ne that is independent of the initial sampling scheme. It is possible to construct a separation-of-timescales-based description of the genealogical process by combining this description of the long-term behavior with a location-dependent description of the short-term coalescence probabilities. This approach results in explicit expressions for the distribution of coalescent times for a pair of sequences. This distribution is a function of the sampling locations, dispersal rate, population density, and habitat geometry.
ABSTRACT
>THE MODEL
THE SCATTERING PHASE
THE COLLECTING PHASE
EFFECTIVE POPULATION SIZE
THE COALESCENCE PROBABILITY...
DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
L2, although some consideration is also given to a torus that is the direct product of two circles of lengths L1 and L2, where once again L1
L2. These two geometries can also be thought of as two different treatments of the boundary conditions on the same L1 x L2 rectangle. In both cases, the purpose of the boundary conditions is to retain migrating lineages within the habitat. In the "rectangular" case, reflecting boundary conditions are assumed. That is, if migration would have carried a lineage a certain distance beyond the habitat boundary, that lineage moves to a location an equally far distance from the boundary, but inside the habitat, as if the lineage had bounced off the boundary. In the toroidal case, periodic boundary conditions are assumed. A lineage exiting the habitat in this case would reenter through the boundary at the opposite end of the rectangle. These two formulations are illustrated in Figure 1.
|
The population density
is equal to N/A. Population density regulation is absolute, so the structure is that of a lattice model, with exactly one individual occupying each point on the lattice every generation. This is equivalent to the lattice model investigated by SLATKIN and MADDISON (1990) and by BARTON and WILSON (1995)(1996) for absolute local density regulation. Generations are nonoverlapping, with each individual producing a large number of propagules, which are dispersed according to a bivariate normal distribution with variance
2 in each direction. A large number of propagules thus arrive at each point in the lattice in every generation, and one of those propagules is chosen at random to produce the adult at that location.
In the coalescent process, samples are taken from particular geographic locations. The locations of lineages corresponding to the ancestors of those samples are modeled as a random walk, where each step is drawn from a two-dimensional normal distribution with variance
2 in each direction. The probability that the two lineages coalesce in the previous generation is a function of the distance between them. BARTON et al. (2002) show that the effective density is inversely proportional to the integral of this probability over distance. Under the assumptions of Gaussian dispersal and perfect density regulation, the effective density is equal to the actual density
, and coalescence can be assumed to occur when the two lineages simultaneously fall within an area occupied by only one individual (1/
). Under these conditions, the neighborhood size, Nb, is one over the probability that two lineages starting from the same location both fall within an area of 1/
after each has taken one step in its random walk. As Nb becomes large, its value approaches 4

2 (WRIGHT 1943).
In the recent past (the scattering phase), the distribution of coalescence times for a pair of sequences depends on their relative sampling locations, the neighborhood size, and the location of nearby habitat boundaries. In the more distant past (the collecting phase), the coalescent process becomes independent of the original sampling scheme and depends only on properties of the population as a whole. For certain parameter values, the collecting phase is reasonably approximated by the standard equations for the coalescent process under the SNCM. In these cases, the long-term effective population size depends on the local dispersal behavior and the habitat geometry. For clarity, only the major results are presented in the text, along with comparisons of the analytic results to simulations. Derivations and more technical discussion have been relegated to the APPENDIX.
ABSTRACT
THE MODEL
>THE SCATTERING PHASE
THE COLLECTING PHASE
EFFECTIVE POPULATION SIZE
THE COALESCENCE PROBABILITY...
DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
![]() | (1) |
Equation 1 is in the form of a recursion relationship. That is, to calculate the probability of coalescence in generation t, it is first necessary to calculate the probabilities of coalescence in generations 1 through t 1, which becomes unwieldy for large values of t. An approximate expression for this probability, which does not require recursive calculation, is derived in the APPENDIX. The corresponding equation for the cumulative probability distribution (the probability that two sequences sampled from the same location share a common ancestor no more than T generations in the past) is given by Equation A14 and has the following form:
![]() | (2) |
In Equation 2,
1 is Euler's gamma (
0.5772). Approximate values for the subsequent
i terms are given by expression (A12) in the APPENDIX. This expression becomes less accurate as the neighborhood size decreases. In particular, it is not valid if Nb < Log(T). The complexity of these equations arises from the nonindependence of the two random walks. Specifically, if the two lineages have not yet coalesced by a particular time T, the distance between them will, on average, be greater than would be expected from two independent random walks. This interference between lineages is, ultimately, the source of the elevation in effective population size seen in geographically structured populations.
To evaluate the range of parameter values over which the approximations invoked in this analysis are valid, the results of Equation 2 were compared with simulations. Figure 2 presents the cumulative density function (CDF) for coalescence of a pair of sequences sampled from adjacent lattice points in an unbounded habitat. Data are presented for several different values of the neighborhood size. In each case, simulation results are presented along with the distribution described by the equations presented in this article. These results show that the method fails as the neighborhood size becomes small (<20), even at smaller values of T. The probability of identity for a pair of samples from the same location was also calculated for a number of different parameter combinations by taking the sum of (F(T) F(T 1))e2µT over all T based on Equation 2. This comparison assumes a per-generation mutation rate of µ under an infinite-sites model of mutation. These values were compared against Malécot's approximation under the same conditions (Equation 13 of BARTON et al. 2002). For all values tested, the two estimates differed by no >0.22%.
|
Description of the coalescence of two samples drawn from different locations is both an easier and a harder problem. The addition of a nonzero distance between the samples complicates the equations, but the degree of lineage interference is less, so fewer terms need be considered to generate an acceptable approximation. For sufficiently large values of Nb, the probability density function for coalescence of a pair of sequences separated by a distance of 2
(where
is in units of
) is given by Equation 3 (A17 in the APPENDIX),
![]() | (3) |
* = 0.562, and
indicates the incomplete gamma function, which is given by Equation A18.
Equation 3 was compared to Malécot's approximation for probabilities of identity (Equation 15 of BARTON et al. 2002) for a number of parameter combinations just as Equation 2 was. Equation 3 represents a cruder approximation than Equation 2, and the deviation from Malécot's approximation is correspondingly larger. For a neighborhood size of 50 and a separation between samples of 10
, the deviation ranged from 12.4% (µ = 103) to 21.4% (µ = 105). For Nb = 500, the deviations were much smaller, varying between 1.2% (µ = 103) and 2.2% (µ = 105). The deviations were nearly identical for a separation of 20
. The increasing error at smaller neighborhood sizes results from the fact that Equation 3 incompletely accounts for the interference between lineages, and this interference is greatest when the neighborhood size is small. The larger errors associated with lower mutation rates result from the fact that Equation 3 becomes more inaccurate farther in the past, and older coalescence times are more relevant to the probability of identity when the mutation rate is small. When the mutation rate was set to 105, coalescence times >200,000 generations in the past contributed measurably to the probability of identity based on Equation 3. With µ = 103, the probability of identity was not significantly influenced by coalescent events >10,000 generations in the past. The separation-of-timescales approach employed here means that we will generally require accuracy from Equation 3 only at the smaller values of t. The accuracy of Equation 3 over the duration of the scattering phase is illustrated in the context of the overall method for particular cases below (Figures 810).
|
|
Given these expressions for the coalescent process, we can use the method of images to construct an expression for the distribution of coalescence times for a pair of sequences. This method can be applied to an arbitrary pair of sampling locations, but is restricted to a rectangular habitat. To account for the boundary conditions, we need to assume a number of competing coalescent processes that occur simultaneously, corresponding to reflections off various habitat boundaries. Similar reasoning applies to samples drawn from separate locations. One must consider not only the coalescent process corresponding to the distance
between the two locations, but also each distance
i corresponding to the distance from one sample to each reflected image of the other sample. The total number and arrangement of images that need to be considered will depend on the relative lengths of the axes of the habitat. A minimum of eight images will be required to account for the four habitat boundaries and four corners. Additional images corresponding to multiple reflections may also need to be considered, particularly if the habitat is much longer in one dimension than in the other. In principle, the number of such images is infinite. What is important for this method is that we include all of the images that correspond to distances shorter than that of the maximally distant included image. The process of determining image distances is discussed in more detail in the APPENDIX and is illustrated in Figure 3 for a particular hypothetical habitat. The application of the method of images to a toroidal habitat is described in detail by BARTON and WILSON (1995)(1996) and in the APPENDIX. Unfortunately, the method of images does not lead directly to a method for rigorously treating other habitat shapes. In certain cases, it may be possible to treat the recent past for nearby samples by considering only nearby boundaries. The validity of such an approach is not considered further in the present analysis, however.
|
ABSTRACT
THE MODEL
THE SCATTERING PHASE
>THE COLLECTING PHASE
EFFECTIVE POPULATION SIZE
THE COALESCENCE PROBABILITY...
DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
Simulations were used to explore the range of parameter values over which the collecting phase converges on the SNCM. Coalescent simulations were performed on samples of size 25. In each case, 100,000 replicates were completed. The results of these simulations were analyzed as skyline plots (PYBUS et al. 2000; STRIMMER and PYBUS 2001), that is, the average number of generations in which there were i lineages left in the sample, where 2
i
25. This average duration was then multiplied by
. The rate of coalescence is thus represented in terms of the effective population for which that rate would be expected under the SNCM. For each value of i, this value was plotted against the mean time at which the period of i lineages ended. For the SNCM, in which KINGMAN's (1982a) coalescent has time scaled by the effective population size (Ne), the expected plot would be a horizontal line near Ne. Figure 4 illustrates the convergence of the coalescent process for different sampling schemes in a square habitat. For this set of parameter values (N = 10,000; L1 = L2 = 10;
= 100;
= 0.3; Nb = 113), the coalescent process converges to an effective population size of
10,700, regardless of the original sampling scheme. It is worth noting that the process converges in <600 generations, compared to an average tree depth of >20,000 generations. Thus, in this case, the collecting phase is well described by the SNCM and accounts for >97% of the genealogical process.
|
Figure 5 shows results for rectangular habitats of a variety of length-to-width ratios. In all cases, all other parameter values are identical to those used to generate Figure 4. The total area is held constant at 100. This graph illustrates the influence of habitat geometry on the long-term coalescent process and suggests conditions under which the SNCM is an adequate description of the collecting phase. Specifically, if the length of the major axis of the habitat is too long with respect to the dispersal rate, the coalescent process does not approach the SNCM. As the ratio L1/L2 becomes very large, the coalescent process is expected to converge on that for a one-dimensional habitat. It is difficult to compare this limit explicitly with the one-dimensional solution (WILKINS and WAKELEY 2002) due to the fact that as L2 becomes small, a very large number of reflections will need to be considered.
|
The average coalescence time for a pair of sequences drawn from a panmictic population of size N is N generations. A sequence taking a one-dimensional random walk with step size
will have to travel a distance L1 in
(L1/
)2 generations. Let us define M* as N
2/L21, where L1 is the length of the major axis of the habitat. If M* >> 1, lineages cross the habitat quickly relative to the rate of the coalescent process. If M* << 1, lineage movement will be rate limiting with respect to coalescence. Thus M* can serve as an indicator of the applicability of the effective-population-size approximation for the collecting phase. The parameter values used to generate Figure 4 correspond to a value of M* = 4, and the skyline plots in Figure 5 have M* values ranging from 0.36 (for the uppermost set of points) to 9.0 (at the bottom). Figure 6 presents the results of simulations demonstrating the correspondence between M* and the uniformity of the long-term coalescent process. Each point on the graph indicates the slope of the last five points of a skyline plot like those presented in Figures 4 and 5. Slopes close to zero indicate a good correspondence to the SNCM approximation. Under the SNCM, the expected waiting time for the last coalescent event is equal to the effective population size and equal to the expected waiting time for all previous coalescent events combined. This slope thus provides a crude estimate of the fractional size of the error introduced by approximating the genealogical process in this manner. For example, if the slope is 0.25, the ratio of the expected waiting time for the last coalescent event to earlier (more recent) waiting times will be on the order of 25% greater than that expected under the SNCM. For values of M* > 1, slopes are consistently <0.05, suggesting that for these sets of parameter values, the error introduced by this approximation will be <5%. The points corresponding to the curves in Figure 5 are derived from 100,000 replicate simulations and are represented by solid circles. Crosses represent the outcome of 10,000 simulations.
|
Analogous simulations for a toroidal habitat produced qualitatively similar results (data not shown), but with the SNCM applied to values of M* > 0.25. Intuitively, a torus whose major axis is of length L1 represents a less extended habitat than a rectangle of length L1. Specifically, a lineage need travel only a distance L1/2 to have crossed the toroidal habitat. Since M* is proportional to 1/L21, this produces the factor of four difference in the value of M* at which the long-term behavior changes from a genealogical process that is well approximated by the SNCM to one that is expected to produce longer coalescence times for the deepest branches.
The conditions under which the SNCM approximation holds bear some resemblance to the strong migration limit (NAGYLAKI 1980, 2000; NOTOHARA 1993). However, the conditions considered here are less stringent. In the strong migration limit, the location of each lineage becomes independent of the locations of the other lineages. In the model considered here, where conservative migration is assumed, this would correspond to the limit in which the effective population size would approach the census size, and the duration of the scattering phase would approach zero. The regime considered here corresponds to those conditions under which the long-term coalescent process appears to be reasonably well described by the largest nonunit eigenvalue of the corresponding Markov chain transition matrix. Furthermore, a range of cases exists (where M* > 1, or, equivalently, Nb > 4
r, where r is the ratio of the lengths of the major and minor axes of the habitat) where the eigenvalues corresponding to subsequent coalescent events bear the same relation to each other as those in the SNCM do. In this range, migration is not strong enough to eliminate geographic structure completely, but the primary effect of limited gene flow is to alter the constant factor by which the coalescent process is rescaled. Figure 4 (where M* = 9) presents an example of a degree of population structure falling within this range. For more restricted gene flow (M* < 1), this rescaling is not constant throughout the coalescent process, leading to the same sort of effects on tree shape that would be expected from a population that had a larger effective population size in the past.
ABSTRACT
THE MODEL
THE SCATTERING PHASE
THE COLLECTING PHASE
>EFFECTIVE POPULATION SIZE
THE COALESCENCE PROBABILITY...
DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
![]() | (4) |
Computation of Equation 4 is feasible because the terms of the sum become small as either n2 or m2 becomes large. CHARLESWORTH et al. (2003) present an excellent approximation to Equation 4 for the case where L = L1 = L2. Translated into the terms of this article, Equation 9 of CHARLESWORTH et al. (2003) is
![]() | (5) |
![]() | (6) |
Equations 4 and 6 were compared with values derived from simulations (Figure 7). Each simulation-derived effective population size was taken from the last point in a skyline plot like those in Figures 4 and 5. If we plot Ne directly, it is not possible to get clear visual separation between the various curves at all values of
simultaneously. Therefore, for clarity of presentation, simulation and analytic values are given in terms of FST (= (Ne N)/Ne). For a fixed value of N, higher values of FST correspond to higher values of Ne. The fit is reasonable over the sets of parameter values considered. The total habitat area is the same for all curves in Figure 7.
|
The effective population size increases with the total population size and decreases with increasing dispersal rate. In the limit of high dispersal, the population becomes effectively panmictic, and Ne is equal to N. The long-term effective population size is also a function of the habitat geometry. For fixed values of A,
(and therefore N), and
, a rectangular habitat will have a larger value of Ne than a torus of the same dimensions. Intuitively, this is a consequence of the fact that two lineages can be functionally separated by a greater distance in a rectangle than in a torus. The maximum distance between any two points in an L1 x L2 rectangle is
, whereas in a torus it is 
. Recalling that we have assumed that L1
L2, Ne increases with L1/L2 in both the rectangular and toroidal geometries. The intuitive explanation for this effect is similar to that for the difference between the rectangle and torus. The more protracted the rectangle, the greater the possible distance separating two lineages.
A number of the points plotted in Figure 7 correspond to conditions for which M* < 1. It is worth noting that Equations 4 and 6 appear to provide reasonable approximations for the expected waiting time for the most ancient coalescent event, even when the rest of the collecting phase is not well described by the SNCM. For example, for the 25 x 4 rectangular habitat with N = 10,000 and
= 0.1 (M* = 0.16), the effective population size predicted by Equation 6 is 25,974. The average number of generations required to go from two lineages to the common ancestor of the entire sample in 1 million simulated genealogies was 25,894. The mean wait times for the next four most ancient coalescent events (each scaled by
were 22,835, 21,280, 20,290, and 19,612. This suggests that Equations 4 and 6 may provide a valid description of the collecting phase even for some cases where M* < 1, so long as our analysis is restricted to a pair of samples.
ABSTRACT
THE MODEL
THE SCATTERING PHASE
THE COLLECTING PHASE
EFFECTIVE POPULATION SIZE
>THE COALESCENCE PROBABILITY...
DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
at which we make the transition from the scattering phase to the collecting phase. Under many models employing separation of timescales, a point exists in the genealogical process at which the collecting-phase description becomes completely accurate. In this model, by contrast, there is no finite time for which all geographic information has been lost. Rather, the genealogical process asymptotically converges to the collecting phase. Thus, there is no single correct choice for
. I present two methods for calculating
, each of which is dependent on the number of images considered in the analysis. Both have the feature that the results will be relatively insensitive to small changes in
. The accuracy of the description is improved by increasing the number of images considered, but so is the computational burden. Furthermore, the approximations used become less valid as t becomes larger. The choice of the number of images, or image radius (see Figure 3), will therefore be governed by this trade-off.
The first method for calculating
applies to samples drawn from nearby locations. For such samples, the rate of coalescence will be >1/Ne in the recent past and then decay asymptotically toward 1/Ne. The scattering-phase description using a finite number of images will provide a good approximation to the coalescent process at short timescales, but with a coalescence rate that decays to zero. It is appealing, therefore, to set
to the point where the rate of coalescence is equivalent under the two descriptions. This point will be close to the time when the rates of coalescence are equal if we neglect interference between lineages in both processes:
![]() | (7) |
The term on the left-hand side of Equation 7 is summed over all image distances. Figure 8 compares the distribution of coalescence times for three different choices of the image radius for a pair of samples drawn from the adjacent locations in a rectangular habitat. The dip in the plot at t = 2 is the result of substantial inaccuracies in Equation 2 for very small values of t. If the details of the coalescence time distribution in this region are of interest, it would be better to use the recursion equation of BARTON and WILSON (1995), which is given in the APPENDIX as Equation A1. For times >10 generations in the past, Equation 2 provides a good approximation (see Figure 2).
For a pair of samples drawn from two more distant locations, there may be more than one value of
that will satisfy Equation 7, in which case the largest value should be used as the transition time. For two samples drawn from widely separated locations, the rate of coalescence will initially be close to zero and will gradually approach 1/Ne without ever exceeding it. Thus, for some pairs of sampling locations, there will be no value of
that will satisfy Equation 7. Considering a finite number of images, the coalescence rate given by the scattering-phase description will increase to some maximum and then decline to zero in the distant past. The coalescence rate at that maximum will be closer to 1/Ne for larger numbers of images. Under these conditions, I suggest setting
to a value that gives this maximum, which is found by solving Equation 8 for
:
![]() | (8) |
Figure 9 compares the distribution of coalescence times for a pair of samples drawn from two widely separated locations for three different choices of the number of images.
|
Taken together, Equations 28 allow us to construct approximate expressions for the distribution of coalescence times for pairs of sequences drawn from a rectangular habitat. The details of constructing this distribution are provided in the APPENDIX. To provide some validation for the approximations used, Equations A23 and A28 have been compared with Monte Carlo results for a few particular parameter combinations (Figure 10). These data illustrate the dependence of the coalescent process on location within the habitat as well as the distance between samples. As expected from previous analyses (MARUYAMA 1970c,d, 1972; FLEMING and SU 1974; MALéCOT 1975; NAGYLAKI and BARCILON 1988; HEY 1991; HERBOTS 1994, pp. 66 and 145146; WILKINS and WAKELEY 2002), coalescence time increases with the distance between samples and is greater for samples drawn farther from the edge of the habitat.
ABSTRACT
THE MODEL
THE SCATTERING PHASE
THE COLLECTING PHASE
EFFECTIVE POPULATION SIZE
THE COALESCENCE PROBABILITY...
>DISCUSSION
APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
The fact that the separation-of-timescales approach provides a reasonably accurate approximation has implications both for computational methods of analyzing geographic structure and for the interpretation of geographic patterns of genetic diversity. Markov chain Monte Carlo (MCMC)-based approaches to the analysis of coalescent processes typically rely on having an analytic expression for the waiting time until the next event (coalescence, migration, recombination, etc.). For the simple case of n lineages in a panmictic population, there are n 1 coalescent events. The waiting time to go from k to k 1 lineages is exponentially distributed with mean
/N. Thus, one complete genealogy can be constructed simply by generating n 1 exponential waiting times (see, e.g., HUDSON 1990). The computational time required is roughly independent of the population size. One of the challenges in applying MCMC sampling in a continuous habitat is that analogous expressions for these waiting times are not readily available. Put another way, the waiting time to the next "event" is always one generation, since each lineage moves every generation. When lineages are far apart, their movement can be drawn from a Gaussian spanning several generations. However, when they are close together, every generation must be considered explicitly.
This analysis indicates a class of models for which the long-term coalescent behavior is independent of the original sampling scheme. Under the equilibrium model, there is a range of parameter values where this long-term behavior converges on the standard neutral coalescent model with some effective population size that depends on the local dispersal behavior and the geometry of the habitat. This location-independent long-term behavior typically describes the majority of the tree depth. Thus, for an arbitrary number of samples, genealogies could be generated computationally using a two-step process. The recent part of the genealogy can be generated by explicitly simulating every generation. At some point (the same point where we switch from the scattering phase to the collecting phase), this simulation process could be replaced with the conventional process based on waiting times. Such an approach might be used to improve the efficiency of computationally intensive approaches to the analysis of geographically structured data.
For many populations, the deep branches of genealogies will be shaped by nonequilibrium processes (e.g., population bottlenecks, range expansions, selection at linked loci, etc.) and by geographical heterogeneity (e.g., barriers to gene flow). A separation-of-timescales approach may lead to methods for separating isolation-by-distance effects from these sorts of nonequilibrium events. For example, MCMC integration of the scattering phase could be used to project a set of samples from particular locations onto a distribution of ancestral samples at the boundary between the scattering and collecting phases. This ancestral distribution could then be queried for patterns associated with particular demographic histories or for evidence of additional subdivision (in the form of geographic structuring of clades beyond what is explained by ongoing local dispersal). The application of multiple timescales to separate local dispersal from other factors shaping genealogical structure could provide a valuable tool in the development of statistically rigorous phylogeographic methods (KNOWLES and MADDISON 2002; KNOWLES 2004).
The analysis presented here also suggests certain specific features of the genealogical process that may aid in the interpretation of empirical data. In particular, the long-term coalescent behavior observed in simulations may be relevant to efforts to use genealogical structure to infer the existence of barriers to gene flow. First, the duration of the collecting phase is longer than that of the scattering phase by a factor on the order of Nb. This means that in most populations, the scattering phase will account for only a tiny fraction of the genealogy. This is consistent with previous arguments that the deep branches in any genealogy are unlikely to carry useful information about long-term patterns of gene flow under equilibrium conditions.
I have proposed the term M* as a useful metric for characterizing the nature of the genealogical process in the collecting phase. When M* > 1, the collecting phase is well approximated by the SNCM. The wide spread of the points in the M* < 1 region of Figure 6 suggests that M* is not sufficient to fully characterize the collecting phase in those cases where it deviates significantly from the panmictic process and should therefore be used simply to characterize a particular population as falling within one of the two regimes. M* is defined as N
2/L21, where L1 is the length of the longer of the two habitat axes. The condition M* > 1 can be rewritten as Nb > 4
r, where r is the ratio of the lengths of the two habitat axes. Since r can be no smaller than 1, no population is expected to converge to the SNCM if Nb < 4
(
12.5). The more extended the habitat becomes, the larger the neighborhood size must be for this convergence to occur.
The simulation work by IRWIN (2002) in a one-dimensional habitat showed that under certain parameter values genealogies generated from a model of simple isolation by distance could give the appearance of a deep phylogeographic break, which could lead to the erroneous inference of a barrier to gene flow. This signature is in the form of coalescence into two geographically distinct clades, with a deep genealogical split between them. Irwin notes that the likelihood of this outcome increases as the dispersal distance (
) or the population size (N) decreases. Inspection of Figure 4 from IRWIN (2002) indicates that the region of parameter values for which deep phylogeographic breaks are likely corresponds to values of M* that are less than one.
Of course, the variance of the coalescent process is large, even in models lacking geographic structure, and deep genealogical divisions can arise by chance. In attempting to assess the likelihood of the existence of a barrier to gene flow, the best course of action is to consider multiple independently segregating loci. While individual loci may manifest deep phylogeographic breaks by chance, in the absence of linkage, multiple loci are unlikely to manifest geographically coincident deep genealogical divisions unless there is, or has been, some barrier to gene flow. However, when only a single locus is available, determination of whether M* is likely to be less than or greater than one could serve as a check on the plausibility of attributing deep phylogeographic structure to simple isolation by distance in individual cases.
The motivation behind this work has been to generate solutions for a model of geographic structure that is somewhat more realistic than many other models for which analytic solutions have been derived. At the same time, the goal has been to develop a solution that is more easily computed than one that relies on explicitly calculated recursions or Monte Carlo simulations. It is my hope that results such as these may contribute to the development of a useful middle ground between realism and tractability. The implementation of these results into an inferential framework has not been pursued here, but C programs for performing many of the basic calculations are available from the author.
ABSTRACT
THE MODEL
THE SCATTERING PHASE
THE COLLECTING PHASE
EFFECTIVE POPULATION SIZE
THE COALESCENCE PROBABILITY...
DISCUSSION
>APPENDIX
ACKNOWLEDGEMENTS
LITERATURE CITED
Two sequences from the same location:
I begin by deriving the distribution of coalescence times for a pair of sequences drawn from the same location. Following BARTON and WILSON (1995)(1996), the probability that the two lineages coalesce t generations in the past is given by the recursion equation:
![]() | (A1) |
This equation can be rewritten in a nonrecursive form as
![]() | (A2) |
The additional terms continue to alternate sign, with each subsequent term multiplied by 1/Nb and including an additional nested sum. Because each summation is to the previous index minus 1 (e.g., the sum on j is from 1 to i 1), the number of nonzero terms in Equation A2 is equal to t (e.g., for t = 3, only the 1/Nb, 1/Nb2, and 1/Nb3 terms are not equal to zero). The sums in Equation A2 can be eliminated using some approximate forms. Denoting the nested sum following the 1/Nbj term as
j, Equation A2 can be written as
![]() | (A3) |
The
j terms are approximated as follows:
![]() | (A4) |
In these expressions,
is Euler's gamma (
0.5772). The
j terms were calculated explicitly for values of j up through seven and for values of t up through 2000. The approximations in (A4) were in reasonable agreement over this range of values, with errors of <5% over the vast majority of points. These approximations become less accurate as j increases, and there is no guarantee that the form specified by Equations A4 will hold for very large values of j. For this reason, this approach is valid only where the series of sums can be truncated. Since the lead term of
j has the form j Logj1(t + 3/2 j), terms in the sum should be decreasing as long as Log(t) < Nb. Substituting these approximations into (A2) yields an expression of the form
![]() | (A5) |
Owing to the form of the approximations in A4, any term occurring in a particular
j will appear in all subsequent terms
j'>j. If we collect these terms together, this expression becomes
![]() | (A6) |
We can then approximate the series of 1/Nbj terms by 1/(1 + (1/Nb)) to yield an expression of the form
![]() | (A7) |
In principle, this form will include terms up through
![]() | (A8) |
For purposes of calculation, however, it will be useful to truncate Expression A7 after a small number of terms. Once again, the magnitude of the terms in (A7) will diminish so long as Log(t) < Nb and will decrease more rapidly for smaller values of Log(t)/Nb. The corresponding CDF, or the probability that our two lineages coalesce no more than T generations in the past, is derived simply by summing Equation A7 from 1 to T:
![]() | (A9) |
The form of (A7) includes different numbers of terms for different values of t, and the inclusion of the 1/(1 + (1/Nb)) term assumes the existence of subsequent terms. However, a reasonable approximation of the CDF can be derived for larger values of T:
![]() | (A10) |
The 1/Nb sum is approximately equal to (Log(T) +
)/Nb. The "2
" term can be similarly approximated as 2
(Log(T) +
1)/(Nb2 + Nb). The other terms asymptotically approach a similar form such that for large T Equation A10 can be written as
![]() | (A11) |
Approximate values for the first few
i terms are
![]() | (A12) |
Equation A11 can then be rearranged to collect the Log terms:
![]() | (A13) |
This can be further simplified to
![]() | (A14) |
1 =
. This gives a final form for the CDF that does not include any nested sums. Inspection of Equation A14 supports our prior conclusion that this approach is valid only for Log(T) < Nb, since F(T) cannot be >1. Furthermore, inspection of the series of
i terms in (A12) suggests that the terms in the sum in (A14) will decrease in magnitude so long as Nb >
15. A number of the approximations invoked here assume the existence of an infinite number of terms. For very small values of T, this leads to significant errors (see Figure 8), but appears to be reasonably accurate for T > 10 (Figure 2).
Two sequences from different locations:
Consider the case of two samples drawn from locations separated by a distance x. As before, I start with the recursion derived by BARTON and WILSON (1995)(1996). The probability that the two lineages coalesce t generations in the past is given by
![]() | (A15) |
If we rescale distance relative to dispersal, defining
= x/2
, and assume that Nb is sufficiently large that we can neglect terms of order 1/Nb3, this can be rewritten as
![]() | (A16) |
Equation A16 can be approximated by
![]() | (A17) |
* is 0.562, and
(0, z1, z2) is the incomplete gamma function, defined as
![]() | (A18) |
Unfortunately, a general approximate expression for the CDF corresponding to Equation A17 is not readily forthcoming. The CDF can, of course, be written as
![]() | (A19) |
For t >>
*, Equation A17 can be approximated by
![]() | (A20) |
Boundary effects and images:
The effect of habitat boundaries is approximated by an approach analogous to the "method of images" in electrostatics. For a pair of sequences drawn from the same location a distance z from a reflecting boundary, the coalescence probability can be modeled as the combination of two unbounded coalescent processes: one for two sequences from the same location and one for two sequences separated by a distance 2z (
= z/
). This can be imagined as taking one of the two samples and creating an additional, "mirror image" sample by reflection across the boundary. The coalescent process for two sequences drawn from different locations can be constructed in an analogous manner. For two sequences 



, corresponding to the distance separating two neighboring points on the two-dimensional lattice. Agreement is good for values of Nb >10. No boundary effects are included in either the simulations or the analytic results.






























