- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Wilkins, J. F.
- Articles by Wakeley, J.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Wilkins, J. F.
- Articles by Wakeley, J.
The Coalescent in a Continuous, Finite, Linear Population
Jon F. Wilkinsa and John Wakeleyba Program in Biophysics, Harvard University, Cambridge, Massachusetts 02138
b Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, Massachusetts 02138
Corresponding author: Jon F. Wilkins, Biological Laboratories-2102, 16 Divinity Ave., Cambridge, MA 02138., jfwilkin{at}fas.harvard.edu (E-mail)
Communicating editor: M. W. FELDMAN
| ABSTRACT |
|---|
In this article we present a model for analyzing patterns of genetic diversity in a continuous, finite, linear habitat with restricted gene flow. The distribution of coalescent times and locations is derived for a pair of sequences sampled from arbitrary locations along the habitat. The results for mean time to coalescence are compared to simulated data. As expected, mean time to common ancestry increases with the distance separating the two sequences. Additionally, this mean time is greater near the center of the habitat than near the ends. In the distant past, lineages that have not undergone coalescence are more likely to have been at opposite ends of the population range, whereas coalescent events in the distant past are biased toward the center. All of these effects are more pronounced when gene flow is more limited. The pattern of pairwise nucleotide differences predicted by the model is compared to data collected from sardine populations. The sardine data are used to illustrate how demographic parameters can be estimated using the model.
MIGRATION often plays an important role in shaping patterns of genetic diversity. Under conditions of restricted gene flow, the geographical and genetic structures of a population tend to become correlated. In the most basic terms, we expect individuals in close geographical proximity to be genetically more similar than geographically distant individuals. This differentiation will arise even in the absence of local adaptation, due to locally occurring genetic drift. As a result, it is possible to use existing patterns of neutral genetic variation to make inferences about the geographic structure of populations.
The best-studied models of geographic structure are the island (![]()
![]()
![]()
Coalescent theory differs from classical population genetics in its focus on the time to the most recent common ancestor of two or more sequences, rather than on the properties of the population as a whole. This focus on the genealogical structure of a sample provides a framework in which properties of populations can be estimated. Coalescent theory applied to geographically structured populations with discrete demes has been formalized as the "structured coalescent" (see, e.g., ![]()
The coalescent model developed in this article assumes a population distributed uniformly along a finite, one-dimensional habitat. Gene flow is restricted, so locations of parents and offspring are correlated. A diffusion approximation is used to characterize the locations of ancestors of sampled sequences. Applied to pairs of sequences, this approach fully specifies the probability density for the times and locations of their most recent common ancestors and also provides summary statistics such as the mean time to coalescence.
The model is analogous to the one-dimensional stepping-stone model, but with some important differences that illustrate the motivation for this work. Although there has been a lot of work on finite stepping-stone models (discussed below), most analyses of the stepping-stone model rely on nonrealistic treatments of habitat boundaries. The best-studied models fall into two categories. Models in the first category assume that the ends of the array are joined together (the circular stepping-stone model, or the toroidal model in two dimensions; e.g., ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
The model studied here assumes a finite linear habitat (e.g., along a stretch of coastline). The analysis indicates that the expected pattern of genetic diversity does, in fact, depend on location in the habitat, suggesting that application of a circular model to a finite linear population is problematic. At the very least, this misapplication entails discarding information encoded in the variation in genetic diversity along the population range.
Models of isolation by distance in a continuous population date back to ![]()
Work from within the classical population genetics paradigm provides some insight to the properties of finite linear models similar to the one considered here. Finite one-dimensional stepping-stone models have been analyzed by ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
While all of these results are intimately linked to the distribution of coalescence times (![]()
A model similar to the one presented here was proposed by ![]()
![]()
| THE MODEL |
|---|
The model assumes a uniformly distributed population of N haploid individuals in a linear habitat, but can be applied to a population of N/2 diploid individuals without modification. Distance is scaled such that any location along the habitat is indexed by a number between 0 and 1 (with 1/2 being the midpoint of the habitat). Absolute density-dependent population regulation is assumed. Each individual occupies a space of width 1/N, from which all other individuals are excluded. The structure of the population is a one-dimensional lattice, as in the voter model (![]()
![]()
Generations are nonoverlapping. Each individual produces a very large number of gametes, which are dispersed according to a normal distribution centered at the location of the individual and with variance 2
2m. Thus in each generation an effectively infinite number of gametes arrives at each location. One of these gametes is selected at random to become the adult at that location in the next generation. The distribution of the origins of those gametes is the correlation of the normal with a delta function at the individual's location. Thus, the location of a parent is normally distributed around the location of its offspring, with variance 2
2m (a grandparent is normally distributed around the same location with variance 4
2m, and so on).
The boundaries of the habitat are reflecting, so a gamete that would otherwise land outside the habitat range is reflected back an equal distance within it. Each individual thus has the same expected number of offspring regardless of its location. This means that migration is conservative, so migration alone is sufficient to maintain the relative population densities at all locations in the habitat (![]()
As ![]()
Applying a coalescent approach to the analysis of this model involves tracking the location of the ancestors of a particular sequence back in time. The location of a single sampled lineage can be approximated using a diffusion process with diffusion constant
2m. The precise location of a lineage is known only for the generation in which sampling occurred, so its location in the past is represented here as a probability distribution. Considering only a single sequence sampled from a particular location z0 and its ancestors, one can imagine how this probability distribution broadens going back in time. In the distant past, the distribution becomes completely flat, when the ancestral sequence is equally likely to have been anywhere in the range. The exact distribution of ancestral locations can be derived for such a system using a Fourier series.
The analysis presented here derives the distribution of the time to coalescence for a pair of sequences drawn from locations z01 and z02 in the habitat. Each lineage is subject to a diffusion process backward in time, with the probability of coalescence related to the overlap of the two probability distributions (Fig 1). The solution employs a two-dimensional Fourier method, but is more complicated than the case of a single sequence because the ancestral location distributions z1(t) and z2(t), conditional on not yet having coalesced, are not independent of each other.
|
If we consider a single lineage, disregarding habitat boundaries for the moment, it is equally likely to have come from a location to its left or its right in the previous generation. It follows that when we consider two lineages at some distance from each other, they are equally likely to have been closer together or farther apart in the previous generation. However, if the two shared a common ancestor in the previous generation, they must have been closer together. Thus, conditional on not coalescing in the previous generation, the two lineages are slightly more likely to have been farther apart than closer together. In contrast to the single-sequence case, as time in the past becomes very large, the ancestral lineages are not equally likely to be found anywhere in the habitat range. If the two lineages are still distinct, they are likely to have been more geographically separated than a uniform distribution dictates.
The analysis involves a transformation of variables to create two new parameters that do diffuse independently. The first parameter encodes the distance between the two sequences, and the second their average position:
![]() |
(1) |
![]() |
(2) |
The "1" terms in Equation 1 and Equation 2 are included for mathematical convenience, and other coordinate systems would yield the same results, so long as diffusion is isotropic. The distribution of ancestral locations t generations in the past can now be represented as a two-dimensional probability surface in x and y:
![]() |
(3) |
This probability distribution is nonzero within a right triangle ranging from -1 to 1 in y and from |y| to 1 in x (Fig 2).
|
The genealogical history is now modeled as a single diffusion process in this triangular state space. The diffusion constant is 2
2m, twice that for the single-particle diffusion process discussed above. This factor of two arises from the fact that the new parameters are the sum and difference of two
2m single-particle diffusion processes.
Diffusion is subject to two different boundary conditions in the triangular state space. The two short sides of the triangle are reflecting boundaries. Reflection at these lines corresponds to reflection of a lineage off of a habitat boundary. The long side of the triangle is a more complex, partially reflecting, partially absorbing boundary. Positions along this line represent states where the two ancestral lineages are very close together in space. Reflection is equivalent to the two lineages moving past each other. Absorption is equivalent to a coalescent event.
The diffusion process in the transformed state space is isotropic, but not separable. Although diffusion in one dimension is independent of diffusion in the other dimension, it is not independent of location, due to the fact that the state space is triangular. The diagonal reflecting boundaries can be eliminated by placing a mirror image of the state space opposite the boundary. Reflection at the boundary is now represented as movement into the mirror-image state space. Three such reflections transform the state space into a square ranging from -1 to 1 in both x and y. All four edges are coalescent boundaries, and the symmetric shape makes the x and y diffusion processes separable as
![]() |
(4) |
where each distribution satisfies the diffusion equation
![]() |
(5) |
![]() |
(6) |
Each pair of locations (z1, z2) in the real habitat corresponds to four locations in this square space, one in each of the four images of the triangular state space (see Equation A40).
The boundary conditions at the edges of the square depend on both the population density and the dispersal (migration) rate. If the population density and dispersal rate are very high, then when the two lineages come close together, they are likely to pass by each other rather than share a common ancestor, because there are a large number of individuals within their dispersal range. This corresponds to a more reflecting boundary in the two-dimensional state space. On the other hand, if the population density and dispersal rate are low, neighborhood size is small, and two lineages that are close together in space will be more likely to share a common ancestor, corresponding to a more absorbing boundary. Mathematically speaking, the flux rate of the probability distribution across the boundary is equal to the probability that the two lineages coalesce in the previous generation.
Because the model assumes perfect density-dependent population regulation, there is exactly one haploid lineage in each span of width 1/N. A coalescent event occurs when the two ancestral lineages are found within the same 1/N span in a given generation. Note that giving the lineages a finite physical width addresses the concern raised by ![]()
![]() |
(7) |
![]() |
(8) |
These series can be truncated for purposes of making calculations without significant loss of accuracy. The
i and
*j terms are determined by the boundary conditions (see Equation A13aEquation A13b), and the fi and f*j terms are determined by the two sampling locations (Equation A21aEquation A21b). Discussion of Fourier-series solutions of the diffusion equation can be found in most texts (e.g., ![]()
At time t, the probability of the two lineages not yet having coalesced is equal to the volume under the probability surface defined by U (x, y, t) within the square -1 < x < 1, -1 < y < 1. Intuitively, this is because coalescence is represented by the diffusion of the probability density out of the square. Thus, the instantaneous rate of coalescence at time t is given by the rate at which the probability volume within the square is decreasing at that time. From this relationship it is possible to derive the expectation of the time to coalescence for two sequences sampled from locations corresponding to x0 and y0:
![]() |
(9) |
It is possible using this approach to derive a number of other analytic results, including the full probability distribution for the time to coalescence (Equation A25 and Equation A33). Each of the moments of the distribution has a simple form analogous to that for the expectation (Equation A31 and Equation A34). It is also possible to write down the exact distribution of the locations of two lineages that have not yet coalesced (A40), as well as the distribution of locations of the coalescent events (A44A46).
| RESULTS |
|---|
The results derived in this article have been compared to simulation data to assess the accuracy of the diffusion approximation in this context. This section contains analytic and simulation results for a number of values over a range of parameters. These results provide both reassurance regarding the accuracy of the equations and insight into the behavior of the coalescent process in a continuous, linear habitat.
Monte Carlo simulations were performed backward in time using two different migration/coalescence processes. In the first process, the locations of the two lineages were kept as floating point numbers. Migration each generation was performed by drawing a random number from a normal distribution. If the new location for the lineage lay outside the habitat range, the new location was selected by reflection at the habitat boundary. A coalescent event occurs if, after translocation and reflection, the two lineages lie within a distance 1/(2N) of each other. Times and locations of coalescent events were averaged over a large number of sample runs.
The second process used a discrete lattice model. Each of the two lineages was assigned an integer location between 1 and N. Each generation two pseudorandom integers were drawn from independent Poisson distributions for each lineage, which were then translocated by the difference of the two Poissons. This produces a discrete distribution that approximates the shape of a normal distribution. Reflections were performed as in the first process. If, after migration and reflection, the two lineages were at the same location (had the same integer location value), a coalescent event was considered to have occurred. Again, the times and locations of the coalescent events were averaged over a large number of sample runs.
The relative value of the mean time to coalescence is determined largely by the product N
2m, which is analogous to Nm in discrete-deme models of geographical structure. Presented here are three sets of data representing three different values of N
2m. In all three cases N = 200. The values of
2m are 0.005, 0.0005, and 0.00005
. For each value of
2m, the mean time to coalescence was determined for a pair of sequences for a number of sampling locations. Fig 3 presents mean time to coalescence determined analytically from Equation 9 for the parameters used in Table 2
. Mean times to coalescence in Table 1 Table 2 Table 3 are determined by three methods. The first two values are simulation results (by the Poisson and normal methods described above) from 1 million replicates. The third value is the analytically determined value from Equation 9.
|
|
|
|
Inspection of the tables reveals that the analytically derived mean time to coalescence is in good agreement with simulated data, with the results of the three methods differing typically by no more than 0.5%. Table 1 Table 2 Table 3 and Fig 3 also immediately reveal two features of this model. First is the intuitively pleasing result that the mean time to coalescence increases with the physical distance between the two sampled sequences. The rate of increase with distance is dependent on the migration rate, with lower migration rates corresponding to higher rates of genetic divergence with distance.
A second, less intuitive, result from these data is the dependence of time to coalescence on the location of the two sampled sequences (in contrast to their separation). The pattern is most easily seen by considering pairs of adjacent sequences sampled from various locations along the habitat. A pair of adjacent sequences sampled from the center of the population range has a longer mean time to coalescence than pairs sampled closer to the ends, an effect that is more pronounced at lower migration rates. This result is anticipated by classical population genetics results in which the probability of identity by descent is higher for demes near a reflecting boundary (![]()
![]()
![]()
The model also provides results regarding the locations of the lineages and common ancestors (Equation A46). Fig 4 shows the probability surface for the time and location of coalescence for three different pairs of sequences (from Equation A44). Recent coalescent events are likely to be found in the region between the two sampling sites. In the more distant past, the probability distribution depends only on the migration rate and not the sampling locations. This distant-past distribution is biased toward the center of the range. Another result that can be derived from the model is the distribution of the lineage locations conditional on their still being separate at a time t in the past (Equation A40). In the distant past, this distribution is skewed toward the edges. Intuitively, if the two lineages are separated by a very deep genealogical branch, it most likely results from their having spent a lot of time at opposite ends of the range. A number of other results are also derived and presented in the Appendix, including the strong-migration limit (![]()
![]()
|
| APPLICATION TO DATA |
|---|
The expected time to coalescence can be used to estimate demographic parameters. In this section published sequence data are used to fit the model and estimate the effective population size and the genetic dispersal rate. Our purpose here is not to determine specific parameter values for a particular organism. In fact, the population considered below is likely to violate one or more assumptions of the model. Our goal is simply to illustrate the fact that patterns of genetic diversity such as the one predicted by the model may be found in nature. We also want to emphasize the fact that the finite linear model makes different predictions under neutrality than either an island or a circular model and that these differences may alter our interpretation of sequence data.
We have applied the model developed here to sequence data collected from the mitochondrial control region in the five different regional forms of sardines (Sardinops) in the Indian and Pacific oceans (![]()
![]()
On the basis of these observations, it may be reasonable to apply the finite, linear model to this system, treating the populations as linearly arrayed from Japan to California to Chile to Australia to South Africa, with genetic contact possible only between adjacent populations. The total length of this range is
25,000 miles, with the five sampling sites occurring at
5000-mile intervals. These five sites yield 15 pairwise comparisons, which were fit to the model by finding the parameter values that minimize the sum of the squares of the differences between the predicted and observed values. The statistical properties of these estimators are not investigated here. However, in this case, the model does appear to fit the data reasonably well, reproducing the same pattern of genetic diversity, and it provides a framework that highlights certain features of the data.
Fig 5 shows the observed and expected average number of pairwise nucleotide differences (proportional to the expected coalescence time) in the data set for parameter values minimizing the sum of the squares of the errors. The observed data manifest the two key features of the model: increasing genetic differentiation with distance and greater genetic diversity near the center of the habitat. The product N
2m was estimated to be 0.017, and
(= 2Nµ) was estimated to be 1.64. Assuming a per generation mutation rate of
7.5 x 10-6 for the mitochondrial control region (on the basis of a divergence rate of 15% per million years between lineages and a mean generation time of 2 years, ![]()
![]()
, which corresponds to a mean intergenerational migration distance of
8 miles and a (mitochondrial) neighborhood size of
200.
|
Inferences can also be drawn from differences in observed and expected values. In the fourth column of Fig 5 (labeled "Calif."), the observed interpopulation values are all lower than the corresponding expected values. In the fifth column ("Japan"), on the other hand, the expected values are lower than the observed. This pattern suggests that the California and Chile populations are more closely connected genetically than might be predicted from distance alone, whereas the Japan and California populations are genetically more distant than expected. This observation supports BOWEN and GRANT's (1997) argument that the tropical barrier in the eastern Pacific is, or has recently been, traversible. In fact, the data suggest that the tropical water in the eastern Pacific may represent less of a genetic barrier than the equally large band of temperate water between North America and Japan.
This analysis is presented not to address specific issues in sardine biogeography or question the conclusions of Bowen and Grant, who attribute the observed pattern of genetic diversity to a range expansion of the sardine populations. Our purpose has been simply to show that patterns predicted by the model can, in fact, be found in natural populations and to illustrate how the model can be employed to estimate interesting demographic parameters. Furthermore, the analysis suggests how observed patterns of genetic diversity can be made more meaningful when compared to a more sophisticated null model.
| DISCUSSION |
|---|
We have presented a model for analyzing genetic diversity in a finite, continuous, linear population. Using a diffusion approximation, the model fully characterizes the distribution of possible genealogical histories of a pair of sequences sampled from such a population. Results derived from the model include the full distribution of coalescence times and locations, as well as a number of summary statistics, such as the mean time to the most recent common ancestor.
The analytic results derived from the model are in good agreement with simulations over a wide range of parameter values. This agreement extends even to extremely small neighborhood sizes (approaching one), allowing us to relax the usual coalescent theory assumption of a large local population size. In the other extreme, the strong migration limit where the neighborhood size approaches the population size, the model converges on well-established results for the coalescent process in a panmictic population.
The model makes several predictions regarding genealogies in a finite continuous habitat. In addition to the intuitive result that genetic divergence increases with distance, the model predicts that genetic diversity will be greater near the center of the habitat than at the edges. Coalescent events in the recent past are most likely to occur between the sampling locations of the two sequences. In the distant past, the distribution of locations of coalescent events becomes independent of sampling location and is concentrated toward the center of the habitat. The locations of lineages (conditional on not having coalesced), on the other hand, are biased toward the edges of the habitat in the distant past. All of these effects are more pronounced under lower migration.
In this development of the model, we have assumed reflecting habitat boundaries, meaning that individuals suffer no loss of fecundity when they are situated at the edge of the habitat. It may be more reasonable to assume absorbing habitat boundaries. In the forward-time model, this would mean that gametes that dispersed outside the habitat range would be lost. The effect in the backward-time model would be to bias the distribution of lineage locations slightly toward the center of the habitat, decreasing slightly the time to common ancestry, but preserving the broad patterns described for the reflecting-boundaries case.
One property of the model not discussed above is the dependence of the coalescent process on neighborhood size. For a given pair of locations, the ratio of the expected time to coalescence to the total population size is determined primarily by the product N
2m, which is analogous to Nm in demic models of population structure. However, N and
m enter into minor terms in the equations separately, meaning that it is possible, in principle, to estimate these two parameters independently without an independent estimate of the population size (e.g., from
and µ). In practice, however, it seems unlikely that sufficient data could be collected (or that a population could be found that conformed closely enough to the model) to separately estimate these parameters with any certainty.
It is also possible using this model to derive other values for a particular set of parameters. ![]()
![]() |
(10) |
where
0 represents the mean time to coalescence for a pair of sequences drawn from the same deme, and
represents the mean time to coalescence for a pair drawn at random from the entire population. These two values can be derived from Equation 9, where
0 is the average value along the line
, and
is the average value over the entire space (-1 < x0 < 1, -1 < y0 < 1). In fact, the value of
0 is very nearly N, independent of the value of
2m, consistent with the observation that the mean within-deme coalescence time will average to N in any system with conservative migration (![]()
![]()
The distribution of coalescence times given by Equation A25 can also be combined with a particular mutational model. Integration of this probability against the mutational process will yield the probability of identity in state or the likelihood of a particular set of differences between the two sequences. Results such as these may be valuable in the analysis of sequence data.
Mathematica files for generating the results described in this article are available from the authors, as is a C program for estimating parameter values from sequence data.
| ACKNOWLEDGMENTS |
|---|
We thank N. H. Barton, J. L. Cherry, T. Nagylaki, A. Platt, J. Wall, and three anonymous reviewers for helpful discussions and comments on the manuscript. This work was supported by a grant from the Howard Hughes Medical Institute to J.F.W. and in part by National Science Foundation grant no. DEB-9815367 to J.W.
Manuscript received June 10, 2001; Accepted for publication March 4, 2002.
| APPENDIX |
|---|
Derivation of formulas:
Let the initial positions of the two sequences be denoted by z01 and z02, where z01 and z02 represent the relative locations along the length of the entire habitat, and therefore both lie between 0 and 1. Let x0 and y0 be the following transformations of these coordinates:
![]() |
(A1a) |
![]() |
(A1b) |
A pair of coordinates (x, y) then fully describes the system of two identical particles along the line from 0 to 1. The state space in this coordinate system is a right triangle ranging from -1 to 1 in y and from |y| to 1 in x. The state of the system at a time t generations in the past is given by a probability function
on this state space. Migration of the two particles along the line is considered to be a diffusion process with constant
2m, and the corresponding diffusion process in the transformed (x, y) coordinate system is a two-dimensional diffusion process with constant 2
2m in each direction.
The two short sides of the triangular state space represent reflecting boundaries, which may be eliminated by the method of reflecting the state space across the boundary. Three such reflections generate a square state space ranging from -1 to 1 in each direction. Note that crossing over the diagonals of this square involves a transposition of x and y. However, since the diffusion process is isotropic, this transposition does not affect the analysis.
Because the x and y diffusion processes are now separable, further discussion focuses on a one-dimensional diffusion process. The two-dimensional process can be reconstructed by multiplication of two such one-dimensional processes. The derivation uses only x. The equations for the diffusion process in y are identical.
The long side of the triangular state space, which is now replicated four times as the boundary of the new square state space, represents a partially reflecting, partially absorbing boundary. The diffusion process has now been reduced to a Sturm-Liouville-type problem, where the boundary conditions are set by a relationship between the flux rate across the boundary and the density function within the boundary.
The function Ux(x, t) must satisfy the diffusion equation within the range (-1, 1),
![]() |
(A2) |
and is subject to the boundary conditions
![]() |
(A3a) |
![]() |
(A3b) |
In these equations, the flux J across the boundary is set equal to the average probability over the next generation that the separation between the two lineages is <1/(2N). Since there are N individuals in the population and complete density-dependent population regulation, it is assumed that two lineages separated by less than one-half the "width" of an individual must be the same lineage. Thus the flux rate across the boundary is equal to the rate of coalescence. The probability that the two lineages coalesce in the previous generation is proportional both to this width and to the probability that the two lineages are separated by a distance s and that this separation decreases by s in one generation of dispersal, integrated over all possible values of s. In Equation A3a and Equation A3b, a normal distribution has been assumed for the single-generation dispersal pattern. Other dispersal patterns are, of course, possible. However, most patterns will converge on a normal distribution after a relatively small number of generations. Reflections have been ignored in these two equations, as it is assumed that single-generation migration events much longer than the total habitat length are extremely rare.
The general solution to the diffusion equation is
![]() |
(A4) |
and we can incorporate the boundary conditions by using Taylor-series expansions
![]() |
(A5a) |
![]() |
(A5b) |
where U(n)x represents the nth derivative of Ux with respect to x. The right-hand sides of these equations can be integrated term by term to give
![]() |
(A6a) |
![]() |
(A6b) |
Substituting in expressions for the derivatives of Ux (derived from Equation A4), we get
![]() |
(A7a) |
and
![]() |
(A7b) |
Collecting terms by C1 and C2 and simplifying, these conditions become
![]() |
(A8a) |
and
![]() |
(A8b) |
Two classes of nontrivial solutions satisfy both of these conditions simultaneously. For the first class of solutions, C2 = 0 and
satisfies the following:
![]() |
(A9) |
The second class of solutions will have C1 = 0 and values of
that satisfy
![]() |
(A10) |
These relationships can be further simplified once we consider the following two relationships for the gamma function:
![]() |
(A11a) |
![]() |
(A11b) |
The specific solution for the diffusion equation with these boundary conditions then becomes
![]() |
(A12) |
where
i satisfies
![]() |
(A13a) |
and
*j satisfies
![]() |
(A13b) |
The normalization values for the eigenfunctions are
![]() |
(A14a) |
![]() |
(A14b) |
which makes the normalized solution at time t = 0,
![]() |
(A15) |
where the C1 and C2 terms are derived by integrating the product of the probability distribution at time t = 0 with each term:
![]() |
(A16a) |
![]() |
(A16b) |
If we take the initial probability distribution to be the
function
(x - x0), then the system in x is represented as
![]() |
(A17) |
The
i series terms (from Equation A13a) are of the form
, where 0 <
i+1 <
i < 1/2. Similarly, the
*j terms (from Equation A13a) are of the form
, where 0 <
*j+1 <
*j <
. Since both series are monotonically increasing, and
2 terms are found in the exponential, the series can be truncated for purposes of making calculations without significant loss of accuracy. The number of terms required is determined by parameter values, with more terms needed for smaller neighborhood sizes, and for samples that are closer together. The calculations presented in Table 1 Table 2 Table 3 were truncated after 40, 20, and 7 terms, respectively.
The equations for the
i and a*j can be simplified in the case where the neighborhood size is very large (
N
m >> 1), in which case these equations become approximately
![]() |
(A18a) |
and
![]() |
(A18b) |
The form of the solution in Equation A17 works well so long as the probability of coalescence in the first generation in the past is very small, that is, when the two sequences are separated by a sufficient distance (when x0 is not close to 1) or when the neighborhood size is large. If x0 is close to 1 and the neighborhood size is not large, we must use a more complex form for the initial conditions. This results from the fact that the model assumes discrete time steps, and so the shortest possible coalescence time is one generation. The diffusion approximation solution, on the other hand, assumes continuous time, and coalescence is possible at any time t > 0. When the probability of coalescence occurring within the first time step is very small, this correction is negligible. However, under certain conditions, we must account for the fact that one generation of migration occurs prior to the first opportunity for coalescence. This migration effectively moves the initial condition probability peak away from the boundary, resulting in a longer predicted time to coalescence.
The more accurate form of the solution, valid over all values of x0, takes as its initial conditions a normal distribution of variance 2
2m, centered at x0 and reflected off of the boundary at x = 1. The initial probability distribution is given approximately by
![]() |
(A19) |
Then the probability distribution in x as a function of t becomes
![]() |
(A20) |
where
![]() |
(A21a) |
![]() |
(A21b) |
Only a single reflection (at the point x = 1) is considered in deriving the initial conditions. More accurately, reflection off of the x = -1 point would also be included, as well as higher-order reflections (off of both x = 1 and x = -1, etc.). However, unless the migration rate is extremely high (high enough to completely eliminate any geographical structure), these terms are negligible if x0 > 0. We have assumed by definition that x0 > 0, but y0 is allowed to range between -1 and 1. It should be noted, however, that all solutions for y0 will be identical under the transformation y0 = -y0. Therefore, by considering only y0 > 0, we can retain a complete description of the system and ignore migratory reflections off of the
boundary in the first generation.
Assuming this distribution for the initial conditions is equivalent to permitting one-half generation of migration to occur prior to initiation of the coalescent process. In this way, coalescent events that occur over the first generation take place in the range of times 1/2 < t < 3/2. Note that t = 1, the first time when coalescence can occur in the discrete time model, lies at the center of this range. This "premigration," which is necessary to compensate for approximations made in the translation from discrete to continuous time, gives rise to the "t - 1/2" terms in Equation A20 and in subsequent derivations. Once these factors have been taken into account, a full description of the state of the system at time t is given by
, which can be manipulated to yield a number of other results.
Cumulative distribution function of the coalescence time:
First we derive the probability that the two lineages have coalesced prior to time t. Recall that, in this formulation, coalescence is equivalent to diffusion outside of the square state space. The formulas derived above have been normalized so that, at t = 0, the volume under the probability surface is equal to one. At a time t, the probability that the two lineages are still separate is simply given by the volume under the probability surface within the square:
![]() |
(A22) |
Due to symmetry, the sine terms in the series expansions of Ux and Uy integrate to zero, leaving only a product of the integrals of the cosine series, and the cumulative distribution function of the coalescence time is
![]() |
(A23) |
Since all
i are nonzero, this value approaches one as t goes to infinity.
Probability density function of the coalescence time:
The probability that the two lineages coalesce at time t is given by the time derivative of the cumulative distribution function,
![]() |
(A24) |
which yields
![]() |
(A25) |
for values of t > 1/2. For t < 1/2, C(t) = 0.
Moments of the distribution:
The expectation and variance for the distribution of coalescence times, as well as all higher moments, can be derived from the expression for C(t). The pth moment of the distribution is given by
![]() |
(A26) |
Since C(t) = 0 for t < 1/2, this becomes
![]() |
(A27) |
The solution to
![]() |
(A28) |
has a simple form, which can be used to derive the moments:
![]() |
(A29) |
The expectation for the coalescence time then becomes
![]() |
(A30) |
and the variance is
![]() |
(A31) |
Simplified form:
If the two samples are taken from locations separated by more than the single-generation dispersal range, or if the neighborhood size is large, we can neglect the premigration correction introduced above and use a simplified form of the solution where the initial state of the system is the
function
(x - x0, y - y0). The cumulative distribution function, probability density function, and pth moment of the distribution of the time to coalescence are given by
![]() |
(A32) |
![]() |
(A33) |
![]() |
(A34) |
High migration limit:
Considering the strong-migration limit, where N
2m becomes large, we expect the model to converge on a panmictic population. As N
2m becomes large,
1 approaches zero, and only the first term in Equation A34 contributes significantly to the sum. Equation A33 thus simplifies to
![]() |
(A35) |
Furthermore, sin(
1) approaches
1, and cos(
1 x0) and cos(
1 y0) both approach one:
![]() |
(A36) |
Finally, using a series expansion for cot(
1), we find that, for large N
2m,
![]() |
(A37) |
![]() |
(A38) |
which gives us
![]() |
(A39) |
which is exactly what is expected for a panmictic population of size N.
Locations of the lineages:
The probability distribution for the locations of two lineages that have not yet coalesced is fully described by U(x, y, t). The transformed (triangular) state space is represented four times within the square space over which we have been considering the values of U. Thus, each combination of lineage locations (z1, z2) corresponds to four pairs of (x, y) coordinates. The likelihood at time t that the two lineages are at positions z1 and z2 is given by
![]() |
(A40) |
where
![]() |
(A41) |
This probability distribution is not conditional on the two lineages still being separate. That is, the total probability over all (z1, z2) between zero and one will not integrate to one, but rather to the probability that the two lineages have not yet coalesced.
Locations of the coalescence events:
The instantaneous rate of coalescence at a particular location is equivalent to the flux across the boundary at the point corresponding to that location. The probability of a coalescent event occurring at a particular location in the habitat, z0, corresponds to flux at four locations in the transformed state space, one on each of the four sides, and the differential width in the habitat,
z, is twice the corresponding differential width (
x or
y) in the transformed space. The coalescence rate within the range z0 to z0 +
z is





































































