Genetics, Vol. 161, 873-888, June 2002, Copyright © 2002

The Coalescent in a Continuous, Finite, Linear Population

Jon F. Wilkinsa and John Wakeleyb
a Program in Biophysics, Harvard University, Cambridge, Massachusetts 02138
b Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, Massachusetts 02138

Corresponding author: Jon F. Wilkins, Biological Laboratories-2102, 16 Divinity Ave., Cambridge, MA 02138., jfwilkin{at}fas.harvard.edu (E-mail)

Communicating editor: M. W. FELDMAN


*  ABSTRACT
*TOP
*ABSTRACT
*THE MODEL
*RESULTS
*APPLICATION TO DATA
*DISCUSSION
*APPENDIX
*LITERATURE CITED

In this article we present a model for analyzing patterns of genetic diversity in a continuous, finite, linear habitat with restricted gene flow. The distribution of coalescent times and locations is derived for a pair of sequences sampled from arbitrary locations along the habitat. The results for mean time to coalescence are compared to simulated data. As expected, mean time to common ancestry increases with the distance separating the two sequences. Additionally, this mean time is greater near the center of the habitat than near the ends. In the distant past, lineages that have not undergone coalescence are more likely to have been at opposite ends of the population range, whereas coalescent events in the distant past are biased toward the center. All of these effects are more pronounced when gene flow is more limited. The pattern of pairwise nucleotide differences predicted by the model is compared to data collected from sardine populations. The sardine data are used to illustrate how demographic parameters can be estimated using the model.


MIGRATION often plays an important role in shaping patterns of genetic diversity. Under conditions of restricted gene flow, the geographical and genetic structures of a population tend to become correlated. In the most basic terms, we expect individuals in close geographical proximity to be genetically more similar than geographically distant individuals. This differentiation will arise even in the absence of local adaptation, due to locally occurring genetic drift. As a result, it is possible to use existing patterns of neutral genetic variation to make inferences about the geographic structure of populations.

The best-studied models of geographic structure are the island (WRIGHT 1931 Down; MARUYAMA 1970A Down) and stepping-stone (KIMURA and WEISS 1964 Down) models. Both types of model assume a population composed of a number of subpopulations, or demes, connected to each other through migration. Each deme is assumed to be panmictic. In the island model, there is no explicit geography, in that each migration event occurs via a common migrant pool. Stepping-stone models, in contrast, permit migration only between neighboring demes. In the one-dimensional model, the demes are arrayed in a line, and each deme exchanges migrants only with the two adjacent demes. The analogous two-dimensional model assumes a grid of demes, with each deme exchanging migrants with some number of neighbors (e.g., four).

Coalescent theory differs from classical population genetics in its focus on the time to the most recent common ancestor of two or more sequences, rather than on the properties of the population as a whole. This focus on the genealogical structure of a sample provides a framework in which properties of populations can be estimated. Coalescent theory applied to geographically structured populations with discrete demes has been formalized as the "structured coalescent" (see, e.g., WILKINSON-HERBOTS 1998 Down).

The coalescent model developed in this article assumes a population distributed uniformly along a finite, one-dimensional habitat. Gene flow is restricted, so locations of parents and offspring are correlated. A diffusion approximation is used to characterize the locations of ancestors of sampled sequences. Applied to pairs of sequences, this approach fully specifies the probability density for the times and locations of their most recent common ancestors and also provides summary statistics such as the mean time to coalescence.

The model is analogous to the one-dimensional stepping-stone model, but with some important differences that illustrate the motivation for this work. Although there has been a lot of work on finite stepping-stone models (discussed below), most analyses of the stepping-stone model rely on nonrealistic treatments of habitat boundaries. The best-studied models fall into two categories. Models in the first category assume that the ends of the array are joined together (the circular stepping-stone model, or the toroidal model in two dimensions; e.g., MARUYAMA 1970B Down; NAGYLAKI 1974A Down, NAGYLAKI 1977 Down; STROBECK 1987 Down; SLATKIN 1991 Down). This assumption secures mathematical tractability by making all demes identical and migration isotropic (STROBECK 1987 Down), but is directly applicable to few systems in nature (e.g., a population inhabiting the entire coast of an island). Models in the second category assume a linear habitat of infinite length (e.g., WEISS and KIMURA 1965 Down; NAGYLAKI 1974B Down, NAGYLAKI 1976 Down; SAWYER 1976 Down, SAWYER 1977 Down). While these models provide useful insights regarding the short-term behavior of populations in a one-dimensional habitat, they predict infinite divergence between individuals (SAWYER 1976 Down; GRIFFITHS 1981 Down; WILKINSON-HERBOTS 1998 Down).

The model studied here assumes a finite linear habitat (e.g., along a stretch of coastline). The analysis indicates that the expected pattern of genetic diversity does, in fact, depend on location in the habitat, suggesting that application of a circular model to a finite linear population is problematic. At the very least, this misapplication entails discarding information encoded in the variation in genetic diversity along the population range.

Models of isolation by distance in a continuous population date back to WRIGHT 1943 Down, who defined the effective neighborhood population size as the reciprocal of the probability of self-fertilization. That is, the neighborhood size is approximately the number of individuals within the single-generation dispersal range. Wright's work shows that the correlation between adjacent individuals and the differentiation between neighborhoods both increase as the neighborhood size becomes small compared with the total population size. However, much of the theoretical work since has focused on populations subdivided into discrete demes. While a discrete model may be appropriate for many organisms, others may be distributed more or less continuously across a particular range, but nevertheless be geographically structured due to limited gene flow. The model presented here assumes a continuous population, but can be applied in modified form to the discrete-demes model.

Work from within the classical population genetics paradigm provides some insight to the properties of finite linear models similar to the one considered here. Finite one-dimensional stepping-stone models have been analyzed by MARUYAMA 1970C Down, FLEMING and SU 1974 Down, and MALECOT 1975 Down. These analyses derive expectations for classical measures such as the covariance in gene frequencies across demes. NAGYLAKI and BARCILON 1988 Down have considered probabilities of identity in a semiinfinite linear habitat. MARUYAMA 1971 Down has also derived probabilities of identity for a continuous population on a torus and the rate of decrease in genetic variability in a finite two-dimensional population (MARUYAMA 1970D Down, MARUYAMA 1972 Down). HEY 1991 Down has compared the mean coalescence time for a pair of sequences sampled from opposite ends of a finite linear stepping stone with that of a pair sampled at random from the entire population. The result that coalescence times are longer near the center of the habitat range is consistent with findings of HERBOTS 1994 Down(pp. 66 and 145–146), who found a similar pattern in linear stepping-stone models with three and five demes.

While all of these results are intimately linked to the distribution of coalescence times (SLATKIN 1991 Down), many classical population-genetic analyses do not make use of all of the information in DNA sequence data, making the coalescent approach presented here preferable. Furthermore, all of these analyses rely on approximations that assume a large local population size, an assumption that is relaxed in the present analysis. However, it is noteworthy that the results derived here are consistent with, and in some cases anticipated by, much of this previous work.

A model similar to the one presented here was proposed by BARTON and WILSON 1995 Down, BARTON and WILSON 1996 Down, who applied a coalescent approach to a continuous population in two dimensions, deriving recursion equations that describe the coalescent process for a pair of sequences. These equations agree closely with simulated distributions of coalescence times. However, the method becomes cumbersome for long coalescence times and does not readily lead to summary statistics such as the moments of the probability distribution. While limited to pairs of genes in a one-dimensional habitat, the model presented here is easily applied to both long and short coalescence times and yields summary statistics that can be used to make inferences regarding demographic history from genetic data. Simulation results confirm that the diffusion approximation used in this model provides an accurate characterization of the entire coalescent process under a broad range of parameter values.


*  THE MODEL
*TOP
*ABSTRACT
*THE MODEL
*RESULTS
*APPLICATION TO DATA
*DISCUSSION
*APPENDIX
*LITERATURE CITED

The model assumes a uniformly distributed population of N haploid individuals in a linear habitat, but can be applied to a population of N/2 diploid individuals without modification. Distance is scaled such that any location along the habitat is indexed by a number between 0 and 1 (with 1/2 being the midpoint of the habitat). Absolute density-dependent population regulation is assumed. Each individual occupies a space of width 1/N, from which all other individuals are excluded. The structure of the population is a one-dimensional lattice, as in the voter model (HOLLEY and LIGGETT 1975 Down), a contact-process model used in many ecological applications (DURRETT and LEVIN 1994 Down). The distribution of coalescence times is found using a continuous approximation. Another way to think of the model is as a stepping-stone model consisting of N demes, each of size 1.

Generations are nonoverlapping. Each individual produces a very large number of gametes, which are dispersed according to a normal distribution centered at the location of the individual and with variance 2{sigma}2m. Thus in each generation an effectively infinite number of gametes arrives at each location. One of these gametes is selected at random to become the adult at that location in the next generation. The distribution of the origins of those gametes is the correlation of the normal with a delta function at the individual's location. Thus, the location of a parent is normally distributed around the location of its offspring, with variance 2{sigma}2m (a grandparent is normally distributed around the same location with variance 4{sigma}2m, and so on).

The boundaries of the habitat are reflecting, so a gamete that would otherwise land outside the habitat range is reflected back an equal distance within it. Each individual thus has the same expected number of offspring regardless of its location. This means that migration is conservative, so migration alone is sufficient to maintain the relative population densities at all locations in the habitat (NAGYLAKI 1980 Down). Nonreflecting boundaries would correspond to the case where those gametes dispersing outside the habitat range are lost. In such a system, individuals near the edges of the habitat would have a reduced effective fecundity relative to those nearer to the center.

As FELSENSTEIN 1975 Down pointed out, most continuous-space models in population genetics assume a uniform population density that would not actually be maintained by the proposed reproductive scheme. A normal distribution of gametes without severe density regulation generates a population that is clumped together at certain locations and sparsely populated at others. With its absolute density regulation at all locations, the model of reproduction proposed here will immediately generate and maintain a population that is uniformly distributed across its habitat range.

Applying a coalescent approach to the analysis of this model involves tracking the location of the ancestors of a particular sequence back in time. The location of a single sampled lineage can be approximated using a diffusion process with diffusion constant {sigma}2m. The precise location of a lineage is known only for the generation in which sampling occurred, so its location in the past is represented here as a probability distribution. Considering only a single sequence sampled from a particular location z0 and its ancestors, one can imagine how this probability distribution broadens going back in time. In the distant past, the distribution becomes completely flat, when the ancestral sequence is equally likely to have been anywhere in the range. The exact distribution of ancestral locations can be derived for such a system using a Fourier series.

The analysis presented here derives the distribution of the time to coalescence for a pair of sequences drawn from locations z01 and z02 in the habitat. Each lineage is subject to a diffusion process backward in time, with the probability of coalescence related to the overlap of the two probability distributions (Fig 1). The solution employs a two-dimensional Fourier method, but is more complicated than the case of a single sequence because the ancestral location distributions z1(t) and z2(t), conditional on not yet having coalesced, are not independent of each other.



View larger version (15K):
In this window
In a new window
Download PPT slide
 
Figure 1. The diffusion process for two lineages. The graphs illustrate the process of lineage diffusion backward in time. (A) The probability distribution for the locations of the two lineages at a time in the recent past. For sequences sampled from different locations, there is little overlap in the two distributions and therefore little possibility of a coalescent event. Going farther back in the past (B and C), the overlap between the two distributions initially increases, and the probability of the two sequences sharing a common ancestor increases. More recent coalescent events are most likely to occur near the center of the space separating the two samples (B). In the more distant past, the overlap between the two decreases and broadens, and the range over which coalescent events are likely to have occurred becomes less well defined (C).

If we consider a single lineage, disregarding habitat boundaries for the moment, it is equally likely to have come from a location to its left or its right in the previous generation. It follows that when we consider two lineages at some distance from each other, they are equally likely to have been closer together or farther apart in the previous generation. However, if the two shared a common ancestor in the previous generation, they must have been closer together. Thus, conditional on not coalescing in the previous generation, the two lineages are slightly more likely to have been farther apart than closer together. In contrast to the single-sequence case, as time in the past becomes very large, the ancestral lineages are not equally likely to be found anywhere in the habitat range. If the two lineages are still distinct, they are likely to have been more geographically separated than a uniform distribution dictates.

The analysis involves a transformation of variables to create two new parameters that do diffuse independently. The first parameter encodes the distance between the two sequences, and the second their average position:

(1)


(2)

The "1" terms in Equation 1 and Equation 2 are included for mathematical convenience, and other coordinate systems would yield the same results, so long as diffusion is isotropic. The distribution of ancestral locations t generations in the past can now be represented as a two-dimensional probability surface in x and y:

(3)

This probability distribution is nonzero within a right triangle ranging from -1 to 1 in y and from |y| to 1 in x (Fig 2).



View larger version (21K):
In this window
In a new window
Download PPT slide
 
Figure 2. The state space of the system in the transformed coordinates. Diffusion of the two lineages in the one-dimensional habitat is represented as the diffusion of a single point in a two-dimensional state space. The x coordinate encodes the distance between the two lineages, with x = 0 corresponding to the maximal separation of the two and x = 1 corresponding to the two lineages being at the same location. The y coordinate encodes the average position of the two lineages, with y = -1 corresponding to both lineages being at the end of the range where z = 0 and y = 1 corresponding to z = 1. The point x = 0, y = 0 is at the center of the square that is generated by the three reflections.

The genealogical history is now modeled as a single diffusion process in this triangular state space. The diffusion constant is 2{sigma}2m, twice that for the single-particle diffusion process discussed above. This factor of two arises from the fact that the new parameters are the sum and difference of two {sigma}2m single-particle diffusion processes.

Diffusion is subject to two different boundary conditions in the triangular state space. The two short sides of the triangle are reflecting boundaries. Reflection at these lines corresponds to reflection of a lineage off of a habitat boundary. The long side of the triangle is a more complex, partially reflecting, partially absorbing boundary. Positions along this line represent states where the two ancestral lineages are very close together in space. Reflection is equivalent to the two lineages moving past each other. Absorption is equivalent to a coalescent event.

The diffusion process in the transformed state space is isotropic, but not separable. Although diffusion in one dimension is independent of diffusion in the other dimension, it is not independent of location, due to the fact that the state space is triangular. The diagonal reflecting boundaries can be eliminated by placing a mirror image of the state space opposite the boundary. Reflection at the boundary is now represented as movement into the mirror-image state space. Three such reflections transform the state space into a square ranging from -1 to 1 in both x and y. All four edges are coalescent boundaries, and the symmetric shape makes the x and y diffusion processes separable as

(4)

where each distribution satisfies the diffusion equation

(5)


(6)

Each pair of locations (z1, z2) in the real habitat corresponds to four locations in this square space, one in each of the four images of the triangular state space (see Equation A40).

The boundary conditions at the edges of the square depend on both the population density and the dispersal (migration) rate. If the population density and dispersal rate are very high, then when the two lineages come close together, they are likely to pass by each other rather than share a common ancestor, because there are a large number of individuals within their dispersal range. This corresponds to a more reflecting boundary in the two-dimensional state space. On the other hand, if the population density and dispersal rate are low, neighborhood size is small, and two lineages that are close together in space will be more likely to share a common ancestor, corresponding to a more absorbing boundary. Mathematically speaking, the flux rate of the probability distribution across the boundary is equal to the probability that the two lineages coalesce in the previous generation.

Because the model assumes perfect density-dependent population regulation, there is exactly one haploid lineage in each span of width 1/N. A coalescent event occurs when the two ancestral lineages are found within the same 1/N span in a given generation. Note that giving the lineages a finite physical width addresses the concern raised by SAWYER 1976 Down that common ancestry in a continuous model requires two lineages to have a physical separation of zero, which leads to pathological behaviors when applied to models of more than one dimension. In our continuous approximation of the population, we assume that a coalescent event occurs whenever the two lineages are separated by a distance of <1/(2N), that is, when 1 - 1/(2N) < x < 1. This approximates the probability that both lineages are found within the same fixed span of width 1/N. Applying this criterion at the boundaries, the result is an infinite series of sine and cosine terms. The full solution is derived in the Appendix, and the main results are reproduced here in the text. For example, the joint probability density for the locations of the two lineages is given by

(7)


(8)

These series can be truncated for purposes of making calculations without significant loss of accuracy. The {alpha}i and {alpha}*j terms are determined by the boundary conditions (see Equation A13aEquation A13b), and the fi and f*j terms are determined by the two sampling locations (Equation A21aEquation A21b). Discussion of Fourier-series solutions of the diffusion equation can be found in most texts (e.g., CHURCHILL and BROWN 1987 Down). The treatment of initial conditions used here is standard, and the boundary conditions are incorporated using the Sturm-Liouville method.

At time t, the probability of the two lineages not yet having coalesced is equal to the volume under the probability surface defined by U (x, y, t) within the square -1 < x < 1, -1 < y < 1. Intuitively, this is because coalescence is represented by the diffusion of the probability density out of the square. Thus, the instantaneous rate of coalescence at time t is given by the rate at which the probability volume within the square is decreasing at that time. From this relationship it is possible to derive the expectation of the time to coalescence for two sequences sampled from locations corresponding to x0 and y0:

(9)

It is possible using this approach to derive a number of other analytic results, including the full probability distribution for the time to coalescence (Equation A25 and Equation A33). Each of the moments of the distribution has a simple form analogous to that for the expectation (Equation A31 and Equation A34). It is also possible to write down the exact distribution of the locations of two lineages that have not yet coalesced (A40), as well as the distribution of locations of the coalescent events (A44–A46).


*  RESULTS
*TOP
*ABSTRACT
*THE MODEL
*RESULTS
*APPLICATION TO DATA
*DISCUSSION
*APPENDIX
*LITERATURE CITED

The results derived in this article have been compared to simulation data to assess the accuracy of the diffusion approximation in this context. This section contains analytic and simulation results for a number of values over a range of parameters. These results provide both reassurance regarding the accuracy of the equations and insight into the behavior of the coalescent process in a continuous, linear habitat.

Monte Carlo simulations were performed backward in time using two different migration/coalescence processes. In the first process, the locations of the two lineages were kept as floating point numbers. Migration each generation was performed by drawing a random number from a normal distribution. If the new location for the lineage lay outside the habitat range, the new location was selected by reflection at the habitat boundary. A coalescent event occurs if, after translocation and reflection, the two lineages lie within a distance 1/(2N) of each other. Times and locations of coalescent events were averaged over a large number of sample runs.

The second process used a discrete lattice model. Each of the two lineages was assigned an integer location between 1 and N. Each generation two pseudorandom integers were drawn from independent Poisson distributions for each lineage, which were then translocated by the difference of the two Poissons. This produces a discrete distribution that approximates the shape of a normal distribution. Reflections were performed as in the first process. If, after migration and reflection, the two lineages were at the same location (had the same integer location value), a coalescent event was considered to have occurred. Again, the times and locations of the coalescent events were averaged over a large number of sample runs.

The relative value of the mean time to coalescence is determined largely by the product N{sigma}2m, which is analogous to Nm in discrete-deme models of geographical structure. Presented here are three sets of data representing three different values of N{sigma}2m. In all three cases N = 200. The values of {sigma}2m are 0.005, 0.0005, and 0.00005 . For each value of {sigma}2m, the mean time to coalescence was determined for a pair of sequences for a number of sampling locations. Fig 3 presents mean time to coalescence determined analytically from Equation 9 for the parameters used in Table 2 . Mean times to coalescence in Table 1 Table 2 Table 3 are determined by three methods. The first two values are simulation results (by the Poisson and normal methods described above) from 1 million replicates. The third value is the analytically determined value from Equation 9.



View larger version (161K):
In this window
In a new window
Download PPT slide
 
Figure 3. This surface represents the mean coalescence times for pairs of sequences drawn from various locations in the habitat range. The data here are equivalent to those presented in Table 2 (N = 100, {sigma}2m = 0.001). The x- and y-axes represent the two sampling locations. Darker areas correspond to shorter mean coalescence times. The lower-left and upper-right corners represent the case where both samples are taken from one end of the range. The other two corners (with the longest coalescence times) represent the case where the two samples are taken from opposite ends of the range. Mean coalescence times are longer for samples taken from the center of the array than for samples taken from near the ends.


 
View this table:
In this window
In a new window

 
Table 1. Comparison of derived and simulated mean coalescence times (N = 100; {sigma}2m = 0.0001)


 
View this table:
In this window
In a new window

 
Table 2. Comparison of derived and simulated mean coalescence times (N = 100; {sigma}2m = 0.001)


 
View this table:
In this window
In a new window

 
Table 3. Comparison of derived and simulated mean coalescence times (N = 100; {sigma}2m = 0.01)

Inspection of the tables reveals that the analytically derived mean time to coalescence is in good agreement with simulated data, with the results of the three methods differing typically by no more than 0.5%. Table 1 Table 2 Table 3 and Fig 3 also immediately reveal two features of this model. First is the intuitively pleasing result that the mean time to coalescence increases with the physical distance between the two sampled sequences. The rate of increase with distance is dependent on the migration rate, with lower migration rates corresponding to higher rates of genetic divergence with distance.

A second, less intuitive, result from these data is the dependence of time to coalescence on the location of the two sampled sequences (in contrast to their separation). The pattern is most easily seen by considering pairs of adjacent sequences sampled from various locations along the habitat. A pair of adjacent sequences sampled from the center of the population range has a longer mean time to coalescence than pairs sampled closer to the ends, an effect that is more pronounced at lower migration rates. This result is anticipated by classical population genetics results in which the probability of identity by descent is higher for demes near a reflecting boundary (MARUYAMA 1970C Down; NAGYLAKI and BARCILON 1988 Down) and by the work of HERBOTS 1994 Down.

The model also provides results regarding the locations of the lineages and common ancestors (Equation A46). Fig 4 shows the probability surface for the time and location of coalescence for three different pairs of sequences (from Equation A44). Recent coalescent events are likely to be found in the region between the two sampling sites. In the more distant past, the probability distribution depends only on the migration rate and not the sampling locations. This distant-past distribution is biased toward the center of the range. Another result that can be derived from the model is the distribution of the lineage locations conditional on their still being separate at a time t in the past (Equation A40). In the distant past, this distribution is skewed toward the edges. Intuitively, if the two lineages are separated by a very deep genealogical branch, it most likely results from their having spent a lot of time at opposite ends of the range. A number of other results are also derived and presented in the Appendix, including the strong-migration limit (NAGYLAKI 1980 Down, NAGYLAKI 2000 Down) and application of these results to a linear array of demes.





View larger version (66K):
In this window
In a new window
Download PPT slide
 
Figure 4. Probability surfaces for the time and location of the most recent common ancestor. (A–C) The probability distributions for the times and locations of coalescent events for three different pairs of sampling locations. Location (spanning the entire habitat range) is plotted along the short axis, and time (from 0 to 2000 generations in the past) along the long axis; 2N = 2000 and {sigma}2m = 0.00005. The sample locations (z01, z02) are (A) (0, 0.33), (B) (0.33, 0.67), and (C) (0, 0.67). Initially there is no chance of coalescence due to the separation of the sequences. In all three cases the most recent coalescence events are positioned right between the two sampling sites. Note that the surface is taller in A than in B, in spite of the fact that the two graphs represent the same separation between samples. This difference corresponds to the shorter average coalescence times for sequences taken from closer to the edges of the habitat range. Note also that the peak in C is smaller and shifted out on the time axis relative to A and B, as a result of the larger distance separating the sampling locations. As time in the past becomes very large, all three distributions approach a common shape, with coalescence locations biased toward the center of the range.


*  APPLICATION TO DATA
*TOP
*ABSTRACT
*THE MODEL
*RESULTS
*APPLICATION TO DATA
*DISCUSSION
*APPENDIX
*LITERATURE CITED

The expected time to coalescence can be used to estimate demographic parameters. In this section published sequence data are used to fit the model and estimate the effective population size and the genetic dispersal rate. Our purpose here is not to determine specific parameter values for a particular organism. In fact, the population considered below is likely to violate one or more assumptions of the model. Our goal is simply to illustrate the fact that patterns of genetic diversity such as the one predicted by the model may be found in nature. We also want to emphasize the fact that the finite linear model makes different predictions under neutrality than either an island or a circular model and that these differences may alter our interpretation of sequence data.

We have applied the model developed here to sequence data collected from the mitochondrial control region in the five different regional forms of sardines (Sardinops) in the Indian and Pacific oceans (BOWEN and GRANT 1997 Down). Sardines are characterized by an antitropical distribution and are restricted to five temperate upwelling zones off the coasts of Japan, California, Chile, Australia, and South Africa. Temperate waters extend continuously from South Africa through Australia to Chile and from Japan to North America. The two temperate zones are separated by warmer tropical waters in the Pacific. However, BOWEN and GRANT 1997 Down point out that this tropical zone is fairly narrow in the eastern Pacific, along the west coast of Mexico, suggesting that genetic contact between the California and Chile sardine populations is or has been possible. In the western Pacific, the tropical zone is much broader, making genetic exchange between Japan and Australia unlikely.

On the basis of these observations, it may be reasonable to apply the finite, linear model to this system, treating the populations as linearly arrayed from Japan to California to Chile to Australia to South Africa, with genetic contact possible only between adjacent populations. The total length of this range is ~25,000 miles, with the five sampling sites occurring at ~5000-mile intervals. These five sites yield 15 pairwise comparisons, which were fit to the model by finding the parameter values that minimize the sum of the squares of the differences between the predicted and observed values. The statistical properties of these estimators are not investigated here. However, in this case, the model does appear to fit the data reasonably well, reproducing the same pattern of genetic diversity, and it provides a framework that highlights certain features of the data.

Fig 5 shows the observed and expected average number of pairwise nucleotide differences (proportional to the expected coalescence time) in the data set for parameter values minimizing the sum of the squares of the errors. The observed data manifest the two key features of the model: increasing genetic differentiation with distance and greater genetic diversity near the center of the habitat. The product N{sigma}2m was estimated to be 0.017, and {theta}(= 2Nµ) was estimated to be 1.64. Assuming a per generation mutation rate of ~7.5 x 10-6 for the mitochondrial control region (on the basis of a divergence rate of 15% per million years between lineages and a mean generation time of 2 years, BUTLER et al. 1996 Down), this gives a total mitochondrial effective population size of 2 x 105. This value is likely much smaller than the census size, possibly resulting from a higher variance of reproductive success than that assumed by the model and also consistent with suggestions of population fluctuations in the species (SOUTAR and ISAACS 1974 Down). This produces an estimate of , which corresponds to a mean intergenerational migration distance of ~8 miles and a (mitochondrial) neighborhood size of ~200.



View larger version (59K):
In this window
In a new window
Download PPT slide
 
Figure 5. Average pairwise differences for the mitochondrial control region in sardines. The average number of nucleotide differences between pairs of sequences sampled from five locations (dark shaded bars) are plotted here along with the values predicted from the model (light shaded bars). Parameter values for the expected results were chosen to minimize the sum of the squares of the differences between the expected and observed values. In the fourth column (California), the observed values are consistently lower than expected, suggesting that the barrier to gene flow between California and Chile is lower than might be suggested by distance alone. Similarly, the higher-than-expected values in the fifth column (Japan) suggest that the temperate zone in the northern Pacific represents a larger barrier than would be predicted by distance alone.

Inferences can also be drawn from differences in observed and expected values. In the fourth column of Fig 5 (labeled "Calif."), the observed interpopulation values are all lower than the corresponding expected values. In the fifth column ("Japan"), on the other hand, the expected values are lower than the observed. This pattern suggests that the California and Chile populations are more closely connected genetically than might be predicted from distance alone, whereas the Japan and California populations are genetically more distant than expected. This observation supports BOWEN and GRANT's (1997) argument that the tropical barrier in the eastern Pacific is, or has recently been, traversible. In fact, the data suggest that the tropical water in the eastern Pacific may represent less of a genetic barrier than the equally large band of temperate water between North America and Japan.

This analysis is presented not to address specific issues in sardine biogeography or question the conclusions of Bowen and Grant, who attribute the observed pattern of genetic diversity to a range expansion of the sardine populations. Our purpose has been simply to show that patterns predicted by the model can, in fact, be found in natural populations and to illustrate how the model can be employed to estimate interesting demographic parameters. Furthermore, the analysis suggests how observed patterns of genetic diversity can be made more meaningful when compared to a more sophisticated null model.


*  DISCUSSION
*TOP
*ABSTRACT
*THE MODEL
*RESULTS
*APPLICATION TO DATA
*DISCUSSION
*APPENDIX
*LITERATURE CITED

We have presented a model for analyzing genetic diversity in a finite, continuous, linear population. Using a diffusion approximation, the model fully characterizes the distribution of possible genealogical histories of a pair of sequences sampled from such a population. Results derived from the model include the full distribution of coalescence times and locations, as well as a number of summary statistics, such as the mean time to the most recent common ancestor.

The analytic results derived from the model are in good agreement with simulations over a wide range of parameter values. This agreement extends even to extremely small neighborhood sizes (approaching one), allowing us to relax the usual coalescent theory assumption of a large local population size. In the other extreme, the strong migration limit where the neighborhood size approaches the population size, the model converges on well-established results for the coalescent process in a panmictic population.

The model makes several predictions regarding genealogies in a finite continuous habitat. In addition to the intuitive result that genetic divergence increases with distance, the model predicts that genetic diversity will be greater near the center of the habitat than at the edges. Coalescent events in the recent past are most likely to occur between the sampling locations of the two sequences. In the distant past, the distribution of locations of coalescent events becomes independent of sampling location and is concentrated toward the center of the habitat. The locations of lineages (conditional on not having coalesced), on the other hand, are biased toward the edges of the habitat in the distant past. All of these effects are more pronounced under lower migration.

In this development of the model, we have assumed reflecting habitat boundaries, meaning that individuals suffer no loss of fecundity when they are situated at the edge of the habitat. It may be more reasonable to assume absorbing habitat boundaries. In the forward-time model, this would mean that gametes that dispersed outside the habitat range would be lost. The effect in the backward-time model would be to bias the distribution of lineage locations slightly toward the center of the habitat, decreasing slightly the time to common ancestry, but preserving the broad patterns described for the reflecting-boundaries case.

One property of the model not discussed above is the dependence of the coalescent process on neighborhood size. For a given pair of locations, the ratio of the expected time to coalescence to the total population size is determined primarily by the product N{sigma}2m, which is analogous to Nm in demic models of population structure. However, N and {sigma}m enter into minor terms in the equations separately, meaning that it is possible, in principle, to estimate these two parameters independently without an independent estimate of the population size (e.g., from {theta} and µ). In practice, however, it seems unlikely that sufficient data could be collected (or that a population could be found that conformed closely enough to the model) to separately estimate these parameters with any certainty.

It is also possible using this model to derive other values for a particular set of parameters. SLATKIN 1991 Down derived FST in relation to mean time to coalescence for pairs of genes as

(10)

where 0 represents the mean time to coalescence for a pair of sequences drawn from the same deme, and represents the mean time to coalescence for a pair drawn at random from the entire population. These two values can be derived from Equation 9, where 0 is the average value along the line , and is the average value over the entire space (-1 < x0 < 1, -1 < y0 < 1). In fact, the value of 0 is very nearly N, independent of the value of {sigma}2m, consistent with the observation that the mean within-deme coalescence time will average to N in any system with conservative migration (STROBECK 1987 Down; NAGYLAKI 1998 Down).

The distribution of coalescence times given by Equation A25 can also be combined with a particular mutational model. Integration of this probability against the mutational process will yield the probability of identity in state or the likelihood of a particular set of differences between the two sequences. Results such as these may be valuable in the analysis of sequence data.

Mathematica files for generating the results described in this article are available from the authors, as is a C program for estimating parameter values from sequence data.


*  ACKNOWLEDGMENTS

We thank N. H. Barton, J. L. Cherry, T. Nagylaki, A. Platt, J. Wall, and three anonymous reviewers for helpful discussions and comments on the manuscript. This work was supported by a grant from the Howard Hughes Medical Institute to J.F.W. and in part by National Science Foundation grant no. DEB-9815367 to J.W.

Manuscript received June 10, 2001; Accepted for publication March 4, 2002.


*  APPENDIX
*TOP
*ABSTRACT
*THE MODEL
*RESULTS
*APPLICATION TO DATA
*DISCUSSION
*APPENDIX
*LITERATURE CITED

Derivation of formulas:
Let the initial positions of the two sequences be denoted by z01 and z02, where z01 and z02 represent the relative locations along the length of the entire habitat, and therefore both lie between 0 and 1. Let x0 and y0 be the following transformations of these coordinates:

(A1a)


(A1b)

A pair of coordinates (x, y) then fully describes the system of two identical particles along the line from 0 to 1. The state space in this coordinate system is a right triangle ranging from -1 to 1 in y and from |y| to 1 in x. The state of the system at a time t generations in the past is given by a probability function on this state space. Migration of the two particles along the line is considered to be a diffusion process with constant {sigma}2m, and the corresponding diffusion process in the transformed (x, y) coordinate system is a two-dimensional diffusion process with constant 2{sigma}2m in each direction.

The two short sides of the triangular state space represent reflecting boundaries, which may be eliminated by the method of reflecting the state space across the boundary. Three such reflections generate a square state space ranging from -1 to 1 in each direction. Note that crossing over the diagonals of this square involves a transposition of x and y. However, since the diffusion process is isotropic, this transposition does not affect the analysis.

Because the x and y diffusion processes are now separable, further discussion focuses on a one-dimensional diffusion process. The two-dimensional process can be reconstructed by multiplication of two such one-dimensional processes. The derivation uses only x. The equations for the diffusion process in y are identical.

The long side of the triangular state space, which is now replicated four times as the boundary of the new square state space, represents a partially reflecting, partially absorbing boundary. The diffusion process has now been reduced to a Sturm-Liouville-type problem, where the boundary conditions are set by a relationship between the flux rate across the boundary and the density function within the boundary.

The function Ux(x, t) must satisfy the diffusion equation within the range (-1, 1),

(A2)

and is subject to the boundary conditions

(A3a)


(A3b)

In these equations, the flux J across the boundary is set equal to the average probability over the next generation that the separation between the two lineages is <1/(2N). Since there are N individuals in the population and complete density-dependent population regulation, it is assumed that two lineages separated by less than one-half the "width" of an individual must be the same lineage. Thus the flux rate across the boundary is equal to the rate of coalescence. The probability that the two lineages coalesce in the previous generation is proportional both to this width and to the probability that the two lineages are separated by a distance s and that this separation decreases by s in one generation of dispersal, integrated over all possible values of s. In Equation A3a and Equation A3b, a normal distribution has been assumed for the single-generation dispersal pattern. Other dispersal patterns are, of course, possible. However, most patterns will converge on a normal distribution after a relatively small number of generations. Reflections have been ignored in these two equations, as it is assumed that single-generation migration events much longer than the total habitat length are extremely rare.

The general solution to the diffusion equation is

(A4)

and we can incorporate the boundary conditions by using Taylor-series expansions

(A5a)


(A5b)

where U(n)x represents the nth derivative of Ux with respect to x. The right-hand sides of these equations can be integrated term by term to give

(A6a)


(A6b)

Substituting in expressions for the derivatives of Ux (derived from Equation A4), we get

(A7a)

and

(A7b)

Collecting terms by C1 and C2 and simplifying, these conditions become

(A8a)

and

(A8b)

Two classes of nontrivial solutions satisfy both of these conditions simultaneously. For the first class of solutions, C2 = 0 and {alpha} satisfies the following:

(A9)

The second class of solutions will have C1 = 0 and values of {alpha} that satisfy

(A10)

These relationships can be further simplified once we consider the following two relationships for the gamma function:

(A11a)


(A11b)

The specific solution for the diffusion equation with these boundary conditions then becomes

(A12)

where {alpha}i satisfies

(A13a)

and {alpha}*j satisfies

(A13b)

The normalization values for the eigenfunctions are

(A14a)


(A14b)

which makes the normalized solution at time t = 0,

(A15)

where the C1 and C2 terms are derived by integrating the product of the probability distribution at time t = 0 with each term:

(A16a)


(A16b)

If we take the initial probability distribution to be the {delta} function {delta}(x - x0), then the system in x is represented as

(A17)

The {alpha}i series terms (from Equation A13a) are of the form , where 0 < {epsilon}i+1 < {epsilon}i < 1/2. Similarly, the {alpha}*j terms (from Equation A13a) are of the form , where 0 < {epsilon}*j+1 < {epsilon}*j < . Since both series are monotonically increasing, and {alpha}2 terms are found in the exponential, the series can be truncated for purposes of making calculations without significant loss of accuracy. The number of terms required is determined by parameter values, with more terms needed for smaller neighborhood sizes, and for samples that are closer together. The calculations presented in Table 1 Table 2 Table 3 were truncated after 40, 20, and 7 terms, respectively.

The equations for the {alpha}i and a*j can be simplified in the case where the neighborhood size is very large (N{sigma}m >> 1), in which case these equations become approximately

(A18a)

and

(A18b)

The form of the solution in Equation A17 works well so long as the probability of coalescence in the first generation in the past is very small, that is, when the two sequences are separated by a sufficient distance (when x0 is not close to 1) or when the neighborhood size is large. If x0 is close to 1 and the neighborhood size is not large, we must use a more complex form for the initial conditions. This results from the fact that the model assumes discrete time steps, and so the shortest possible coalescence time is one generation. The diffusion approximation solution, on the other hand, assumes continuous time, and coalescence is possible at any time t > 0. When the probability of coalescence occurring within the first time step is very small, this correction is negligible. However, under certain conditions, we must account for the fact that one generation of migration occurs prior to the first opportunity for coalescence. This migration effectively moves the initial condition probability peak away from the boundary, resulting in a longer predicted time to coalescence.

The more accurate form of the solution, valid over all values of x0, takes as its initial conditions a normal distribution of variance 2{sigma}2m, centered at x0 and reflected off of the boundary at x = 1. The initial probability distribution is given approximately by

(A19)

Then the probability distribution in x as a function of t becomes

(A20)

where

(A21a)


(A21b)

Only a single reflection (at the point x = 1) is considered in deriving the initial conditions. More accurately, reflection off of the x = -1 point would also be included, as well as higher-order reflections (off of both x = 1 and x = -1, etc.). However, unless the migration rate is extremely high (high enough to completely eliminate any geographical structure), these terms are negligible if x0 > 0. We have assumed by definition that x0 > 0, but y0 is allowed to range between -1 and 1. It should be noted, however, that all solutions for y0 will be identical under the transformation y0 = -y0. Therefore, by considering only y0 > 0, we can retain a complete description of the system and ignore migratory reflections off of the boundary in the first generation.

Assuming this distribution for the initial conditions is equivalent to permitting one-half generation of migration to occur prior to initiation of the coalescent process. In this way, coalescent events that occur over the first generation take place in the range of times 1/2 < t < 3/2. Note that t = 1, the first time when coalescence can occur in the discrete time model, lies at the center of this range. This "premigration," which is necessary to compensate for approximations made in the translation from discrete to continuous time, gives rise to the "t - 1/2" terms in Equation A20 and in subsequent derivations. Once these factors have been taken into account, a full description of the state of the system at time t is given by , which can be manipulated to yield a number of other results.

Cumulative distribution function of the coalescence time:
First we derive the probability that the two lineages have coalesced prior to time t. Recall that, in this formulation, coalescence is equivalent to diffusion outside of the square state space. The formulas derived above have been normalized so that, at t = 0, the volume under the probability surface is equal to one. At a time t, the probability that the two lineages are still separate is simply given by the volume under the probability surface within the square:

(A22)

Due to symmetry, the sine terms in the series expansions of Ux and Uy integrate to zero, leaving only a product of the integrals of the cosine series, and the cumulative distribution function of the coalescence time is

(A23)

Since all {alpha}i are nonzero, this value approaches one as t goes to infinity.

Probability density function of the coalescence time:
The probability that the two lineages coalesce at time t is given by the time derivative of the cumulative distribution function,

(A24)

which yields

(A25)

for values of t > 1/2. For t < 1/2, C(t) = 0.

Moments of the distribution:
The expectation and variance for the distribution of coalescence times, as well as all higher moments, can be derived from the expression for C(t). The pth moment of the distribution is given by

(A26)

Since C(t) = 0 for t < 1/2, this becomes

(A27)

The solution to

(A28)

has a simple form, which can be used to derive the moments:

(A29)

The expectation for the coalescence time then becomes

(A30)

and the variance is

(A31)

Simplified form:
If the two samples are taken from locations separated by more than the single-generation dispersal range, or if the neighborhood size is large, we can neglect the premigration correction introduced above and use a simplified form of the solution where the initial state of the system is the {delta} function {delta}(x - x0, y - y0). The cumulative distribution function, probability density function, and pth moment of the distribution of the time to coalescence are given by

(A32)


(A33)


(A34)

High migration limit:
Considering the strong-migration limit, where N{sigma}2m becomes large, we expect the model to converge on a panmictic population. As N{sigma}2m becomes large, {alpha}1 approaches zero, and only the first term in Equation A34 contributes significantly to the sum. Equation A33 thus simplifies to

(A35)

Furthermore, sin({alpha}1) approaches {alpha}1, and cos({alpha}1 x0) and cos({alpha}1 y0) both approach one:

(A36)

Finally, using a series expansion for cot({alpha}1), we find that, for large N{sigma}2m,

(A37)


(A38)

which gives us

(A39)

which is exactly what is expected for a panmictic population of size N.

Locations of the lineages:
The probability distribution for the locations of two lineages that have not yet coalesced is fully described by U(x, y, t). The transformed (triangular) state space is represented four times within the square space over which we have been considering the values of U. Thus, each combination of lineage locations (z1, z2) corresponds to four pairs of (x, y) coordinates. The likelihood at time t that the two lineages are at positions z1 and z2 is given by

(A40)

where

(A41)

This probability distribution is not conditional on the two lineages still being separate. That is, the total probability over all (z1, z2) between zero and one will not integrate to one, but rather to the probability that the two lineages have not yet coalesced.

Locations of the coalescence events:
The instantaneous rate of coalescence at a particular location is equivalent to the flux across the boundary at the point corresponding to that location. The probability of a coalescent event occurring at a particular location in the habitat, z0, corresponds to flux at four locations in the transformed state space, one on each of the four sides, and the differential width in the habitat, {delta}z, is twice the corresponding differential width ({delta}x or {delta}y) in the transformed space. The coalescence rate within the range z0 to z0 + {delta}z is