Estimates of gene flow between subpopulations based on FST (or NST) are shown to be confounded by the reproduction parameters of a model of skewed offspring distribution. Genetic evidence of population subdivision can be observed even when gene flow is very high, if the offspring distribution is skewed. A skewed offspring distribution arises when individuals can have very many offspring with some probability. This leads to high probability of identity by descent within subpopulations and results in genetic heterogeneity between subpopulations even when Nm is very large. Thus, we consider a limiting model in which the rates of coalescence and migration can be much higher than for a Wright–Fisher population. We derive the densities of pairwise coalescence times and expressions for FST and other statistics under both the finite island model and a many-demes limit model. The results can explain the observed genetic heterogeneity among subpopulations of certain marine organisms despite substantial gene flow.
NATURAL populations of organisms are often subdivided by geography. Individuals may or may not migrate between these subpopulations. Modeling gene flow between subpopulations can be traced back to Wright (1931), whose island model describes a population subdivided into discrete, local subpopulations by geography, with limited migration between individual subpopulations. With the advent of techniques to characterize genetic variation, several measures of population subdivision have been proposed on the basis of probabilities of identity (see Rousset 2002 for a review). These include Wright's (1951) FST, Nei's (1982) γST, and Lynch and Crease's (1990) NST.
The quantity FST can, under the assumption of equilibrium, be used to estimate levels of gene flow from allozyme data (Wright 1951). The quantity γST can be calculated from DNA sequence data and is equivalent to FST if the mutation rate is very low (Slatkin 1991). Levels of gene flow between subpopulations can thus also be estimated from DNA sequence data. However, Slatkin (1991) argues that FST is appropriate for allozyme data, whereas the gene genealogy-based method of Slatkin and Maddison (1989) is appropriate for DNA sequence data. As FST continues to be used in investigations of population structure and, recently, as a tool for identifying loci under selection (e.g., Murray and Hare 2006), we are concerned with FST and related measures below.
We derive expressions for FST and NST under the island model of population subdivision with symmetric migration (Nagylaki 1980; Strobeck 1987) and skewed offspring distribution among individuals in a population. When the offspring distribution is skewed, individuals have some nonnegligible probability of having very many offspring. The population model of skewed offspring distribution we adopt in this work can result in an ancestral process with asynchronous multiple mergers (Eldon and Wakeley 2006). An ancestral process with asynchronous multiple mergers, or Λ-coalescent, was introduced by Pitman (1999) and also derived by Sagitov (1999) from a Cannings (1974) model. In a Λ-coalescent, any number of ancestral lines can coalesce at once to a single ancestor. In contrast, the Kingman coalescent (Kingman 1982a,b) allows only two lines to coalesce each time. For a single population, the ancestral process obtained from the population model of Eldon and Wakeley (2006), and employed in this work, is a special case of the Λ-coalescent of Pitman (1999) and Sagitov (1999).
Type III survivorship curve, and high fecundity, characterize a diverse group of organisms (e.g., many plants and marine animals). A prime example are marine species with broadcast spawning, including Atlantic cod (Gadus morhua; Árnason 2004), Pacific oysters (Crassostrea gigas; Beckenbach 1994; Hedgecock 1994a), and red drum (Sciaenops ocellatus; Turner et al. 2002). A model of skewed offspring distribution, in which individuals can have very many offspring with a nonnegligible probability, may therefore better apply in such cases than the Wright–Fisher (Fisher 1930; Wright 1931) or the Moran (Moran 1958, 1962) models.
Genetic observations from these species also argue against the standard population models. Genetic diversity is observed to be much lower than expected on the basis of population size for some marine populations (Hedgecock et al. 1982; Nei and Graur 1984; Avise et al. 1988; Avise 1994). In particular, low effective to actual population size ratios have been reported for Atlantic cod (Árnason 2004), red drum (Turner et al. 2002), and the Pacific oyster (Hedgecock 1994a), and this has been explained by high variance in offspring distribution (Crow and Kimura 1970; Hedrick 2005). Second, models of skewed offspring distribution predict a large number of singleton variants (Eldon and Wakeley 2006; Sargsyan and Wakeley 2008), a feature observed, for example, in Pacific oysters (Boom et al. 1994), Atlantic cod (Árnason 2004), and some hydrothermal vent taxa (Won et al. 2003; Hurtado et al. 2004; Johnson et al. 2006; Young et al. 2008).
Genetic heterogeneity on a small spatial scale has been observed for many marine populations, including the purple sea urchin (Strongylocentrotus purpuratus; Edmands et al. 1996), even though planktonic larvae disperse over wide-ranging habitats (Johnson and Black 1984; Watts et al. 1990; Hedgecock 1994b; David et al. 1997). A range of explanations has been proposed for the observed heterogeneity (see Burton 1983; Palumbi 1994). Our aim is to address, by analytic methods, the problem concerning the genetic population structure of a highly fecund species with potentially highly skewed offspring distribution, like the Atlantic cod (Árnason et al. 2000).
We obtain the probability distributions of pairwise coalescence times, and expressions for FST, for both the finite island and a many-demes limit model. Our main result is that evidence of population subdivision can be observed in genetic data even if the usual migration rate Nm is very large. In essence, a skewed offspring distribution leads to high probabilities of identity by descent within subpopulations and thus high FST. Therefore, patterns in genetic data indicating population subdivision cannot be taken to indicate low levels of gene flow in a population with a skewed offspring distribution. In fact, estimates of migration rate based on FST (or NST) are confounded by the reproduction parameters of our model of skewed offspring distribution. These results may explain the genetic heterogeneity among subpopulations of some marine species like the purple sea urchin (S. purpuratus; Edmands et al. 1996), despite the potential for wide dispersal of long-lived planktotrophic larvae (Burton 1983; Palumbi 1994).
METHODS AND RESULTS
Throughout we are concerned with neutral genetic diversity at a single nonrecombining locus in a haploid population. As usual, N is the population size. The results should hold for a diploid population with gametic migration if we replace N with 2N. The population model we consider is a modification of the well-known Moran model of reproduction (Moran 1958, 1962). In the Moran model, a single randomly chosen individual reproduces each time step. To keep the population size constant a randomly chosen individual, but not the offspring, dies to make room for the offspring.
In our model, which was first presented in Eldon and Wakeley (2006), a single randomly chosen individual (the parent) reproduces each time step. With probability 1 – ε the parent has one offspring. Alternatively, with probability ε the parent has ψN – 1 offspring (a large reproduction event) with 0 < ψ < 1. To keep population size constant when a large reproduction event occurs, a total of ψN – 1 individuals die to make room for the new offspring. In our model the parent always persists. The parameter ψ represents the fraction of the population that is replaced by the offspring of the parent. Eldon and Wakeley (2006) show that this modified Moran model of overlapping generations gives rise to a coalescent process that allows for asynchronous multiple mergers of ancestral lines, i.e., is of the same type as the ancestral process considered by Pitman (1999) and Sagitov (1999).
For ease of presentation, we define the following quantities: Nγ, cN, λγ, and IA. The quantity Nγ is the coalescence timescale in our model. The coalescence timescale is proportional to the number of time steps, on average, it takes for two individuals to coalesce (in a single population). It depends on the value of ε that we assume has the form ε ≡ 2ϕ/Nγ for some constants ϕ and γ with 0 < ϕ, γ < ∞. In our model, the coalescence timescale is Nγ/2 time steps when 0 < γ < 2. In the usual Moran model, the timescale is N2/2 time steps, which is also the value of Nγ when γ ≥ 2.
For a single population, Eldon and Wakeley (2006) show that different coalescent processes result depending on γ. Multiple mergers of ancestral lines are allowed in the coalescent process when 0 < γ ≤ 2, while Kingman's coalescent (Kingman 1982a,b) results when γ > 2. The probability that two individuals do coalesce in a single time step is denoted by cN and depends on ε. The rate λγ of coalescence of two individuals is obtained from cN by “speeding up” time by a factor of Nγ. When 0 < γ ≤ 2, λγ depends on the reproduction parameters ϕ and ψ. In mathematical notation, Nγ is expressed as , and the coalescence probability cN is(1)For notational convenience, we also define the indicator function IA asFor example, Iγ<2 = 1 if γ < 2, and zero otherwise. In our model a large reproduction event occurs when the number of offspring of the parent equals ψN – 1. These events occur with probability ε. Our choice of ε = 2ϕ/Nγ results in the coalescence timescale being Nγ. The rate λγ of coalescence is then(2)The coalescence rate λγ is a key quantity in nearly all of our results below.
Model of subdivision:
We now consider the finite island model of population subdivision with the simplifying assumption that migration does not change the sizes of the subpopulations (Nagylaki 1980; Strobeck 1987; Herbots 1997). Reproduction in all the subpopulations follows the modified Moran model described above. The discrete-time ancestral process for a sample of size 2 is a Markov chain with transition probabilities given in Equation A1 in the appendix.
We are concerned with small migration rates, specifically those on the order of 1/Nγ time steps. This means that a single individual resides in the same subpopulation for 2Nγ time steps, on average, before migrating to a different subpopulation. When 0 < γ < 2, each individual resides in the same subpopulation for only Nγ time steps, on average. This time can be much shorter (when 0 < γ < 1) than the usual average of N time steps assumed in Wright–Fisher populations. In other words, a large number of individuals migrate during N time steps when 0 < γ < 1. We let m denote the probability that a single individual resided in a different subpopulation in the previous time step and model m as m = mγ ≡ κ/(2Nγ) in which κ is a finite constant (0 < κ < ∞).
To illustrate the difference between our migration rate κ and the usual migration rate Nm let M* ≡ N2mγ denote a migration rate scaled in units of N2 time steps (or N generations). This corresponds to the usual “Nm” in the Wright–Fisher model. Substituting for mγ gives . If, for example, γ = , then . When γ < 2 the migration rate M* is very high; i.e., as since κ is finite. However, in our modified model of reproduction coalescence also occurs on the timescale of N3/2 time steps (or generations when γ = ) and thus “counteracts” the effects of high migration rate.
The main results of this work concern expected coalescence times (Equations 3 and 5) and FST-like measures (Equations 10–12). We also derive the densities of the coalescence times (see appendix). The densities are used to derive distribution functions for the number of segregating sites between two sequences (see the appendix), which in turn yield expressions for FST-like measures including mutation (Equations 13 and 14).
The distributions of the coalescence times are functions of λγ:
DNA sequences differ because they have accumulated mutations from the time of their most recent common ancestor until they are sampled. By assuming a very low mutation rate, Slatkin (1991) derived an expression for FST in terms of expected values of coalescence times. The time until two genes coalesce is therefore a fundamental quantity in theoretical work on structured populations. Given two genes sampled from a structured population, two different coalescence times arise that are of interest: the time T0 until two genes sampled from the same subpopulation coalesce and time T1 until two genes sampled from different subpopulations reach a common ancestor. The densities of T0 and T1 were previously derived under the structured coalescent by Takahata (1988) and Nath and Griffiths (1993) in the case of two subpopulations and by Herbots (1997) for any finite number of subpopulations.
Given the transition rates in Equation A2, we can obtain the distributions of the coalescence times T0 and T1 (see the appendix). Figure 1 shows the distributions of T0 and T1, respectively, as functions of time for different values of ψ (the fraction of the population replaced by the offspring of a single individual). As ψ increases (i.e., tends to 1), the coalescence times T0 and T1 become very short.
The expected value and variance of T0 are both less than the corresponding quantities for T1. Specifically,(3)and(4)Equation 3 holds a key result, namely that E(T0) is always less than E(T1).
The significance of the result in Equation 3 is best understood by an example. When γ < 2, say then the timescale is Nγ = N3/2, and λγ = ψ2 (assuming ϕ = 1). Our migration parameter is then κ = mγNγ = mγN3/2. Migration is scaled in units of N2 time steps in a standard Moran population. If we let M* ≡ N2mγ be a scaled migration rate in units of N2 time steps, then if mγ is of order 1/N3/2 as above, M* becomes very high in a large population. Specifically, since γ = , we have (as ), since κ is a constant. The result in Equation 3 says that even when one will still see evidence of population structure in DNA sequence data, since coalescence occurs on a timescale of time steps (in a large population) when γ = . In fact, as whenever 0 < γ < 2.
Similarly, is always less than . In addition, the expected value and variance of T0 are inversely proportional to λγ and thus will be small when the probability of large reproduction events is close to one. The expressions for E(T0) and E(T1) (Equation 3) obtained under the usual reproduction models (Nei and Feldman 1972; Li 1976; Griffiths 1981) can be recovered by assuming that large reproduction events occur on a longer timescale (γ > 2) than usual (e.g., Wright–Fisher) sampling, in which case λγ = 1. The variances of T0 and T1 were first derived by Hey (1991) under the structured coalescent and can be recovered in the same way from Equation 4.
A many-demes limit:
The structured coalescent simplifies under certain migration mechanisms when the number of subpopulations is taken to be much greater than the sample size of DNA sequences (Wakeley 1998). The convergence of the ancestral process under a many-demes limit (i.e., when ) follows from the work of Möhle (1998), which shows how events in a stochastic process that occur on different timescales can be separated (see the appendix for a more detailed description). We consider the ancestral process in the limit and . Switching the order of the limits leads to the same coalescent process (see the appendix).
The limit process of two genes sampled from a population subdivided into very many subpopulations (), each of which is very large (), is of the form P*etG* in which P* and G* are given by Equations A16 and A19, respectively. The form of P* tells us that the ancestral process immediately enters the continuous-time process if the two genes are sampled from two different demes. If the two genes are sampled from the same subpopulation, they coalesce with probability or enter the continuous-time process by moving to different subpopulations with probability . In the continuous-time process the two lines wait with exponential time with rate on a timescale of DNγ time steps until they coalesce. The ancestral process under the many-demes limit model (Equation A19) differs from the limit process obtained when the number of subpopulations is finite (Equation A2), in that G* has a zero entry for the transition where the two alleles enter the same subpopulation, after having been separated. When D < ∞, the corresponding rate is κ/(D – 1) (Equation A2). Ancestral lines can coalesce, however, only if they reside in the same subpopulation. The matrix B* (Equation A18) ensures that the two lines do arrive in the same subpopulation.
Again we are interested in the coalescence times T0 and T1 of two genes sampled from the same, or different, subpopulations, respectively. The distribution of T0 is a mixture distribution (see appendix), and we obtain(5)and(6)The expressions for the expected value and variance of T0 and T1 obtained under the many-demes limit model (Equations 5 and 6) are functions of λγ and κ in the same way as the corresponding expected values and variances (Equations 3 and 4) obtained for a finite number of subpopulations. In particular, we always expect a shorter coalescence time for two ancestral lines sampled from the same subpopulation than if they were sampled from different subpopulations.
Deriving FST and NST:
The quantity FST is commonly used to assess levels of population subdivision. The inbreeding coefficient of an individual relative to a collection of subpopulations, FIT, can be attributed to nonrandom mating within a subpopulation (FIS) and to differences among subpopulations (FST; Wright 1951). Two sequences are identical by descent if they have not experienced mutation from the time of their most recent common ancestral sequence until they are sampled. If we let f0 and f denote the probability of identity by descent of two genes sampled from the same subpopulation (f0) and at random from the collection of subpopulations (f), we can express FST as(7)(Nei 1973). By the definition of FST in terms of inbreeding coefficients (as in Equation 7), FST depends on the mutation rate (μ). By forcing μ to be very low Slatkin (1991) derived an approximation of FST that is a function of expectation of coalescence times and is given by(8)in which T is the coalescence time of two lines randomly sampled from the collection of subpopulations, T0 is the time to coalescence of two lines from the same subpopulation, and μ is the mutation rate.
To obtain an expression of in terms of coalescence times under skewed offspring distribution, we can proceed by first obtaining the expected coalescence time E(T) of two genes randomly sampled from the collection of subpopulations, which is readily obtained from Equations 3 and A10 and is given by(9)When the number of subpopulations D is finite, the general form of is(10)For example, when 0 < γ < 2, the rate of coalescence is λγ = ψ2 (with ϕ = 1) and Equation 10 gives . The expression for in Equation 10 has the same form as the one derived by Slatkin (1991) under the standard coalescent. The key difference is that, under skewed offspring distribution, FST is a function of the rate λγ (Equation 2) of coalescence and thus a function of the reproduction parameters ϕ and ψ. The result that Slatkin (1991) obtained can be recovered from Equation 10 by taking γ > 2, in which case λγ = 1 (recall that the probability of large reproduction events ∝ 1/Nγ).
Following Wright (1951), the value of FST has often been used to estimate levels of gene flow. Figure 2 shows , obtained from Equation 11, as a function of ψ for different values of FST (Figure 2a) and ϕ (Figure 2b) and for two different values of λγ. Since FST is a function of ψ and ϕ, so is any estimate of gene flow based on FST, as Figure 2 clearly shows.
Lynch and Crease (1990) used the number of pairwise sequence differences of DNA sequences to estimate levels of genetic heterogeneity. In that context, Lynch and Crease (1990) introduced the quantity NST that has the form in which and are the average number of pairwise differences between sequences sampled from different, or the same, subpopulations, respectively. If mutation rate is constant and mutations occur according to the infinite-sites model (Watterson 1975), then NST estimates (Slatkin 1993). Using the results obtained for expected coalescence times (Equation 3), we obtain as in Equation 11 for the many-demes limit model of population subdivision and(12)when D < ∞. The effect of skewed offspring distribution is the same on NST as it is on FST. Under the infinite-sites mutation model we do not need an assumption of small mutation rate to obtain an expression of NST in terms of coalescence times, unlike the case for FST. As NST is defined, the mutation parameter cancels out (Slatkin 1993).
Number of segregating sites between pairs of sequences:
By the definition of FST in terms of probabilities of identity by descent (Equation 7), FST depends on mutation. Eldon and Wakeley (2006) show that the limit process (as ) of our model of skewed offspring distribution predicts nonzero levels of genetic variation only when γ > 1. If we (as in Eldon and Wakeley 2006) let μ denote the probability of mutation for each offspring in a single time step, we define the mutation rate θ as (and γ > 1). We can include mutation in an expression for FST by first obtaining the probability distributions of the number of segregating sites, under the infinite-sites model (Watterson 1975), between two genes given a model of population subdivision with migration. Let K0 denote the number of segregating sites between two genes sampled from the same subpopulation and K denote the number of segregating sites between two genes sampled randomly from the collection of subpopulations. The distributions of K0 and K are derived in the appendix, along with the distribution of the number of segregating sites K1 between two genes sampled from different subpopulations. Then by the definition of FST given in Equation 7 we obtain(13)When ,(14)From Equation 14 we conclude that mutation can affect FST only if θ is large relative to λγ. The expression for FST in Equation 14 has the same form as the one derived by Wilkinson-Herbots (1998) and by Nei (1975) and Takahata (1983) by other methods, under the Wright–Fisher model, including mutation. In Figure 3, FST from Equation 14 is graphed as a function of ψ for different values of θ and κ. The interpretation of Figure 3 is that FST, as a function of ψ, can vary considerably when the timescale of coalescence (and migration) is in units of Nγ/2 generations with 1 < γ < 2 (Figure 3, b and d).
Nei's genetic distance d:
Not all indicators of separation between populations depend on λγ. Nei's (1972) genetic distance is more appropriate for estimating divergence time between species, and FST-like quantities are more suitable for inferring population structure within species (Slatkin 1991). Nei's (1972) genetic distance measure is given by in which f0 and f1 are the probabilities of identity by descent of two genes sampled from the same or different subpopulations, respectively, and we add the subscript N to remind us that time is discrete. If we now assume that 0 < μE(ti) < 1 for i = 0, 1, then using the Maclaurin series expansion of the logarithmic function we obtain (previously obtained by Slatkin 1991) in which t0 and t1 are the coalescence times for two genes sampled from the same, or different, subpopulations, respectively. To obtain an expression of d for continuous time, we assume that the product converges to a constant θ as (and γ > 1). Rewriting the approximation for dN gives(15)which has the continuous-time limit(16)However, using the expressions for E(T1) and E(T0) (Equation 3), we obtain and so Nei's (1972) genetic distance is independent of λγ. Another way of deriving an expression for d is to note that the probability of identity by descent of two genes is the same as the probability that no mutations occur from the time they are sampled until they reach a common ancestor. Thus fi = P(Ki = 0) for i = 0, 1. We can therefore write(17)for any model of population subdivision. For the many-demes limit model under consideration,Using either the limit approach (Equation 16) or the substitution approach (Equation 17) in the many-demes limit model, and assuming small θ/κ (i.e., 0 < θ/κ < 1), d is of the form θ/κ. The same result is obtained for a finite number of subpopulations. Indeed, when D is finite, we obtain from Equations A28 and A29(18)Thus, if 0 < (θ/2)(D – 1)/κ < 1, we have from Equations 17 and 18(19)Even if (θ/2)(D – 1)/κ > 1, we have from Equations 17 and 18 that d is not a function of λγ. Thus Nei's (1972) genetic distance can be used to estimate divergence times of species even if one or both species have skewed offspring distribution, since d is proportional to the time of separation of two populations (Nei 1972; Slatkin 1991).
Some organisms, for example Pacific oysters (Beckenbach 1994; Hedgecock 1994a) and Atlantic cod (Bekkevold et al. 2002; Árnason 2004), may exhibit skewed offspring distribution among individuals in a population. Both Beckenbach (1994) and Hedgecock (1994a) describe the reproductive mode of oysters, for example, as a lottery, in which only the offspring of a few lucky females survive. Oyster and cod females have very high reproductive potential, as they may produce millions of eggs in a single spawning (May 1967; Strathmann 1987; Chambers and Waiwood 1996; Kjesbu et al. 1996). The Wright–Fisher model does not capture the skewed offspring distribution possibly exhibited by organisms with high fecundities and high early mortality. The models of Pitman (1999) and Sagitov (1999), and later of Eldon and Wakeley (2006) and Sargsyan and Wakeley (2008) for overlapping generations in a single population, incorporate the skewness and may thus better apply to organisms with highly fecund individuals and sweepstakes-style recruitment. By deriving distributions of coalescence times for two genes sampled from a subdivided population, we show how skewed offspring distribution confounds estimates of migration rate between subpopulations when based on FST-like measures of population subdivision.
An important result of this work is that FST depends not only on the migration rate κ but also on the parameters (ψ and ϕ) of our model of large offspring numbers. Demographic processes such as population size fluctuations, founder effects, or skewed offspring distribution have been thought to increase genetic differentiation among subpopulations. As defined and calculated from genetic data, common indicators of population subdivision then take on high values, thus suggesting low levels of migration (Boileau et al. 1992; Whitlock 1992; Slatkin 1993; Hedgecock 1994a). Our main conclusions are twofold. First, FST is shown to depend on the parameters controlling the size (ψ) and frequency (ϕ) of large reproduction events (the probability that the offspring of a single individual replace a fraction ψ of the population is ε = 2ϕ/Nγ) and can thus indicate high or low levels of genetic heterogeneity depending on ψ and ϕ. To illustrate, consider the expression for FST derived under the many-demes limit model without mutation (Equation 11), and let the timescale of coalescence occur on Nγ/2 time steps (i.e., 0 < γ < 2). In that case the rate of coalescence λγ = ψ2 (by taking ϕ = 1), and the rate of large reproduction events is high. By fixing κ, we see that FST ranges from very low (when ψ is low), to ≈1/(1 + κ), when ψ ≈ 1. Second, to the extent that FST (or NST) is used in estimating levels of gene flow, these estimates are confounded by λγ and thus by ϕ and ψ. Also, migration in our model is not the usual Nm quantity, but is given by κ = mγNγ. This means that even when Nm is very large, we may still observe genetic heterogeneity, since the rate of large reproduction events is also large. In a population where individuals can have very many offspring, gene flow is not the only demographic force that influences genetic heterogeneity.
The coalescence times T0 and T1 (for genes sampled from the same or different subpopulations, respectively) are fundamental quantities of the ancestral process of genes in subdivided populations. The time during which DNA sequences accumulate mutations is determined by T0 and T1. As we have shown, the coalescence times depend on the skewness of the offspring distribution through the rate λγ (Equation 2) of coalescence. By deriving the distributions of T0 and T1 for two genes in a structured population, we have obtained insight into how skewed offspring distribution shapes the genetics of structured populations. Since T0 and T1 are functions of ϕ and ψ through the rate of coalescence λγ, all the quantities of interest in regard to investigation of the genetics of structured populations, including expected values, number of segregating sites, and indicators of population subdivision, are functions of λγ.
One such insight is that genetic heterogeneity can be observed in genetic data even if gene flow is very high by the usual standard (). Edmands et al. (1996) found significant genetic heterogeneity among subpopulations of the purple sea urchin S. purpuratus sampled along the coast of California and Baja California. The ecology and physiology of S. purpuratus indicate the capacity for highly skewed offspring distribution: external fertilization and very high fecundity. Despite a planktonic larval period of several weeks (Strathmann 1978), and thus a potential for high dispersal, both allozyme and mtDNA sequence data revealed genetic differentiation, even over short distances (Edmands et al. 1996). We have shown that, regardless of the timescale of migration, E(T0) < E(T1). Genetic heterogeneity can, therefore, be observed in DNA sequence data even if gene flow is very high, in a population with skewed offspring distribution. Population turnover in a metapopulation model when demes that become extinct are recolonized by one or a few individuals can also lead to increased FST (Wade and McCauley 1988; Whitlock and McCauley 1990; Pannell 2003). Indeed, a model of metapopulation structure that allows only one founder for every deme that is recolonized necessarily results in a coalescent process with multiple mergers, if the founder can have many offspring.
In summary, we consider the coalescence times of a subdivided population following a sweepstakes-style recruitment. The expected coalescence time for two genes sampled from the same subpopulation is always less than the expected coalescence time for two genes sampled from different subpopulations, even when migration occurs on a very short timescale. Estimates of migration rate based on FST are confounded by the rate λγ of coalescence, since FST-like measures of genetic heterogeneity are a function of the reproduction parameters of our model of skewed offspring distribution. These results underscore the importance of choosing an appropriate limit process for the population under consideration.
The discrete-time transition matrix:
The probability transition matrix ΠN (Equation A1) for the ancestral process over one time step for the case of arbitrary fixed number D ≥ 2 of subpopulations has three states: (1) two lines in the same deme but not coalesced, (2) two lines in different demes, and (3) the two lines have coalesced. We do not distinguish between subpopulations. The transition probabilities in ΠN are derived under the assumption that migration does not alter the subpopulation sizes (Nagylaki 1980; Strobeck 1987; Herbots 1997). We let m denote the single backward migration fraction. The matrix ΠN is(A1)in which cN is the coalescence probability andThe matrix in Equation A1 is a generalization of the matrix for the same migration mechanism (cf. Wakeley 2008) obtained under the usual Wright–Fisher model of reproduction. In Equation A1, the probability of coalescence is cN, instead of the usual 1/N, as is the case for the haploid Wright–Fisher model. The corresponding continuous-time process has rate matrix G given by(A2)
Distribution functions of the coalescence times when D is finite:
In this section we derive the distribution functions for the coalescence times T0 and T1 when the number of subpopulations is finite. Given the distributions of T0 and T1 we can determine the distribution of the coalescence time T of two genes sampled at random from the collection of subpopulations. The distributions of these coalescence times allow us to derive expressions for FST with or without mutation.
We can use the rate matrix (Equation A2) to obtain the density functions for T0 and T1. Using Laplace transforms (see Herbots 1997) we obtain(A3)in which(A4)and(A5)for i = 1, 2. To obtain the density function of T1, we note that T1 can be represented as a sum of two independent random variables, T1 = Y1 + T0, where Y1 is an exponential random variable with rate κγ/(D – 1). By direct calculation,(A6)The form of the continuous functions and (Equations A3 and A6) immediately yields the cumulative densities and for T0 and T1, respectively. Namely, writing λi = –ri for i = 1, 2,(A7)and(A8)in whichLet T denote the time to coalescence for two genes sampled at random from the collection of subpopulations. Then, with probability 1/D the two genes are sampled from the same subpopulation, and with probability 1 – 1/D they are sampled from different subpopulations. The cumulative density function (c.d.f.) FT of T is then given by(A9)in which denotes the c.d.f. of Ti for i = 0, 1. The expected value of T is(A10)in which E(T0) and E(T1) are given by Equation 3 and the variance of T is(A11)in which Var(T0) and Var(T1) are given by Equation 4. Note that although the expected value of T lies between E(T0) and E(T1), Equation A11 tells us that the variance of T may not lie between the variance of T0 and T1. If , then Var(T) > Var(T1).
A many-demes limit model:
In this section we derive the ancestral process for two genes in the many-demes limit () and as . Since now , the single-generation backward transition matrix, following Möhle (1998), can be written(A12)in which the matrices A and B are given below. The matrix A describes the probabilities of the transitions that occur on a timescale of time steps. The matrix B/D describes transitions that occur on a timescale of D time steps, thus forming the continuous-time part of the ancestral process. The limit process (as ) is then given by Π(t) = PetPBP in which describes the equilibrium process of the events that occur on the timescale of time steps (Möhle 1998). The ancestral history of a sample is first adjusted by an instantaneous process described by P and then enters a continuous-time process described by the rate matrix PBP. Given the ancestral process, we can derive the distributions of T0 and T1.
The instantaneous matrix A is(A13)with eigenvalues λ1 = λ2 = 1 andand we observe that . By calculating the corresponding eigenvectors (not shown) we obtain the limit matrix P by diagonalizing A,(A14)in which(A15)and in the limit we obtain(A16)When but holding N finite, the ancestral process is given by PetPBP, in which the matrix B is given by(A17)and we obtain(A18)The rate matrix G* = P*B*P* then takes the general format(A19)The ancestral process in the limit and is P*etG* and immediately yields the density functions for T0 and T1 as follows. The time T1 to coalescence for two lines sampled from different demes follows in each case (γ > 2, γ = 2, and 0 < γ < 2) an exponential distribution with rate . Now consider the time T0 to coalescence for two lines sampled from the same subpopulation. Going back in time, two lines in the same subpopulation can either coalesce with probability or they enter the continuous-time process with probability , in which case one of the two lines migrates to a different subpopulation. Thus T0 follows a mixture distribution with cumulative density function(A20)
Order of limits irrelevant in the many-demes limit model:
In this section we show that the same ancestral process is obtained irrespective of the order of the limits and . We have already derived the process when first and then . Now we show that the same ancestral process is obtained when first and then .
As , we obtain the ancestral process described by the rate matrix(A21)We remark that An = (–κ – λγ)n−1A for n ≥ 1. Since A is a rate matrix, we have, with a = κ + λγ,(A22)Equation A22 immediately gives us the instantaneous matrix(A23)Equations A21 and A23 then give us the rate matrix G = PBP after first taking the limit and then .
By similar arguments we can show that the ancestral process does indeed result in coalescence regardless of initial state. Indeed,(A24)which gives, writing b = κλγ/(κ + λγ),(A25)The equilibrium distribution (as ) is then(A26)and we obtain that two genes do reach a common ancestor regardless of initial state—i.e.,(A27)
Number of segregating sites:
We now consider the number of segregating sites, under the infinite-sites model (Watterson 1975), between two genes in the two models of population subdivision discussed previously. First, consider the model of finite number of subpopulations. Let K0 and K1 denote the number of segregating sites for two genes sampled from the same or different subpopulations, respectively. We let θ denote the scaled mutation rate and note that, given a specific length of time t, the number of segregating sites is Poisson distributed with rate θt/2. The probability distribution for the number of segregating sites for two genes sampled from the same subpopulation is(A28)in which A1 and A2 are given in Equation A4, and λi = –ri from Equation A5.
The probability distribution for the number of segregating sites between two genes sampled from different subpopulations is(A29)in which Bi = Aiκ/(κ + (D – 1)ri) for i = 1, 2. The expected value and variance of K0 are both less than the corresponding quantities for K1. Indeed,(A30)and(A31)The probability mass distribution of the number of segregating sites K for two genes sampled at random from the collection of subpopulations is(A32)in which Ci = Ai/D + (D – 1)Bi/D for i = 1, 2 and . One can write K ∼ (1/D)K0 + (1 – 1/D)K1; i.e., K is a mixture distribution. In fact, K is distributed as K0 with probability 1/D and as K1 with probability 1 – 1/D. Hence,(A33)and so E(K0) < E(K) < E(K1). However, the variance of K (Equation A34), which can be obtained in the same way as the variance of the time to coalescence of two genes sampled at random from the collection of subpopulations (Equation A11), may not lie between the variance of K0 and K1. The variance of K is(A34)
Number of segregating sites under the many-demes limit model:
Under the many-demes limit model of population subdivision, the probability mass distribution for the number of segregating sites K1 between two genes sampled from different subpopulations is(A35)The expected number and the variance of the number of segregating sites between two genes sampled from different subpopulations is then(A36)(A37)The probability mass function for K0, the number of segregating sites between two genes sampled from the same subpopulation, is (k = 0, 1, …)(A38)which gives expected value(A39)and the variance of K0 is(A40)
We thank two anonymous reviewers for helpful comments and suggestions.
- Received July 23, 2008.
- Accepted November 25, 2008.
- Copyright © 2009 by the Genetics Society of America