Abstract
A nonequilibrium migration model is proposed and applied to genetic data from humans. The model assumes symmetric migration among all possible pairs of demes and that the number of demes is large. With these assumptions it is straightforward to allow for changes in demography, and here a single abrupt change is considered. Under the model this change is identical to a change in the ancestral effective population size and might be caused by changes in deme size, in the number of demes, or in the migration rate. Expressions for the expected numbers of sites segregating at particular frequencies in a multideme sample are derived. A maximum-likelihood analysis of independent polymorphic restriction sites in humans reveals a decrease in effective size. This is consistent with a change in the rates of migration among human subpopulations from ancient low levels to present high ones.
THE early reports of mitochondrial DNA (mtDNA) haplotype diversity among humans indicated a recent large increase in population size starting from a small number of founders (Cannet al. 1987; Vigilantet al. 1991). This pattern, which is marked by an increase in rare variants, has not subsequently been observed for nuclear loci. Instead, nuclear loci show an excess of polymorphic sites segregating at intermediate frequencies (Hey 1997). This is illustrated by Tajima's (1989) statistic, D, which is positive, although not significantly so, for four extensive nuclear DNA sequence data sets (Hardinget al. 1997; Clarket al. 1998; Zietkiewiczet al. 1998; Harris and Hey 1999). In addition, single nucleotide polymorphism (SNP) data do not show evidence of population growth (Nielsen 2000). Here, I consider a subdivided population model, which allows for a single change in the effective size of the population, and apply the results to restriction fragment length polymorphism (RFLP) data from a worldwide sample of humans (Bowcocket al. 1987; Matulloet al. 1994; Poloniet al. 1995). The data provide strong evidence for a change in demography: a decrease in effective size. This is most likely due to a shift from a more ancient subdivided population to one with less structure today.
Numerous models of subdivision with migration have been proposed to explain patterns of genetic variation among local populations. The most important distinction among these is that some models, such as Kimura and Weiss' (1964) stepping-stone model, generate isolation by distance (Wright 1943), whereas others, such as Wright's (1931) island model, do not. The model considered here is of the latter type. Wakeley (1998) studied the ancestral genealogical process for samples from a D-deme version of the finite island model, in which the number of demes is assumed to be large. In the finite island model (Maruyama 1970; Latter 1973), all demes exchange migrants regardless of their geographic location. The single-generation probability of migration is assumed to be the same between each pair of demes. Specifically, m is the probability that a gene came from one of the other D − 1 demes to its present one in the previous generation. This model is realistic for species in which individuals choose either to stay put or to move to a location picked uniformly from the entire species range. In certain situations, it may also apply when individuals choose either to migrate a great distance or not at all.
When the number of demes is large, the genealogical history of a sample taken from such a population can be divided into two parts, which I will call the scattering phase and the collecting phase; see Figure 1. The scattering phase is the very recent history of the sample, during which coalescent and migration events bring it rapidly to the start of the second, and much longer, collecting phase of the history. The collecting phase begins when each ancestral lineage is in a separate deme and is characterized by migration events that move lineages from deme to deme, the great majority of which are unoccupied. Eventually, a migration event will place a pair of lineages into the same deme, at which time they can coalesce. The ability to break the genealogy into these two parts, and to consider them separately, depends only on the number of demes being large. When this is true, the time spent in the scattering phase can be ignored because the collecting phase dominates the history (Wakeley 1998). A similar separation of time scales has been found in the study of genealogies under partial selfing (Nordborg and Donnelly 1997).
A hypothetical genealogy of a sample from d = 3 demes under the model. In this case there was just one migration event during the scattering phase, and this was in the sample from deme 2. Thus ,
,
, and n′ = d + 1 = 4.
Consider first the equilibrium version of this island model. We assume that each of D demes is of constant diploid size N, but any results will also apply to haploids if N is placed by N/2. This model is characterized by two parameters: θ = 4NDu, in which u is the neutral mutation rate, and M = 2NDm/(D − 1). When D is large, M becomes simply 2Nm. A sample of n items, taken from d different demes, which are arbitrarily labeled 1 through d, is denoted n = (n1, n2, …, nd), where ni is the sample size from the ith deme. Of course
The scattering phase of the genealogy takes the singledeme sample, ni, through a series of migration and coalescent events to a new configuration where each remaining lineage is in a separate deme. The number of these recent ancestral lineages, n′i (1 ≤ n′i ≤ ni), depends on the sample size and the migration rate. It is important to note that each of these n′j lineages is in a separate deme at the end of the scattering phase. The distribution of n′i is given by
Thus, the distribution of the number of ancestral lineages of each deme's sample at the start of the collecting phase is the same as the distribution of the number of alleles in the Ewens sampling formula (Ewens 1972; Karlin and McGregor 1972) but with mutation replaced by migration. There is one major difference: the n′i items at the start of the collecting phase represent exactly that number of ancestral lineages, whereas the alleles counted by the Ewens sampling formula do not. Let
The genealogical history of these n′ ancestral lineages during the collecting phase is a coalescent (Kingman 1982a, b) when time is measured in units of twice the effective population size
Because the collecting phase is a coalescent, it is relatively simple to allow for changes in effective population size. Here I consider the possibility of a single abrupt change in demography at generation t in the past. Thus, the collecting phase itself is divided into two parts. Before time t, the population is assumed to have been of effective size NeA = NADA[1 + 1/(2MA)], which can be compared to Equation 3. Thus, the change in demography might have been caused by a change in deme size, in the number of demes, or in the migration rate. Part of the flexibility of the model is that the change might have been any combination of these. If the ancestral population was panmictic we have the demographic model that Takahata (1995) proposed for human history, but without extinction and recolonization.
This nonequilibrium model necessitates two parameters in addition to θ and M. The first is the time of the event, T = t/2ND), measured in units of 2ND generations. The second is R = NeA/(ND), the ratio of the ancestral effective size to the current total population size. Thus, values of R/[1 + 1/(2M)] greater than one indicate that the ancestral effective size was larger than the current effective size. Given an estimate of R and assuming that ND = kNADA, where k is some constant, we can retrieve
Neutral mutations are modeled as a Poisson process along the branches of the genealogy (Hudson 1983, 1990). When a mutation happens it is assumed to occur at a previously unmutated site. This assumption is shared by Watterson's (1975) infinite sites model without recombination and Kimura's (1969) model, which assumes free recombination between sites. We apply these models to genetic data when the mutation rate per site is so small that multiple changes at any one site are unlikely. Watterson (1975) and others have studied the distribution of segregating sites under the neutral infinite sites model, and Sawyer and Hartl (1992) have done so for both selection and neutrality under Kimura's (1969) model. Under neutrality, both models make the same prediction about the expected number of segregating sites in a sample, either taken as a whole or broken into categories by frequency (Hudson 1983; Tajima 1989; Sawyer and Hartl 1992; Fu and Li 1993). In other words, expectations derived under particular assumptions about the recombination rate are valid for all rates of recombination. Here, I assume Watterson's (1975) model. The effect of linkage and recombination is on the variances, and these are largest under when the recombination rate is lower.
Under these mutations-at-unique-sites models, each segregating site will divide the sample into two groups: one that has inherited the mutant base and one that still shows the ancestral base. Let i = (i1, i2, …, id) be the counts of mutant bases at a single polymorphic site in the sample n = (n1, n2, …, nd). Further, let zi be the number of sites in the sample that show the pattern i = (i1, i2, …, id). In this article, I calculate
THEORY
The approach taken to deriving E(zi) is to condition on the numbers of lineages and the mutant counts at the start of the collecting phase and at the time of the abrupt change in demography. Thus, we have
Consider the genealogy of the n′ lineages during the collecting phase of the history. Because this is a coalescent, the demographic change in the model is identical to a single, instantaneous change in effective population size from NeA = NADA[1 + 1/(2MA)] to Ne of Equation 3. Wakeley and Hey (1997) found expressions for the expected site frequencies under this size-change model by summing the numbers expected to have occurred before and after the change. Their Equation 20 is adapted to the present situation by replacing their θ1 with θ[1 + 1/(2M)], and their θA with θR, and by rescaling time according to Equation 3. This gives
Pc(i′|i″, n″) in Equation 7 is given by expression 18 in Wakeley and Hey (1997), or equivalently by
Equation 7 applies to the unobserved ancestral sample at the start of the collecting phase. This is related to the observable sample in Equation 6 by the probability that a mutant found in i′ copies in the collecting phase sample, n′, will appear in (i1, i2, …, id) copies from each deme in the multideme sample (n1, n2, …, nd). This probability, Pr[i|i′, n′], is derived as follows. First, the distribution of the i′ mutant copies among the sequences that make up the ancestral sample, (
Next, consider how mutants with counts (i′1, …, i′d) in the collecting phase sample, (n′1, …, n′d), become ones with counts (i1, …, id) in the original sample, (n1, …, nd). As the scattering phase occurs independently for each deme—e.g., see Equation 2—this probability, which is called Ps(ij|i′j, n′j), is calculated separately for each deme and the results multiplied together. appendix b shows that
The subscript in
Thus, we have the probability that a mutant in i′ copies in the collecting phase sample is distributed as (i1, …, id) in the starting sample, (n1, …, nd),
Equation 6 becomes computationally prohibitive as the sample size per deme and the number of demes sampled increase. A great deal of time is saved by noting that the denominator of Equation 5 is equivalent to the expected number of segregating sites in the sample, and that is more efficiently computed as
The subsampled data
APPLICATION TO HUMAN RFLP DATA
The method was applied to some data derived from the extensive RFLP surveys of Bowcock et al. (1987), Matullo et al. (1994), and Poloni et al. (1995). These data overlap with those analyzed by Nielsen et al. (1998). I considered data from four localities, which I call demes: Japan, New Guinea, Senegal, and Italy. Joanna Mountain kindly supplied these data, which consisted of nj = 20, for j = 1, 2, 3, 4, at 45 independent loci. Due to the computational burden of calculating the likelihood 6, I made a smaller data set of nj = 7, for all j, by randomly sampling haplotypes within demes (without replacement). One locus was lost because it was not polymorphic in the subsample. This left 44 independent loci for the analysis, and these data are shown in Table 1. I also performed all of the analyses described below on subsamples of size nj = 3 and nj = 5, and found nearly identical results. The full data sent to me by Joanna Mountain included two more African sample locations: Central African Republic and Zaire. I excluded these two samples for reasons discussed below. However, I also analyzed just the three African samples and the results were identical in overall pattern to what is reported here for the four widely separated samples. The biggest difference was that the estimate of the current migration rate was almost five times larger.
A contour plot of the likelihood of the data in Table 1 over a range of R and T, when M = 1.83 (see text). Any waviness in the surface results from the fact that the likelihood was calculated at a finite number (25 × 25 = 625) of points.
For the data in Table 1, the maximum-likelihood estimate of the migration rate was
A number of conclusions can be drawn from this figure. First, it is clear that small values of R and T (lower left in Figure 2), which could represent recent population expansion, are strongly rejected. Next, the present model reduces to an equilibrium model when R = 1 + 1/(2M), that is when the effective size was the same before and after T. For M = 1.83, this gives R = 1.27 for all T, a value that falls between the third and the fourth contours in Figure 2. These contours define a broad flat area more than 6 log-likelihood units below the maximum, which occupies much of the graph. Thus, equilibrium migration (i.e., no change in demography) is also strongly rejected. As T either increases or decreases for a given finite value of R, the likelihood returns to this broad flat area, which is consistent with an equilibrium model. This makes intuitive sense because the change in demography will have little effect if it is either extremely recent or extremely ancient.
As Figure 2 implies, there is no upper bound for R. The 95% confidence interval for R is (5.0, ∞), and in fact the likelihood of the data continues to increase as R grows. As R increases, En′(zi′) and En′(S)—see Equations 7 and 15—become linear in R as well as θ. For instance, En′(S) converges on
The timing of this change in demography is also difficult to establish precisely. Figure 2 shows that the lower bound of the 95% confidence interval for T does not depend much on R, but that the upper bound is positively correlated with it. If we were to draw a line along the ridge of the maximum-likelihood surface in Figure 2, it would run from T = 0.81 at R = 10 to T = 3.42 at R = 106. The more ancient the event was, the more extreme it has to have been to explain the data. Using the limiting expressions for En′(zi′) and En′(S). gives a 95% confidence interval for T as (0.2, ∞). Figure 3b shows that the likelihood increases with increasing T, but that the surface becomes quite flat above T = 2.0. Translating the confidence interval for T into years requires values for the recent effective size and the generation time for humans. Some typical values are Ne = 104 and g = 15 yr for the generation time (Takahataet al. 1995; Mountain 1998). Assuming that Ne is given by Equation 3, that M = 1.8, and using the lower bound of 0.2 for T, we estimate that this demographic event could have occurred as recently as 2NeTg/[1 + 1/(2M)] ≈ 47,000 years ago, but might have been much more ancient.
DISCUSSION
The subdivided population coalescent model considered here and in Wakeley (1998) is flexible and easy to analyze. The genealogical history of a sample is characterized by a scattering phase and a collecting phase. The scattering phase depends on the current number of demes, the current deme size, and the current migration rate. The collecting phase is a coalescent on a time scale of 2Ne generations, where Ne is given by Equation 3. Because of this it is straightforward to incorporate changes in the ancestral numbers of demes, deme sizes, and migration rates, by considering their influence on the effective population size. We may write
The results presented here are preliminary, as some of the assumptions of the model are violated for humans (see below). The purpose of this work has been to suggest that changes in migration rates over time may have shaped patterns of genetic polymorphism in humans. Because these RFLP data span the human genome (Bowcocket al. 1987; Matulloet al. 1994; Poloniet al. 1995), we are probably correct in attributing the patterns they show to population structure rather than to natural selection. This is because population-level processes affect all loci in a similar way while selection acts on particular sites or regions. The contrasting pattern seen in mtDNA may very well be the result of selection on the tightly linked mitochondrial genome, as suggested by Hey (1997). Alternatively, because of the fourfold difference in effective size between nuclear and mtDNA, the mitochondrial data might reflect more recent events and the nuclear data more ancient ones (Fay and Wu 1999; Hey and Harris 1999). The present work suggests that historical patterns of migration could also be important in accounting for differences between nuclear and mitochondrial data, for instance if the male and female migration rates vary in different ways over time.
One assumption of the model that is clearly violated for humans is that migration is equal and symmetric among every possible pair of demes, regardless of geographic location. Poloni et al. (1995), for example, found significant isolation by distance in a larger sample of demes that included those considered here. In the hope of approximating an island model, the four samples used in this study were chosen because they are relatively equidistant and widely separated. Slatkin and Barton (1989) found that samples taken from a two-dimensional stepping-stone model will behave roughly, at least in terms of FST, like samples from an island model if the distances between them are much greater than the characteristic scale of variation. Other assumptions of the model that might not hold include the assumption that all demes are equal in size and that only a single abrupt demographic event occurred. Clearly, these issues deserve to be explored further. The last of them would be straightforward to address. Using the approach of Wakeley and Hey (1997), multiple demographic events could be included—as, for instance, Hey and Harris (1999) have recently done for population bottleneck—and this could be used to approximate a continuous change in effective size. However, a continuous change that occurs rapidly will be approximated well by a single abrupt event—see, for example, Reich et al. (1999)—so this might not be a serious problem with the present analysis.
A further concern is the effect of ascertainment bias in the data, which could produce spuriously high values of R. These RFLP data were originally selected for study because they were polymorphic among Europeans. If there were a detailed record of how this was done, we could use Nielsen's (2000) method to correct for the bias, after adjusting the method to take population subdivision into account. There appears to be no such detailed record, but simulations have been carried out to measure the possible effect of ascertainment bias (Mountain and Cavalli-Sforza 1994). These indicate that the bias will be strongest for European samples and ones genetically close to them, and that ascertainment bias alone does not account for the genetic signal in the data. In an attempt to gauge the importance of ascertainment bias to this work, I excluded the European sample and redid the analysis. This did not change the results appreciably. As mentioned above, I also looked at just three African samples and found similar results. Because other nuclear data—both DNA sequences (Hardinget al. 1997; Clarket al. 1998; Ziet-kiewiczet al. 1998; Harris and Hey 1999) and SNPs (Nielsen 2000)—appear to show a similar pattern, it seems likely that these RFLP data reflect past human demography rather than, or at least in addition to, biased sampling.
The conclusions reached here stand in contrast to most models of human history that assume a panmictic ancestral population; for a recent review of these see Mountain (1998). Here it was found that subdivision was stronger among ancestral human populations than it is now. This may be easy to accept given what we know about recent human demography, but it brings into focus another very important point. We can probably reject models of the isolation and subsequent divergence of human populations without gene flow. These would be characterized by very small values of M, probably small values of R, and a value of T corresponding to the time of isolation. Figure 2 shows that such values are strongly rejected. Some caution is needed here because strict isolation models are not simple limiting cases of corresponding migration models (Nath and Griffiths 1993). However, the data indicate that humans are a web of demes interconnected by gene flow rather than a tree of isolated populations.
A C program that performs the above analyses is available from the author.
Acknowledgments
I thank Rasmus Nielsen, Joanna Mountain, Monty Slatkin, and Carl Morris for helpful discussions of this material. Jody Hey and Warren Ewens provided comments on an earlier version. I thank Naoyuki Takahata and an anonymous referee for helpful reviews. Joanna Mountain kindly provided the RFLP data. This work was supported by a grant from the National Science Foundation (DEB-9815367).
APPENDIX A
The result given in Equation 3 is implicit in Wakeley (1998), where the expectation and variance of the total length of the genealogy during the collecting phase were derived. Under the assumption that the number of demes is large (D ⪢ n), the time to a common ancestor of n′ sequences is a two-step process. For two members of the sample to coalesce, they must first be in the same deme. When time is measured in units of 2ND generations, the waiting time to this event is exponentially distributed with parameter
Let k be the number of times that two of the sequences have the chance to coalesce before they actually do. Then k is geometrically distributed:
APPENDIX B
The scattering process that leads to Equation 1 is identical to what Arratia et al. (1992) call the Feller coupling. This is one of several stochastic processes known to generate the Ewens sampling formula. It follows, then, that not only does the (marginal) distribution in Equation 1 hold, but also that the distribution of the numbers of descendents of these lineages is given by the full Ewens sampling formula. Consider the singledeme sample, nj, and to simplify notation, for the moment let r = nj and
Footnotes
-
Communicating editor: N. Takahata
- Received May 6, 1999.
- Accepted August 12, 1999.
- Copyright © 1999 by the Genetics Society of America