| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
Genetics, Vol. 176, 969-981, June 2007, Copyright © 2007
doi:10.1534/genetics.107.071464
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
,1
* Department of Statistics and
Department of Mathematics, Cornell University, Ithaca, New York 14853
1 Corresponding author: 523 Malott Hall, Cornell University, Ithaca, NY 14853.
E-mail: rtd1{at}cornell.edu
| ABSTRACT |
|---|
|
|
|---|
An alternative approach to modeling spatial structure that does not suffer from this defect is the stepping-stone model. In this article, we investigate the consequences of modeling space as a two-dimensional stepping-stone model in which there is an L x L grid of colonies and migration only to neighboring colonies. The migration scheme is very simple; however, it results in what Wright called isolation by distance. In other words, it takes a number of migration events for the lineages to spread across the system. As we will see, this feature, which is certainly present in Drosophila and early human populations, causes a dramatic change in the coalescence structure of lineages.
The reason for this is intuitively clear. At small times the lineages have not had a chance to spread across the population, so the effective population size is reduced. The coalescence rate is increased, reducing the number of low-frequency-derived alleles, skewing the site frequency spectrum, and increasing linkage disequilibrium. These effects occur in the island model as well; two lineages sampled from one deme have an increased coalescence rate until one of them migrates, at which point they behave like a random sample from the overall population. In contrast, as we later explain, in the stepping-stone model the effective population size increases roughly linearly in time.
The main point of this article is to argue that spatial structure in the form of the stepping-stone model has a different effect from the symmetric island model and can have a much greater impact on genealogies, so it should also be considered when one wants to assess the impact of spatial structure on estimation procedures or statistical tests. We begin by reviewing theoretical results of COX and DURRETT (2002) and ZÄHLE et al. (2005) for the coalescence time of a sample of size n and contrast these results with the corresponding facts about the symmetric island model. The strange nonlinear time scaling needed to reduce genealogies in the stepping-stone model to Kingman's coalescent indicates that there is a strong effect on commonly used statistics, but the exact nature of the changes is difficult to analyze mathematically. Because of this, we turn to simulations to demonstrate the effect of stepping-stone spatial structure on the decay of linkage disequilibrium, the site frequency spectrum, and the distribution of test statistics based on the site frequency spectrum.
| THEORETICAL RESULTS |
|---|
|
|
|---|
the genealogy of a sample of size n is the same as that of a homogenously mixing population of size
![]() | (1) |
The reason for the simplification for large D is easy to understand. At most times all lineages are in different demes, their actual locations are irrelevant, and the coalescence times will have the lack of memory property that characterizes the exponential. If we sample n chromosomes from one deme then there is an initial period called the "scattering phase," which ends when all of the surviving lineages are in different demes. This initial phase is short compared to the coalescence time, so it is equivalent to a random reduction in the sample size. For more details, see p. 1864 of WAKELEY (1999).
LESSARD and WAKELEY (2004) have recently extended this analysis to the two-locus ancestral graph in a subdivided population. WAKELEY and LESSARD (2003) have applied these results to the study of linkage disequilibrium (LD) in humans. They found that their model with a large number of demes fitted the data for humans well (see Figure 2 on p. 1049 of WAKELEY and LESSARD 2003), in contrast to REICH et al. (2002), who did not get a good fit from the two-island model.
|
COX and DURRETT (2002) investigated the Moran model in which each individual is replaced at rate 1, replacement comes from a different deme with the migration probability m (called
by Cox and Durrett), and the probability a migrant to x comes from y is q(y x), where q is a probability distribution with finite range that has the same symmetries as the two-dimensional lattice. This symmetry assumption implies that the two coordinates are uncorrelated and have the same variance
2. The grid of colonies is an L x L square and the difference y x is computed modulo L; i.e., we have periodic boundary conditions that identify opposite edges of the square. This assumption is a mathematical convenience that has been used in many previous studies, but is not necessary. The proofs in COX and DURRETT (2002) and ZÄHLE et al. (2005) extend easily to a flat universe with migration out of the system suppressed or reflected back in, and the qualitative behavior is the same.
Theorem 4 of COX and DURRETT (2002) shows that if we pick any two chromosomes and
as L
then the coalescence time divided by
converges to an exponential distribution with mean 1. In words, if the per colony migration rate, Nm, is much larger than log L the system is essentially homogeneously mixing. This is a strong migration limit that corresponds to results of NAGYLAKI (1980) and NOTOHARA (1993) for the island model. It should also be compared to the remark of KIMURA and MARUYAMA (1971, pp. 125126) that "marked local differentiation of gene frequencies can occur if
" while "if
the whole population tends to behave as a panmictic unit." As we can see from the mathematical result, the cutoff between the two behaviors is not constant somewhere between 1 and 4, but depends on the number of demes and increases like log L.
To state results for the case in which the local population size N is not much larger than log L, we need the following rescaled migration rate:
![]() |
is the natural rescaled migration rate for the stepping-stone model, which is the analog of M = 4Nm for the island model. Note that in the stepping-stone model the variance
2 joins the product Nm to give the composite parameter that describes the strength of migration, but in contrast to the island model this quantity is divided by log L. The 2
that comes from the central limit theorem is included to make the next formula simple.
Theorem 5 of COX and DURRETT (2002) shows that two chromosomes sampled at random from the population have a coalescence time that is asymptotically, as L
, exponential with mean
![]() |
= 1 the mean coalescence time of two individuals is
![]() |
inside the logarithm are there to make the approximation more accurate for small L. They are not important as L
, so they should be dropped when comparing with our asymptotic result, but even with this there remains the difference of the factor of 1 +
. It is difficult to say what causes this difference since the result cited by CHARLESWORTH et al. (2003) and attributed to BARTON et al. (2002) does not appear in that article.
Theorem 2 in ZÄHLE et al. (2005) extends the result for the coalescence time of two chromosomes by showing that a sample of n chromosomes chosen at random from the population has the same genealogy as a sample of size n from a homogeneously mixing population of size
![]() | (2) |
. To illustrate the use of these formulas, consider our 10 x 10 grid with migration to nearest neighbors so L = 10 and
. If the local population size is N = 25 and we choose m = 0.1 so that 4Nm = 1 then
=
(0.25)/log(10) = 0.341 and Ne = 2500(1.341)/0.341 = 9829 vs. the actual population size of 2500.
In many genetic studies, samples are not chosen at random from the population as a whole. For example, one of the samples in SABETI et al. (2002) consists of 73 Beni individuals who are civil servants in Benin City, Nigeria. To capture this type of local sample in our framework, we assume that the n chromosomes are sampled at random from a Lß x Lß square of colonies. Theorem 3 in ZÄHLE et al. (2005) shows that we get the ordinary coalescent after a nonlinear time change in which times
with
correspond to time
in the coalescent, and then time proceeds at the usual linear rate for a population with the Ne given in (2). Here, we have changed ZÄHLE et al.'s (2005) 2m in the denominator to 4
m
2. This does not affect the limit theorem, but as we will see, it makes for a better approximation.
To help explain the time change, we note that the coalescence rate at time s =
[that is,
] is
![]() |
and hence area of order s, so the effective population size is of order s.
|
|
with
1. At times
L2/4
m
2 the lineages have had time to spread across the entire space. At this point the coalescence time, which is of order
, is much larger than the time, of order L2/m
2, needed for the random walk on an L x L square that jumps at rate 2m and has variance
2 to equilibrate in the uniform distribution (COX and DURRETT 2002). Thus the relative positions of lineages are unimportant and the system behaves as if it were homogeneously mixing. Since we have changed ZÄHLE et al.'s (2005) 2m in the denominator to 4
m
2, 1 over the coalescence rate increases until it becomes constant in the second regime, and the transition is continuous. ZÄHLE et al. (2005) were able to compute various quantities for samples of size two under the infinite-sites model including the expected number of pairwise differences and the probability for no coalescence before recombination for two loci with a per generation recombination probability r. They showed that the latter quantity decayed more slowly in the stepping-stone model compared to a homogenously mixing population, but it is hard to relate this to commonly used measures of linkage disequilibrium.
The limit theorem of ZÄHLE et al. (2005) is difficult to use for computations because the coalescence rate changes in time. For the ordinary coalescent one can easily compute the correlation between coalescence times at two loci for samples of size two by considering whether recombination or coalescence occurs first. One can find this result of GRIFFITHS (1981) explained on pp. 8083 of DURRETT (2002). However, in the situation of ZÄHLE et al. (2005) one must also consider the time at which the event occurs, and one can no longer obtain the answer by solving three equations in three unknowns. Thus, to investigate the effect of stepping-stone population structure on samples of size n > 2, we turn to simulations.
| METHODS |
|---|
|
|
|---|
To try to minimize the differences between the spatial structures, we use Ne formulas from NEI and TAKAHATA (1993) and ZÄHLE et al. (2005) given earlier as (1) and (2) to determine the number of diploids per colony, N, so that the computed Ne for the population is
10,000. Table 1 lists the scaled migration rate (4Nm), number of diploids per colony (N), and computed effective population sizes (Ne). The definition of effective population size we use here is one-half the average coalescence time of two lineages, so this makes the mean number of pairwise differences the same.
|
Decay of linkage disequilibrium:
Following PRITCHARD and PRZEWORSKI (2001), we compute the square of the correlation coefficient, which for two loci with two alleles A and a and B and b is
![]() |
10, 30, and 50 kb. The number of simulations used to determine the distribution of r2 was 350,000.
Site frequency spectrum:
Since the alignment and the ancestral state are known, we can compute for each SNP the observed number of chromosomes i (1
i
39) that have the mutant nucleotide. This number is then divided by the total number of segregating sites from all 350,000 replications, to get the site frequency distribution.
SNP density:
We choose our physical sequence length to be 10 kb. The number of segregating sites for 350,000 simulations was tabulated and normalized.
TAJIMA's (1989) D-statistic and FAY and WU's (2000) H are calculated for each replication using Hudson's "sample_stats" program, which is included with the ms program. The median, 2.5, and 97.5 percentiles are computed over 350,000 simulations.
| RESULTS |
|---|
|
|
|---|
Decay of linkage disequilibrium:
As Figure 3 shows, when 4Nm = 1, there is a large difference in the rate of decay of r2 between the homogeneously mixing and the migration models. As expected, the stepping-stone model has considerably more LD than the island model. When 4Nm = 1, the stepping-stone local sample has r2
0.9 at a distance of 100 kb, which is considerably larger than values observed in the human genome, but of course our universe is only a 10 x 10 array of colonies. The random samples have a faster decay of r2 than the local samples, but in contrast to the theoretical results quoted above, their behavior is not the same as that of the homogeneously mixing population. One reason for this is that the sample size is n = 40, so n2 = 1600 is much larger than the number of demes, 100, and the assumption of the limit theorem is not justified. A simple calculation for 40 lineages and 100 demes shows that the probability that all lineages will be in separate demes is 0.000122. Using a Poisson approximation with mean 0.4 for the number of lineages in a deme, we see that on the average 5.26 demes will have 2 lineages.
|
Distribution of r2:
Figure 4 compares the distribution of r2 for a homogeneously mixing population and a stepping-stone local sample with 4Nm = 10, for SNPs separated by 1011, 3031, or 5051 kb. Not only is the mean of r2 larger in the stepping-stone model, but also there is a significant probability that r2 = 1, even for SNPs separated by 5051 kb.
Site frequency spectrum:
As Figure 5 shows, in the case of a random sample, the site frequency spectrum for the island and the stepping-stone model with 4Nm = 1 behaves like a homogeneously mixing population. That is, as FU (1995) has shown, the probability that k members of the sample have the mutant nucleotide is c/k, where c is a constant that makes the probabilities sum to 1. This is the behavior expected on the basis of the theoretical results for random samples discussed earlier, which show that the genealogy of a random sample converges to that of Kingman's coalescent.
|
For 4Nm = 1 and 4Nm = 3, both models exhibit an excess of intermediate- and high-frequency-derived nucleotides compared to the homogeneously mixing model. When 4Nm = 10, the discrepancy at the high-frequency end is almost gone but there are significant differences for rare alleles.
SNP density:
Figure 6 shows that for random samples the SNP density from both migration models match closely the one for a homogeneously mixing population, even when 4Nm = 1. For local samples the SNP density is changed in the spatial models (see Figure 6, bd) with the number of SNPs shifted toward smaller values and the shift more pronounced for the stepping-stone model than for the island model. Note that when 4Nm = 1, >35% of our 10-kb segments have zero SNPs in the stepping-stone model, compared to 5% for the island model, while this almost never happens in a homogeneously mixing population. The skew in the distribution persists in local samples from both spatial models when 4Nm = 3 and even when 4Nm =10.
|
|
|
| DISCUSSION |
|---|
|
|
|---|
Our simulation results have shown that stepping-stone population structure produces a slow decay of linkage disequilibrium and dramatically increases the probability of perfect correlation; that is, r2 = 1. This change in the two-locus sampling distribution may cause trouble for likelihood methods, such as the ones MCVEAN et al. (2004) have used to estimate recombination rates in humans. Low recombination rates could be assigned to intervals between markers with high correlation, when in reality this is due to spatial structure. In making the last comment, we are not suggesting that when spatial structure is taken into account, all cold spots will suddenly become warm. However, the qualitative picture of recombination rate variability may be significantly changed.
For a concrete example where spatial structure may have contributed to erroneously rejecting neutrality, we consider results of HAMBLIN and AQUADRO (1996) for the glucose dehydrogenase gene based on a sample of 11 Drosophila simulans collected in 1984 in Raleigh, North Carolina. The only test that suggested a pattern of nonneutral evolution was Fu and Li's test. In that case observing one singleton of 26 segregating sites had a probability of P < 0.05 in a two-tailed test, but might not be significant if one took into account that local sampling could itself reduce the number of singleton mutations.
The effect of local population sampling in humans, particularly in European or North American populations may not be as dramatic, since as if one goes back a few dozen generations the lineages disperse over a wide area. However, it could be more problematic when a population has occupied a small area for a long period of time. SABETI et al. (2002) sampled 73 Beni from Benin City, Nigeria, and sequenced a region around the G6PD locus. To argue that selection had acted on this locus they examined the decay of the extended haplotype homozygosity (EHH), i.e., the probability that two randomly chosen chromosomes carrying the same core haplotype were identical by descent up to that point. To evaluate the likelihood of the observed data, they used Hudson's ms program to simulate homogeneously mixing populations of constant size, with exponential growth, a bottleneck, or a two-island model. However, as we have shown, stepping-stone spatial structure can dramatically reduce the decay of linkage disequilibrium.
We are not the first to have suggested that spatial structure of the human population may have played a role in the spurious detection of positive selection. MEKEL-BOBROV et al. (2005) and EVANS et al. (2005) showed in studies of abnormal spindle-like microcephaly-associated (ASPM) and microcephalin genes that the presence of haplotypes with unusually large frequencies was caused by positive selection. CURRAT et al. (2006) argued that human demographic models with structure followed by population growth could explain the observed haplotype frequency patterns. In reply, MEKEL-BOBROV et al. (2006) argued that the bottleneck in Currat et al.'s model was unrealistically long and narrow.
While one can debate the impact of spatial structure on genealogies that cannot be directly observed, there is clear evidence that some allele frequencies show pronounced spatial structure: for example, lactase persistence (BERSAGLIERI et al. 2004; TISHKOFF et al. 2007), the CCR5-
32 deletion that leads to strong resistance against HIV-1 (STEPHENS et al. 1998), and the Duffy blood group (HAMBLIN et al. 2002). One final observation that argues for the importance of spatial structure in shaping patterns of variability is that of ROSENBERG et al. (2002), who have shown that with information about a large number of microsatellite loci, one can classify most of the 1056 individuals in a sample from 52 populations into their correct geographical regions.
In the other direction, one might argue that patterns of nucleotide variability are shaped over longer timescales than microsatellites, so only recent mutations will show the effects of population structure. However, this article has clearly shown that the "isolation by distance" in the stepping-stone model has a profound effect on patterns of variability, even when Nm is not much larger than 1, so it should also be considered when one wants to assess the impact of population subdivision on estimation procedures or statistical tests.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
| LITERATURE CITED |
|---|
|
|
|---|
BERSAGLIERI, T., P. C. SABETI, N. PATTERSON, T. VANDERPLOEG, S. F. SCHAFFNER et al., 2004 Genetic signatures of strong recent positive selection at the lactase gene. Am. J. Hum. Genet. 74: 11111120.[CrossRef][Medline]
BARTON, N. H., F. DEPAULIS and A. M. ETHERIDGE, 2002 Neutral evolution in a spatially continuous population. Theor. Popul. Biol. 61: 3148.[CrossRef][Medline]
CHARLESWORTH, B., D. CHARLESWORTH and N. H. BARTON, 2003 The effects of genetic and geographic structure on neutral variation. Annu. Rev. Ecol. Evol. Syst. 34: 99125.[CrossRef]
COX, J. T., and R. DURRETT, 2002 The stepping stone model: new formulas expose old myths. Ann. Appl. Probab. 12: 13481377.[CrossRef]
CURRAT, M., L. EXCOFFIER, W. MADDISON, S. P. OTTO, N. RAY et al., 2006 Comment on "Ongoing adaptive evolution of ASPM, a brain size determinant in Homo sapiens" and "Microcephalin, a gene regulating brain size, continues to evolve adaptively in humans" Science 313: 172a.
DURRETT, R., 2002 Probability Models for DNA Sequence Evolution. Springer, New York.
EVANS, P. D., S. L. GILBERT, N. MEKEL-BOBROV, E. J. VALLENDER, J. R. ANDERSON et al., 2005 Microcephalin, a gene regulating brain size, continues to evolve adaptively in humans. Science 309: 17171720.
FAY, J. C., and C.-I WU, 2000 Hitchhiking under positive Darwinian selection. Genetics 155: 14051413.
FU, Y. X., 1995 Statistical properties of segregating sites. Theor. Popul. Biol. 48: 172197.[CrossRef][Medline]
FU, Y. X., and W. H. LI, 1993 Statistical tests of neutrality of mutations. Genetics 133: 693709.[Abstract]
GRIFFITHS, R. C., 1981 Neutral two-locus multiple allele models with recombination. Theor. Popul. Biol. 19: 169186.[CrossRef]
HAMBLIN, M. T., and C. F. AQUADRO, 1996 High nucleotide sequence variation in a region of low recombination in Drosophila simulans is consistent with the background selection model. Mol. Biol. Evol. 13: 11331140.[Abstract]
HAMBLIN, M. T., E. E. THOMPSON and A. DIRIENZO, 2002 Complex signatures of natural selection at the Duffy blood group locus. Am. J. Hum. Genet. 70: 369383.[CrossRef][Medline]
HUDSON, R., 2002 Generating samples under a WrightFisher neutral model of genetic variation. Bioinformatics 18: 337338.
KIMURA, M., and T. MARUYAMA, 1971 Patterns of neutral polymorphism in a geographically structured population. Genet. Res. 18: 125131.[Medline]
LESSARD, S., and J. WAKELEY, 2004 The two-locus ancestral graph in a subdivided population: convergence as the number of demes grow in the island model. J. Math. Biol. 48: 275292.[CrossRef][Medline]
MCVEAN, G. A. T., S. R. MYERS, S. HUNT, P. DELOUKAS, D. R. BENTLEY et al., 2004 The fine scale structure of recombination rate variation in the human genome. Science 304: 581584.
MEKEL-BOBROV, N., S. L. GILBERT, P. D. EVANS, E. J. VALLENDER, J. R. ANDERSON et al., 2005 Ongoing adaptive evolution of ASPM, a brain size determinant in Homo sapiens. Science 309: 17201722.
MEKEL-BOBROV, N., P. D. EVANS, S. L. GILBERT, E. J. VALLENDER, R. R. HUDSON et al., 2006 Response to comment on "Ongoing adaptive evolution of ASPM, a brain size determinant in Homo sapiens" and "Microcephalin, a gene regulating brain size, continues to evolve adaptively in humans". Science 313: 172b.
NAGYLAKI, T., 1980 The strong-migration limit in geographically structured populations. J. Math. Biol. 9: 101114.[CrossRef][Medline]
NEI, M., and N. TAKAHATA, 1993 Effective population size, genetic diversity, and coalescence time in subdivided populations. J. Mol. Evol. 37: 240244.[Medline]
NOTOHARA, M., 1993 The strong-migration limit for the genealogical process in geographically structured populations. J. Math. Biol. 31: 115122.
PRITCHARD, J. K., and M. PRZEWORSKI, 2001 Linkage disequilibrium in humans: models and data. Am. J. Hum. Genet. 69: 114.[CrossRef][Medline]
PTAK, S. E., and M. PRZEWORSKI, 2002 Evidence for population growth in humans is confounded by fine-scale population structure. Trends Genet. 18: 559563.[CrossRef][Medline]
REICH, D. E., S. F. SCHAFFNER, M. J. DALY, G. MCVEAN, J. C. MULLIKIN et al., 2002 Human genome sequence variation and the influence of gene history, mutation, and recombination. Nat. Genet. 32: 135142.[CrossRef][Medline]
ROSENBERG, N. A., J. K. PRITCHARD, J. L. WEBER, H. M. CANN, K. K. KIDD et al., 2002 Genetic structure of human populations. Science 298: 23812385.
SABETI, P. C., D. E. REICH, J. M. HIGGINS, H. Z. P. LEVINE, D. J. RICHTER et al., 2002 Detecting recent positive selection in the human genome from haplotype structure. Nature 419: 832837.[CrossRef][Medline]
STEPHENS, L. C., D. E. REICH, D. B. GOLDSTEIN, H. D. SHIN, M. W. SMITH et al., 1998 Dating the origin of the CCR5-
32 AIDS-resistance allele by the coalescence of haplotypes. Am. J. Hum. Genet. 62: 15071515.[CrossRef][Medline]
TAJIMA, F., 1989 Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123: 585595.
THORNTON, K. R., and J. D. JENSEN, 2007 Controlling the false positive rate in multi-locus genome scans for selection. Genetics 175: 737750.
TISHKOFF, S. A., F. A. REED, A. RANCIARO, B. F. VOIGHT, C. C. BABBIT et al., 2007 Convergent adaptation of human lactase persistence in Africa and Europe. Nat. Genet. 39: 3140.[CrossRef][Medline]
WAKELEY, J., 1998 Segregating sites in Wright's island model. Theor. Popul. Biol. 53: 166174.[CrossRef][Medline]
WAKELEY, J., 1999 Nonequilibrium migration in human history. Genetics 153: 18631871.
WAKELEY, J., and S. LESSARD, 2003 Theory of the effects of population structure and sampling on patterns of linkage disequilibrium applied to genomic data from humans. Genetics 164: 10431053.
WILKINS, J., 2004 A separation-of-timescales approach to the coalescent in a continuous population. J. Math. Biol. 31: 115122.
ZÄHLE, I., J. T. COX and R. DURRETT, 2005 The stepping stone model, II. Genealogies and the infinite sites model. Ann. Appl. Probab. 15: 671699.[CrossRef]
Communicating editor: M. W. FELDMANThis article has been cited by other articles:
![]() |
T. Stadler, U. Arunyawat, and W. Stephan Population Genetics of Speciation in Two Closely Related Wild Tomatoes (Solanum Section Lycopersicon) Genetics, January 1, 2008; 178(1): 339 - 350. [Abstract] [Full Text] [PDF] |
||||
![]() |
U. Arunyawat, W. Stephan, and T. Stadler Using Multilocus Sequence Data to Assess Population Structure, Natural Selection, and Linkage Disequilibrium in Wild Tomatoes Mol. Biol. Evol., October 1, 2007; 24(10): 2310 - 2322. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |