Abstract
We develop predictions for the correlation of heterozygosity and for linkage disequilibrium between two loci using a simple model of population structure that includes migration among local populations, or demes. We compare the results for a sample of size two from the same deme (a single-deme sample) to those for a sample of size two from two different demes (a scattered sample). The correlation in heterozygosity for a scattered sample is surprisingly insensitive to both the migration rate and the number of demes. In contrast, the correlation in heterozygosity for a single-deme sample is sensitive to both, and the effect of an increase in the number of demes is qualitatively similar to that of a decrease in the migration rate: both increase the correlation in heterozygosity. These same conclusions hold for a commonly used measure of linkage disequilibrium (r2). We compare the predictions of the theory to genomic data from humans and show that subdivision might account for a substantial portion of the genetic associations observed within the human genome, even though migration rates among local populations of humans are relatively large. Because correlations due to subdivision rather than to physical linkage can be large even in a single-deme sample, then if long-term migration has been important in shaping patterns of human polymorphism, the common practice of disease mapping using linkage disequilibrium in “isolated” local populations may be subject to error.
WE derive the correlation in coalescence times at a pair of loci, with recombination between them, in a sample of two chromosomes at each locus from a subdivided population. Under the assumption that variation is selectively neutral at both loci and that either the infinite-sites mutation model (Watterson 1975) holds or the mutation rates at both loci are very small, (e.g., see Slatkin 1991 and Nielsen 2000), these correlations in coalescence times are easily translated into correlations in heterozygosity. Griffiths (1981) studied the correlation in heterozygosity at a pair of loci in a panmictic population, as did Hudson (1983) and Kaplan and Hudson (1985) who extended the results to cover many linked sites by averaging over pairs of sites. These results have been used to construct estimators of the population recombination rate on the basis of the variance of pairwise differences (Hudson 1987; Wakeley 1997)—which turn out to perform poorly compared to other estimators (Hudson 2001)—and to discuss the trade-off between sequencing longer stretches of DNA and taking a larger sample when the goal is to measure amounts of DNA polymorphism (Pluzhnikov and Donnelly 1996).
Two recent developments, one theoretical and one empirical, provide renewed motivation for studies of the correlation in heterozygosity or coalescence times. On the empirical front, Reich et al. (2002) measured the correlations in polymorphism level between many pairs of short loci in samples of two human chromosomes. These are the first genome-wide estimates of such correlations, corrected for ascertainment bias and variation in mutation rate among loci. Thus, we can now directly compare theoretical predictions about the correlation in heterozygosity to observed data. On the theoretical side, McVean (2002) showed that correlations in coalescence times can be used to predict r2, a commonly employed measure of linkage disequilibirum, defined to be the correlation coefficient between alleles at two loci (Hill and Robertson 1968). Thirty years ago, Ohta and Kimura (1971) proposed an approximation to the expected value of r2 at a pair of loci by expressing it as a ratio of expected values, which appears to be accurate as long as the frequencies of alleles at the two loci are not too small (Hudson 1985; McVean 2002). Using the results of Strobeck and Morgan (1978) and Hudson (1985) and assuming that the per-site mutation rate at each locus is close to zero, McVean (2002) showed that Ohta and Kimura’s (1971) estimate can be expressed as a ratio of covariances in coalescence times, or expected products of coalescence times, at the two loci.
This demonstration of a direct relationship between expected products of coalescence times at pairs of loci and linkage disequilibrium, as measured by r2, should help to connect the large body of theoretical work on genealogical processes to the ongoing empirical effort to describe and to understand patterns of linkage disequilibrium in the human genome. Human history has been marked by the growth and decline of populations, subdivision both with and without migration, and the admixture of subpopulation samples (Takahata 1995; Harpendinget al. 1998; Hawkset al. 2000), all of which can strongly influence patterns of linkage disequilibrium (Slatkin 1994; Ewens and Spielman 1995; Kaplanet al. 1995; Lander and Schork 1996; Kruglyak 1999; Pritchard and Przeworski 2001). It is particularly important to have an accurate picture of these factors in view of the rising importance of linkage disequilibrium mapping as a tool in disease association studies (Jorde 1995; Lander 1996; Risch and Merikangas 1996) because our ability to judge the significance of associations depends on having an accurate picture of human demographic history. Predictions about r2 are helpful in this because they measure variability in the more commonly used statistics D and D′ (Lewontin 1964). For example, the history of migration we study here predicts larger values of r2 than would be expected in a panmictic species. This means that values of D that would be judged significant if humans were panmictic might be observed more often by chance among human populations connected by migration.
The study of Reich et al. (2002) on samples of size two followed another recent study (Reichet al. 2001), which examined linkage disequilibrium in a larger sample, but within regions defined by the existence of a high-frequency coding single-nucleotide polymorphism (SNP) and thus subject to ascertainment bias. Both studies reached the same conclusions. First, correlations in heterozygosity and linkage disequilibrium extend farther than predicted by Kruglyak’s (1999) model of an unstructured human population that has grown in size, in which it was assumed that the recombination rate does not vary across the genome. Second, the preferred explanation for this pattern is variation in the recombination rate across the genome rather than demographic factors. However, Reich et al. (2001) considered only a population bottleneck in an unstructured human population and Reich et al. (2002) considered either a bottle-neck or the subdivision of humans into two subpopulations. Here, we focus on population subdivision and show that the conclusions of Reich et al. (2002) could be a consequence of the fact that only two subpopulations were considered.
It has been known for some time that population subdivision can increase the level of association between alleles in a population, due to covariance in allele frequencies among demes (Nei and Li 1973). Our results are entirely consistent with this previous work and are especially close to those of Ohta (1982) who partitioned the components of linkage disequilibrium in a subdivided population in much the same way that F-statistics are partitioned. However, we take a coalescent approach and this allows us to more naturally address questions of sampling over geography. Because the effects of subdivision on samples of multiple chromosomes from each of a number of demes are direct and obvious (Nei and Li 1973), we focus mainly on two possibilities: that all samples come from a single deme (single-deme sample) or that each sample comes from a different deme (scattered sample).
We find that subdivision-induced associations are negligible for scattered samples, but can be very strong in single-deme samples. The dual nature of migration is the source of these inflated levels of association between loci in samples from a single deme compared to what would be expected for a pair of loci with the same rate of recombination in a panmictic population. Restricted migration structures genetic variation among demes so that immigration events bring genetically dissimilar genomes into a deme from outside. Thus, lower rates of migration in the population lead to stronger average levels of association. We also find a strong dependence on the number of demes in the population. Levels of association in single-deme samples become stronger as the number of demes increases even if the migration rate among demes remains constant. We invoke this as at least a partial explanation of the correlations in the data considered by Reich et al. (2002). Clearly, variation in recombination rates across the genome, for which there is good direct evidence (Konget al. 2002), must also have had a role in shaping the observed “block structure” of linkage disequilibrium in humans (Gabrielet al. 2002). Our results show that (1) demographic factors may not be so easily dismissed if humans are subdivided into more than two demes, and (2) it may be possible to avoid the misleading portions of the linkage disequilibrium structure of the human genome, those that are due to subdivision and migration, by looking at scattered samples.
THEORY AND METHODS
Assume that a sample of two chromosomes is taken at each of two loci and that T(1) is the length of the genealogy at the first locus and T(2) is the length of the genealogy at the second locus. Note that these are equal to twice the coalescence time at each locus. Our goal is to compute
Equation 1 is what Reich et al. (2002) estimate using genomic data from humans and call ρ(τx, τx+d) for a pair of loci separated by d intervening nucleotides. The value of ρ(τx, τx+d) depends on whether the two copies at each locus are linked on the same two chromosomes, share one chromosome, or are located on two distinct pairs of chromosomes (Strobeck and Morgan 1978), and Reich et al. (2002) call these possibilities cis, trans, and dis, respectively. In addition to (1), we use the covariances of tree lengths to compute
In contrast to the case of a single randomly mating population, in a subdivided population the expected values that go into Equations 1 and 3 will depend on how the sample is distributed among demes. Thus, we cannot simply use cis, trans, and dis here and instead develop an expanded notation (described below) for samples from a subdivided population. We begin with a general statement of the model in the next section and then use this framework to compute (1) and (3) in the finite island model (Wright 1931; Moran 1959; Latter 1973; Maruyama 1974) of subdivision for different sample configurations. We also present some limiting expressions that hold for the two-locus ancestral process in the island model as the number of demes becomes large, a topic we treat in detail elsewhere (Lessard and Wakeley 2003).
The ancestral recombination graph for a pair of sites in a structured population: We assume discrete, nonoverlapping generations in a diploid population structured into D demes, with backward migration rates constant through time. We consider two loci or sites with recombination rate r per generation between them (not to be confused with the r in r2). Elsewhere (Lessard and Wakeley 2003) we describe the ancestral process for this model, which is a generalization of the single-locus structured coalescent (Wilkinson-Herbots 1998; Nordborg 2001), or an extension of the two-locus ancestral graph (Griffiths 1991) to the case of a structured population. Here, we restate enough of the frame-work to solve the problem at hand. In the general version of the model, demes may be of different relative sizes, ci, for i = 1,..., D, and migration rates can vary across the population. We let mij be the proportion of deme i that came from deme j in the previous generation, and let Mij = 4Nmij be the scaled migration rate. Note that in doing this, we have made the usual coalescent assumption that the migration rate mij is small and deme size N is large.
In considering the genealogy of a sample of chromosomal segments at the two sites, it is necessary to distinguish three kinds of segments: those ancestral to the sample at site 1 only (type 1), those ancestral at site 2 only (type 2), and those ancestral at both sites (type 3). If deme i contains
After spending an exponential amount of time in state n, there is a jump to another state n′ with transition probabilities
We can use this framework to obtain systems of equations for the quantities required to compute the covariances of genealogical tree lengths at two sites in a structured population. Suppose that the Markov chain is currently in state n, given by Equation 4. Let
The island model with an arbitrary number of demes: In the island model, the D demes are assumed to be of the same size and the backward migration rates to other demes are all equal. Therefore, we have ci = 1 for all i and Mij = M/(D - 1) for all j ≠ i. These assumptions make the ancestral graph at two sites symmetric with respect to any permutation of the demes in addition to being symmetric with respect to the two sites. In computing (1) and (3), we need to consider only samples of two chromosomes at each site. This simplifies the state space considerably, and Table 1 lists all the possibilities numbered for simplicity from 1 to 15. We focus below on states 5 and 11, which represent the most common sample configuration for a sample of size two at each of a pair of sites, or loci, in a subdivided population. These are both cis comparisons; state 5 is when the two chromosomes are sampled from the same deme and state 11 is when they come from different demes.
Notation for states in the island model
The quantities in Equations 1, 2, and 3 that depend only on the history at a single site have been known for some time; see Hey (1991) and references therein. If we represent the sample states by their numbers, then for site 1 we have
We use Equation 11 to obtain the other necessary quantities, i.e., the expectations of the product of the tree lengths at two sites, depending on the distribution of the ancestral segments within and between demes. To save space, we let
A simpler ancestral process when D is large: The un-wieldy expressions for the case of arbitrary D can be checked against simpler predictions that hold in the limit as D goes to infinity. When the number of demes is large, the ancestral process for the sample becomes much simpler. In particular, in the island model both with recombination (Lessard and Wakeley 2003) and without it (Wakeley 1998, 1999), the history of a scattered sample is the same as that in a panmictic population of ND diploid organisms, but on a timescale longer by the factor (1 + 1/M). Samples that include multiple segments from single demes are subject to a rapid stochastic sample-size adjustment, the “scattering phase” (Wakeley 1999), before they enter this much longer panmictic process. Coalescent events during the scattering phase are the source of the greater within-deme than between-deme relatedness predicted under the island model.
The appendix gives approximations for
By substituting these large-D approximations into Equation 1 we have
Equations 17 and 18 illustrate a surprising fact about the large-D ancestral process for a scattered sample, namely, that the appropriate recombination parameter continues to be equal to R = 4NDr, even as D goes to infinity (Lessard and Wakeley 2003). The reason this is surprising is that the timescale of the coalescent process, and thus the time over which mutation and recombination events can occur in the history of the sample, is increased by the factor (1 + 1/M). The number of mutation events in the history of the sample does indeed depend on θ(1 + 1/M) (Wakeley 1998), but the number of potentially observable recombination events depends only on R (Lessard and Wakeley 2003). While (1 + 1/M) more recombination events do occur in the history, a fraction 1/(1 + M) of these are repaired instantaneously because the two resulting segments will initially be present in the same deme and thus will be like a single-deme sample, subject to a scattering phase. What remains is R × (1 + 1/M) × M/(M + 1) = R. This might have important practical consequences because disease loci can be mapped with greater resolution when recombination is more frequent, as long as enough SNPs are present, and higher numbers of SNPs lead to increased power for a given recombination map (Kruglyak 1999). We do not pursue these issues here, but note that Nordborg and Tavaré (2002) discuss this in relation to the effect that partial selfing has on the numbers of SNPs and recombination events (Nordborg 2000). Lessard and Wakeley (2003) detail the similarities and differences between partial selfing and island-model migration.
RESULTS
Population subdivision provides an additional axis for comparison of polymorphism levels and associations/correlations between loci. It introduces the possibility of making within- vs. between-deme comparisons [as embodied by Wright’s (1951) well-known fixation index FST], an idea that Ohta (1982) extended to measures of linkage disequilibrium. Thus, the distribution of the sample among demes, as well as the distribution of ancestral genetic material among chromosomes, becomes important. Figure 1 shows that the correlation of tree lengths at two loci depends strongly on how the sample is taken, either from two different demes or from the same deme, as well as on the distance separating the two sites, the demic migration rate, and the number of demes in the population. Note that in Figure 1, a and b have a broader range on the vertical axis than c and d. For Figure 1, a-d, the population rate of recombination was assumed to be 5.2 × 10-4/site, following Reich et al. (2002). Thus, the distances in a and b correspond to a range of values of R from 0.052 (= 0.1 kb × 103 bp/ kb × 5.2 × 10-4/bp) to 52. For c and d, the distance was set to 10 kb, which corresponds to R = 5.2.
—Analytic predictions for the correlation in tree lengths at two loci in samples of two chromosomes at each locus. (a and b) The correlation as a function of the distance between the two sites, measured in kilobases (kb), and the number of demes D in an island model population with M = 1. (c and d) The correlation at two sites separated by 10 kb as a function of the migration rate M and the number of demes. As discussed in the text, the population per-site rate of recombination was assumed to be R = 5.2 × 10-4 after Reich et al. (2002). a and c are for the case of two chromosomes [cis using the terminology of Reich et al. (2002)] sampled from two different demes; i.e., they plot Equation 1 for state 11 in Table 1. b and d are for the case of two chromosomes (again, cis) sampled from the same deme; i.e., they plot Equation 1 for state 5 in Table 1.
For a single-deme sample, Figure 1, a and c, the predicted correlation depends on all three quantities: the distance, M, and D. Clearly, the distance between the two sites, which in our model linearly determines the recombination rate between them, has the strongest effect, with shorter distances corresponding to higher correlations. The migration rate is the next most important factor for single-deme samples, with lower migration rates producing stronger correlations in tree lengths. Finally, there is an effect of the number of demes on the correlation of tree lengths at two loci in a single-deme sample, with larger numbers of demes yielding stronger correlations. A large number of demes is qualitatively similar to a small migration rate because in both cases samples share either a very recent common ancestor at both loci due to coalescence within the deme or a very ancient one if a migration event occurs in their recent history, prior to which it may be a long time before another migration event again puts them in the same deme so that they have the chance to coalesce. This subdivision-induced inflation of the correlation is true for pairs of loci at any distance, but the effect is stronger for loci that are farther apart.
For a scattered sample, Figure 1, b and d, only the distance between the two sites strongly affects the correlation, again with shorter distances producing stronger correlations. Surprisingly, neither the migration rate nor the number of demes has a large effect on the correlation of tree lengths for scattered samples. There is some dependence on M and D when D is small, but the magnitude of the change in the correlation there is much smaller than that when the distance between sites is varied. Thus, the correlation in tree lengths for a scattered sample is similar to that in a panmictic population even though the average lengths of the trees change substantially with both M and D; see Equation 13. This constancy for scattered samples is predicted for large values of D—see Equation 17 and Lessard and Wakeley (2003)—but it appears to hold approximately for small D as well.
—Comparison of the predictions of the island model to the data of Reich et al. (2002). The population per-site rate of recombination was set, as in Figure 1, to be R = 5.2 × 10-4 and was constant across the genome. The different curves for the island model predictions are for D = 1, 4, 16, ∞, from bottom to top. The curves for D = 1 are the ones labeled “theoretical prediction” in Reich et al. (2002). In a, M = 1 and the prediction curves plot Equation 1 for state 11 in Table 1. In b, M = 4 and the prediction curves plot Equation 3 for state 11 in Table 1. Data points are redrawn from Figure 5, a and c, in Reich et al. (2002): the open/solid squares in both a and b give the upper/lower bounds for the correlation and for on the basis of variability in the correction for among-locus variation for mutation rate in comparisons with chimpanzee sequences, and the open circles in b show r2 calculated from the Utah CEPH data of Gabriel et al. (2002).
The protocol of Reich et al. (2002) categorized pairs of sites as cis, trans, and dis, but did not distinguish among the possible sample configurations in Table 1. The chance that both samples at both loci are from the same deme will be greatest for cis comparisons and smallest for dis comparisons, so the effects of subdivision on the correlations of tree lengths will be strongest for the cis samples. More detailed statements are difficult due to the heterogeneity of population origin of the samples used by Reich et al. (2002). In addition to these correlations in tree lengths at pairs of loci, Reich et al. (2002) plotted r2 calculated from a separate data set that included only Americans of European ancestry from the CEPH Utah pedigrees (Gabrielet al. 2002). For the sake of illustration, we assume that both sets of data are composed of single-deme samples. Figure 2 shows that for given migration rates for these two sets of data (M = 1 and M = 4, respectively; see below), increasing the number of demes brings the theoretical predictions closer to the observed data.
Both the pairwise data of Reich et al. (2002) and the Utah CEPH data of Gabriel et al. (2002) are subject to ascertainment bias with respect to the process of migration. In the case of Reich et al. (2002), an upper bound was exerted on the number of SNPs per read, and in the case of Gabriel et al. (2002) SNPs were discovered in an initial smaller sample and then typed in a larger one and an upper bound was exerted on the number of SNPs per read. Reich et al. (2002) corrected for ascertainment when calculating correlations, but neither set of authors considered the effect of ascertainment bias on realized, or estimated, migration rates. We can infer that realized migration rates in the Utah CEPH data are greater than the actual migration rates (Wakeleyet al. 2001), but in the pairwise data they should be smaller than actual rates because the protocol truncated only the upper tail of the distribution of the number of SNPs per read. This occurs because migrants tend to differ from residents, so the number of migrant chromosomes detected by this method is less than would be found in a random sample. This is the justification for using different values, M = 1 and M = 4, for the two data sets in Figure 2. These values were chosen for the sake of illustration, but they are roughly consistent with estimates of M based on FST (Cavalli-Sforzaet al. 1994) when ascertainment bias is taken into account.
DISCUSSION
Human beings do not mate randomly on a global scale. Instead, they are subdivided into local populations, or demes, among which there is substantial gene flow. Substantial here means roughly M ≥ 1, as appears to be true of many local human populations (Cavalli-Sforzaet al. 1994). We have shown that the number of demes is important in determining levels of association between alleles and of correlations in genealogical tree lengths at different loci when multiple samples come from single demes; see Figure 1, a and c. In particular, average two-locus correlations in a single-deme sample become stronger both when the migration rate decreases and when the number of demes increases. Thus, the conclusion of Reich et al. (2002) that migration rates are not low enough among human demes to explain observed long-distance correlations is at least in part a consequence of the fact that only a two-deme model was considered.
We have also shown that associations due to subdivision are negligible for scattered samples; see Figure 1, b and d. This has consequences for the prospect of disease-locus mapping using patterns of linkage disequilibrium. There is active debate over population choice for linkage disequilibrium mapping studies, based both on the chance that the disease is less heterogeneous and on the knowledge of the demographic history of local populations (Wrightet al. 1999). For instance, it is popular to focus on recently founded local populations because these are expected to have preserved levels of linkage disequilibrium due to sampling of founders. There is some empirical evidence of this, e.g., Jorgensen et al. (2002), but there is also evidence that some local populations do not follow this pattern (Eaveset al. 2000). So far, there has been little attention to migration in these discussions. The migration model we studied here predicts that local populations that receive fewer migrants each generation are expected to show greater correlations between loci and higher levels of linkage disequilibrium. When M is small, migrants will differ from residents at many loci simultaneously, and the model predicts an equilibrium level of linkage disequilibrium as a balance among immigration, genetic drift, and recombination. If a sample contains recent migrants, the chance of spurious association increases.
While we have not studied any particular disease model and have considered only levels of association between neutral markers, further study of the role of migration among local populations of humans in establishing genomic patterns of linkage disequilibrium seems warranted. Oversimplified models of human history are not consistent with available data (Przeworskiet al. 2000; Pluzhnikovet al. 2002), and appropriate tools for linkage disequilibrium mapping are necessary. Methods in the spirit of Pritchard et al. (2000), which use background levels of association to control for population structure, appear promising. Note, however, that the method of Pritchard et al. (2000) assumes no association within demes and so does not appear to be applicable if there has been migration. If the island model with a large number of demes (or a related model; see below) holds for humans, the present analysis, e.g., Figure 1, b and d, suggests that linkage disequilibrium mapping studies should be done using scattered samples because these should have the lowest level of subdivision-induced association. This would not be recommended, however, if the causes of the disease under study differed from population to population.
Of course, the island migration model has shortcomings: it assumes that the population has always been subdivided, that every deme is of the same size and has the same migration rate, and that the number of demes has remained constant. Perhaps most importantly, it lacks true geographic structure because every deme can exchange migrants with every other deme. Many of these problems can be dealt with in the case where the number of demes is large, and extinction/recolonization of demes can also be included (Wakeley and Aliacar 2001). Thus, the results we describe here should hold, at least qualitatively, in a much broader setting. We have not here considered changes in population size or migration rate, but these too can be incorporated relatively easily into the model when the number of demes is large (Wakeley 1999).
The work we have presented here is similar in spirit to the recent work of Vitalis and Couvet (2001a,b) who studied two-locus probabilities of identity under the infinite island model, with the possible addition of partial selfing. Thus, it might be expected that our results, in the limit as D goes to infinity, should match theirs. However, the model considered by Vitalis and Couvet (2001a,b) differs from ours in one very important respect. While we assume that the population mutation rate θ= 4NDu and the population recombination rate R = 4NDr remain finite as D goes to infinity, Vitalis and Couvet (2001a,b) assume that the demic mutation rate 4Nu and the demic recombination rate 4Nr are not necessarily small. The population rates of mutation and recombination in their model are, implicitly, infinite. One consequence of this is that their model predicts zero identity linkage disequilibrium when the migration rate for a deme is very large (or, equivalently, for a scattered sample); see Figures 4 and 5 in Vitalis and Couvet (2001b). In addition, their model would predict an infinite number of SNPs, or mutation events, at a locus. On the other hand, the large-D limit we consider predicts zero SNPs at a locus in a single-deme sample if no migration events occur in the history of the sample. Which of these models is most appropriate depends on whether allelic data or sequence data are analyzed and whether mutation and recombination events that occur in the recent history of a deme are the focus of study or can be ignored because they are far outnumbered by mutation and recombination events that occur earlier in the history of a sample.
APPENDIX
Using Equations 9 and 11, we obtain the following equilibrium equations for the expected product of tree lengths at two sites in the island model,
Solving the system of equations above and taking the limit as D goes to infinity, we obtain the following approximations for the expected product of tree lengths at the two loci:
Acknowledgments
We thank David Reich and Stephen Shaffner for helpful discussions of Reich et al. (2002) and for making their article available before it was published. Kristin Ardlie and Monty Slatkin provided helpful comments on an earlier version of the manuscript. J.W. was supported by a Career Award (DEB-0133760) and by a grant (DEB-9815367) from the National Science Foundation. S.L. was supported by grants from the Natural Sciences and Engineering Research Council of Canada, the Fonds Québécois de la Recherche sur la Nature et les Technologies, and the Université de Montréal.
Footnotes
-
Communicating editor: W. Stephan
- Received November 4, 2002.
- Accepted March 13, 2003.
- Copyright © 2003 by the Genetics Society of America