Abstract
The degree of association between alleles at different loci, or linkage disequilibrium, is widely used to infer details of evolutionary processes. Here I explore how associations between alleles relate to properties of the underlying genealogy of sequences. Under the neutral, infinite-sites assumption I show that there is a direct correspondence between the covariance in coalescence times at different parts of the genome and the degree of linkage disequilibrium. These covariances can be calculated exactly under the standard neutral model and by Monte Carlo simulation under different demographic models. I show that the effects of population growth, population bottlenecks, and population structure on linkage disequilibrium can be described through their effects on the covariance in coalescence times.
MEASURES of the nonrandom association between alleles at different loci, or linkage disequilibrium, are widely used to infer properties of population history, recombination, and the location of mutations contributing to disease susceptibility and adaptive evolution. Associations between alleles are generated by the stochastic nature of mutation and sampling in a finite population, as well as certain forms of geographical structure (e.g., Ohta 1982), and natural selection (e.g., Strobeck 1983). In contrast, recombination acts to break down such associations. Comparison of empirical patterns of linkage disequilibrium to those expected from population genetics theory, and across different genomic regions, can provide much information about the forces shaping genetic diversity.
The rise of coalescent theory (Kingman 1982) as a tool for interpreting patterns of genetic diversity in samples has led to a shift in focus in theoretical population genetics from mutations to genealogies. Most importantly, if mutations have no effect on organismal fitness, the genealogy of a sample can be separated entirely from the mutational process. Consequently, all information about important evolutionary parameters (such as demography and the action of selection at linked sites) is contained in the genealogy, which can be estimated only indirectly from the distribution of mutations among sampled chromosomes.
The question of how statistics of linkage disequilibrium relate to aspects of the underlying genealogy is therefore of considerable interest. Here I show that a quantity that approximates the expectation of a commonly used statistic of linkage disequilibrium, r2, can be expressed in terms of covariances in coalescence times. The result provides an intuitive basis for understanding how linkage disequilibrium behaves under different demographic scenarios.
GENEALOGICAL APPROACH
Linkage disequilibrium and identity coefficients: The r2 statistic of linkage disequilibrium is equivalent to the square of the correlation coefficient between the alleles A at locus x and B at locus y,
Consider first the numerator in (2). The expectation of D is zero (irrespective of the recombination rate and demographic model), hence E[D2] = Var(D). Strobeck and Morgan (1978) and Hudson (1985) showed that the expected square of disequilibrium can be written in terms of identity coefficients for sets of sequences,
Identity coefficients in a genealogical context: We consider the identity coefficients in (3) for the case where both sites are polymorphic and each polymorphism is the result of a single mutation [single-nucleotide polymorphisms (SNPs)]. When there are just two alleles at both loci, the square of the disequilibrium coefficient is independent of how alleles are defined; hence we consider the identity coefficients between the derived mutations (denoted by an asterisk). The identity coefficient
For finite sample size, a modification is required to include the possibility that i, j, k, and l are not all distinct,
—Statistics of the genealogy.
—Cov[tx(ij), ty(ik)] measures the covariance in coalescence time at site x for chromosomes i and j and site y for chromosomes i and k.
Conditional linkage disequilibrium: If the expectation is conditioned on the exclusion of rare mutations (those represented fewer than a times in the sample), the covariances in coalescence times in (9) have to be augmented by the covariances in times between coalescing and the first point that the lineage ancestral to the MRCA has at least a descendants in the sample. However, the magnitude of the extra terms is small, and a good approximation is obtained with a slight modification to the denominator, replacing E[t] with E[t] - E[δa], where E[δa] is the expected time until an ancestral lineage has at least a descendants. In the standard coalescent E[δa] = 2(a - 1)/n for a < n (Saunderset al. 1984).
In practice, linkage disequilibrium is typically conditioned on the exclusion of rare alleles, rather than rare mutations. However, because rare alleles typically represent rare mutations, the error introduced by conditioning on rare mutations rather than rare alleles is small. For example, among loci for which the rare allele is represented only once, the rare allele represents the rare mutation with probability 1 - 1/n under the standard neutral model.
LINKAGE DISEQUILIBRIUM IN THE STANDARD NEUTRAL MODEL
The expectation of (9) can be derived under the standard coalescent using the results of Griffiths (1981, 1991; see also Pluzhnikov and Donnelly 1996). If the sample size is sufficiently large such that all sequences picked at random from the sample are distinct, the covariances in coalescence times (in units of 2Ne generations) are
DISCUSSION
Interpreting linkage disequilibrium in terms of the underlying genealogy can help in understanding the behavior of linkage disequilibrium under different demographic scenarios, such a population growth (Slatkin 1994; Krugylak 1999), bottlenecks (Reichet al. 2001), and geographical subdivision (Wakeleyet al. 2001).
—The relationship between the scaled recombination rate, (lines), and the average value of r2 (points) for all segregating sites (solid line and triangles) and those for which the derived mutation is present in at least 10% of samples (dotted line and squares). Values of r2 were obtained by coalescent simulation under the standard neutral model with n = 50.
Growing populations: The effect of population growth is to distort genealogies such that the mean coalescence is reduced relative to the case of no growth and, more importantly, the variance in coalescence times is even more reduced. The effects of population growth on the correlations in coalescence times are more subtle. Consider two genealogies, which have experienced the same number of recombination events, but one generated under a standard neutral model and one generated by a growing population model. Under high rates of growth, gene genealogies assume a star-like shape, such that the vast proportion of the total tree length is composed of external branches. So if a recombination event is thrown onto the genealogy, the probability that it occurs in the history of a randomly chosen pair of sequences from the sample approaches 2/n. In contrast, in a constant population size, the probability that the recombination event affects the ancestry of the chosen pair is
Population bottlenecks: Recent population bottlenecks can increase linkage disequilibrium considerably, because in contrast to the case of population growth, bottlenecks affect the mean coalescence time more than the variance. If the probability that a pair of chromosomes coalesces during a recent bottleneck is ϕ, and we assume the bottleneck is instantaneous (hence chromosomes coalescing during the bottleneck have coalescence time equal to zero), the mean coalescence time is 1 - ϕ and the variance is 1 - ϕ2 (in units of 2Ne generations). So the ratio E[t]2/Var(t) is reduced relative to the case of no bottleneck.
The effect of population bottlenecks on correlations in coalescence time is more complex. Bottlenecks distort gene genealogies such that the majority of the tree length occurs when there are few ancestral lineages (those that survived the bottleneck); consequently most recombination events will influence these ancestral lineages. The correlations in coalescence time are therefore increased by the probability of coalescing during the bottleneck and decreased by the effects of prebottleneck recombination. For weak bottlenecks, ancestral recombination is more important, whereas for strong bottlenecks, correlations are increased by coalescence events during the bottleneck. Table 2 shows the effects of recent bottlenecks on the correlations in coalescence time for the same average number of recombination events in the history of the sample. Overall, the effect on the variance in coalescence times is more important than the effect on correlations, such that LD is increased by bottlenecks.
The effect of exponential population growth (rate λ) on genealogical correlations for a sample of n = 50 chromosomes (from 106 coalescent simulations)
The effect of recent bottlenecks of severity ϕ (the probability of coalescence during the bottleneck for a pair of chromosomes) on genealogical correlations for a sample of n = 50 chromosomes (from 106 coalescent simulations)
Population structure: Population structure increases linkage disequilibrium because of the correlations in coalescence times induced by coalescent events within subpopulations. This is true even for unlinked sites. Consider a two-deme model with symmetric migration between them at rate m per chromosome, per generation. Under such conditions, and assuming large n (sampled evenly from the two demes), the expected covariances in coalescence times at unlinked sites x and y are
Other measures of LD: It is worth noting that another widely used statistic of linkage disequilibrium, |D′| (Lewontin 1964), behaves in a very different manner to
Acknowledgments
Many thanks to Molly Przeworski, David Reich, Paul Fearnhead, Carsten Wiuf, Mikkel Schierup, Simon Myers, and two anonymous reviewers. G.M. is funded by the Royal Society.
Footnotes
-
Communicating editor: W. Stephan
- Received April 10, 2002.
- Accepted July 15, 2002.
- Copyright © 2002 by the Genetics Society of America