- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by McVean, G. A. T.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by McVean, G. A. T.
A Genealogical Interpretation of Linkage Disequilibrium
Gilean A. T. McVeanaa Department of Statistics, University of Oxford, Oxford OX1 3TG, United Kingdom
Corresponding author: Gilean A. T. McVean, 1 S. Parks Rd., Oxford OX1 3TG, UK., mcvean{at}stats.ox.ac.uk (E-mail)
Communicating editor: W. STEPHAN
| ABSTRACT |
|---|
The degree of association between alleles at different loci, or linkage disequilibrium, is widely used to infer details of evolutionary processes. Here I explore how associations between alleles relate to properties of the underlying genealogy of sequences. Under the neutral, infinite-sites assumption I show that there is a direct correspondence between the covariance in coalescence times at different parts of the genome and the degree of linkage disequilibrium. These covariances can be calculated exactly under the standard neutral model and by Monte Carlo simulation under different demographic models. I show that the effects of population growth, population bottlenecks, and population structure on linkage disequilibrium can be described through their effects on the covariance in coalescence times.
MEASURES of the nonrandom association between alleles at different loci, or linkage disequilibrium, are widely used to infer properties of population history, recombination, and the location of mutations contributing to disease susceptibility and adaptive evolution. Associations between alleles are generated by the stochastic nature of mutation and sampling in a finite population, as well as certain forms of geographical structure (e.g., ![]()
![]()
The rise of coalescent theory (![]()
The question of how statistics of linkage disequilibrium relate to aspects of the underlying genealogy is therefore of considerable interest. Here I show that a quantity that approximates the expectation of a commonly used statistic of linkage disequilibrium, r2, can be expressed in terms of covariances in coalescence times. The result provides an intuitive basis for understanding how linkage disequilibrium behaves under different demographic scenarios.
| GENEALOGICAL APPROACH |
|---|
Linkage disequilibrium and identity coefficients:
The r2 statistic of linkage disequilibrium is equivalent to the square of the correlation coefficient between the alleles A at locus x and B at locus y,
![]() |
(1) |
(![]()
is the standard measure of linkage disequilibrium, with fA(x)B(y) indicating the frequency of chromosomes carrying the A and B alleles. Although it is impossible to derive a simple analytic expression for the expectation of (1), we can consider the related quantity of the ratio of expectations
![]() |
(2) |
(![]()
![]()
|
|
|
Consider first the numerator in (2). The expectation of D is zero (irrespective of the recombination rate and demographic model), hence
. ![]()
![]()
![]() |
(3) |
The three terms are, respectively, the probability that two sequences i and j are identical in state at both sites x and y; the probability that sequences i and j are identical at site x and that i and k are identical at site y; and finally, the probability that sequences i and j are identical at site x and sequences k and l are identical at site y. Note that for finite sample sizes, the possibility that i, j, k, and l are not all distinct has to be taken into account (![]()
![]()
Identity coefficients in a genealogical context:
We consider the identity coefficients in (3) for the case where both sites are polymorphic and each polymorphism is the result of a single mutation [single-nucleotide polymorphisms (SNPs)]. When there are just two alleles at both loci, the square of the disequilibrium coefficient is independent of how alleles are defined; hence we consider the identity coefficients between the derived mutations (denoted by an asterisk). The identity coefficient F*x(ij)y(kl) can be expressed as the expectation of the probability that the mutations occur in the portion of the genealogy ancestral to sequences i and j at site x and ancestral to k and l at site y, divided by the probability that one mutation occurs at each site. Assuming the mutation rate per base pair per generation, µ, is the same at both sites,
![]() |
(4) |
where Imx(ij) is the branch length (in generations) leading from the most recent common ancestor (MRCA) of sequences i and j to the MRCA of the entire sample and E[TxTy] is the expected product of the total tree length at sites x and y. The mutation rate is a nuisance parameter that can be eliminated by taking the limit as µ
0 (![]()
![]() |
(5) |
By writing
![]() |
(6) |
(Fig 1), where tx(ij) is the coalescence time for sequences i and j at site x, and Tmx is the time until the MRCA for the entire sample at site x, it can be shown that
![]() |
(7) |
We can use a similar procedure to find the denominator in Equation 2. The expectation E[fA(x)(1 - fA(x))fB(y)(1 - fB(y))] for the case of SNPs can be expressed as the expected probability that two alleles drawn with replacement will be different at the x locus and another two drawn with replacement will be different at the y locus. Taking the limit as µ
0,
![]() |
(8) |
where E[t] is the expected coalescence time for a pair of chromosomes. Combining Equation 7 and Equation 8 gives an expression for
2d:
![]() |
(9) |
In other words, the expected linkage disequilibrium as measured by the r2 statistic can be approximated in terms of the covariance in coalescence times for pairs of sequences. For example, the middle term in the numerator of (9) is the covariance in coalescence time at site x for sequences i and j and at site y for sequences i and k; see Fig 2. More generally, the kth moment of the distribution of D will depend on the covariances in coalescence times for sets of up to k chromosomes ancestral at each site. Because no assumptions are made about the underlying demographic model in the derivation of (9), it provides a general way of describing the relationship between linkage disequilibrium and aspects of the underlying genealogy.
For finite sample size, a modification is required to include the possibility that i, j, k, and l are not all distinct,
![]() |
(10) |
(following ![]()
2d can be written in terms of correlations in coalescence times,
![]() |
(11) |
where the subscripts refer to the three configurations of sample chromosomes. One advantage of writing the expression in terms of correlations rather than covariances is that correlations will be influenced largely by recombination, whereas demographic factors can strongly influence the mean and variance of coalescence times.
Conditional linkage disequilibrium:
If the expectation is conditioned on the exclusion of rare mutations (those represented fewer than a times in the sample), the covariances in coalescence times in (9) have to be augmented by the covariances in times between coalescing and the first point that the lineage ancestral to the MRCA has at least a descendants in the sample. However, the magnitude of the extra terms is small, and a good approximation is obtained with a slight modification to the denominator, replacing E[t] with E[t] - E[
a], where E[
a] is the expected time until an ancestral lineage has at least a descendants. In the standard coalescent
for a < n (![]()
In practice, linkage disequilibrium is typically conditioned on the exclusion of rare alleles, rather than rare mutations. However, because rare alleles typically represent rare mutations, the error introduced by conditioning on rare mutations rather than rare alleles is small. For example, among loci for which the rare allele is represented only once, the rare allele represents the rare mutation with probability 1 - 1/n under the standard neutral model.
| LINKAGE DISEQUILIBRIUM IN THE STANDARD NEUTRAL MODEL |
|---|
The expectation of (9) can be derived under the standard coalescent using the results of ![]()
![]()
![]()

(![]()
![]()
![]()
![]()
. Note these differ from the results of ![]()
, hence the ratio of the expectations (2) is
![]() |
(12) |
This is the same result as given by ![]()
![]()
![]()
| DISCUSSION |
|---|
Interpreting linkage disequilibrium in terms of the underlying genealogy can help in understanding the behavior of linkage disequilibrium under different demographic scenarios, such a population growth (![]()
![]()
![]()
![]()
Growing populations:
The effect of population growth is to distort genealogies such that the mean coalescence is reduced relative to the case of no growth and, more importantly, the variance in coalescence times is even more reduced. The effects of population growth on the correlations in coalescence times are more subtle. Consider two genealogies, which have experienced the same number of recombination events, but one generated under a standard neutral model and one generated by a growing population model. Under high rates of growth, gene genealogies assume a star-like shape, such that the vast proportion of the total tree length is composed of external branches. So if a recombination event is thrown onto the genealogy, the probability that it occurs in the history of a randomly chosen pair of sequences from the sample approaches 2/n. In contrast, in a constant population size, the probability that the recombination event affects the ancestry of the chosen pair is
, which is >2/n for n > 3. Consequently, in growing populations fewer recombination events will influence the history of a randomly chosen pair of chromosomes from the sample, leading to higher correlations in coalescence times; see Table 1. Overall, the reduction in variance of coalescence times caused by population growth has a greater effect on linkage disequilibrium (LD) than the increase in correlations, leading to a decrease in LD.
|
Population bottlenecks:
Recent population bottlenecks can increase linkage disequilibrium considerably, because in contrast to the case of population growth, bottlenecks affect the mean coalescence time more than the variance. If the probability that a pair of chromosomes coalesces during a recent bottleneck is
, and we assume the bottleneck is instantaneous (hence chromosomes coalescing during the bottleneck have coalescence time equal to zero), the mean coalescence time is 1 -
and the variance is 1 -
2 (in units of 2Ne generations). So the ratio E[t]2/Var(t) is reduced relative to the case of no bottleneck.
The effect of population bottlenecks on correlations in coalescence time is more complex. Bottlenecks distort gene genealogies such that the majority of the tree length occurs when there are few ancestral lineages (those that survived the bottleneck); consequently most recombination events will influence these ancestral lineages. The correlations in coalescence time are therefore increased by the probability of coalescing during the bottleneck and decreased by the effects of prebottleneck recombination. For weak bottlenecks, ancestral recombination is more important, whereas for strong bottlenecks, correlations are increased by coalescence events during the bottleneck. Table 2 shows the effects of recent bottlenecks on the correlations in coalescence time for the same average number of recombination events in the history of the sample. Overall, the effect on the variance in coalescence times is more important than the effect on correlations, such that LD is increased by bottlenecks.
|
Population structure:
Population structure increases linkage disequilibrium because of the correlations in coalescence times induced by coalescent events within subpopulations. This is true even for unlinked sites. Consider a two-deme model with symmetric migration between them at rate m per chromosome, per generation. Under such conditions, and assuming large n (sampled evenly from the two demes), the expected covariances in coalescence times at unlinked sites x and y are

where M = 4Nem (Ne is the sum of the effective population sizes for the two demes). So the ratio of expectations (2) is
![]() |
(13) |
The implication of the result is that significant LD, even between unlinked markers, is expected in subdivided populations when the population migration rate is low (M < 1).
Other measures of LD:
It is worth noting that another widely used statistic of linkage disequilibrium, |D'| (![]()
2d. This is because |D'| can be less than one only if all four possible haplotypes are present for a pair of segregating sites. The expectation of |D'| therefore depends on higher moments of coalescence times than the expectation of r2.
| ACKNOWLEDGMENTS |
|---|
Many thanks to Molly Przeworski, David Reich, Paul Fearnhead, Carsten Wiuf, Mikkel Schierup, Simon Myers, and two anonymous reviewers. G.M. is funded by the Royal Society.
Manuscript received April 10, 2002; Accepted for publication July 15, 2002.
| LITERATURE CITED |
|---|
GRIFFITHS, R. C., 1981 Neutral two-locus multiple allele models with recombination. Theor. Popul. Biol. 19:169-186.
GRIFFITHS, R. C., 1991 The two-locus ancestral graph, pp. 100117 in Selected Proceedings on the Symposium on Applied Probability (IMS Lecture Notes, Monograph Series, Vol. 18), edited by I. V. BASAWA and R. L. TAYLOR. Institute of Mathematical Statistics, Hayward, CA.
HILL, W. G. and A. R. ROBERTSON, 1968 Linkage disequilibrium in finite populations. Theor. Appl. Genet. 38:226-231.
HUDSON, R. R., 1985 The sampling distribution of linkage disequilibrium under an infinite allele model without selection. Genetics 109:611-631.
KAPLAN, N. and R. R. HUDSON, 1985 The use of sample genealogies for studying a selectively neutral m-loci model with recombination. Theor. Popul. Biol. 28:382-396.[Medline]
KINGMAN, J. F. C., 1982 The coalescent. Stoch. Proc. Appl. 13:235-248.
KRUGYLAK, L., 1999 Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat. Genet. 22:139-144.[Medline]
LEWONTIN, R. C., 1964 The interaction of selection and linkage. I. General considerations; heterotic models. Genetics 49:49-67.
NIELSEN, R., 2000 Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics 154:931-942.
OHTA, T., 1982 Linkage disequilibrium with the island model. Genetics 101:139-155.
OHTA, T. and M. KIMURA, 1971 Linkage disequilibrium between two segregating nucleotide sites under the steady flux of mutations in a finite population. Genetics 68:571-580.
PLUZHNIKOV, A. and P. DONNELLY, 1996 Optimal sequencing strategies for surveying molecular genetic diversity. Genetics 144:1247-1262.[Abstract]
REICH, D. E., M. CARGILL, S. BOLK, J. IRELAND, and P. C. SABETI et al., 2001 Linkage disequilibrium in the human genome. Nature 411:199-204.[Medline]
SAUNDERS, I., S. TAVARÉ, and G. WATTERSON, 1984 On the genealogy of nested subsamples from a haploid population. Adv. Appl. Probab. 16:471-491.
SLATKIN, M., 1994 Linkage disequilibrium in growing and stable populations. Genetics 137:331-336.[Abstract]
STROBECK, C., 1983 Expected linkage disequilibrium for a neutral locus linked to a chromosomal arrangement. Genetics 103:545-555.
STROBECK, C. and K. MORGAN, 1978 The effect of intragenic recombination on the number of alleles in a finite population. Genetics 88:829-844.
SVED, J. A., 1971 Linkage disequilibrium and homozygosity of chromosome segments in finite populations. Theor. Popul. Biol. 2:125-141.[Medline]
WAKELEY, J., R. NIELSEN, S. N. LIU-CORDERO, and K. ARDLIE, 2001 The discovery of single-nucleotide polymorphismsand inferences about human demographic history. Am. J. Hum. Genet. 69:1332-1347.[Medline]
WEIR, B. S. and W. G. HILL, 1986 Nonuniform recombination within the human ß-globin gene cluster. Am. J. Hum. Genet. 38:776-778.[Medline]
This article has been cited by other articles:
![]() |
I. Hellmann, Y. Mang, Z. Gu, P. Li, F. M. de la Vega, A. G. Clark, and R. Nielsen Population genetic analysis of shotgun assemblies of genomic sequences from multiple individuals Genome Res., July 1, 2008; 18(7): 1020 - 1029. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. M. Macpherson, J. Gonzalez, D. M. Witten, J. C. Davis, N. A. Rosenberg, A. E. Hirsh, and D. A. Petrov Nonadaptive Explanations for Signatures of Partial Selective Sweeps in Drosophila Mol. Biol. Evol., June 1, 2008; 25(6): 1025 - 1042. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Eldon and J. Wakeley Linkage Disequilibrium Under Skewed Offspring Distribution Among Individuals in a Population Genetics, March 1, 2008; 178(3): 1517 - 1532. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Kamau, B. Charlesworth, and D. Charlesworth Linkage Disequilibrium and Recombination Rate Estimates in the Self-Incompatibility Region of Arabidopsis lyrata Genetics, August 1, 2007; 176(4): 2357 - 2369. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Tenesa, P. Navarro, B. J. Hayes, D. L. Duffy, G. M. Clarke, M. E. Goddard, and P. M. Visscher Recent human effective population size estimated from linkage disequilibrium Genome Res., April 1, 2007; 17(4): 520 - 526. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. McVean The Structure of Linkage Disequilibrium Around a Selective Sweep Genetics, March 1, 2007; 175(3): 1395 - 1406. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Hagenblad, J. Bechsgaard, and D. Charlesworth Linkage Disequilibrium Between Incompatibility Locus Region Genes in the Plant Arabidopsis lyrata Genetics, June 1, 2006; 173(2): 1057 - 1073. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. C. Bruen, H. Philippe, and D. Bryant A Simple and Robust Statistical Test for Detecting the Presence of Recombination Genetics, April 1, 2006; 172(4): 2665 - 2681. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Carvajal-Rodriguez, K. A. Crandall, and D. Posada Recombination Estimation Under Complex Evolutionary Models with the Coalescent Composite-Likelihood Method Mol. Biol. Evol., April 1, 2006; 23(4): 817 - 827. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. G. C. Smith and P. Fearnhead A Comparison of Three Estimators of the Population-Scaled Recombination Rate: Accuracy and Robustness Genetics, December 1, 2005; 171(4): 2051 - 2062. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Lemey, O. G. Pybus, A. Rambaut, A. J. Drummond, D. L. Robertson, P. Roques, M. Worobey, and A.-M. Vandamme The Molecular Population Genetics of HIV-1 Group O Genetics, July 1, 2004; 167(3): 1059 - 1068. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Kim and R. Nielsen Linkage Disequilibrium as a Signature of Selective Sweeps Genetics, July 1, 2004; 167(3): 1513 - 1524. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. E. Ptak, K. Voelpel, and M. Przeworski Insights Into Recombination From Patterns of Linkage Disequilibrium in Humans Genetics, May 1, 2004; 167(1): 387 - 397. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Fischer, V. Wiebe, S. Paabo, and M. Przeworski Evidence for a Complex Demographic History of Chimpanzees Mol. Biol. Evol., May 1, 2004; 21(5): 799 - 808. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Wakeley and S. Lessard Theory of the Effects of Population Structure and Sampling on Patterns of Linkage Disequilibrium Applied to Genomic Data From Humans Genetics, July 1, 2003; 164(3): 1043 - 1053. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by McVean, G. A. T.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by McVean, G. A. T.





= 4Ner, 









) on genealogical correlations for a sample of n = 50 chromosomes (from 106 coalescent simulations)


