## Abstract

The “LD curve” relates the linkage disequilibrium (LD) between pairs of nucleotide sites to the distance that separates them along the chromosome. The shape of this curve reflects natural selection, admixture between populations, and the history of population size. This article derives new results about the last of these effects. When a population expands in size, the LD curve grows steeper, and this effect is especially pronounced following a bottleneck in population size. When a population shrinks, the LD curve rises but remains relatively flat. As LD converges toward a new equilibrium, its time path may not be monotonic. Following an episode of growth, for example, it declines to a low value before rising toward the new equilibrium. These changes happen at different rates for different LD statistics. They are especially slow for estimates of , which therefore allow inferences about ancient population history. For the human population of Europe, these results suggest a history of population growth.

LINKAGE disequilibrium (LD) refers to the statistical association between pairs of genetic loci. It is used routinely in localizing disease genes, in detecting natural selection, and in studying population history. In all of these contexts, it is necessary to account for effects of changes in population size.

These effects arise because inhabitants of small populations tend to be close relatives. The genealogical paths that separate them are short, and this reduces the opportunity for recombination. For this reason, LD rises after a fall in population size and falls after a rise.

These effects are understood in a general way and are often studied by computer simulation (Kruglyak 1999; Pritchard and Przeworski 2001). Although this approach has led to important insights, our understanding is still rudimentary. This article uses a deterministic algorithm to explore the effects of growth, of decline, and of temporary reductions (bottlenecks) in population size. It shows that each type of history leaves a distinctive signature in the “LD curve,” which relates the LD between pairs of sites to the distance that separates them along the chromosome.

The paper uses (defined below) as a measure of LD. This choice is unusual, because has always been of secondary importance. As we shall see, however, has dynamical properties that give it deeper time depth than alternative measures of LD. It is readily estimated from data and can be predicted by a deterministic theory, which makes it easy to study the response to changes in population size. This article shows that is of more than secondary importance. It is useful in its own right as a measure of LD.

## Materials and Methods

### Software

The various methods were evaluated against simulated data generated by MACS (Chen *et al.* 2009). Simulation results in Figure 1 are based on my own coalescent program, hetage, written in C. All other analyses were done using ldpsiz, a package of C programs available at github.com/alanrogers/ldpsiz. This package includes several programs: eld, which estimates and *r*^{2} from genetic data; preld, which calculates from the history of population size; and sald, which fits history parameters to an observed LD curve, using the method of simplex-simulated annealing (Press *et al.* 1992, pp. 451–455), with modifications described by Gao and Han (2012).

### Measuring LD

Consider a pair of loci (nucleotide sites), *A* and *B*. At locus *A*, alleles *A*_{1} and *A*_{0} have frequencies *a* and 1 − *a*. At locus *B*, alleles *B*_{1} and *B*_{0} have frequencies *b* and 1 − *b*. The disequilibrium coefficient, *D*, is defined such that *ab* + *D* is the frequency of gamete type *A*_{1}*B*_{1}. The sign of *D* is arbitrary, depending on how one labels the alleles, so the magnitude of LD is often measured by *D*^{2}.

These measures are sensitive to heterozygosity at the two loci, so many authors prefer the squared correlation coefficient (Hill and Robertson 1968), (1)Unfortunately, there is no consensus regarding the expected value of this statistic, even in the simplest case of neutral loci in a randomly mating population of constant size (compare Sved 2009 with Song and Song 2007 and Durrett 2008, p. 98).

Ohta and Kimura (1969, p. 233) proposed a related measure of LD, the “squared standard linkage deviation,” which was also motivated by a desire to minimize the effect of heterozygosity: (2)This measure is usually viewed as an approximation to *E*[*r*^{2}]. It is most useful in this role when the population size is large and constant and both loci have appreciable heterozygosity (Hudson 1985). In other situations, the two parameters can differ greatly. But even when fails as an approximation, it is still useful as a measure of LD.

This is not to say that provides a complete description of variation at a pair of loci. That is a problem with multiple dimensions, which cannot be solved by any scalar-valued measure of LD (Weir 1996, pp. 125–127). Such measures are simplifications and are necessarily incomplete (Schaper *et al.* 2012). However, as we shall see, captures enough to provide an interesting window into the history of population size.

### Estimation of

One can estimate from data by replacing the expected values in Equation 2 with averages across the genome. For this purpose, I treat each polymorphic site [single-nucleotide polymorphism (SNP)] as “focal” and compare the focal SNP with each other SNP within some given range. From each comparison, I calculate *D*^{2} and *a*(1 − *a*)*b*(1 − *b*). These values are tabulated separately within bins based on the distance in centimorgans between the SNPs. After all comparisons have been tabulated, the estimate is calculated as the sum of *D*^{2} within a bin divided by the corresponding sum of *a*(1 − *a*)*b*(1 − *b*).

In this article, I ignore pairs of SNPs separated by >0.3 cM, because there is little LD at these distances in most parts of the human genome.

To estimate uncertainties, I use a moving-blocks bootstrap (Lieu and Singh 1992), with 300 SNPs per block. With diploid data, some genotypes may be unphased. In such cases, I use the EM algorithm to estimate *D*^{2}, as explained in the *Appendix*.

### Deterministic evolution of

Under neutral evolution with constant population size, converges to an equilibrium value (Ohta and Kimura 1971; McVean 2002). In addition, several authors have introduced recurrence equations, which make it possible to study the transient behavior of after changes in population size (Weir and Cockerham 1974; Hill 1975; Strobeck and Morgan 1978). The model of Strobeck and Morgan (1978) allows faster calculations but is less numerically stable than that of Hill (1975). I present some results using the former model but focus primarily on the latter. In the *Appendix*, I summarize Hill’s model and show that it holds not only under the mutational model that he studied, but also under the model of infinite sites (Kimura 1969). It is thus appropriate for use with DNA sequence data.

For bootstrap confidence intervals, I accelerate these calculations by using Equation A2 of the *Appendix* to approximate a system of ordinary differential equations, which is then solved using standard software.

### Sampling bias

Drawing a sample is equivalent to one generation of evolution with a very small population—one whose size equals that of the sample. Because drift introduces LD, there is much more LD in a sample than in the population from which it was drawn.

Hudson (1985, Equation 6) showed that sampling bias in *r*^{2} equals 1/*n*, when the sample consists of *n* gametes. Sampling bias also inflates the value of The program preld includes this bias in calculations of the expected value, These calculations use the model of Hill (1975) or Strobeck and Morgan (1978) to project the state vector forward one generation, with the population size set equal to the sample size.

### Fitting population history to an observed LD curve

The program eld (a part of ldpsiz) estimates the LD curve from genetic data. A second program, sald, can then be used to search for the population history parameters that minimize the sum of squared differences between observed and predicted LD curves. This minimization problem searches a complex surface, with many local minima. To search for the global minimum, sald uses the method of simplex-simulated annealing. For each data set, sald starts 10 simulated annealing jobs on parallel threads. Typically, several of these converge to the same global minimum. sald returns the best of the 10 solutions as fitted parameters.

## Results

### is average *r*^{2}, weighted by heterozygosities

Equation 2 is equivalent to (3)where *H _{A}* = 2

*a*(1 −

*a*) is the expected heterozygosity at locus

*A*and

*H*is that at locus

_{B}*B*. This implies that is the expectation of

*r*

^{2}weighted by the product of the two heterozygosities.

This weighting also carries over to the estimate, which is obtained by replacing expectations with averages. Such estimates will be insensitive to loci with low heterozygosity and, for this reason, also insensitive to sequencing error. They should be useful with low-coverage sequence data.

### Loci with high heterozygosity have deep gene trees

Weighting by heterozygosity exaggerates the influence of unusually deep gene trees. At some given nucleotide position, let *T* represent the age of the last common ancestor of the sample. I call this the “depth” of the gene tree for that nucleotide position. Griffiths and Tavaré (1998, Equation 1.5, p. 276) derive the conditionally expected depth, *E*[*T* | *x*, *n*], given that *x* copies of the derived allele were observed in a sample of haploid size *n*. Given the heterozygosity, we can solve for *x*, but we cannot tell the derived from the ancestral allele. It may be present in *x* copies or in *n* − *x*. At mutation–drift equilibrium, however, these alternatives have probabilities proportional to 1/*x* and 1/(*n* − *x*) (Fu 1995, equation 1). Using these values as weights, I average *E*[*T* | *x*, *n*] and *E*[*T* | *n* − *x*, *n*] to calculate expected tree depth given heterozygosity. The results are shown as a solid line in Figure 1. As Figure 1 shows, tree depth increases with heterozygosity. Simulated values—shown as open circles—agree closely with the theory.

Simulated values also allow us to examine the upper tail of the distribution, as seen in Figure 1, right. The 95th percentile of tree depth increases with heterozygosity even more steeply than does the mean. Thus, heterozygosity has an exaggerated effect on the upper tail of the distribution.

Geneticists often study loci selected for their high heterozygosity (Lewontin 1967; Rogers and Jorde 1996; Clark *et al.* 2005). A sample of loci selected in this fashion will have gene trees that are unusually deep, especially in the upper tail of the distribution.

Because is weighted by heterozygosity, it exaggerates the influence of these deep gene trees. For this reason, it should be sensitive to earlier portions of a population’s history. Following a change in population size, it should converge more slowly to the new equilibrium.

### Effect of population growth

LD will decline following an expansion of population size, because genetic drift is weaker in large populations and produces less LD. This process is illustrated in Figure 2, which shows the effect of an expansion from size 2*N* = 10^{3} to 2*N* = 10^{5}. The LD curve of the initial population (labeled *t* = 0) represents an equilibrium between mutation, drift, and recombination. After the expansion, the LD curve will eventually converge to a new equilibrium, which is labeled *t* = ∞. This new equilibrium, however, is reached only gradually.

LD is measured by in Figure 2, left, and by *r*^{2} in Figure 2, right. The dashed lines were calculated using the method of Hill (1975), and the open circles were estimated from data simulated using MACS (Chen *et al.* 2009). Note that is still far from equilibrium at generation 500 and does not approach equilibrium until about generation 1500.

The situation is very different in Figure 2, right, where LD is measured by *r*^{2}. By generation 500, *r*^{2} is close to equilibrium. At the left edge of the graph, it is slightly *below* the equilibrium. We return to this point later, but for the moment, note simply that *r*^{2} converges much faster than For this reason, the two measures are useful for studying different timescales. We learn from *r*^{2} about the recent past and from about more ancient events. Presumably, this difference arises because is weighted toward loci with high heterozygosity. As shown in Figure 1, this weighting makes it more sensitive to the distant past.

Returning to Figure 2, left, note that the right portion of the curve converges faster than the left. This is because the postexpansion population is large, and genetic drift is weak. The dynamics are therefore dominated by recombination, which is stronger on the right side of the graph. The result is that midway through the process—say at generation 500—the LD curve declines very steeply. Thus, a steeply declining LD curve is the signature of recent population growth (Kruglyak 1999; Pritchard and Przeworski 2001).

### The method of Hayes *et al.* (2003)

Figure 2 suggests that is likely to be of greater use than *r*^{2} in inferring the ancient history of population size. Yet the latter statistic is often used instead (Hayes *et al.* 2003; McEvoy *et al.* 2011; Tenesa *et al.* 2007), using a method that is based on the formula *E*[*r*^{2}] ≈ *ρ*, where (4)Here, *N* is diploid population size, and *c* is the recombination rate (Sved 1971; Sved and Feldman 1973). Although this formula has been criticized (Littler 1973, p. 272; McVean 2002, pp. 987–988; McVean 2007, p. 923; Durrett 2008, p. 98), it is widely used.

Hayes *et al.* (2003) study a generalization of *r*^{2} and argue from simulations that the expectation of this statistic is ≈*ρ*. Furthermore, this approximation even works for populations that have increased in size at a constant linear rate, provided that one interprets *N* as the population size that obtained 1/2*c* generations in the past. By inverting Equation 4, they are able to estimate *N* over a wide range of values of *c*, which correspond under their model to different points in the past.

Tenesa *et al.* (2007) work directly with *r*^{2}, but estimate the history of population size in the same way. They also employ a modified predictor, (5)which in their view accounts better for the effect of mutation. This method is also used by McEvoy *et al.* (2011). There are two difficulties with this formula. First, Tenesa *et al.* (2007, p. 521) derive it from a formula of Hill (1975, p. 124), which refers not to *E*[*r*^{2}] but to Thus, it approximates *E*[*r*^{2}] only to the extent that does. In addition, it uses a fairly crude standard of approximation, taking 10 as ≈11.

The arguments justifying both of these methods assume that population size has increased linearly. Of course, no population can increase linearly forever, so the period of linear increase must end at some point in the past. Furthermore, the method’s results are often interpreted in terms of nonlinear growth trajectories, such as bottlenecks (Hayes *et al.* 2003, pp. 639–670; Tenesa *et al.* 2007, p. 525; McEvoy *et al.* 2011, p. 822). Let us consider how the method behaves in this broader context.

Figure 3 considers a population that expands suddenly in size and is observed 500 generations later. In both panels of Figure 3, the solid lines show predicted LD. In Figure 3, left, LD is measured by and predicted by (Hill 1975). In Figure 3, right, it is measured by *r*^{2} and predicted by *ρ* and The open circles show values simulated using MACS (Chen *et al.* 2009). In the context of this population history, provides excellent predictions of but neither *ρ* nor predicts *r*^{2}.

Presumably, this reflects the nonlinearity of the growth trajectory assumed in Figure 3. We expect better accuracy in Figure 4, which evaluates *ρ* and against a history involving 300 generations of linear population growth, approximated as a series of 20 steps. The period of linear growth corresponds to the region to the right of the dashed line in Figure 4, right. At least in this region, predicted and simulated values should match. Yet as before, predicts well, but *ρ* and do not.

Finally, Figure 5 evaluates *ρ*, and against a history of constant population size. In this case, we get an accurate prediction of from and a fairly accurate prediction from This makes sense, because (as mentioned above) is an approximation to the equilibrium formula for On the other hand, neither *ρ* nor provides a useful prediction of *r*^{2}.

In these analyses, the observed and predicted values of *r*^{2} are distressingly far apart. Perhaps this discrepancy reflects an error in the simulation software (MACS) or in my own code for estimating *r*^{2} from simulated data. To find out, I used this software to replicate the published results of Hudson (1985). The results, shown in Figure 6, are reassuring. Given the same parameter values, the code used here produces results indistinguishable from those of Hudson.

It seems fair, therefore, to conclude that the problem lies in the predictors, *ρ* and neither of which is useful in the cases studied here. It seems unlikely that they will be useful as a basis for inference in real populations.

### Effect of population collapse

A collapse in population size produces an effect on the LD curve very different from that of population growth. This is shown in Figure 7, which illustrates two important points. First, the whole process is over very quickly. Even with we cannot look very far into the past. Second, the transient curves (the dashed lines) are quite flat. As time passes, the initial curve (labeled *t* = 0) is elevated without much change in shape. Presumably, this is because the effect of drift is dominant in a small population, so differences in recombination rate do not matter much. The result is that the LD curve becomes high but flat after a collapse in population size.

This pattern is superficially similar to that generated by gene conversion, which reduces LD between closely linked sites (Ardlie *et al.* 2001). It seems unlikely, however, that these effects will be confused. The effect of gene conversion is probably minor for sites separated by >1 kb (Frisse *et al.* 2001, p. 838; Chen *et al.* 2007, p. 764), whereas that of a population collapse extends much farther.

### Effect of a bottleneck

Figure 8 shows the effect of a 100-generation bottleneck in population size. Population size was 2*N* = 10^{3} during the bottleneck but 10^{5} before and after. The bottleneck ended at time 0, and we observe it in various subsequent generations. In the graph, the curve labeled *t* = 0 is at the end of the bottleneck and exhibits an LD curve that is high but flat, for reasons just discussed. As time passes, the right portion of the curve (the portion with high rates of recombination) falls much faster than the left, so that by generation 1000, the curve is almost L shaped. This is the signature of a bottleneck in population size.

Note the dramatic difference between Figure 2 and Figure 8—between expansion from equilibrium and from a bottleneck. Both curves are declining toward the same equilibrium, but the left portion of the curve declines much more slowly after a bottleneck than after expansion from equilibrium. This can only reflect the state of the population just before the increase in size. The initial population of Figure 2 had been small much longer than that of Figure 8. Consequently, it had less heterozygosity. As we see in the next section, this accelerates the rate of decline in LD between closely linked sites.

It is sometimes suggested that bottlenecks inflate long-range LD, just the opposite of the pattern seen above (Slatkin 2008, pp. 481–482; Tenaillon *et al.* 2008). This discrepancy, however, evaporates on close inspection. When Tenaillon *et al.* (2008, figure 6) discuss “long-range” LD, they are referring to pairs of nucleotide sites separated by ∼0.01 cM. In my own analysis, such pairs would be called “tightly linked” and would fall at the left edge of the LD curve. In this region of Figure 8, LD is indeed elevated, in agreement with Tenaillon *et al.* (2008). In another study, Thornton and Andolfatto (2006, p. 1611) report elevated LD following a bottleneck. But as they do not discuss the shape of the curve, there is no conflict between their results and mine.

### The time paths of individual points on the LD curve

Figure 9 shows the time path of after an expansion in population size. Because the new population is larger, the new equilibrium will have less LD. But does not decline monotonically toward this new equilibrium. Instead, it falls rapidly to a smaller value before climbing back slowly toward to the new equilibrium.

The initial decline in does not result from any decline in its numerator, *E*[*D*^{2}]. Indeed, Figure 10 shows that the numerator actually grows. The decline in happens because growth in its numerator is outstripped by that in its denominator, *E*[*a*(1 − *a*)*b*(1 − *b*)], which increases under the influence of mutation. The proportional increase is large, because our initial population was small and therefore had little heterozygosity. For this reason, the denominator was initially so small that increments caused by mutation had a large proportional effect.

Two factors account for the postexpansion growth in the numerator, *E*[*D*^{2}], of First, *D* is proportional to heterozygosity (Kaplan and Weir 1992, p. 334), which increases in response to mutation. Ordinarily, the positive effect on *D*^{2} would be offset by the negative effect of recombination. But in the early generations following our expansion, this negative effect is very weak. This is because the initial population had low heterozygosity. Few recombinant gametes are produced in such a population, because such gametes are produced only by double heterozygotes.

The nonmonotone time path in Figure 9 refers to but it seems plausible that *r*^{2} might obey similar dynamics. This may explain why, in Figure 2, right, the left end of the curve for *t* = 500 is below the equilibrium.

This also explains why the left portion of the LD curve declines faster after expansion from equilibrium than after a bottleneck. The left portion of the curve refers to tightly linked sites, with weak recombination. In the postexpansion population, genetic drift is also weak because the population is large. Because recombination and drift are both weak, mutation dominates the dynamics. The proportional effect of mutation is large when heterozygosity is low. In the case of expansion from equilibrium, the population has been small a long time, so heterozygosity is low, the proportional effect of mutation is large, and declines rapidly. Heterozygosity is not so low at the end of a bottleneck, because the population has been small only briefly. Consequently, the proportional effect of mutation is smaller, and declines more slowly.

### LD in the human population of Europe

The methods discussed above generate predictions about LD from assumptions about the history of population size. It is also possible, by fitting predicted to observed curves, to work in the other direction—from data to inferences about population history. This is a problem in statistical inference, and methods for this purpose will be described elsewhere.

The present methods, however, are useful in exploratory data analysis, as shown in Figure 11. The open circles show as estimated from chromosome 1 in European data (1000 Genomes Project Consortium 2012). These are surrounded by dashed lines, which indicate a 95% confidence region generated by moving-blocks bootstrap. These show that chromosome 1 provides accurate estimates of Hill (1981) expressed skepticism about the possibility of estimating population size from data on LD. At that time only limited data were available, and it was not possible to estimate LD statistics with great accuracy. This inaccuracy bled into estimates of population size. The narrow confidence region in Figure 11 shows that things have changed.

The solid line in Figure 11 shows the equilibrium curve that fits these data best—the one that minimizes squared errors between observed and predicted values of This is the curve for a population of constant size, 2*N* = 8621. In Figure 11, right, we see the differences between observed and fitted values. Because these differences are positive on the left but negative on the right, it is clear that the observed LD curve declines more steeply than any equilibrium curve. As we have seen, this suggests a history of population expansion in Europe, in agreement with many other analyses of European data.

## Discussion

has never been valued for its own sake. It is seen instead as an approximation to the quantity of real interest—the expected value of *r*^{2}. Yet it is often a poor approximation, even in populations of constant size (Maruyama 1982; Hudson 1985, pp. 616–617). It is even worse when population size varies.

Following a change in population size, the mean of *r*^{2} converges toward its new equilibrium far faster than does Presumably, this is because is sensitive to loci with high heterozygosity, and the gene trees of such loci are deep. Whatever the cause, this difference in rates of convergence has two effects. First, it makes useless as an approximation to *r*^{2} in populations that have varied in size. On the other hand, it also means that itself provides a deep record of demographic history.

To take advantage of this extended record, one must estimate directly from data. This is easy to do, by using averages in place of the expectations in Equation 2. With genome-scale data, such estimates are quite accurate. This provides a measure of LD that is unique in that one can easily calculate expected LD from the history of population size. With other measures, such inferences would require extensive computer simulations.

Following a population expansion, drift is weak and the dynamics of LD are dominated by recombination. The LD curve begins to decline, and this decline is fastest in the right-hand portion of the curve, where recombination rates are highest. Consequently, the curve will be unusually steep for hundreds of generations following a population expansion.

Following a population collapse, drift becomes strong and dominates the dynamics. It pushes LD upward rapidly and relatively uniformly throughout the curve. Thus, the LD curve becomes high and flat. This pattern has been reported for several human populations, including high-latitude foragers (Kaessmann *et al.* 2002) and Ashkenazi Jews (Shifman and Darvasi 2001). Yet as Pritchard and Przeworski (2001, p. 7) observe, it has been difficult to explain: We have several examples in which large regions exhibit more LD than would be expected under either a model of constant population size or a model with rapid population growth. Yet, at the same time studies of polymorphism at a small scale reveal less LD than would be expected. These observations at different scales are hard to accommodate in a single explanation since factors that increase long-distance LD will tend to have an even larger effect on closely linked sites.

As we have seen, this pattern arises naturally from a recent reduction in population size.

A bottleneck in population size begins with an episode of population collapse. At the end of the bottleneck, the LD curve is therefore high but flat. As population size rises at the end of the bottleneck, genetic drift becomes weak and recombination grows in importance. The right portion of the LD curve falls faster than the left, so the curve becomes steeper.

In the left portion of the curve, recombination is weak. Genetic drift is also weak, because of the increase in population size. This allows mutation to play an important role, and its effect distinguishes the two forms of expansion: from equilibrium and from a bottleneck. In the former case, the preexpansion population had little heterozygosity, so each mutational increment has a large proportional effect on the denominator of After a bottleneck, the opposite is true, so the left portion of the LD curve declines more slowly. The curve becomes even steeper after a bottleneck than after an expansion from equilibrium.

In summary, (1) when recombination is strong relative to drift, LD declines and the curve becomes steeper; (2) when drift is strong relative to recombination, LD rises but the curve stays flat; and (3) where drift and recombination are both weak, the rate of decline in LD decreases with heterogyzosity.

In the human population of Europe, the LD curve is steep, suggesting a history of population expansion. This might reflect the spread of modern humans into Europe, the spread of farmers during the Neolithic, or the spread of Indo-European speakers.

## Acknowledgments

I am grateful for comments from R. Bohlender, E. Cashdan, R. Gutenkunst, H. Harpending, W. G. Hill, and J. Seger and for access to computer facilities in the laboratories of Lynn Jorde and Henry Harpending.

## Appendix

### Hill’s model of LD evolution

Hill (1975) incorporates mutation, using the model of infinite alleles. However, most modern work involves either DNA sequence data or SNPs. In these data, alleles are nucleotide states: A, T, G, or C. Nearly all loci have just two alleles, so it is not appropriate to assume an infinity of alleles. Yet as we shall see, Hill’s (1975) results also apply to a more appropriate mutational model—that of “infinite sites” (Kimura 1969), which assumes that mutation never strikes the same site twice.

Hill begins with a vector of moments:Here *a _{i}* and

*b*are the frequencies of allele

_{j}*A*at locus

_{i}*A*and allele

*B*at locus

_{j}*B*. The disequilibrium coefficients,

*D*, are defined such that

_{ij}*a*+

_{i}b_{j}*D*is the frequency of gamete type

_{ij}*A*. The dynamics of these moments are approximately linear, after dropping terms in

_{i}B_{j}*u*

^{2}, 1/

*N*

^{2}, and

*u*/

*N*, where

*u*is the mutation rate and

*N*the diploid population size. Thus, Hill (1975, equation 1) shows thatwhere

**D**,

**R**, and

**M**are matrices describing the linear effects of drift, recombination, and mutation. So far, the model describes only the changes in frequency of existing alleles. Thus it applies not only to Hill’s model of infinite alleles, but also to the model of infinite sites.

To incorporate mutation, Hill defines a new vector, (A1)where the sums run across all pairs of alleles at each locus. The dynamics of this new vector include an additive contribution, Δ*x*_{mut}, which represents the effect of mutation to new alleles: (A2)This equation can be iterated across many generations, and it is easy to incorporate changes in population size. At the end of this process, is calculated as *x*_{3}/2*x*_{1} (Hill 1975, p. 124).

The first term on the right side of Equation A1 applies equally to both mutational models. Some reinterpretation is required, however, to relate the second term to the model of infinite sites. In this new context, Δ*x*_{mut} becomes the contribution of mutations at other sites in the genome, rather than that from new alleles at the same pair of sites. The mutational increments are, however, identical in the two models.

To see that this is so, let us take a close look at the mutational contributions to *x*, the vector defined above in Equation A1. Under the model of infinite sites, mutation never strikes the same site twice, so each polymorphic site has exactly two alleles. In this biallelic context, we can simplify Hill’s vector of moments as (A3)Here, *a* and *b* are the frequencies of alleles *A*_{1} and *B*_{1}, and *D* is the coefficient of linkage disequilibrium. It is defined such that *ab* + *D* is the frequency of gamete type *A*_{1}*B*_{1}. When each locus is biallelic, Hill’s multiallelic moments reduce to *x*_{1} = 4*h*_{1}, *x*_{2} = 4*h*_{2}, and *x*_{3} = 8*h*_{3} (Hill 1975, p. 123).

The mutational increments, Δ*x*_{mut} and Δ*h*_{mut}, both depend on the heterozygosity, *H* = *E*[2*a*(1 − *a*)]. The dynamics of heterozygosity under infinite sites are the same as those under infinite alleles (Hill 1975, p. 120):The *h _{i}* are nonzero only if both loci are polymorphic. On the other hand, our model assumes that mutation affects only monomorphic sites. This implies that mutational increments to

*h*occur only at pairs of sites in which one site is polymorphic and the other monomorphic. Table A1 summarizes the case in which

*A*is initially polymorphic, and a mutation strikes an initially monomorphic locus,

*B*.

Consider first the increment to *h*_{1} = *E*[*a*(1 − *a*)*b*(1 − *b*)]. Before mutation, *h*_{1} = 0 because *b* = 0. After mutation, *b* = 1/2*N*, so *h*_{1} becomes *E*[*a*(1 − *a*)(1/2*N* − 1/4*N*^{2})] ≈ *H*/4*N*. This accounts for only half the effect of mutation, because there are also pairs at which *A* is monomorphic and *B* polymorphic. The expected effect of a single mutation is thus ∼*H*/2*N*. The expected number of such mutations per generation is 2*Nu*. In aggregate, therefore, the increment from mutation is Δ_{mut}*h*_{1} ≈ *uH*.

To derive the mutational increment to *h*_{2} = *E*[(1 − 2*a*)(1 − 2*b*)*D*], we assume as before that *A* is polymorphic but *B* is not. Because *D* is zero when either site is monomorphic, *h*_{2} = 0 before mutation. After mutation, there are two cases to consider. With probability *a*, the mutation falls on an *A*_{1}-bearing gamete, and *h*_{2} becomes (1 − 2*a*)(1 − 2/2*N*)(1 − *a*)/2*N*, as shown in Table A1. On the other hand, with probability 1 − *a* the mutation falls on an *A*_{0}-bearing gamete, and *h*_{2} becomes −(1 − 2*a*)(1 − 2/2*N*)*a*/2*N*. In expectation, the new value of *h*_{2} is 0. This result is conditional on a mutation at locus *B*, but an identical argument applies for mutation at locus *A*. Thus, Δ_{mut}*h*_{2} = 0.

The third moment is *h*_{3} = *E*[*D*^{2}]. Because *B* is initially monomorphic, *D*_{2} = 0 before the mutation. When a mutation occurs at *B*, it strikes an *A*_{1}-bearing gamete with probability *a* and an *A*_{0}-bearing gamete with probability 1 − *a*. The resulting values of *D* are shown in Table A1. Squaring these and weighting by *a* and 1 − *a* gives

This is the effect on *D*^{2} of a single mutation at *B*. There are 2*Nu* such mutations per generation, so the expected increment from mutation at locus *B* is 2*Nu* × *H*/8*N*^{2} = *uH*/4*N*. Finally, multiply by 2 to account for cases in which *A* is initially monomorphic and *B* polymorphic. This gives Δ_{mut}*h*_{3} = *uH*/2*N* ≈ 0, ignoring terms of order *u*/*N*.

In summary, (A4)in agreement with Hill (1975, p. 121). This shows that the mutational increment is the same under the models of infinite sites and infinite alleles. Because of this equivalence, Hill’s results apply equally to both models of mutation.

### Estimating Linkage Disequilibrium with Partially Phased Diploid Data

In 1000-genotypes data, not all genotypes are phased. At different loci, the unphased genotypes may correspond to different individuals. Thus, we need a method that can deal with unphased genotypes scattered throughout the data matrix.

The symbols *j*, *k*, *l*, and *m* represent alleles and will always equal either 0 or 1. I write phased two-locus genotypes in form which says that a diploid individual has genotype *jk* at locus *A* and genotype *lm* at locus *B*. This represents the union of gametes and This genotype is unordered, in that we cannot distinguish the maternal gamete from the paternal one. Consequently, there is no distinction between and When I write genotypes in this form, I imply that linkage phase is known. In other words, genotypes and are not equivalent.

Consider a tiny two-locus data set, consisting of three diploid individuals:

On the right, each row corresponds to a locus and each column to a gamete. To estimate linkage disequilibrium (*D*), we would need to calculate where *x _{i}* is the

*i*th value in the upper row and

*y*is the corresponding value in the lower one. This sum can be written asHere,

_{i}*s*=

_{i}*j*+

_{i}l_{i}*k*is the contribution of the

_{i}m_{i}*i*th diploid genotype. This calculation requires phased data, and it is the only step where such data are required in estimating

*D*. To cope with unphased data, we need a method to estimate

*s*.

_{i}Consider the function If at least one genotype is homozygous, then phasing does not matter. For example, We can calculate *s* regardless of phasing. The only genotypes that need concern us are those in which both loci are heterozygous.

As mentioned above, all genic values (*j*, *k*, *l*, and *m*) are either 0 or 1. Double heterozygotes will look like either or (I ignore the equivalent representations and because genotypes are unordered.) With unphased data, we cannot distinguish between these two cases. Yet they imply different values: but For double heterozygotes, I replace *s* with its expected value, which equals the probability, *w*, that the genotype is of form rather than

To calculate this probability, I begin with standard results for the frequencies of the four gamete types, as shown in Table A2. Following Rogers and Huff (2009), I ignore recombination in the most recent generation. Under random mating, these gametes form at random to produce two-locus genotypes. The frequency of genotype among double heterozygotes is thus (A5)whereandThese results can be used to estimate *D* using the EM algorithm (Dempster *et al.* 1977). Let *K* represent the number of unphased double heterozygotes. Each of these is of type with probability *w* and of type with probability 1 − *w*. The expected log likelihood isNote that *E* ln *L* depends on *p*_{0}, *p*_{1}, *p*_{2}, and *p*_{3}, which are themselves functions of *D*. The maximum-likelihood estimate of *D* is the value that maximizes *E* ln *L*.

This is a unidimensional maximization problem, because *D* is the only unknown. *D* cannot fall outside the range where is the maximum of −*a*(1 − *b*) and −*b*(1 − *a*), and is the minimum of *a*(1 − *b*) and *b*(1 − *a*) (Lewontin 1964, p. 55). The initial value of *D* is set assuming that *w* = 1/2. In each iteration, if *d*^{2}*E* ln *L*/*dD*^{2} < 0, then the algorithm tries a Newton step (Hamming 1973, p. 68). If the result is within then the Newton step is accepted. Otherwise, the algorithm uses a modified version of the bisect algorithm (Hamming 1973, p. 62) to move in the uphill direction, as indicated by the sign of *dE* ln *L*/*dD*.

The modification to bisect allows the algorithm to make large steps when it seems likely that the optimum is at a boundary. For example, if *dE* ln *L*/*dD* > 0 at the current value of *D*, then motion is to the right, toward Before deciding how far to move, the algorithm checks the sign of the derivative at If both derivatives are positive, then the algorithm takes a big step, moving 80% of the distance from *D* to But if the two derivatives have opposite sign, then the algorithm takes a small step, moving only 50% of the way to When *dE* ln *L*/*dD* < 0, the algorithm is similar, except that motion is to the left, toward

## Footnotes

*Communicating editor: J. Wall*

- Received November 10, 2013.
- Accepted May 28, 2014.

- Copyright © 2014 by the Genetics Society of America