# Closed-Form Two-Locus Sampling Distributions: Accuracy and Universality

^{*}Computer Science Division and^{†}Department of Statistics, University of California, Berkeley, California 94720

- 1
*Corresponding author:*Department of Electrical Engineering and Computer Sciences, University of California, 683 Soda Hall #1776, Berkeley, CA 94720-1776. E-mail: yss{at}cs.berkeley.edu

## Abstract

Sampling distributions play an important role in population genetics analyses, but closed-form sampling formulas are generally intractable to obtain. In the presence of recombination, there is no known closed-form sampling formula that holds for an arbitrary recombination rate. However, we recently showed that it is possible to obtain useful closed-form sampling formulas when the population-scaled recombination rate ρ is large. Specifically, in the case of the two-locus *infinite-alleles* model, we considered an asymptotic expansion of the sampling formula in inverse powers of ρ and obtained closed-form expressions for the first few terms in the expansion. In this article, we generalize this result to an arbitrary *finite-alleles* mutation model and show that, up to the first few terms in the expansion that we are able to compute analytically, the functional form of the asymptotic sampling formula is common to all mutation models. We carry out an extensive study of the accuracy of the asymptotic formula for the two-locus parent-independent mutation model and discuss in detail a concrete application in the context of the composite-likelihood method. Furthermore, using our asymptotic sampling formula, we establish a simple sufficient condition for a given two-locus sample configuration to have a *finite* maximum-likelihood estimate (MLE) of ρ. This condition is the first analytic result on the classification of the MLE of ρ and is instantaneous to check in practice, provided that one-locus probabilities are known.

FOR a given population genetics model, the probability of observing a sample of DNA sequences plays a fundamental role in various applications, including parameter estimation and ancestral inference. However, except in the one-locus case with a special model of mutation such as the infinite-alleles model or the finite-alleles parent-independent mutation (PIM) model, closed-form sampling formulas are generally unknown. In particular, when recombination is involved, obtaining an analytic formula for the sampling distribution has so far remained an intractable problem, and most research has focused on developing Monte Carlo methods, such as importance sampling (Griffiths and Marjoram 1996; Stephens and Donnelly 2000; Fearnhead and Donnelly 2001; De Iorio and Griffiths 2004a,b; Griffiths *et al.* 2008) and Markov chain Monte Carlo (Kuhner *et al.* 2000; Nielsen 2000; Wang and Rannala 2008). These Monte Carlo methods have led to useful tools for population genetics analysis, but they tend to be computationally intensive—in some cases prohibitively so—and it is difficult to provide a theoretical characterization of their accuracy. In the case of the infinite-sites model of mutation, Lyngsø *et al.* (2008) recently proposed a new approach to compute likelihoods using ideas from parsimony, but that method is not scalable.

The main mathematical framework that underlies the above-mentioned computational methods is the coalescent with recombination (Griffiths 1981; Kingman 1982a,b; Hudson 1983), which models the genealogical history of sample chromosomes. When the rate of recombination is large, the genealogies sampled by Monte Carlo methods are typically very complicated, containing many recombination events. However, in contrast to this increase in complexity in the coalescent, we in fact expect the dynamics to be easier to study for large recombination rates, since the loci under consideration would then be less dependent. It seems plausible that a stochastic process exists that is simpler than the standard coalescent with recombination, which describes the dynamics of the relevant degrees of freedom for large recombination rates. To study whether such a “dual” description of weakly interacting loci might exist, we (Jenkins and Song 2009) recently revisited the problem of obtaining a closed-form sampling formula for a model with recombination and obtained useful analytic results in the case that the population-scaled recombination rate ρ is large. More precisely, we considered the diffusion limit of the two-locus *infinite-alleles* model with population-scaled mutation rates θ_{A} and θ_{B} at the two loci. For a given sample configuration **n** (defined later in the text), we found the asymptotic expansion of the sampling probability in inverse powers of ρ,(1)where *q*_{0}, *q*_{1}, and *q*_{2} are independent of ρ. The zeroth-order term corresponds to the completely unlinked case, given by a product of marginal one-locus sampling distributions found by Ewens (1972). Our main contribution was to derive a closed-form formula for the first-order term and to show that the second-order term can be decomposed into two parts, one for which we obtained a closed-form formula and the other that satisfies a simple recursion that can be easily evaluated using dynamic programming.

The goal of this article is to generalize the above result to an arbitrary *finite-alleles* model of mutation and to study the accuracy of our asymptotic sampling formula (ASF). The main theoretical result presented here is that the functional form of our ASF is *universal* in the following sense: For all mutation models, the first two terms *q*_{0} and *q*_{1} in the expansion can be expressed as linear combinations of products of marginal one-locus probabilities, with coefficients depending on the sample configuration but not on mutation parameters. Hence, *q*_{0} and *q*_{1} for different mutation models can be obtained simply by substituting the corresponding one-locus sampling probabilities into the formulas. Whether this universality property extends to higher-order terms in the expansion is an open question. As in the infinite-alleles case, the second-order term *q*_{2} for an arbitrary finite-alleles mutation model can be decomposed into two parts, one for which we obtain a closed-form formula and the other that satisfies a simple recursion relation. Since we are not able to obtain a closed-form solution to the recursive part, we do not know whether the entire second-order term *q*_{2} will obey the universality property.

We carry out an extensive study of the accuracy of our ASF for the two-locus PIM model, in which case we can numerically solve for the true sampling probabilities for small sample sizes. Since any diallelic recurrent mutation model can be reduced to a PIM model (reviewed below), our results should have practical applications. We discuss a concrete application in the context of the composite-likelihood method (Hudson 2001a), which uses two-locus sampling probabilities as the basic building blocks. LDhat (McVean *et al.* 2002, 2004), a widely used software package for estimating recombination rates, is based on this composite-likelihood approach. LDhat assumes the symmetric diallelic model of mutation and relies on the importance sampling scheme developed by Fearnhead and Donnelly (2001) for the coalescent with recombination, to generate exhaustive lookup tables containing two-locus sampling probabilities for all inequivalent diallelic sample configurations. This process of generating exhaustive lookup tables is very computationally expensive. The developers of LDhat suggest using a range of integral ρ-values from 0 to 100, in which case the two-locus sampling probability for ρ > 100 is approximated by that at ρ = 100. An alternative to this truncation approach is to use our ASF for large ρ-values. In this article, we examine the effect that the truncation used by LDhat has on the two-locus sampling probability.

Another application of our work is classification of the maximum-likelihood estimate (MLE) of the recombination rate ρ. Specifically, we establish a simple sufficient condition for a given two-locus sample configuration to have a *finite* MLE of ρ. To our knowledge, this is the first analytic result on the classification of the MLE of ρ. Further, the sufficient condition is instantaneous to check, provided that one-locus probabilities are known.

We have written a C++ program, called ASF, that computes the first three terms in the expansion (1) for either the infinite-alleles model or the finite-alleles PIM model. This program is publicly available at http://www.eecs.berkeley.edu/∼yss/software.html.

## REVIEW OF ONE-LOCUS SAMPLING DISTRIBUTIONS

Here we briefly review two mutation models for which one-locus sampling formulas are known in closed form. As mentioned above, our two-locus sampling formula is given in terms of one-locus sampling formulas.

#### Notation:

The sample configuration at a locus is denoted by a vector , where *n _{i}* denotes the number of gametes with allele

*i*at the locus. Note that

*K*takes on slightly different meanings depending on the assumed model of mutation. In an infinite-alleles model,

*K*refers to the number of observed alleles, whereas in a finite-alleles model it refers to the number of possible alleles in the model. We use

*n*to denote the total sample size . The probability of an

*unordered*sample configuration is denoted by ; the dependence on parameters is not shown for ease of notation.

Let 𝒜_{n} denote an *ordered* configuration of *n* sequentially sampled gametes such that the corresponding unordered configuration is given by **n**. By exchangeability, the probability of 𝒜_{n} does not change under an arbitrary permutation of the sampling order, so we can write this probability of an ordered sample as without ambiguity; again, we suppress the dependence on parameters. In what follows, our theoretical results are presented in the case of ordered samples.

Throughout, we consider the diffusion limit of a neutral haploid exchangeable model of random mating with constant population size 2*N*. Let *u* denote the probability of mutation at a locus per gamete per generation. Then, in the diffusion limit, *N* → ∞ and *u* → 0 with the population-scaled mutation rate θ = 4*Nu* held fixed. Mutation events occur according to a Poisson process with rate θ/2.

Finally, given a nonnegative real number *x* and a positive integer *n*, we use (*x*)_{n} := *x*(*x* + 1) … (*x* + *n* − 1) to denote the *n*th ascending factorial of *x*.

#### Infinite-alleles model:

Under the infinite-alleles model, any two gametes can be compared to determine whether or not they have the same allele, but it is not possible to determine how the alleles are related when they are different. Therefore, allelic label is arbitrary. Each mutation gives rise to a new allele that has never been seen before in the population. For this model, the sampling distribution of an ordered sample is given by(2)(Ewens 1972). The probability of an unordered sample configuration is related to aswhere α_{i} denotes the number of allele types represented *i* times; *i.e.*, .

#### Finite-alleles PIM model:

Assume that allelic changes are described by a Markov chain with transition matrix , with *P _{ij}* being the probability of transition from type

*i*to type

*j*. For a given sample configuration , satisfies the recursion(3)with boundary conditions , for

*i*= 1, …,

*K*, where is a

*K*-dimensional unit vector with a 1 at the

*i*entry and is the stationary distribution corresponding to

**P**. See for example Lundstrom

*et al.*(1992) and Griffiths and Tavaré (1994) for ways to obtain (3) [Griffiths and Tavaré's corresponds to our ]. A general solution to the above recursion is not known, but a closed-form solution can be found for the parent-independent mutation model. As its name suggests, the PIM model satisfies

*P*=

_{ij}*P*for all

_{j}*i*, and the stationary distribution of is . (In some sense the infinite-alleles model may be viewed as a PIM model. However, henceforth in this article we take “PIM” to mean a finite-alleles model whose transition matrix satisfies the above condition.) Further, the stationary sampling probability of an ordered configuration is given by(4)(Wright 1949), while the probability of an unordered sample configuration is related to by

The diallelic case (*i.e.*, with *K* = 2) is of particular interest. The mutation process is specified by , where **I** is the identity matrix, so two models with the same are in fact equivalent. Hence, given any diallelic model with mutation parameter θ_{Diallelic} and transition matrixit can be transformed into an equivalent PIM model with mutation parameter θ_{PIM} = θ_{Diallelic}(*P*_{12} + *P*_{21}) and transition matrixwhere *P*_{1} = *P*_{21}/(*P*_{12} + *P*_{21}).

The above equivalence can be checked algebraically, but the intuition behind it is perhaps more illuminating. The diagonal entries *P _{ii}* represent transitions from allele

*i*to the same allele

*i*, and they are effectively invisible in the genealogy. These “invisible” mutations provide an extra degree of freedom in the transition matrix, and when the matrix is 2 × 2 this is sufficient to force it into a parent-independent form. The appropriate choice for the rate of invisible mutations is to set

*P*

_{11}=

*P*

_{21}and set

*P*

_{22}=

*P*

_{12}. This ensures that the entries in any column of

**P**are all the same (

*i.e*., we have parent independence). Then, one normalizes each entry by (

*P*

_{12}+

*P*

_{21}) (also adjusting θ

_{Diallelic}in a corresponding manner) so that

**P**is still a stochastic matrix. This explains the expressions for θ

_{PIM}and

*P*

_{1}given above.

## ASYMPTOTIC SAMPLING FORMULA FOR THE TWO-LOCUS MODEL

We now consider two-locus models with recombination. We denote the two loci by *A* and *B* and use θ_{A} and θ_{B} to denote the respective population-scaled mutation rates. In the case of an infinite-alleles model, we use *K* and *L* to denote the number of distinct allelic types observed at locus *A* and locus *B*, respectively, while for a finite-alleles model, *K* and *L* denote the number of possible allelic types at the two loci. The population-scaled recombination rate is denoted by ρ = 4*Nr*, where *r* is the probability of recombination between the two loci per gamete per generation. In the diffusion limit, *N* → ∞ and *r* → 0, while ρ is held fixed.

#### Extended sample configuration for two loci:

To obtain a closed system of recursion relations satisfied by two-locus sampling probabilities, the type space must be extended to allow some gametes to be specified at only one of the two loci. The two-locus sample configuration is denoted by , where with *a _{i}* being the number of gametes with allele

*i*at locus

*A*and unspecified alleles at locus

*B*, with

*b*being the number of gametes with unspecified alleles at locus

_{j}*A*and allele

*j*at locus

*B*, and is a

*K*×

*L*matrix with

*c*being the multiplicity of gametes with allele

_{ij}*i*at locus

*A*and allele

*j*at locus

*B*. Throughout, we use the following notation:We also use and to denote the marginal sample configurations of

**c**restricted to locus

*A*and locus

*B*, respectively. It is important to emphasize the distinction between the vectors

**a**and

**b**, which represent gametes with alleles specified at only one of the loci, and the vectors and , which represent the one-locus marginal configurations of gametes with both alleles observed. Even if we observe both alleles of every gamete in a sample [in the form ], the vectors

**a**and

**b**are still needed when we think about the sample's ancestry. An ancestral gamete might transmit genetic material to extant gametes in the sample at only one of the two loci, whereupon we use

**a**or

**b**to avoid specifying the allele at the nonancestral locus.

#### Two-locus recursion:

We use to denote the sampling probability of an ordered sample with configuration . As in the one-locus case, we suppress the dependence on parameters. Golding (1984) considered generalizing the infinite-alleles model to include recombination and constructed a recursion relation satisfied by at stationarity in the diffusion limit. Ethier and Griffiths (1990) later undertook a more mathematical analysis of the model and provided several theoretical results. For a *finite-alleles* model with transition matrices and at the two loci, one can show that satisfies the following recursion at stationarity,(5)with boundary conditions for all *i* ∈ {1, … , *K*} and for all *j* ∈ {1, … , *L*}, where and are stationary distributions corresponding to and , respectively. There are several ways to derive the above recursion. One can argue directly by considering the probabilities of the most recent event going backward in time in the coalescent with recombination. Alternatively, the recursion can be obtained from the underlying diffusion process (Ethier and Griffiths 1990; Griffiths and Tavaré 1994).

#### Asymptotic sampling formula:

In the case of the two-locus infinite-alleles model with a large ρ, we (Jenkins and Song 2009) found an asymptotic expansion of the form(6)where *q*_{0}, *q*_{1}, and *q*_{2} are independent of ρ. Below we generalize this work to an arbitrary finite-alleles mutation model. Our closed-form formulas are given in terms of the marginal one-locus sampling formulas *q ^{A}* and

*q*, where

^{B}*q*depends only on , and the marginal sample configuration of

^{A}**n**at locus

*A*, while

*q*depends only on , and the marginal sample configuration of

^{B}**n**at locus

*B*.

We now state our main results. All proofs are deferred to the appendix.

Proposition 1. *In the asymptotic expansion* (*6*) *of the two-locus sampling formula for either an infinite-alleles or an arbitrary finite-alleles model*, *the zeroth order term* *is given by*(7)

Proposition 1 is intuitive, since the first term in the asymptotic expansion should give the exact solution for ρ = ∞. When the two loci are unlinked, the complete sampling probability is then simply the product of two marginal one-locus sampling probabilities, each applied to the marginal sample configuration at that locus. The proceeding terms in the asymptotic expansion are described below.

Theorem 1. *In the asymptotic expansion* (*6*) *of the two-locus sampling formula for either an infinite-alleles or an arbitrary finite-alleles model*, *the first-order term* *is given by*(8)*for arbitrary configurations* *of nonnegative integers*.

This theorem generalizes our previous result (Jenkins and Song 2009) to an arbitrary finite-alleles model of mutation. This result is *universal* in the following sense: The functional form of *q*_{1} in terms of *q ^{A}* and

*q*is given by Equation 8, regardless of whether we are considering an infinite or a finite set of possible alleles at each locus. If the one-locus sampling probabilities are known in closed form, then (8) is instantaneous to evaluate. One simply applies the applicable one-locus solutions for

^{B}*q*and

^{A}*q*:

^{B}For an infinite-alleles model,

*q*and^{A}*q*are given by the Ewens sampling Equation 2, with θ replaced with θ^{B}and θ_{A}, respectively._{B}For a finite-alleles model,

*q*and^{A}*q*are the solutions to (3) with appropriate θ and^{B}*P*for each locus. Closed-form expressions to these equations are not known in general, unless mutation is parent independent, in which case_{ij}*q*and^{A}*q*are given by (4) with appropriate θ and^{B}*P*._{ij}

Note that . However, it turns out that does not vanish in general, and we do not have an analytic expression for it. The following theorem shows that can be decomposed into two parts, one for which we have a closed-form expression and the other :

Theorem 2. *In the asymptotic expansion* (*6*) *of the two-locus sampling formula for an arbitrary finite-alleles model, the second-order term* *is of the form*(9)*where* *is given by the analytic formula shown in the appendix*, *and* *satisfies the recursion relation*(10)*with boundary conditions* *for i* = 1, …, *K and* *for j* = 1, …, *L.*

If one-locus sampling distributions are known in closed form, evaluating the analytic part in is instantaneous for all sample configurations. Unfortunately, we do not have a closed-form expression for , so we are not able to conclude whether the above-mentioned universality result extends to the second-order term. However, for the PIM model, one can sum over the index *t* in (10), so can be evaluated efficiently using dynamic programming. In practice, this procedure is orders of magnitude faster than solving (5) directly. Furthermore, as discussed in the next section, the relative contribution of the recursive term to the asymptotic sampling distribution is generally very small, so we may safely discard that term without a considerable loss of accuracy.

## ACCURACY OF THE ASYMPTOTIC SAMPLING FORMULA

Our asymptotic sampling formula is applicable to a range of models of genetic data. Equipped with such a formula, it is natural to consider when it can be used to approximate the probability of a given sample configuration. As discussed before, evaluating the likelihood associated with a sample configuration is at the heart of many problems in population genetics. While obtaining such likelihoods is computationally expensive for most samples of interest, the ASF can be evaluated almost instantaneously.

#### Test setup:

To test the accuracy of the ASF, we focus on the diallelic recurrent mutation model, which has been used to model single-nucleotide polymorphism (SNP) data. In what follows, each locus represents a single site and we make the following distinction: We use the term *diallelic* to refer to a model in which there are two possible alleles at a locus. This is in contrast to the term *dimorphic*, which refers to a sample in which precisely two alleles are observed. When we refer to a sample from a two-locus model as being dimorphic, we mean that both loci are dimorphic.

We address the following simple model of sequence evolution. Assume a diallelic, symmetric mutation model for each locus, with the same mutation rate θ_{Diallelic} and transition matrix . As discussed earlier, it can be cast as a PIM model with mutation rate θ = 2θ_{Diallelic} and transition matrix . We use this PIM model at each locus in our calculation of the ASF, so θ_{A} = θ_{B} = θ. In what follows we condition on segregation at both sites by considering only dimorphic samples and normalize where appropriate.

Realistic choices for θ depend on the biological sample of interest. Here, we consider two values of θ. First, for many neutral regions of the human genome, typical mutation rates per base are in the range 0.001 ≤ θ ≤ 0.01 (Nachman and Crowell 2000; McVean *et al.* 2002). Below, we present results for θ = 0.01; results for θ = 0.001 were almost identical (not shown). When mutation rates are low, the infinite-alleles assumption is also reasonable, and we obtained very similar results using the ASF under an infinite-alleles model with θ = 0.01. For brevity these results are also omitted. Second, we consider θ = 1.0. This model might be assumed for the higher levels of polymorphism seen, for example, at synonymous sites in human immunodeficiency virus (McVean *et al.* 2002).

Throughout this section, we ignore missing data by considering sample configurations only of the form .

#### A first look into accuracy:

We want to answer the following broad question: For a given sample size *c* and recombination rate ρ, how well does the ASF approximate the true likelihood? In our study, we truncate the asymptotic expansion (6) at the second order and define asBy comparing the performance of against that of and , we can gain an idea of how much improvement is provided by each additional term in . As motivation, consider the likelihood curves shown in Figure 1, a and b, which compare different levels of approximation with the true likelihood for the configurations with and , respectively, for θ = 0.01. The former is illustrative of a configuration for which the MLE of ρ is infinite, and the latter is an example for which the MLE is finite. These configurations have sample sizes small enough that the true likelihoods can be solved directly from Equation 5. We can make a number of preliminary observations from Figure 1. As more terms are added to the ASF, it sustains a higher accuracy, as we would hope. In particular, seems to approximate the true likelihood very accurately for ρ ≥ 20. As illustrated in Figure 1, a similar level of accuracy is achieved by , an approximation to defined byRecall that comprises two terms (*cf*. Theorem 2): , which must be solved by dynamic programming, and , for which we derived an analytic expression. Calculating imposes a small computational burden compared to solving (5) directly, but the associated running time grows with sample size. The approximation simply ignores this potentially burdensome term. As we discuss later, in general this seems to lead to a negligible sacrifice in accuracy. Empirically, for most configurations we examined, is at least an order of magnitude smaller than the analytic term . We return to this below and carry out a more thorough study of the relative contribution of to .

#### Distribution of unsigned relative errors:

To ensure that the above results were not unique to the particular choices of data set illustrated, we performed a more systematic study as follows. For a given **c**, we measure the accuracy of by the *unsigned relative error*with corresponding definitions for the unsigned relative errors of , , and . By drawing according to the true sampling probability, we may speak of the distribution of the unsigned relative error. This distribution forms the basis of our measure of accuracy, and when calculating it, we must be sure to weight each sample correctly. The correct weight for is times a combinatorial factor capturing the number of distinct orderings of the sample; this product is denoted by . To evaluate , we looked at sample sizes sufficiently small so that the true likelihood could be calculated directly by solving Equation 5. We implemented a computer program for solving the recursion numerically.

We can make one further efficiency saving. Computing the true probability for every configuration of size *c* ≥ 30 is computationally demanding. However, when mutation rates are the same for the two loci, observe that, by symmetry, distinct configurations exist with the same sampling probability. For example, the samples and are distinct observations when *x*, *y* are distinct nonnegative integers, but their sampling probabilities are the same. We can therefore collapse our space of distinct sample configurations into a smaller one of equivalence classes, where two distinct samples are considered equivalent if they have the same probability. For *c* = 30, this meant that for 5456 distinct configurations only 752 probabilities required evaluation. Similar arguments can be made for the infinite-alleles model.

For the finite-alleles PIM models described above, the cumulative distribution of the unsigned relative error across all dimorphic samples of size *c* = 20 is shown in Figure 2. As is clear from the plot, the ASF offers substantial accuracy even for ρ as “low” as 50. For example, for a dimorphic sample of size 20 drawn at random from with θ = 0.01, the probability that its ASF is within 1% of the truth is 0.61, while the probability that is within 5% of the truth is 0.87. The corresponding probabilities when using alone are 0.36 and 0.43, respectively. Note the spike of probability mass around zero even for the curve for . This suggests that assuming ρ = ∞ will still be accurate in about half of cases, but with substantial errors in the rest. This is not the case when employing instead.

The distributions for the two mutation parameters are rather similar, suggesting a certain degree of robustness in the ASF to θ. The case with θ = 1 seems to have a lower probability of achieving a high accuracy (say within 1%) but a higher probability of achieving a moderate accuracy (within 10%).

#### Effect of sample size and recombination rate:

To assess the sensitivity of to the sample size *c* and the recombination rate ρ, we repeated the above plots for varying values of these parameters; results are shown in Figures 3 and 4. As we might expect, accuracy improves with increasing ρ. For example, for θ = 0.01 and ρ = 200, the probability of drawing a sample with within 1% of the truth reaches 1.00, compared with a probability of 0.54 when using alone. On the other hand, for a fixed ρ, the accuracy of generally begins to diminish with increasing *c*. One possible explanation for this trend is that larger samples may have more subtle shapes to their likelihood curves, such as multiple turning points. The ASF effectively has to fit a curve with at most one turning point to approximate with high accuracy for large ρ. This might be at the expense of accuracy for smaller ρ (see also Figure 9 later).

#### Signed relative error:

Using the unsigned relative error as a measure of accuracy loses information on the question of whether the ASF systematically under- or overestimates the true sample probability. To investigate this, we calculated the expected signed relative error across all configurations for each combination of parameters considered above. The signed relative errors of , and are defined similarly. The results are given in Tables 1 and 2. Although most of these expectations are positive, they are generally very small, and all fall within ±1.25%. This suggests that if deviations from the true likelihood tend to err in a particular direction, then this skewness is negligible.

#### Monomorphic samples:

One might argue that excluding nonsegregating sites from the analyses above is unwarranted, since the observation that a site is nonsegregating also contains information about the parameters of the model. We therefore reanalyzed the distribution of the unsigned relative error across all samples of size *c* = 20 (with ρ = 50), rather than just dimorphic samples. Results are shown in Figure 5.

Compare Figure 5 with Figure 2 (note the different scales on the *x*-axes). Clearly, across all samples there is now a much greater level of accuracy, particularly when the mutation rate is small. This may be explained by the following two observations. First, the probability that one or both loci are monomorphic is relatively large: for θ = 0.01 it is >0.99 and for θ = 1.0 it is 0.44. Hence, these configurations will have a substantial effect on the error distribution, particularly for smaller mutation rates. Second, the ASF seems to be much more accurate precisely for those configurations with one or both loci monomorphic. This raises the question: Can we identify any further patterns for which the ASF is more accurate? When the true likelihood is unavailable, this might provide a guide as to how accurate the ASF is likely to be.

#### For which samples is the ASF most accurate?

The figures described above suggest that for most configurations the relative error of the ASF will be very small, with a few having larger errors. It would be of interest to determine whether there is any feature of a given sample configuration that indicates how accurate the ASF will be. Figure 6 plots on a log-scale the unsigned relative error of the ASF of a sample against its sampling probability , across all configurations of sample size *c* = 20 (with θ = 0.01 and θ = 1, and ρ = 50). Encouragingly, there is a clear negative correlation between unsigned relative error and sampling probability (Pearson's correlation coefficient = −0.76 for Figure 6a and −0.61 for Figure 6b). That is, those configurations for which the ASF is most accurate are also observed most often. However, the point of this comparison is to determine which configurations have a more accurate ASF *without* having to calculate . We therefore identified the following categories for the samples that appear to best explain most of the variation in accuracy (the category of each sample is annotated in Figure 6):

▿, the sample is monomorphic at both loci.

▵, the sample is monomorphic at one locus.

○, for at least one allele at a locus, the sample has multiplicity one;

*i.e.*, a row or column of**c**sums to one.+, the sample has very low linkage disequilibrium (LD) as measured by

*r*^{2}(Hudson 2001b), satisfying*r*^{2}≤ 0.02. (This LD measure is also often denoted by Δ^{2}.)◊, the sample is in perfect LD;

*i.e.*, it is of the form or , where*x*+*y*=*c*, for*x*≥ 0 and*y*≥ 0.□, the sample is in perfect LD except for one haplotype;

*i.e.*, it is of the form , , , or , where*x*+*y*=*c*− 1, for*x*≥ 0 and*y*≥ 0.•, the sample does not fall into any of the above categories.

Some configurations fell into more than one category, in which case the higher category in this list was given priority.

Figure 6a confirms that samples monomorphic at one or both loci have among the highest sampling probabilities and the highest accuracies. This is followed by another clearly identifiable cluster of samples of high accuracy, which have in common that one allele has multiplicity one. For the remaining samples, the amount of LD seems to be highly informative of the accuracy of the ASF; for example, at the upper end of the error distribution are those in perfect or almost perfect LD. We do not have an intuitive explanation for this observation. Most of the remaining samples fall around a cluster of intermediate accuracy (∼1%), with the samples with lowest LD at the center of this cluster. For higher mutation rates, illustrated in Figure 6b, we observed a similar, albeit less distinguished, pattern. When mutation rates are higher, the difference in probability between any two sample configurations tends to be attenuated. We also observed qualitatively similar patterns for different choices of sample size and recombination rate (not shown).

#### A quick approximation to the ASF:

Recall that contains a term—namely, —that needs to be evaluated recursively using dynamic programming. (See Theorem 2.) This imposes a small but growing (with sample size) computational burden on the calculation of . Empirical tests (for example, Figure 2) suggested that simply ignoring leads to almost no loss in accuracy. It is worth addressing the contribution of this term in more detail. We measure its contribution by the following quantity:(11)As before, we weight this contribution by the sampling probability of each configuration. Its distribution is shown in Figure 7, for dimorphic samples of size *c* = 20 and ρ = 50.

It is clear from Figure 7 that for these parameter values, using in place of affects the result by no more than 1%, and with a high probability the effect is <0.2%. Indeed, when θ = 1, the contribution is <0.2% with an estimated probability of 1. These results are encouraging. They suggest that we may safely discard the term without a considerable loss of accuracy. This also seems to hold for larger sample sizes. For all samples of size *c* = 100, one can still calculate the contribution (11), although it is now computationally impractical to weight each sample by its true probability . Instead we looked at the *number* of sample configurations within a given range of contributions. For example, when ρ = 50 and θ = 0.01, the contribution (11) was within the range ± 0.5% for 93.2% of dimorphic samples of size *c* = 100. When θ = 1, 95.7% of dimorphic sample configurations were within this range. Thus, even for larger sample sizes, when calculating the term becomes impractical, using the approximation seems to be an attractive option. For the infinite-alleles model, we (Jenkins and Song 2009) have obtained a close upper bound to estimate the contribution of this term whenever calculating it directly is computationally intractable.

## DISCUSSION

Computing likelihoods under the coalescent with recombination is a challenging problem that has received much attention in the past. Here, building on our recent work (Jenkins and Song 2009), we provided an alternative perspective on the likelihood computation and obtained analytic results for the two-locus model with an arbitrary finite-alleles model of mutation. As mentioned in the Introduction, a stochastic process may exist that is simpler than the well-established coalescent with recombination that captures the important dynamics of the relevant degrees of freedom as the rate of recombination gets large. We think that the fact that the first-order term [see (8)] is so simple supports this speculation. For θ = 1, the Ewens sampling formula for the one-locus infinite-alleles model reduces to the classic formula for the joint distribution of cycle counts in a random permutation (see Arratia *et al.* 2003 for a discussion of this and other combinatorial connections). It would be interesting to provide such a combinatorial interpretation of . Further, we believe that finding a closed-form expression for the first-order term for more than two loci might be possible.

As a concrete application of our asymptotic sampling formula, one can analytically compute, for large values of ρ, the expectation of linkage disequilibrium measures such as *r*^{2}. Song and Song (2007) showed analytically that the populationwide expectation of *r*^{2} approaches 1/ρ as ρ → ∞. For a finite sample size *n*, Hill and Weir (1994) derived for the samplewise expectation 𝔼[*r*^{2}] of *r*^{2}. This result was obtained under restrictive assumptions that the populationwide marginal allele frequencies are all and that heterozygote superiority holds at each locus. For a given sample size *n*, it would be interesting to use our asymptotic sampling formula to derive a general asymptotic expression for the samplewise expectation 𝔼[*r*^{2}].

Below we discuss in detail two other concrete applications of our sampling formula. First, we discuss how we can improve the composite-likelihood method (Hudson 2001a). Then, we provide the first analytic result on the classification of the MLE of ρ.

#### Composite-likelihood method:

LDhat (McVean *et al.* 2002, 2004) is a software package aimed at using patterns of variation in population genetic data to estimate fine-scale recombination rates. The procedure is efficient and has been applied to genomewide estimates of the human genetic map (Myers *et al.* 2005; International Hapmap Consortium 2007) (henceforth, we assume that recombination events are always resolved as crossovers and ignore the effects of homologous gene conversion). Part of its efficiency stems from the use of the pairwise composite likelihood first proposed by Hudson (2001a). For a given genetic map, LDhat employs a Bayesian reversible-jump Markov chain Monte Carlo (rjMCMC) procedure to propose updates to this map and to sample from an approximate posterior distribution (the approximation resulting from the use of a pseudolikelihood defined below). To calculate the acceptance probabilities of proposed moves, one must calculate the likelihood of the data. To do so exactly (*e.g.*, by solving a multilocus version of Equation 5) or even by a computationally intensive full-likelihood procedure such as importance sampling (Fearnhead and Donnelly 2001) is highly impractical for the sizes of data sets of interest. To circumvent this problem, Hudson (2001a) proposed the *composite-likelihood* method and showed that it works very well as an estimator for ρ. This method uses two-locus likelihoods as building blocks and LDhat uses the following variation on the theme: Suppose that the input data set *D* contains *m* polymorphic sites, and let *D _{ij}* denote the data set restricted to the two sites

*i*and

*j*. Then, the composite likelihood is defined as(12)In this formula, ρ is assumed to be constant across the region and the probability associated with each pair uses a rescaled value for ρ to account for the physical distance between the pair. It is straightforward to modify this definition to allow for variable recombination rates, as implemented in LDhat. The composite-likelihood approximation treats the collection {

*D*:

_{ij}*i*<

*j*and

*i*,

*j*= 1, …,

*m*} as independent observations. Of course, in reality these observations are highly dependent through their shared ancestry, but often in practice little information about the MLE of ρ is lost. The key to this approach is that it is much more straightforward to calculate each term in the product (12) than it is to calculate the full likelihood, since now one has to deal only with two sites at any one time. Hudson (2001a) considered an infinite-sites model in the limit as θ → 0, conditional on the observed variation. McVean

*et al.*(2002) extended this approach to finite-alleles models and finite θ and used the importance sampling scheme of Fearnhead and Donnelly (2001) to estimate each of these two-site likelihoods. This importance sampling stage is the most computationally intensive part of the LDhat estimation procedure. By considering all possible pairwise sample configurations of a given size, it is possible to precompute these likelihoods, and indeed the LDhat software package is accompanied by precomputed lookup tables for various sample sizes and mutation rates. For each sample configuration and mutation rate, the likelihood is calculated over a grid of ρ-values known as the

*driving values*; the default choice is over the integers ρ = 0, 1,…, 100. Genetic distances are then rounded to the nearest integer. For pairs of sites whose genetic distance is >100, the two-site likelihood simply truncates ρ to 100 and uses this precomputed value for the likelihood instead.

Here, we investigate the effect that this truncation has on the two-site likelihood. Since the ASF applies to the diallelic, symmetric mutation model employed by LDhat, an obvious alternative to truncation is to use the ASF. As ρ increases, we expect the truncation procedure to introduce more error, while the ASF becomes more accurate. Furthermore, the ASF is a continuous function of ρ and so no further error is introduced by rounding to the nearest driving value of ρ. Assuming a recombination rate of 1 cM/Mb (Kong *et al.* 2002) and a human diploid effective population size of 10,000 (Myers *et al.* 2005), a cutoff of ρ = 100 corresponds to a marker separation of ∼250 kb. In the presence of recombination hotspots this distance could be much lower. The study of Myers *et al.* (2005) breaks up genomewide data into windows of 2000 SNPs, using a data set with an average marker spacing of 1.9 kb (Hinds *et al.* 2005). A rough approximation to the mean window size is thus ∼3.8 Mb; pairs of sites separated by a distance >ρ = 100 are therefore encountered frequently by the rjMCMC procedure.

Using its estimate of the likelihood at ρ = 100, LDhat encounters two sources of error: one from the truncation of the range of ρ and one from the stochastic error of the importance sampling procedure. To restrict our attention to the first of these sources, we find a lower bound on the error of LDhat by looking at the quantitywhere *L*(ρ) is the *true* likelihood obtained previously by solving Equation 5. The true likelihood at ρ = 100 will of course be at least as accurate an approximation as that reported by LDhat. Table 3 reports the mean unsigned relative error one would obtain using *L*(100) in place of *L*(ρ), for various choices of ρ and various choices of mutation rates used by the lookup tables in LDhat. For comparison we also show the unsigned relative error of . In Figure 8, we also show the complete distribution of these errors for one set of parameters in the table, namely, *c* = 30 and θ = 0.002. (Note that this choice of mutation rate would be reported as θ = 0.001 in LDhat, since it uses the mutation transition matrix at each locus instead of used here.) Since ρ is not fixed throughout the rjMCMC procedure, it is not clear in this setting how to interpret the idea of drawing a sample configuration at random. Here, we therefore take a simple arithmetic mean unsigned relative error across all dimorphic samples, rather than weighting each configuration by its sampling probability as before.

As is clear from Table 2, using *L*(100) as a proxy for *L*(ρ) when ρ > 100 rapidly becomes very inaccurate with increasing ρ. Even for these moderate sample sizes, there may be a considerable amount of information in the sample so that the likelihoods at ρ = 100 and ρ = 200, say, differ substantially. For the sample sizes considered here, using the ASF instead of LDhat becomes preferable very soon above ρ = 100 and certainly before ρ = 150. For larger samples the critical point at which the ASF becomes preferable may be somewhat higher. Also, note that the error in both methods seems to be highly robust to the choice of mutation rate, at least for the set of realistic choices shown. For both methods, there is a slight decrease in error rate with increasing mutation rate.

Figure 8 confirms the wider distribution of errors when using *L*(100) to estimate *L*(200) rather than using *L*_{ASF}(200). Since we plot here the signed relative error rather than the unsigned relative error, we are also able to observe that the distribution of errors is not symmetric about zero; there is much more probability mass on nonnegligible positive relative errors than on negative relative errors. This suggests that LDhat could systematically overestimate the likelihood of pairs of markers for which a large recombination rate is proposed, biasing recombination rate estimates upward; proposed alterations to the genetic map that increase the recombination rate in a region will have erroneously inflated acceptance rates. When using *L*_{ASF}(ρ), the asymmetry in errors is much smaller.

#### Classification of the MLE of ρ:

The ASF can be used to make inferences about whether or not the MLE of ρ is finite. Intuitively, when , the asymptotic approximation approaches from above as ρ → ∞. Moreover, by virtue of it being an asymptotic approximation, for large enough ρ it must be sufficiently accurate to infer that must also approach from above. We therefore have the following theorem, for which a formal proof may be given via a short analysis argument (in the appendix):

Theorem 3. *For a given sample configuration* , if the first-order term [*see* (8)] *in the asymptotic expansion satisfies**then the MLE of* ρ *is finite*.

If one-locus marginal sampling probabilities are known, as in the infinite-alleles case or the PIM case, then is instantaneous to compute. To our knowledge, this is the first analytic result on the classification of the MLE of ρ. Note, however, that this is a sufficient condition, but not a necessary condition. That is, this condition does not identify *all* samples with a finite MLE. Indeed, it is tempting to argue for the converse to the proposition above—namely, that implies an infinite MLE—but this is false, as the following simple counterexample demonstrates.

Consider the sample configuration determined by , and suppose θ = 0.01. It is straightforward to verify that but that ρ_{0} exists with *L*(ρ_{0}) > *L*(∞). The likelihood curve for this configuration can be found by solving Equation 5 directly and is illustrated in Figure 9. Although the curve approaches the asymptote *L*(∞) from below as ρ → ∞ [as one expects when ], it actually exhibits a local minimum at ρ = 1608, as well as a local maximum farther down at ρ = 74. In this example the likelihood curve is extremely flat, but minima with higher curvature may be found for larger sample sizes. To our knowledge, this is the first confirmation of a local minimum in a likelihood curve for ρ.

## APPENDIX

The proofs given here are similar in structure to our previous proofs (Jenkins and Song 2009) for the infinite-alleles model of mutation. It turns out that much of the argument is independent of the particular model of mutation assumed. Therefore, we give only an outline here and refer the reader to that article for further details.

*Proof of Proposition* 1. The infinite-alleles case was proved in Jenkins and Song (2009), so we focus on the finite-alleles case here. First, assume *c* > 0. One can obtain a recursion satisfied by by substituting our asymptotic expansion (6) into (5), dividing by ρ*c*, and letting ρ → ∞. We obtain(A1)By repeated application of (A1) this becomes(A2)which of course also holds when *c* = 0. Now, by substituting (6) into (5) with and then letting ρ → ∞, we obtain [after some manipulation, invoking (A2) to eliminate terms for which *c* > 0](A3)Boundary conditions are for *i* = 1, …, *K* and for *j* = 1, …, *L*. It is straightforward to verify that the solution to (A3) is(A4)where and are marginal one-locus sampling probabilities for loci *A* and *B*, respectively. Together, (A2) and (A4) give the desired result.▪

*Proof of Theorem* 1. Again, the infinite-alleles case was proved in Jenkins and Song (2009), so we focus on the finite-alleles case here. Substitute the asymptotic expansion (6) into the recursion (5), eliminate terms of order ρ by applying (A1), and let ρ → ∞. After invoking (A3) and (7), with some rearrangement we obtain(A5)where(A6)Above we assumed *c* > 0; we define if .

Although at first sight somewhat daunting, Equation A5 can be solved via a simple probabilistic interpretation of the coefficients of the second term on the right-hand side. In particular, this term can be interpreted as the expected value of after removing a gamete from **c** uniformly at random and redistributing it to **a** and **b**. After this removal, the term is given recursively by discarding further gametes. More generally, we (Jenkins and Song 2009) showed that(A7)where is the random subsample obtained by selecting *m* gametes from the sample **c** uniformly without replacement; that is, it is distributed according to a multivariate hypergeometric distribution with parameters . Further, we define and . The expectations in (A7) are easy to compute and to sum over *m*, resulting in (8).▪

*Proof of Theorem* 2. In a similar manner to the proof of Theorem 1, one can show that(A8)where has a known, but rather cumbersome, closed-form expression. Since these expectations can be evaluated, the sum composes the closed-form part of ; *i.e.*,Unlike the form of , that of depends on the model of mutation. We repeated the steps in Jenkins and Song (2009) using Equation 5—rather than Golding's recursion (Golding 1984), which applies to the infinite-alleles model—to obtain the following expression for . We omit the simple but lengthy algebraic details.

Let *Q ^{A}* denote ,

*Q*denote ,

_{i}^{A}*Q*denote , and so on. Then,The final term, , is a function whose form depends on the model of mutation. Using

_{ik}^{A}*Q*

_{i,+t}

^{A}to denote , and so on, we obtainwhere δ

_{ij}denotes the Kronecker delta, and in a PIM model this simplifies to For completeness, we also give the corresponding expression for the infinite-alleles model,(Jenkins and Song 2009).

We now outline the steps to obtain the recursion shown in (10). One can use Equation A8 to show that(A9)After substituting the asymptotic expansion (6) with into (5) and comparing coefficients of terms proportional to 1/ρ^{2}, one obtains the recursionwith boundary conditions for *i* = 1, …, *K* and for *j* = 1, …, *L*. By appealing to (A9), this can be simplified to the form given in Equation 10.▪

*Proof of Theorem* 3. Since is an asymptotic approximation to , by definitionHence, using , we conclude that ρ′ exists such that, for all ρ′ ≤ ρ < ∞,which implies for ρ′ ≤ ρ < ∞. Since , this implies that the MLE for ρ must be finite.▪

## Acknowledgments

We thank Fulton Wang for writing a computer program that numerically solves two-locus recursions and Rasmus Nielsen and Monty Slatkin for helpful comments. This research is supported in part by a National Institutes of Health grant R00-GM080099 (to Y.S.S.), an Alfred P. Sloan Research Fellowship (to Y.S.S.), and a Packard Fellowship for Science and Engineering (to Y.S.S.).

## Footnotes

Communicating editor: J. Wakeley

- Received July 30, 2009.
- Accepted September 4, 2009.

- Copyright © 2009 by the Genetics Society of America