Closed-Form Two-Locus Sampling Distributions: Accuracy and Universality
Paul A. Jenkins, Yun S. Song

Abstract

Sampling distributions play an important role in population genetics analyses, but closed-form sampling formulas are generally intractable to obtain. In the presence of recombination, there is no known closed-form sampling formula that holds for an arbitrary recombination rate. However, we recently showed that it is possible to obtain useful closed-form sampling formulas when the population-scaled recombination rate ρ is large. Specifically, in the case of the two-locus infinite-alleles model, we considered an asymptotic expansion of the sampling formula in inverse powers of ρ and obtained closed-form expressions for the first few terms in the expansion. In this article, we generalize this result to an arbitrary finite-alleles mutation model and show that, up to the first few terms in the expansion that we are able to compute analytically, the functional form of the asymptotic sampling formula is common to all mutation models. We carry out an extensive study of the accuracy of the asymptotic formula for the two-locus parent-independent mutation model and discuss in detail a concrete application in the context of the composite-likelihood method. Furthermore, using our asymptotic sampling formula, we establish a simple sufficient condition for a given two-locus sample configuration to have a finite maximum-likelihood estimate (MLE) of ρ. This condition is the first analytic result on the classification of the MLE of ρ and is instantaneous to check in practice, provided that one-locus probabilities are known.

FOR a given population genetics model, the probability of observing a sample of DNA sequences plays a fundamental role in various applications, including parameter estimation and ancestral inference. However, except in the one-locus case with a special model of mutation such as the infinite-alleles model or the finite-alleles parent-independent mutation (PIM) model, closed-form sampling formulas are generally unknown. In particular, when recombination is involved, obtaining an analytic formula for the sampling distribution has so far remained an intractable problem, and most research has focused on developing Monte Carlo methods, such as importance sampling (Griffiths and Marjoram 1996; Stephens and Donnelly 2000; Fearnhead and Donnelly 2001; De Iorio and Griffiths 2004a,b; Griffiths et al. 2008) and Markov chain Monte Carlo (Kuhner et al. 2000; Nielsen 2000; Wang and Rannala 2008). These Monte Carlo methods have led to useful tools for population genetics analysis, but they tend to be computationally intensive—in some cases prohibitively so—and it is difficult to provide a theoretical characterization of their accuracy. In the case of the infinite-sites model of mutation, Lyngsø et al. (2008) recently proposed a new approach to compute likelihoods using ideas from parsimony, but that method is not scalable.

The main mathematical framework that underlies the above-mentioned computational methods is the coalescent with recombination (Griffiths 1981; Kingman 1982a,b; Hudson 1983), which models the genealogical history of sample chromosomes. When the rate of recombination is large, the genealogies sampled by Monte Carlo methods are typically very complicated, containing many recombination events. However, in contrast to this increase in complexity in the coalescent, we in fact expect the dynamics to be easier to study for large recombination rates, since the loci under consideration would then be less dependent. It seems plausible that a stochastic process exists that is simpler than the standard coalescent with recombination, which describes the dynamics of the relevant degrees of freedom for large recombination rates. To study whether such a “dual” description of weakly interacting loci might exist, we (Jenkins and Song 2009) recently revisited the problem of obtaining a closed-form sampling formula for a model with recombination and obtained useful analytic results in the case that the population-scaled recombination rate ρ is large. More precisely, we considered the diffusion limit of the two-locus infinite-alleles model with population-scaled mutation rates θA and θB at the two loci. For a given sample configuration n (defined later in the text), we found the asymptotic expansion of the sampling probability $Math$ in inverse powers of ρ,$Math$(1)where q0, q1, and q2 are independent of ρ. The zeroth-order term $Math$ corresponds to the completely unlinked case, given by a product of marginal one-locus sampling distributions found by Ewens (1972). Our main contribution was to derive a closed-form formula for the first-order term $Math$ and to show that the second-order term $Math$ can be decomposed into two parts, one for which we obtained a closed-form formula and the other that satisfies a simple recursion that can be easily evaluated using dynamic programming.

The goal of this article is to generalize the above result to an arbitrary finite-alleles model of mutation and to study the accuracy of our asymptotic sampling formula (ASF). The main theoretical result presented here is that the functional form of our ASF is universal in the following sense: For all mutation models, the first two terms q0 and q1 in the expansion can be expressed as linear combinations of products of marginal one-locus probabilities, with coefficients depending on the sample configuration but not on mutation parameters. Hence, q0 and q1 for different mutation models can be obtained simply by substituting the corresponding one-locus sampling probabilities into the formulas. Whether this universality property extends to higher-order terms in the expansion is an open question. As in the infinite-alleles case, the second-order term q2 for an arbitrary finite-alleles mutation model can be decomposed into two parts, one for which we obtain a closed-form formula and the other that satisfies a simple recursion relation. Since we are not able to obtain a closed-form solution to the recursive part, we do not know whether the entire second-order term q2 will obey the universality property.

We carry out an extensive study of the accuracy of our ASF for the two-locus PIM model, in which case we can numerically solve for the true sampling probabilities for small sample sizes. Since any diallelic recurrent mutation model can be reduced to a PIM model (reviewed below), our results should have practical applications. We discuss a concrete application in the context of the composite-likelihood method (Hudson 2001a), which uses two-locus sampling probabilities as the basic building blocks. LDhat (McVean et al. 2002, 2004), a widely used software package for estimating recombination rates, is based on this composite-likelihood approach. LDhat assumes the symmetric diallelic model of mutation and relies on the importance sampling scheme developed by Fearnhead and Donnelly (2001) for the coalescent with recombination, to generate exhaustive lookup tables containing two-locus sampling probabilities for all inequivalent diallelic sample configurations. This process of generating exhaustive lookup tables is very computationally expensive. The developers of LDhat suggest using a range of integral ρ-values from 0 to 100, in which case the two-locus sampling probability for ρ > 100 is approximated by that at ρ = 100. An alternative to this truncation approach is to use our ASF for large ρ-values. In this article, we examine the effect that the truncation used by LDhat has on the two-locus sampling probability.

Another application of our work is classification of the maximum-likelihood estimate (MLE) of the recombination rate ρ. Specifically, we establish a simple sufficient condition for a given two-locus sample configuration to have a finite MLE of ρ. To our knowledge, this is the first analytic result on the classification of the MLE of ρ. Further, the sufficient condition is instantaneous to check, provided that one-locus probabilities are known.

We have written a C++ program, called ASF, that computes the first three terms in the expansion (1) for either the infinite-alleles model or the finite-alleles PIM model. This program is publicly available at http://www.eecs.berkeley.edu/∼yss/software.html.

REVIEW OF ONE-LOCUS SAMPLING DISTRIBUTIONS

Here we briefly review two mutation models for which one-locus sampling formulas are known in closed form. As mentioned above, our two-locus sampling formula is given in terms of one-locus sampling formulas.

Notation:

The sample configuration at a locus is denoted by a vector $Math$, where ni denotes the number of gametes with allele i at the locus. Note that K takes on slightly different meanings depending on the assumed model of mutation. In an infinite-alleles model, K refers to the number of observed alleles, whereas in a finite-alleles model it refers to the number of possible alleles in the model. We use n to denote the total sample size $Math$. The probability of an unordered sample configuration is denoted by $Math$; the dependence on parameters is not shown for ease of notation.

Let 𝒜n denote an ordered configuration of n sequentially sampled gametes such that the corresponding unordered configuration is given by n. By exchangeability, the probability of 𝒜n does not change under an arbitrary permutation of the sampling order, so we can write this probability of an ordered sample as $Math$ without ambiguity; again, we suppress the dependence on parameters. In what follows, our theoretical results are presented in the case of ordered samples.

Throughout, we consider the diffusion limit of a neutral haploid exchangeable model of random mating with constant population size 2N. Let u denote the probability of mutation at a locus per gamete per generation. Then, in the diffusion limit, N → ∞ and u → 0 with the population-scaled mutation rate θ = 4Nu held fixed. Mutation events occur according to a Poisson process with rate θ/2.

Finally, given a nonnegative real number x and a positive integer n, we use (x)n := x(x + 1) … (x + n − 1) to denote the nth ascending factorial of x.

Infinite-alleles model:

Under the infinite-alleles model, any two gametes can be compared to determine whether or not they have the same allele, but it is not possible to determine how the alleles are related when they are different. Therefore, allelic label is arbitrary. Each mutation gives rise to a new allele that has never been seen before in the population. For this model, the sampling distribution of an ordered sample is given by$Math$(2)(Ewens 1972). The probability $Math$ of an unordered sample configuration is related to $Math$ as$Math$where αi denotes the number of allele types represented i times; i.e., $Math$.

Finite-alleles PIM model:

Assume that allelic changes are described by a Markov chain with transition matrix $Math$, with Pij being the probability of transition from type i to type j. For a given sample configuration $Math$, $Math$ satisfies the recursion$Math$(3)with boundary conditions $Math$, for i = 1, …, K, where $Math$ is a K-dimensional unit vector with a 1 at the i entry and $Math$ is the stationary distribution corresponding to P. See for example Lundstrom et al. (1992) and Griffiths and Tavaré (1994) for ways to obtain (3) [Griffiths and Tavaré's $Math$ corresponds to our $Math$]. A general solution to the above recursion is not known, but a closed-form solution can be found for the parent-independent mutation model. As its name suggests, the PIM model satisfies Pij = Pj for all i, and the stationary distribution of $Math$ is $Math$. (In some sense the infinite-alleles model may be viewed as a PIM model. However, henceforth in this article we take “PIM” to mean a finite-alleles model whose transition matrix satisfies the above condition.) Further, the stationary sampling probability of an ordered configuration is given by$Math$(4)(Wright 1949), while the probability of an unordered sample configuration is related to $Math$ by$Math$

The diallelic case (i.e., with K = 2) is of particular interest. The mutation process is specified by $Math$, where I is the identity matrix, so two models with the same $Math$ are in fact equivalent. Hence, given any diallelic model with mutation parameter θDiallelic and transition matrix$Math$it can be transformed into an equivalent PIM model with mutation parameter θPIM = θDiallelic(P12 + P21) and transition matrix$Math$where P1 = P21/(P12 + P21).

The above equivalence can be checked algebraically, but the intuition behind it is perhaps more illuminating. The diagonal entries Pii represent transitions from allele i to the same allele i, and they are effectively invisible in the genealogy. These “invisible” mutations provide an extra degree of freedom in the transition matrix, and when the matrix is 2 × 2 this is sufficient to force it into a parent-independent form. The appropriate choice for the rate of invisible mutations is to set P11 = P21 and set P22 = P12. This ensures that the entries in any column of P are all the same (i.e., we have parent independence). Then, one normalizes each entry by (P12 + P21) (also adjusting θDiallelic in a corresponding manner) so that P is still a stochastic matrix. This explains the expressions for θPIM and P1 given above.

ASYMPTOTIC SAMPLING FORMULA FOR THE TWO-LOCUS MODEL

We now consider two-locus models with recombination. We denote the two loci by A and B and use θA and θB to denote the respective population-scaled mutation rates. In the case of an infinite-alleles model, we use K and L to denote the number of distinct allelic types observed at locus A and locus B, respectively, while for a finite-alleles model, K and L denote the number of possible allelic types at the two loci. The population-scaled recombination rate is denoted by ρ = 4Nr, where r is the probability of recombination between the two loci per gamete per generation. In the diffusion limit, N → ∞ and r → 0, while ρ is held fixed.

Extended sample configuration for two loci:

To obtain a closed system of recursion relations satisfied by two-locus sampling probabilities, the type space must be extended to allow some gametes to be specified at only one of the two loci. The two-locus sample configuration is denoted by $Math$, where $Math$ with ai being the number of gametes with allele i at locus A and unspecified alleles at locus B, $Math$ with bj being the number of gametes with unspecified alleles at locus A and allele j at locus B, and $Math$ is a K × L matrix with cij being the multiplicity of gametes with allele i at locus A and allele j at locus B. Throughout, we use the following notation:$Math$We also use $Math$ and $Math$ to denote the marginal sample configurations of c restricted to locus A and locus B, respectively. It is important to emphasize the distinction between the vectors a and b, which represent gametes with alleles specified at only one of the loci, and the vectors $Math$ and $Math$, which represent the one-locus marginal configurations of gametes with both alleles observed. Even if we observe both alleles of every gamete in a sample [in the form $Math$], the vectors a and b are still needed when we think about the sample's ancestry. An ancestral gamete might transmit genetic material to extant gametes in the sample at only one of the two loci, whereupon we use a or b to avoid specifying the allele at the nonancestral locus.

Two-locus recursion:

We use $Math$ to denote the sampling probability of an ordered sample with configuration $Math$. As in the one-locus case, we suppress the dependence on parameters. Golding (1984) considered generalizing the infinite-alleles model to include recombination and constructed a recursion relation satisfied by $Math$ at stationarity in the diffusion limit. Ethier and Griffiths (1990) later undertook a more mathematical analysis of the model and provided several theoretical results. For a finite-alleles model with transition matrices $Math$ and $Math$ at the two loci, one can show that $Math$ satisfies the following recursion at stationarity,$Math$(5)with boundary conditions $Math$ for all i ∈ {1, … , K} and $Math$ for all j ∈ {1, … , L}, where $Math$ and $Math$ are stationary distributions corresponding to $Math$ and $Math$, respectively. There are several ways to derive the above recursion. One can argue directly by considering the probabilities of the most recent event going backward in time in the coalescent with recombination. Alternatively, the recursion can be obtained from the underlying diffusion process (Ethier and Griffiths 1990; Griffiths and Tavaré 1994).

Asymptotic sampling formula:

In the case of the two-locus infinite-alleles model with a large ρ, we (Jenkins and Song 2009) found an asymptotic expansion of the form$Math$(6)where q0, q1, and q2 are independent of ρ. Below we generalize this work to an arbitrary finite-alleles mutation model. Our closed-form formulas are given in terms of the marginal one-locus sampling formulas qA and qB, where qA depends only on $Math$, and the marginal sample configuration of n at locus A, while qB depends only on $Math$, and the marginal sample configuration of n at locus B.

We now state our main results. All proofs are deferred to the appendix.

Proposition 1. In the asymptotic expansion (6) of the two-locus sampling formula for either an infinite-alleles or an arbitrary finite-alleles model, the zeroth order term $Math$ is given by$Math$(7)

Proposition 1 is intuitive, since the first term in the asymptotic expansion should give the exact solution for ρ = ∞. When the two loci are unlinked, the complete sampling probability is then simply the product of two marginal one-locus sampling probabilities, each applied to the marginal sample configuration at that locus. The proceeding terms in the asymptotic expansion are described below.

Theorem 1. In the asymptotic expansion (6) of the two-locus sampling formula for either an infinite-alleles or an arbitrary finite-alleles model, the first-order term $Math$ is given by$Math$(8)for arbitrary configurations $Math$ of nonnegative integers.

This theorem generalizes our previous result (Jenkins and Song 2009) to an arbitrary finite-alleles model of mutation. This result is universal in the following sense: The functional form of q1 in terms of qA and qB is given by Equation 8, regardless of whether we are considering an infinite or a finite set of possible alleles at each locus. If the one-locus sampling probabilities are known in closed form, then (8) is instantaneous to evaluate. One simply applies the applicable one-locus solutions for qA and qB:

1. For an infinite-alleles model, qA and qB are given by the Ewens sampling Equation 2, with θ replaced with θA and θB, respectively.

2. For a finite-alleles model, qA and qB are the solutions to (3) with appropriate θ and Pij for each locus. Closed-form expressions to these equations are not known in general, unless mutation is parent independent, in which case qA and qB are given by (4) with appropriate θ and Pij.

Note that $Math$. However, it turns out that $Math$ does not vanish in general, and we do not have an analytic expression for it. The following theorem shows that $Math$ can be decomposed into two parts, one for which we have a closed-form expression and the other $Math$:

Theorem 2. In the asymptotic expansion (6) of the two-locus sampling formula for an arbitrary finite-alleles model, the second-order term $Math$ is of the form$Math$(9)where $Math$ is given by the analytic formula shown in the appendix, and $Math$ satisfies the recursion relation$Math$(10)with boundary conditions $Math$ for i = 1, …, K and $Math$ for j = 1, …, L.

If one-locus sampling distributions are known in closed form, evaluating the analytic part $Math$ in $Math$ is instantaneous for all sample configurations. Unfortunately, we do not have a closed-form expression for $Math$, so we are not able to conclude whether the above-mentioned universality result extends to the second-order term. However, for the PIM model, one can sum over the index t in (10), so $Math$ can be evaluated efficiently using dynamic programming. In practice, this procedure is orders of magnitude faster than solving (5) directly. Furthermore, as discussed in the next section, the relative contribution of the recursive term $Math$ to the asymptotic sampling distribution is generally very small, so we may safely discard that term without a considerable loss of accuracy.

ACCURACY OF THE ASYMPTOTIC SAMPLING FORMULA

Our asymptotic sampling formula is applicable to a range of models of genetic data. Equipped with such a formula, it is natural to consider when it can be used to approximate the probability of a given sample configuration. As discussed before, evaluating the likelihood associated with a sample configuration is at the heart of many problems in population genetics. While obtaining such likelihoods is computationally expensive for most samples of interest, the ASF can be evaluated almost instantaneously.

Test setup:

To test the accuracy of the ASF, we focus on the diallelic recurrent mutation model, which has been used to model single-nucleotide polymorphism (SNP) data. In what follows, each locus represents a single site and we make the following distinction: We use the term diallelic to refer to a model in which there are two possible alleles at a locus. This is in contrast to the term dimorphic, which refers to a sample in which precisely two alleles are observed. When we refer to a sample from a two-locus model as being dimorphic, we mean that both loci are dimorphic.

We address the following simple model of sequence evolution. Assume a diallelic, symmetric mutation model for each locus, with the same mutation rate θDiallelic and transition matrix $Math$$Math$. As discussed earlier, it can be cast as a PIM model with mutation rate θ = 2θDiallelic and transition matrix $Math$$Math$. We use this PIM model at each locus in our calculation of the ASF, so θA = θB = θ. In what follows we condition on segregation at both sites by considering only dimorphic samples and normalize where appropriate.

Realistic choices for θ depend on the biological sample of interest. Here, we consider two values of θ. First, for many neutral regions of the human genome, typical mutation rates per base are in the range 0.001 ≤ θ ≤ 0.01 (Nachman and Crowell 2000; McVean et al. 2002). Below, we present results for θ = 0.01; results for θ = 0.001 were almost identical (not shown). When mutation rates are low, the infinite-alleles assumption is also reasonable, and we obtained very similar results using the ASF under an infinite-alleles model with θ = 0.01. For brevity these results are also omitted. Second, we consider θ = 1.0. This model might be assumed for the higher levels of polymorphism seen, for example, at synonymous sites in human immunodeficiency virus (McVean et al. 2002).

Throughout this section, we ignore missing data by considering sample configurations only of the form $Math$.

A first look into accuracy:

We want to answer the following broad question: For a given sample size c and recombination rate ρ, how well does the ASF approximate the true likelihood? In our study, we truncate the asymptotic expansion (6) at the second order and define $Math$ as$Math$By comparing the performance of $Math$ against that of $Math$ and $Math$, we can gain an idea of how much improvement is provided by each additional term in $Math$. As motivation, consider the likelihood curves shown in Figure 1, a and b, which compare different levels of approximation with the true likelihood for the configurations $Math$ with $Math$$Math$ and $Math$$Math$, respectively, for θ = 0.01. The former is illustrative of a configuration for which the MLE of ρ is infinite, and the latter is an example for which the MLE is finite. These configurations have sample sizes small enough that the true likelihoods can be solved directly from Equation 5. We can make a number of preliminary observations from Figure 1. As more terms are added to the ASF, it sustains a higher accuracy, as we would hope. In particular, $Math$ seems to approximate the true likelihood very accurately for ρ ≥ 20. As illustrated in Figure 1, a similar level of accuracy is achieved by $Math$, an approximation to $Math$ defined by$Math$Recall that $Math$ comprises two terms (cf. Theorem 2): $Math$, which must be solved by dynamic programming, and $Math$, for which we derived an analytic expression. Calculating $Math$ imposes a small computational burden compared to solving (5) directly, but the associated running time grows with sample size. The approximation $Math$ simply ignores this potentially burdensome term. As we discuss later, in general this seems to lead to a negligible sacrifice in accuracy. Empirically, for most configurations we examined, $Math$ is at least an order of magnitude smaller than the analytic term $Math$. We return to this below and carry out a more thorough study of the relative contribution of $Math$ to $Math$.

Figure 1.—

Likelihood curves for ρ, comparing different levels of approximation. A symmetric, finite-alleles model with θ = 0.01 is assumed, as described in the text. The true likelihood is shown as a thick solid line. Formulas compared are $Formula$ (dotted line), $Formula$ (dashed line), $Formula$ (thin solid line), and $Formula$ (dotted-dashed line, largely concealed by the thin solid line). (a) $Formula$$Formula$. (b) $Formula$$Formula$.

Distribution of unsigned relative errors:

To ensure that the above results were not unique to the particular choices of data set illustrated, we performed a more systematic study as follows. For a given c, we measure the accuracy of $Math$ by the unsigned relative error$Math$with corresponding definitions for the unsigned relative errors of $Math$, $Math$, and $Math$. By drawing $Math$ according to the true sampling probability, we may speak of the distribution of the unsigned relative error. This distribution forms the basis of our measure of accuracy, and when calculating it, we must be sure to weight each sample correctly. The correct weight for $Math$ is $Math$ times a combinatorial factor capturing the number of distinct orderings of the sample; this product is denoted by $Math$. To evaluate $Math$, we looked at sample sizes sufficiently small so that the true likelihood could be calculated directly by solving Equation 5. We implemented a computer program for solving the recursion numerically.

We can make one further efficiency saving. Computing the true probability $Math$ for every configuration of size c ≥ 30 is computationally demanding. However, when mutation rates are the same for the two loci, observe that, by symmetry, distinct configurations exist with the same sampling probability. For example, the samples $Math$$Math$ and $Math$$Math$ are distinct observations when x, y are distinct nonnegative integers, but their sampling probabilities are the same. We can therefore collapse our space of distinct sample configurations into a smaller one of equivalence classes, where two distinct samples are considered equivalent if they have the same probability. For c = 30, this meant that for 5456 distinct configurations only 752 probabilities required evaluation. Similar arguments can be made for the infinite-alleles model.

For the finite-alleles PIM models described above, the cumulative distribution of the unsigned relative error across all dimorphic samples of size c = 20 is shown in Figure 2. As is clear from the plot, the ASF offers substantial accuracy even for ρ as “low” as 50. For example, for a dimorphic sample $Math$ of size 20 drawn at random from $Math$ with θ = 0.01, the probability that its ASF $Math$ is within 1% of the truth is 0.61, while the probability that $Math$ is within 5% of the truth is 0.87. The corresponding probabilities when using $Math$ alone are 0.36 and 0.43, respectively. Note the spike of probability mass around zero even for the curve for $Math$. This suggests that assuming ρ = ∞ will still be accurate in about half of cases, but with substantial errors in the rest. This is not the case when employing $Math$ instead.

Figure 2.—

Cumulative distribution of unsigned relative error of $Formula$ (dotted line), $Formula$ (dashed line), $Formula$ (dotted-dashed line, almost concealed by the solid line), and $Formula$ (solid line). We considered dimorphic sample configurations $Formula$ drawn from $Formula$ in the finite-alleles model with c = 20 and ρ = 50. (a) θ = 0.01. (b) θ = 1.

The distributions for the two mutation parameters are rather similar, suggesting a certain degree of robustness in the ASF to θ. The case with θ = 1 seems to have a lower probability of achieving a high accuracy (say within 1%) but a higher probability of achieving a moderate accuracy (within 10%).

Effect of sample size and recombination rate:

To assess the sensitivity of $Math$ to the sample size c and the recombination rate ρ, we repeated the above plots for varying values of these parameters; results are shown in Figures 3 and 4. As we might expect, accuracy improves with increasing ρ. For example, for θ = 0.01 and ρ = 200, the probability of drawing a sample $Math$ with $Math$ within 1% of the truth reaches 1.00, compared with a probability of 0.54 when using $Math$ alone. On the other hand, for a fixed ρ, the accuracy of $Math$ generally begins to diminish with increasing c. One possible explanation for this trend is that larger samples may have more subtle shapes to their likelihood curves, such as multiple turning points. The ASF effectively has to fit a curve with at most one turning point to approximate $Math$ with high accuracy for large ρ. This might be at the expense of accuracy for smaller ρ (see also Figure 9 later).

Figure 3.—

The effect of varying c (fixing ρ = 50). Plotted is the cumulative distribution of the unsigned relative error of $Formula$ across all dimorphic sample configurations of size c = 10 (solid line), c = 20 (dashed line), and c = 30 (dotted line). (a) θ = 0.01. (b) θ = 1.

Figure 4.—

The effect of varying ρ (fixing c = 20). Plotted is the cumulative distribution of the unsigned relative error of $Formula$ across all dimorphic sample configurations of size c = 20 for ρ = 200 (solid line), ρ = 100 (dotted-dashed line), ρ = 50 (dashed line), and ρ = 25 (dotted line). (a) θ = 0.01. (b) θ = 1.

Signed relative error:

Using the unsigned relative error as a measure of accuracy loses information on the question of whether the ASF systematically under- or overestimates the true sample probability. To investigate this, we calculated the expected signed relative error $Math$ across all configurations for each combination of parameters considered above. The signed relative errors of $Math$, and $Math$ are defined similarly. The results are given in Tables 1 and 2. Although most of these expectations are positive, they are generally very small, and all fall within ±1.25%. This suggests that if deviations from the true likelihood tend to err in a particular direction, then this skewness is negligible.

View this table:
TABLE 1

Expected signed relative error (percentage) for different approximations across all dimorphic samples with c = 20

View this table:
TABLE 2

Expected signed relative error (percentage) of qASF(0, 0, c) across all dimorphic samples

Monomorphic samples:

One might argue that excluding nonsegregating sites from the analyses above is unwarranted, since the observation that a site is nonsegregating also contains information about the parameters of the model. We therefore reanalyzed the distribution of the unsigned relative error across all samples of size c = 20 (with ρ = 50), rather than just dimorphic samples. Results are shown in Figure 5.

Figure 5.—

A repeat of Figure 2 but across all samples rather than all dimorphic samples (c = 20, ρ = 50). Note the different scales on the x-axes. The curves show the distribution of the unsigned relative error of $Formula$ (dotted line), $Formula$ (dashed line), $Formula$ (dotted-dashed line, concealed by the solid line), and $Formula$ (solid line). (a) θ = 0.01. (b) θ = 1.

Compare Figure 5 with Figure 2 (note the different scales on the x-axes). Clearly, across all samples there is now a much greater level of accuracy, particularly when the mutation rate is small. This may be explained by the following two observations. First, the probability that one or both loci are monomorphic is relatively large: for θ = 0.01 it is >0.99 and for θ = 1.0 it is 0.44. Hence, these configurations will have a substantial effect on the error distribution, particularly for smaller mutation rates. Second, the ASF seems to be much more accurate precisely for those configurations with one or both loci monomorphic. This raises the question: Can we identify any further patterns for which the ASF is more accurate? When the true likelihood is unavailable, this might provide a guide as to how accurate the ASF is likely to be.

For which samples is the ASF most accurate?

The figures described above suggest that for most configurations the relative error of the ASF will be very small, with a few having larger errors. It would be of interest to determine whether there is any feature of a given sample configuration that indicates how accurate the ASF will be. Figure 6 plots on a log-scale the unsigned relative error of the ASF of a sample against its sampling probability $Math$, across all configurations of sample size c = 20 (with θ = 0.01 and θ = 1, and ρ = 50). Encouragingly, there is a clear negative correlation between unsigned relative error and sampling probability (Pearson's correlation coefficient = −0.76 for Figure 6a and −0.61 for Figure 6b). That is, those configurations for which the ASF is most accurate are also observed most often. However, the point of this comparison is to determine which configurations have a more accurate ASF without having to calculate $Math$. We therefore identified the following categories for the samples that appear to best explain most of the variation in accuracy (the category of each sample is annotated in Figure 6):

1. ▿, the sample is monomorphic at both loci.

2. ▵, the sample is monomorphic at one locus.

3. ○, for at least one allele at a locus, the sample has multiplicity one; i.e., a row or column of c sums to one.

4. +, the sample has very low linkage disequilibrium (LD) as measured by r2 (Hudson 2001b), satisfying r2 ≤ 0.02. (This LD measure is also often denoted by Δ2.)

5. ◊, the sample is in perfect LD; i.e., it is of the form $Math$$Math$ or $Math$, where x + y = c, for x ≥ 0 and y ≥ 0.

6. □, the sample is in perfect LD except for one haplotype; i.e., it is of the form $Math$$Math$, $Math$, $Math$, or $Math$, where x + y = c − 1, for x ≥ 0 and y ≥ 0.

7. •, the sample does not fall into any of the above categories.

Some configurations fell into more than one category, in which case the higher category in this list was given priority.

Figure 6.—

The relationship between unsigned relative error of the ASF and a configuration's sample probability $Formula$, across all configurations of size c = 20 with ρ = 50. Configurations are categorized as monomorphic at both loci (∇), monomorphic at one locus (▵), one or both loci have a singleton allele (○), very low LD (+), in perfect LD (◊), in perfect LD except for one haplotype (□), and the rest (•). See text for details of these definitions. (a) θ = 0.01. (b) θ = 1.0.

Figure 6a confirms that samples monomorphic at one or both loci have among the highest sampling probabilities and the highest accuracies. This is followed by another clearly identifiable cluster of samples of high accuracy, which have in common that one allele has multiplicity one. For the remaining samples, the amount of LD seems to be highly informative of the accuracy of the ASF; for example, at the upper end of the error distribution are those in perfect or almost perfect LD. We do not have an intuitive explanation for this observation. Most of the remaining samples fall around a cluster of intermediate accuracy (∼1%), with the samples with lowest LD at the center of this cluster. For higher mutation rates, illustrated in Figure 6b, we observed a similar, albeit less distinguished, pattern. When mutation rates are higher, the difference in probability between any two sample configurations tends to be attenuated. We also observed qualitatively similar patterns for different choices of sample size and recombination rate (not shown).

A quick approximation to the ASF:

Recall that $Math$ contains a term—namely, $Math$—that needs to be evaluated recursively using dynamic programming. (See Theorem 2.) This imposes a small but growing (with sample size) computational burden on the calculation of $Math$. Empirical tests (for example, Figure 2) suggested that simply ignoring $Math$ leads to almost no loss in accuracy. It is worth addressing the contribution of this term in more detail. We measure its contribution by the following quantity:$Math$(11)As before, we weight this contribution by the sampling probability $Math$ of each configuration. Its distribution is shown in Figure 7, for dimorphic samples of size c = 20 and ρ = 50.

Figure 7.—

Distribution of the contribution (11) of $Formula$ to $Formula$, with respect to dimorphic samples of size c = 20 drawn at random from $Formula$ for ρ = 50. (a) θ = 0.01. (b) θ = 1.

It is clear from Figure 7 that for these parameter values, using $Math$ in place of $Math$ affects the result by no more than 1%, and with a high probability the effect is <0.2%. Indeed, when θ = 1, the contribution is <0.2% with an estimated probability of 1. These results are encouraging. They suggest that we may safely discard the $Math$ term without a considerable loss of accuracy. This also seems to hold for larger sample sizes. For all samples of size c = 100, one can still calculate the contribution (11), although it is now computationally impractical to weight each sample by its true probability $Math$. Instead we looked at the number of sample configurations within a given range of contributions. For example, when ρ = 50 and θ = 0.01, the contribution (11) was within the range ± 0.5% for 93.2% of dimorphic samples of size c = 100. When θ = 1, 95.7% of dimorphic sample configurations were within this range. Thus, even for larger sample sizes, when calculating the term $Math$ becomes impractical, using the approximation $Math$ seems to be an attractive option. For the infinite-alleles model, we (Jenkins and Song 2009) have obtained a close upper bound to estimate the contribution of this term whenever calculating it directly is computationally intractable.

DISCUSSION

Computing likelihoods under the coalescent with recombination is a challenging problem that has received much attention in the past. Here, building on our recent work (Jenkins and Song 2009), we provided an alternative perspective on the likelihood computation and obtained analytic results for the two-locus model with an arbitrary finite-alleles model of mutation. As mentioned in the Introduction, a stochastic process may exist that is simpler than the well-established coalescent with recombination that captures the important dynamics of the relevant degrees of freedom as the rate of recombination gets large. We think that the fact that the first-order term $Math$ [see (8)] is so simple supports this speculation. For θ = 1, the Ewens sampling formula for the one-locus infinite-alleles model reduces to the classic formula for the joint distribution of cycle counts in a random permutation (see Arratia et al. 2003 for a discussion of this and other combinatorial connections). It would be interesting to provide such a combinatorial interpretation of $Math$. Further, we believe that finding a closed-form expression for the first-order term for more than two loci might be possible.

As a concrete application of our asymptotic sampling formula, one can analytically compute, for large values of ρ, the expectation of linkage disequilibrium measures such as r2. Song and Song (2007) showed analytically that the populationwide expectation of r2 approaches 1/ρ as ρ → ∞. For a finite sample size n, Hill and Weir (1994) derived $Math$ for the samplewise expectation 𝔼[r2] of r2. This result was obtained under restrictive assumptions that the populationwide marginal allele frequencies are all $Math$ and that heterozygote superiority holds at each locus. For a given sample size n, it would be interesting to use our asymptotic sampling formula to derive a general asymptotic expression for the samplewise expectation 𝔼[r2].

Below we discuss in detail two other concrete applications of our sampling formula. First, we discuss how we can improve the composite-likelihood method (Hudson 2001a). Then, we provide the first analytic result on the classification of the MLE of ρ.

Composite-likelihood method:

LDhat (McVean et al. 2002, 2004) is a software package aimed at using patterns of variation in population genetic data to estimate fine-scale recombination rates. The procedure is efficient and has been applied to genomewide estimates of the human genetic map (Myers et al. 2005; International Hapmap Consortium 2007) (henceforth, we assume that recombination events are always resolved as crossovers and ignore the effects of homologous gene conversion). Part of its efficiency stems from the use of the pairwise composite likelihood first proposed by Hudson (2001a). For a given genetic map, LDhat employs a Bayesian reversible-jump Markov chain Monte Carlo (rjMCMC) procedure to propose updates to this map and to sample from an approximate posterior distribution (the approximation resulting from the use of a pseudolikelihood defined below). To calculate the acceptance probabilities of proposed moves, one must calculate the likelihood of the data. To do so exactly (e.g., by solving a multilocus version of Equation 5) or even by a computationally intensive full-likelihood procedure such as importance sampling (Fearnhead and Donnelly 2001) is highly impractical for the sizes of data sets of interest. To circumvent this problem, Hudson (2001a) proposed the composite-likelihood method and showed that it works very well as an estimator for ρ. This method uses two-locus likelihoods as building blocks and LDhat uses the following variation on the theme: Suppose that the input data set D contains m polymorphic sites, and let Dij denote the data set restricted to the two sites i and j. Then, the composite likelihood is defined as$Math$(12)In this formula, ρ is assumed to be constant across the region and the probability associated with each pair uses a rescaled value for ρ to account for the physical distance between the pair. It is straightforward to modify this definition to allow for variable recombination rates, as implemented in LDhat. The composite-likelihood approximation treats the collection {Dij : i < j and i, j = 1, …, m} as $Math$ independent observations. Of course, in reality these observations are highly dependent through their shared ancestry, but often in practice little information about the MLE of ρ is lost. The key to this approach is that it is much more straightforward to calculate each term in the product (12) than it is to calculate the full likelihood, since now one has to deal only with two sites at any one time. Hudson (2001a) considered an infinite-sites model in the limit as θ → 0, conditional on the observed variation. McVean et al. (2002) extended this approach to finite-alleles models and finite θ and used the importance sampling scheme of Fearnhead and Donnelly (2001) to estimate each of these two-site likelihoods. This importance sampling stage is the most computationally intensive part of the LDhat estimation procedure. By considering all possible pairwise sample configurations of a given size, it is possible to precompute these likelihoods, and indeed the LDhat software package is accompanied by precomputed lookup tables for various sample sizes and mutation rates. For each sample configuration and mutation rate, the likelihood is calculated over a grid of ρ-values known as the driving values; the default choice is over the integers ρ = 0, 1,…, 100. Genetic distances are then rounded to the nearest integer. For pairs of sites whose genetic distance is >100, the two-site likelihood simply truncates ρ to 100 and uses this precomputed value for the likelihood instead.

Here, we investigate the effect that this truncation has on the two-site likelihood. Since the ASF applies to the diallelic, symmetric mutation model employed by LDhat, an obvious alternative to truncation is to use the ASF. As ρ increases, we expect the truncation procedure to introduce more error, while the ASF becomes more accurate. Furthermore, the ASF is a continuous function of ρ and so no further error is introduced by rounding to the nearest driving value of ρ. Assuming a recombination rate of 1 cM/Mb (Kong et al. 2002) and a human diploid effective population size of 10,000 (Myers et al. 2005), a cutoff of ρ = 100 corresponds to a marker separation of ∼250 kb. In the presence of recombination hotspots this distance could be much lower. The study of Myers et al. (2005) breaks up genomewide data into windows of 2000 SNPs, using a data set with an average marker spacing of 1.9 kb (Hinds et al. 2005). A rough approximation to the mean window size is thus ∼3.8 Mb; pairs of sites separated by a distance >ρ = 100 are therefore encountered frequently by the rjMCMC procedure.

Using its estimate of the likelihood at ρ = 100, LDhat encounters two sources of error: one from the truncation of the range of ρ and one from the stochastic error of the importance sampling procedure. To restrict our attention to the first of these sources, we find a lower bound on the error of LDhat by looking at the quantity$Math$where L(ρ) is the true likelihood obtained previously by solving Equation 5. The true likelihood at ρ = 100 will of course be at least as accurate an approximation as that reported by LDhat. Table 3 reports the mean unsigned relative error one would obtain using L(100) in place of L(ρ), for various choices of ρ and various choices of mutation rates used by the lookup tables in LDhat. For comparison we also show the unsigned relative error of $Math$. In Figure 8, we also show the complete distribution of these errors for one set of parameters in the table, namely, c = 30 and θ = 0.002. (Note that this choice of mutation rate would be reported as θ = 0.001 in LDhat, since it uses the mutation transition matrix $Math$$Math$ at each locus instead of $Math$$Math$ used here.) Since ρ is not fixed throughout the rjMCMC procedure, it is not clear in this setting how to interpret the idea of drawing a sample configuration at random. Here, we therefore take a simple arithmetic mean unsigned relative error across all dimorphic samples, rather than weighting each configuration by its sampling probability $Math$ as before.

View this table:
TABLE 3

The mean unsigned relative error (percentage) of L(100) and of LASF(ρ) compared to the true likelihood L(ρ), across all dimorphic samples of size c

Figure 8.—

Distribution of signed relative errors (%) of (a) L(100) and (b) LASF(200) when estimating L(200), across all dimorphic configurations of size c = 30. The mutation rate is θ = 0.002.

As is clear from Table 2, using L(100) as a proxy for L(ρ) when ρ > 100 rapidly becomes very inaccurate with increasing ρ. Even for these moderate sample sizes, there may be a considerable amount of information in the sample so that the likelihoods at ρ = 100 and ρ = 200, say, differ substantially. For the sample sizes considered here, using the ASF instead of LDhat becomes preferable very soon above ρ = 100 and certainly before ρ = 150. For larger samples the critical point at which the ASF becomes preferable may be somewhat higher. Also, note that the error in both methods seems to be highly robust to the choice of mutation rate, at least for the set of realistic choices shown. For both methods, there is a slight decrease in error rate with increasing mutation rate.

Figure 8 confirms the wider distribution of errors when using L(100) to estimate L(200) rather than using LASF(200). Since we plot here the signed relative error rather than the unsigned relative error, we are also able to observe that the distribution of errors is not symmetric about zero; there is much more probability mass on nonnegligible positive relative errors than on negative relative errors. This suggests that LDhat could systematically overestimate the likelihood of pairs of markers for which a large recombination rate is proposed, biasing recombination rate estimates upward; proposed alterations to the genetic map that increase the recombination rate in a region will have erroneously inflated acceptance rates. When using LASF(ρ), the asymmetry in errors is much smaller.

Classification of the MLE of ρ:

The ASF can be used to make inferences about whether or not the MLE of ρ is finite. Intuitively, when $Math$, the asymptotic approximation $Math$ approaches $Math$ from above as ρ → ∞. Moreover, by virtue of it being an asymptotic approximation, for large enough ρ it must be sufficiently accurate to infer that $Math$ must also approach $Math$ from above. We therefore have the following theorem, for which a formal proof may be given via a short analysis argument (in the appendix):

Theorem 3. For a given sample configuration $Math$, if the first-order term $Math$ [see (8)] in the asymptotic expansion satisfies$Math$then the MLE of ρ is finite.

If one-locus marginal sampling probabilities are known, as in the infinite-alleles case or the PIM case, then $Math$ is instantaneous to compute. To our knowledge, this is the first analytic result on the classification of the MLE of ρ. Note, however, that this is a sufficient condition, but not a necessary condition. That is, this condition does not identify all samples with a finite MLE. Indeed, it is tempting to argue for the converse to the proposition above—namely, that $Math$ implies an infinite MLE—but this is false, as the following simple counterexample demonstrates.

Consider the sample configuration $Math$ determined by $Math$$Math$, and suppose θ = 0.01. It is straightforward to verify that $Math$ but that ρ0 exists with L0) > L(∞). The likelihood curve for this configuration can be found by solving Equation 5 directly and is illustrated in Figure 9. Although the curve approaches the asymptote L(∞) $Math$ from below as ρ → ∞ [as one expects when $Math$], it actually exhibits a local minimum at ρ = 1608, as well as a local maximum farther down at ρ = 74. In this example the likelihood curve is extremely flat, but minima with higher curvature may be found for larger sample sizes. To our knowledge, this is the first confirmation of a local minimum in a likelihood curve for ρ.

Figure 9.—

(a) Example with a local minimum and a local maximum in the likelihood curve for ρ. This example is for $Formula$$Formula$ under a finite-alleles model with symmetric mutation rates and θ = 0.01. The curve has a local minimum at ρ = 1608 (shown in detail in b) and a local maximum at ρ = 74. The true likelihood is shown as a thick solid line. For comparison, also shown are $Formula$ (dotted line) and $Formula$ (thin solid line).

APPENDIX

The proofs given here are similar in structure to our previous proofs (Jenkins and Song 2009) for the infinite-alleles model of mutation. It turns out that much of the argument is independent of the particular model of mutation assumed. Therefore, we give only an outline here and refer the reader to that article for further details.

Proof of Proposition 1. The infinite-alleles case was proved in Jenkins and Song (2009), so we focus on the finite-alleles case here. First, assume c > 0. One can obtain a recursion satisfied by $Math$ by substituting our asymptotic expansion (6) into (5), dividing by ρc, and letting ρ → ∞. We obtain$Math$(A1)By repeated application of (A1) this becomes$Math$(A2)which of course also holds when c = 0. Now, by substituting (6) into (5) with $Math$ and then letting ρ → ∞, we obtain [after some manipulation, invoking (A2) to eliminate terms for which c > 0]$Math$(A3)Boundary conditions are $Math$ for i = 1, …, K and $Math$ for j = 1, …, L. It is straightforward to verify that the solution to (A3) is$Math$(A4)where $Math$ and $Math$ are marginal one-locus sampling probabilities for loci A and B, respectively. Together, (A2) and (A4) give the desired result.▪

Proof of Theorem 1. Again, the infinite-alleles case was proved in Jenkins and Song (2009), so we focus on the finite-alleles case here. Substitute the asymptotic expansion (6) into the recursion (5), eliminate terms of order ρ by applying (A1), and let ρ → ∞. After invoking (A3) and (7), with some rearrangement we obtain$Math$(A5)where$Math$(A6)Above we assumed c > 0; we define $Math$ if $Math$.

Although at first sight somewhat daunting, Equation A5 can be solved via a simple probabilistic interpretation of the coefficients of the second term on the right-hand side. In particular, this term can be interpreted as the expected value of $Math$ after removing a gamete from c uniformly at random and redistributing it to a and b. After this removal, the term is given recursively by discarding further gametes. More generally, we (Jenkins and Song 2009) showed that$Math$(A7)where $Math$ is the random subsample obtained by selecting m gametes from the sample c uniformly without replacement; that is, it is distributed according to a multivariate hypergeometric distribution with parameters $Math$. Further, we define $Math$ and $Math$. The expectations in (A7) are easy to compute and to sum over m, resulting in (8).▪

Proof of Theorem 2. In a similar manner to the proof of Theorem 1, one can show that$Math$(A8)where $Math$ has a known, but rather cumbersome, closed-form expression. Since these expectations can be evaluated, the sum $Math$ composes the closed-form part of $Math$; i.e.,$Math$Unlike the form of $Math$, that of $Math$ depends on the model of mutation. We repeated the steps in Jenkins and Song (2009) using Equation 5—rather than Golding's recursion (Golding 1984), which applies to the infinite-alleles model—to obtain the following expression for $Math$. We omit the simple but lengthy algebraic details.

Let QA denote $Math$, QiA denote $Math$, QikA denote $Math$, and so on. Then,$Math$The final term, $Math$, is a function whose form depends on the model of mutation. Using Qi,+tA to denote $Math$, and so on, we obtain$Math$where δij denotes the Kronecker delta, and in a PIM model this simplifies to $Math$For completeness, we also give the corresponding expression for the infinite-alleles model,$Math$(Jenkins and Song 2009).

We now outline the steps to obtain the recursion shown in (10). One can use Equation A8 to show that$Math$(A9)After substituting the asymptotic expansion (6) with $Math$ into (5) and comparing coefficients of terms proportional to 1/ρ2, one obtains the recursion$Math$with boundary conditions $Math$ for i = 1, …, K and $Math$ for j = 1, …, L. By appealing to (A9), this can be simplified to the form given in Equation 10.▪

Proof of Theorem 3. Since $Math$ is an asymptotic approximation to $Math$, by definition$Math$Hence, using $Math$, we conclude that ρ′ exists such that, for all ρ′ ≤ ρ < ∞,$Math$which implies $Math$ for ρ′ ≤ ρ < ∞. Since $Math$, this implies that the MLE for ρ must be finite.▪

Acknowledgments

We thank Fulton Wang for writing a computer program that numerically solves two-locus recursions and Rasmus Nielsen and Monty Slatkin for helpful comments. This research is supported in part by a National Institutes of Health grant R00-GM080099 (to Y.S.S.), an Alfred P. Sloan Research Fellowship (to Y.S.S.), and a Packard Fellowship for Science and Engineering (to Y.S.S.).

Footnotes

• Communicating editor: J. Wakeley