An Accurate Sequentially Markov Conditional Sampling Distribution for the Coalescent With Recombination
Joshua S. Paul, Matthias Steinrücken, Yun S. Song

Abstract

The sequentially Markov coalescent is a simplified genealogical process that aims to capture the essential features of the full coalescent model with recombination, while being scalable in the number of loci. In this article, the sequentially Markov framework is applied to the conditional sampling distribution (CSD), which is at the core of many statistical tools for population genetic analyses. Briefly, the CSD describes the probability that an additionally sampled DNA sequence is of a certain type, given that a collection of sequences has already been observed. A hidden Markov model (HMM) formulation of the sequentially Markov CSD is developed here, yielding an algorithm with time complexity linear in both the number of loci and the number of haplotypes. This work provides a highly accurate, practical approximation to a recently introduced CSD derived from the diffusion process associated with the coalescent with recombination. It is empirically demonstrated that the improvement in accuracy of the new CSD over previously proposed HMM-based CSDs increases substantially with the number of loci. The framework presented here can be adopted in a wide range of applications in population genetics, including imputing missing sequence data, estimating recombination rates, and inferring human colonization history.

THE conditional sampling distribution (CSD) describes the probability, under a particular population genetic model, that an additionally sampled DNA sequence is of a certain type, given that a collection of sequences has already been observed. In many important settings, the relevant population genetic model is the coalescent with recombination, for which the true CSD, denoted by π, does not have a known analytic formula. Nevertheless, the CSD π and, in particular, approximations thereof have found a wide range of applications in population genetics.

One important problem in which the CSD plays a fundamental role is describing the posterior distribution of genealogies under the coalescent process. Stephens and Donnelly (2000) showed that the true posterior distribution can be written in terms of π and can be approximated by using an approximate CSD, denoted Math. This observation has been used (Stephens and Donnelly 2000; Fearnhead and Donnelly 2001; De Iorio and Griffiths 2004a,b; Fearnhead and Smith 2005; Griffiths et al. 2008) to construct importance sampling schemes for likelihood computation and ancestral inference under the coalescent, including extensions such as recombination and population structure. In conjunction with composite-likelihood frameworks (Hudson 2001; Fearnhead and Donnelly 2002), these importance sampling methods have been used, for example, to estimate fine-scale recombination rates (McVean et al. 2004; Fearnhead and Smith 2005; Johnson and Slatkin 2009).

Li and Stephens (2003) introduced a different application of the CSD, observing that the probability of sampling a set of haplotypes can be decomposed into a product of CSDs and therefore can be approximated by a product of approximate CSDs Math. Similar applications of the CSD have yielded methods for estimating recombination rates (Li and Stephens 2003; Crawford et al. 2004; Stephens and Scheet 2005) and gene conversion parameters (Gay et al. 2007; Yin et al. 2009), for phasing genotype data into haplotype data (Stephens and Scheet 2005), for imputing missing data to improve power in association studies (Stephens and Scheet 2005; Li and Abecasis 2006; Scheet and Stephens 2006; Marchini et al. 2007; Howie et al. 2009), for inferring ancestry in admixed populations (Price et al. 2009), and for inferring demography (Hellenthal et al. 2008; Davison et al. 2009).

In all applications, the fidelity with which the surrogate CSD Math approximates the true CSD π is critical to the quality of the result. Furthermore, the time required to compute probabilities under the CSD is important, as many of the above methods are now routinely applied to genome-scale data sets. As a result, many approximate CSDs have been proposed, particularly for the coalescent with recombination. Fearnhead and Donnelly (2001) introduced an approximation in which an additionally sampled haplotype is constructed as an imperfect mosaic of previously sampled haplotypes, with mosaic breakpoints caused by recombination events and imperfections corresponding to mutation events. The resulting CSD, which we denote by Math, can be cast as a hidden Markov model (HMM), and the associated conditional sampling probability (CSP) can be computed with time complexity linear in both the number of previously sampled haplotypes and the number of loci. Li and Stephens (2003) proposed a related model that can be viewed as a modification to Math limiting the state space of the HMM, hence providing a constant factor improvement in the time complexity; we denote the corresponding CSD by Math.

Following the theoretical work of De Iorio and Griffiths (2004a), Griffiths et al. (2008) derived an approximate CSD from the Wright–Fisher diffusion process associated with the two-locus coalescent with recombination. More recently, Paul and Song (2010) generalized this work to an arbitrary number of loci and demonstrated that the resulting CSD, which we denote by Math, can also be described by a genealogical process. Though it is more accurate than both Math and Math, computing the CSP under Math has time complexity superexponential in the number of loci. To ameliorate this limitation, Paul and Song introduced the approximate CSD Math, which follows from prohibiting coalescence events in the genealogical process associated with Math. Computing the CSP under Math has time complexity exponential in the number of loci. Although this is an improvement over the superexponential complexity associated with Math, it is still impracticable to use Math for >20 loci.

In this article, we introduce an alternate approximation that is scalable in the number of loci, while maintaining the key features of Math that lead to high accuracy. Specifically, motivated by the sequentially Markov coalescent (SMC) introduced by McVean and Cardin (2005), we derive a sequentially Markov approximation to Math. The key idea is to consider the marginal genealogies at each locus sequentially, using the genealogical description of Math. In general, the sequence of marginal genealogies is not Markov, but, as in McVean and Cardin (2005), we make approximations to provide a Markov construction for the sequence. We denote the resulting approximation of Math by Math. The CSD Math can also be obtained from Math by prohibiting a certain class of coalescence events, a fact that mirrors the relation between the SMC and the coalescent with recombination (McVean and Cardin 2005). We formalize this relation by proving that Math is, in fact, equal to Math.

Due to its sequentially Markov construction, Math can be cast as an HMM. Unfortunately, the state space of the HMM is continuous, and so efficient algorithms for CSP computation and posterior inference are not known. Our solution is to discretize the state space. The discretization procedure we develop is related, though not identical, to the Gaussian quadrature method employed by Stephens and Donnelly (2000) and Fearnhead and Donnelly (2001). Although we focus on the CSD problem here, we believe that our general approach has the potential to foster applications of the SMC in other settings as well (see Hobolth et al. 2007; Dutheil et al. 2009).

Having discretized the continuous state space, we apply standard HMM theory to obtain an efficient dynamic program for computing the CSP under the discretized approximation of Math. The resulting time complexity is linear in both the number of previously sampled haplotypes and the number of loci. This time complexity is the same as that for Math and Math and hence is a substantial improvement over Math. In summary, the work presented here provides a practical approximation to Math, which was derived from the diffusion process associated with the coalescent with recombination. Furthermore, as detailed later, the improvement in accuracy of our new CSD over Math and Math increases substantially with the number of loci.

The remainder of this article is organized as follows. In model, we present the necessary notation and background and describe our new CSD Math. We also give an overview of the proof that Math is equivalent to Math and demonstrate several other useful properties. In discretization of the hmm, we describe the discretization of Math, and in empirical results, we provide empirical evidence that the discretized approximation performs well, with regard to both accuracy and run time. Finally, in discussion we mention some connections to existing models and describe possible applications and extensions, in particular conditionally sampling more than one haplotype.

MODEL

In this section, we describe the key transition and emission distributions for the HMM underlying Math. Further, we demonstrate that Math is equivalent to Math, the variant of Math with coalescence disallowed, and also show that the transition density satisfies several useful properties.

Notation:

We consider haplotypes in the finite-sites finite-alleles setting. Denote the set of loci by L = {1, … ,k} and the set of alleles at locus ℓ ∈ L by E. Mutations occur at locus ℓ ∈ L at rate θ/2 and according to the stochastic matrix Math. Denote the set of breakpoints by B = {(1, 2), … , (k − 1, k)}, where recombination occurs at breakpoint bB at rate ρb/2.

The space of k-locus haplotypes is denoted by ℋ = E1 × … × Ek. Given a haplotype α ∈ ℋ, we denote by α[ℓ] ∈ E the allele at locus ℓ ∈ L and by α[1 : ℓ] the partial haplotype (α[1], … , α[ℓ]). A sample configuration of haplotypes is specified by a vector Math, with nα being the number of haplotypes of type α in the sample. The total number of haplotypes in the sample is denoted by |n| = n. Finally, we use eα to denote the singleton configuration comprising a single α haplotype.

A brief review of the CSD Math:

The approximate CSD Math is described by a genealogical process closely related to the coalescent with recombination. We provide below a brief description of the framework and refer the reader to Paul and Song (2010) for further details.

Suppose that, conditioned on having already observed a haplotype configuration Math, we wish to sample a new haplotype α. Define 𝒜*(n) to be the nonrandom trunk ancestry for Math, in which lineages associated with the haplotypes do not mutate, recombine, or coalesce with one another, but rather extend infinitely into the past. We assume that the unknown ancestry associated with Math is 𝒜*(n) and sample a conditional ancestry C associated with α. Within the conditional ancestry, lineages evolve backward in time with the following rates:

  • Mutation: Each lineage mutates at locus ℓ ∈ L with rate θ/2, according to P(ℓ).

  • Recombination: Each lineage undergoes recombination at breakpoint bB with rate ρb/2.

  • Coalescence: Each pair of lineages coalesces with rate 1.

  • Absorption: Each lineage is absorbed into each lineage of 𝒜*(n) at rate 1/2.

When every lineage has been absorbed into 𝒜*(n), the process terminates. The type of every lineage in C can now be inferred, and a sample for α is generated. An illustration of this process is presented in Figure 1A.

Figure 1.—

Illustration of the corresponding genealogical and sequential interpretations for a realization of Formula. The three loci of each haplotype are each represented by a solid circle, with the color indicating the allelic type at that locus. The trunk genealogy 𝒜*(n) and conditional genealogy C are indicated. Time is represented vertically, with the present (time 0) at the bottom of the illustration. (A) The genealogical interpretation: Mutation events, along with the locus and resulting haplotype, are indicated by small arrows. Recombination events, and the resulting haplotype, are indicated by branching events in C. Absorption events, and the corresponding absorption time [t(a) and t(b)] and haplotype [h(a) and h(b), respectively], are indicated by dotted-dashed horizontal lines. (B) The corresponding sequential interpretation: The marginal genealogies at the first, second, and third locus (S1, S2, and S3) are emphasized as dotted, dashed, and solid lines, respectively. Mutation events at each locus, along with resulting allele, are indicated by small arrows. Absorption events at each locus are indicated by horizontal lines.

Although a recursion for computing the CSP Math is known (Paul and Song 2010, Equation 7), it is computationally intractable, and Paul and Song approximate the genealogical process by disallowing coalescence within the conditional genealogy, denoting the resulting CSD by Math. The recursion for Math (Paul and Song 2010, Equation 12) is amenable to dynamic programming, though it still has time complexity exponential in the number k of loci.

The sequentially Markov coalescent:

The sequential interpretation of the coalescent with recombination was introduced by Wiuf and Hein (1999). They observed that an ancestral recombination graph (ARG) may be simulated sequentially along the chromosome. In particular, the marginal coalescent tree at a given locus can be sampled conditional on the marginal ARG for all previous loci. The full ARG is then sampled by first sampling a coalescent tree at the leftmost locus and then proceeding to the right.

McVean and Cardin (2005) proposed a simplification of this process. Though McVean and Cardin presented their work for the infinite-sites model, we state (but do not derive) the analogous results for a finite-sites, finite-alleles model. In their approach, the marginal coalescent tree at locus ℓ is sampled conditional only on the marginal coalescent tree at locus ℓ − 1. In particular, setting b = (ℓ − 1, ℓ) ∈ B, (1) recombination breakpoints are realized as a Poisson process with rate ρb/2 on the marginal coalescent tree at locus ℓ − 1, (2) the lineage branching from each recombination breakpoint associated with locus ℓ − 1 is removed, and (3) the lineage branching from each recombination breakpoint associated with locus ℓ is subject to coalescence with other lineages at rate 1. The resulting tree is the marginal genealogy at locus ℓ. This approximation is called the sequentially Markov coalescent (SMC) and is equivalent to a variant of the coalescent with recombination that disallows coalescence between lineages ancestral to disjoint regions of the sequence (McVean and Cardin 2005).

The sequentially Markov CSD Math

We now describe a sequentially Markov approximation to the genealogical process underlying Math. Our construction is similar to that given by McVean and Cardin (2005), described above, though the resulting dynamics are less involved since the conditional genealogy is constructed for a single haplotype. First, observe that under Math, the marginal conditional genealogy at a given locus ℓ ∈ L is entirely determined by two random variables: the absorption time, which we denote T, and the absorption haplotype, which we denote H. The present corresponds to time 0 and T ∈ [0, ∞]. See Figure 1B for an illustration. For convenience, we write S = (T, H) for the random marginal conditional genealogy at locus ℓ ∈ L and s = (t, h) for a realization.

Within the marginal conditional genealogy at locus ℓ ∈ L, note that T and H are independent, with T distributed exponentially with parameter n/2 and H distributed uniformly over the n haplotypes of Math. Thus, the marginal conditional genealogy S at locus ℓ is distributed with density ζ(n), whereMath(1)

Conditioning on Sℓ−1= sℓ−1=(tℓ−1, hℓ−1), the marginal conditional genealogy S, for ℓ ≥ 2, is sampled by a process analogous to that described above for the SMC. Setting b = (ℓ − 1, ℓ) ∈ B, the sampling procedure is as follows (see Figure 2 for an accompanying illustration): (1) Recombination breakpoints are realized as a Poisson process with rate ρb/2 on the marginal conditional genealogy sℓ−1; (2) going backward in time, the lineage associated with locus ℓ−1 branching from each recombination breakpoint is removed, so that only the lineage more recent than the first (i.e., the most recent) breakpoint remains; and (3) the lineage associated with locus ℓ branching from the first recombination breakpoint is absorbed into a particular lineage of 𝒜*(n) at rate 1/2.

Figure 2.—

Illustration of the (Markov) process for sampling the absorption time T given the absorption time Tℓ−1 = tℓ−1. In step 1, recombination breakpoints are realized as a Poisson process with rate ρb/2 on the marginal conditional genealogy with absorption time tℓ−1. In step 2, the lineage branching from each breakpoint associated with locus ℓ−1 is removed, so that only the lineage more recent than the first breakpoint, at time Tr, remains. In step 3, the lineage branching from the first recombination breakpoint associated with locus is absorbed after time Ta distributed exponentially with rate n/2. Thus, T = Tr + Ta.

From the above description, we deduce that there is no recombination between loci ℓ−1 and ℓ with probability Math, and in this case the marginal conditional genealogy is unchanged; that is, S = sℓ−1. Otherwise, the time Tr of the first recombination breakpoint is distributed exponentially with parameter ρb/2, truncated at time tℓ−1, and the additional time Ta until absorption is distributed exponentially with parameter n/2. Thus we have S = (Tr + Ta, H), where H is chosen uniformly at random from the sample Math. Taking a convolution of Tr and Ta, the transition density Math is given byMath(2)where Math.

Finally, conditioning on S = s, recall that mutations are realized as a Poisson process (cf. Stephens and Donnelly 2000) with rate θ/2. Therefore, a particular allele aE is observed with probabilityMath(3)Hereafter, we omit the superscript Math and the subscripts θ and ρb from these densities, whenever the context is unambiguous.

The sequentially Markov approximation to Math can be cast as a continuous-state HMM. In generating a haplotype α, the observed state, the hidden state, and initial, transition, and emission densities are given by the following:

  • Observed state: At locus ℓ ∈ L, the observed state is the allele α[ℓ].

  • Hidden state: At locus ℓ ∈ L, the hidden state is the marginal genealogy S = (T, H).

  • Initial density: ζ is defined in (1).

  • Transition density: ϕ is defined in (2).

  • Emission density: ξ is defined in (3).

Writing Math for the sequentially Markov approximation to Math, we can use the forward recursion (see, e.g., Doucet and Johansen 2008) to getMath(4)where fSMC (·, ·) is defined byMath(5)with base caseMath(6)Though we cannot analytically solve the above recursion for Math, in the next section we derive a discretized approximation with time complexity linear in both the number of loci k and the number of haplotypes n. Before doing so, we briefly discuss some appealing properties of Math.

Properties of Math:

Recall that the SMC approximation of McVean and Cardin (2005) is equivalent to a variant of the coalescent with recombination disallowing coalescence events between lineages ancestral to disjoint regions. Similarly, the CSD Math, when used to sample a single haplotype, is a variant of Math disallowing the same class of coalescence events. We might therefore expect that the sequentially Markov approximation of Math described above is equivalent to Math, and in fact we can show that this is true.

Proposition 1. For an arbitrary single haplotype α ∈ ℋ and haplotype configuration Math, Math.

We present a sketch of the proof here and refer the reader to supporting information, File S1, for further details.

Sketch of Proof. The key idea of the proof is to introduce a genealogical recursion for f(α, sk), the joint density function associated with sampling haplotype α Math and the marginal genealogy at the last locus sk. This recursion can be constructed following the lines of Griffiths and Tavaré (1994) to explicitly incorporate coalescent time into a genealogical recursion.

By partitioning with respect to the most recent event occurring at the last locus k, it is possible to inductively show that fSMC(α, sk) = f(α, sk). Furthermore, the equality Math can be verified, and thus we conclude thatMath

We now describe other intuitively appealing properties of Math. In particular, it can be verified that the detailed-balance conditionMath(7)holds for the initial and transition densities, ζ and ϕ, respectively. This immediately implies that the initial distribution ζ is stationary under the given transition dynamics; i.e., the invariance conditionMathis satisfied. Thus, S is marginally distributed according to ζ for all loci ℓ ∈ L, and in particular the marginal distribution of T is exponential with rate n/2. This parallels the fact that the marginal genealogies under the SMC (and the coalescent with recombination) are distributed according to Kingman's coalescent.

Similarly, the transition density exhibits a consistency property, which we call the locus-skipping property. Intuitively, this property states that transitioning directly from locus ℓ − 1 to ℓ + 1 can be accomplished by using the transition density parameterized with the sum of the recombination rates. Formally, the following equality holds for all ρ1, ρ2 ≥ 0:Math(8)This property, in conjunction with recursion (5), is computationally useful, as it enables loci ℓ ∈ L for which α[ℓ] is unobserved to be skipped in computing the CSP Math.

Finally, the conditional expectation of T given Tℓ−1 = tℓ−1 isMath(9)where b = (ℓ − 1, ℓ) ∈ B. Asymptotically, this expression provides several intuitive results. As ρb → ∞, 𝔼[T|tℓ−1] → 2/n; that is, recombination happens immediately, and 2/n is the expectation of the additional absorption time Ta. As ρb → 0, we get 𝔼[T|tℓ − 1] → tℓ−1. In this case there is no recombination, and the absorption time does not change. Further, 𝔼[T|tℓ − 1] → 2/ρb + 2/n holds as tℓ−1 → ∞. Here, recombination must occur, and the exponentially distributed time is not truncated, so the expectation is the sum of the expectations of two exponentials. Finally, as tℓ−1 → 0 we have 𝔼[T|tℓ − 1] → 0. No recombination can occur, and so the absorption time is unchanged.

DISCRETIZATION OF THE HMM

In the previous section we described a sequentially Markov approximation of the CSD Math and showed that it can be cast as an HMM. Because the absorption time component of the hidden state is continuous, the dynamic program associated with the classical HMM forward recursion is not applicable. However, by discretizing the continuous component, we are once again able to obtain a dynamic programming algorithm, resulting in an approximate CSP computation linear in both the number of loci and the number of haplotypes.

Rescaling time:

Recall from the previous section that the marginal absorption time at each locus is exponentially distributed with parameter n/2. To use the same discretization for all n, we follow Stephens and Donnelly (2000) and Fearnhead and Donnelly (2001) and transform the absorption time to a more natural scale in which the marginal absorption time is independent of n. In particular, define the transformed state Σ = (𝒯, H) where 𝒯 = (n/2)T. We denote a realization of Σ by σ = (τ, h). In the appendix, we provide expressions for the transformed quantities Math and Math derived from (1), (2), (3), and (5), respectively.

Using this time-rescaled model, the marginal absorption time at each locus is exponentially distributed with parameter 1. Because this distribution is independent of n and the coalescent model parameters ρ and θ, we expect that a single discretization of the transformed absorption time is appropriate for a wide range of haplotype configurations and parameter values.

Discretizing absorption time:

Our next objective is to discretize the absorption time 𝒯 ∈ ℝ≥0. Let 0 = x0 < x1 < ··· < xd = ∞ be a finite strictly increasing sequence in ℝ≥0 ∪ {∞} so that D = {Dj = [xj−1, xj)}j=1, … ,d is a d-partition of ℝ≥0.

Toward formulating a D-discretized version of the dynamics exhibited by the transformed HMM, we define the following D-discretized version of the density MathMath(10)for all ℓ ∈ L. Unfortunately, we cannot obtain a recursion for Math via the definition of Math. Therefore, we make an additional approximation, namely that the transition and emission densities are conditionally dependent on the absorption time 𝒯 only through the event {Dj ∋ 𝒯 }; i.e., the densities depend on the interval Dj to which 𝒯 belongs but not on the actual value of 𝒯. Abusing notation, define Math and Math as the transition and emission densities, respectively, conditioned on the event {Dj ∋ 𝒯}. Formally, we make the following approximations:Math(11)Math(12)Together with the building blocks of the time-rescaled HMM, these assumptions provide a recursive approximation of Math, which we denote by Math. Specifically, assumptions (11) and (12) imply that the integral recursion for Math reduces to the discrete recursionMath(13)with base caseMath(14)where we have defined distributions Math and Math. Setting Math, we getMath(15)Turning to the transition density Math, which is conditioned on the event {Dj ∋ 𝒯}, and recalling that 𝒯 is marginally exponentially distributed with parameter 1, we obtainMath(16)with analytic expressions for y(i) and z(i,j) provided in the appendix. Note that assumption (11) is not used here; rather, the formula follows from using the time-rescaled version of the transition density (2) in the double integral. An expression for the emission density Math can be similarly obtained,Math(17)with an analytic expression for υ(i)(k) also given in the appendix. Again, assumption (12) is not used here; the second equality of (17) follows from using the time-rescaled version of the emission probability (3) in the integral. In summary, Math can be computed efficiently using (13), and Math can be approximated byMath(18)

Equations 1318 provide the requisite D-discretized versions of the transformed densities. Note that these equations characterize an HMM; that the Markov property holds on the discretized state space D follows from assumptions (11) and (12) (Rosenblatt 1959). In fact, (13–18) may alternatively be obtained by assuming that the Markov property holds on D and writing down the relevant transition and emission probabilities with the interpretations given above. In the remainder of this section, we examine some general properties of the discretized dynamics and also provide one method for choosing a discretization D.

Computational complexity of the discretized recursion:

We first consider the asymptotic complexity of computing the CSP under the D-discretized approximation for Math. Substituting Equation 16 into the key recursion (13) givesMath(19)for ℓ ≥ 2. For a fixed discretization D, the expressions Math, y(i), and z(i,j) depend only on the total sample size n, the mutation and recombination rates (θ and ρ), and the boundary points x0, … , xd of D; these may be precomputed and cached for relevant ranges of values. In conjunction with the base case (14), there is a dynamic program (see the appendix for details) for computing the CSP under the D-discretized approximation (18) for Math with time complexity O(k · (nd + d2)), where k is the number of loci. As in Fearnhead and Donnelly (2001), this time complexity is better than O(k · (nd)2), the result that would be obtained by naive use of the HMM forward algorithm.

Properties of the discretization:

Recall the detailed-balance condition (7) associated with Math. Using expressions (15) and (16), together with Bayes' rule, we find thatMath(20)holds (the details are provided in the appendix). Thus, the discretized approximation of Math satisfies an analogous detailed balance condition. As a result, the marginal distribution at each locus of the discretized Markov chain is (again) given by Math and the approximation exhibits the expected symmetries; for example, equal CSPs are computed whether starting at the leftmost locus and proceeding right or starting at the rightmost locus and proceeding left.

Furthermore, recall the locus-skipping property (8) associated with Math. The first equality in (16) and assumption (11) imply the relationMath(21)for all ρ1, ρ2 ≥ 0 (see the appendix for details). Thus, the discretized approximation of Math approximately satisfies an analogous locus-skipping condition, up to the error introduced via approximation (11). This approximation is particularly useful in scenarios when data are missing (i.e., α[] is unknown for one or more L), since this property reduces the time complexity of the dynamic program given above. In particular, when m of the k loci are missing, the time complexity is reduced to O((km) · (nd + d2)). This is relevant, for example, in importance sampling applications (Fearnhead and Donnelly 2001).

Discretization choice and the definition of Math:

Finally, we discuss a method for choosing a discretization D of the absorption time. Recalling that marginally the transformed absorption time is exponentially distributed with parameter 1, let {(w(j), τ(j))}j = 1, … ,d be the d-point Gaussian quadrature associated with the function f(τ) = e−τ (Abramowitz and Stegun 1972, Section 25.4.45). Set x0 = 0, and set xj such that Math. Since Math, the points Math determine a partition D = {Dj = [xj−1, xj)}j=1, … ,d of ℝ≥0.

The use of Gaussian quadrature evokes the work of Stephens and Donnelly (2000) and Fearnhead and Donnelly (2001). Although the method we employ is related, it is different in that we do not use the quadrature directly [for example, the values of the quadrature points {τ(j)} are never used explicitly]; rather, we use the Gaussian quadrature as a reasonable way of choosing a discretization D. We henceforth write Math for the d-point Gaussian quadrature-discretized version of Math.

EMPIRICAL RESULTS

In the previous section, we defined a discretized approximation Math of the CSD Math. In this section, we examine the accuracy of this approximation and also compare it to the widely used CSDs Math and Math, thereby providing evidence that Math is a more accurate and computationally tractable CSD.

Data simulation:

For simplicity, we consider a two-allele model with Math, θ = θ for ℓ ∈ L and ρb = ρ for bB. We sample a k-locus haplotype configuration Math by (i) using a coalescent with recombination simulator, with ρ = ρ0 and θ = θ0, to sample a k0-locus (with k0 >> k) n-haplotype configuration n0, and (ii) restricting attention to the central k segregating loci in n0. This procedure corresponds to the usage of the CSD on typical genomic data, in which only segregating sites are considered.

Given a k-locus n-haplotype configuration n, we obtain a k-locus n-haplotype conditional configuration C = (α, neα) by withholding a single haplotype α from Math uniformly at random. For notational simplicity, we define π on such a conditional configuration in the natural way: π(C) = π(α | neα).

CSD accuracy:

We evaluate the accuracy of a CSD Math relative to a reference CSD π0 using the expected absolute log-ratio (ALR) error,Math(22)where N denotes the number of simulated data sets and C(i) is a k-locus n-haplotype conditional configuration sampled as indicated above, and both Math and π0 are evaluated using the true parameter values θ = θ0 and ρ = ρ0. For example, if Math, the CSP obtained using Math differs from that obtained by π0 by a factor of 10, on average, for a randomly sampled k-locus n-haplotype conditional configuration.

Using the ALR error, we evaluate the accuracy of several CSDs: Math (Fearnhead and Donnelly 2001); Math (Li and Stephens 2003); Math, evaluated using the recursion for Math (Paul and Song 2010); and Math, the d-point quadrature-discretized version of Math, for d ∈ {4, 18, 16}. We also evaluate Math, a variant of Math introduced in Paul and Song (2010) with computational time complexity O(k3 · n); the CSD Math is described in more detail in the appendix.

In what follows, we set θ0 = 0.01 and ρ0 = 0.05 and fix n = 10. For k ≤ 10, it is possible to obtain a very good approximation to the true CSD π using computationally intensive importance sampling. The resulting values of ALRErrk,n(· | π) are plotted in Figure 3A, as a function of k. Supporting the conclusion of Paul and Song (2010), Math is more accurate than both Math and Math, with the disparity increasing as k increases. Moreover, the CSD Math is nearly as accurate as Math, suggesting that the discretization is fairly accurate even for modest values of d. Finally, the CSD Math has accuracy that is indistinguishable from Math.

Figure 3.—

Absolute log-ratio error (ALRErr) of various conditional sampling distributions. See (22) for a formal definition of ALRErrk,n(· | ·). The accuracy of Formula is almost indistinguishable from that of Formula, the most accurate of all approximate CSDs considered here. As expected, discretization reduces the accuracy somewhat, but even Formula is substantially more accurate than Formula and Formula. With θ0 = 0.01 and ρ0 = 0.05, we used the methodology described in the text to sample 250 conditional configurations, each with n = 10 haplotypes and k loci. (A) Error is measured relative to the true CSD π, estimated using computationally intensive importance sampling. (B) Error is measured relative to Formula, computed by numerically solving a recursion for the equivalent CSD Formula.

To investigate these results as k increases, we consider the ALR error relative to Math, which can be evaluated exactly for Math; the resulting values of Math are plotted in Figure 3B, as a function of k. As k increases, both Math and Math continue to diverge from Math, suggesting that the increasing disparity in accuracy, directly observable in Figure 3A, continues for larger values of k. As expected, the discretized approximation Math shows increased fidelity to Math for larger values of d, and even Math is substantially more accurate, relative to Math, than are Math and Math.

It is too computationally expensive to compute Math for k > 20. However, Figure 3B suggests that the CSD Math is nearly indistinguishable from Math. Motivated by this observation, we consider the error relative to Math for k > 20. The values of Math and the analogously defined signed log-ratio (SLR) error Math are plotted as a function of k in Figure 4, A and B, respectively. The trends observed in Figure 3 are recapitulated in Figure 4A, suggesting that they continue to hold for substantially larger values of k. Interestingly, Figure 4B shows that Math and Math produce values significantly smaller than Math (and Math); for example, Math takes values that are, on average, a factor of 10 smaller than Math for k = 100. In conjunction with our conclusion that Math is more accurate than Math and Math, this suggests a similar systematic error with respect to the true CSD.

Figure 4.—

Comparison of the accuracy of various conditional sampling distributions relative to Formula (see Figure 3 for the accuracy of Formula). A and B illustrate that the improvement in accuracy of Formula over Formula and Formula is amplified as the number of loci k increases and that both Formula and Formula produce significantly smaller values than Formula (and Formula). For θ0 = 0.01 and ρ0 = 0.05, we used the methodology described in the text to sample 250 conditional configurations with n = 10 haplotypes and k loci. (A) Absolute log-ratio error. (B) Signed log-ratio error.

For a discussion of CSD accuracy in the context of the product of approximate conditionals (PAC) method (Li and Stephens 2003), we refer the reader to Paul and Song (2010). Since Math is very close to Math (as demonstrated in the present paper), we anticipate that using it produces similar results for PAC likelihood estimation and recombination rate inference.

Running time comparison:

We next consider the empirically observed running time required to compute each CSP. The results, obtained using the conditional configurations with n = 10 and k ∈ {1, … , 100} simulated as previously described, are presented in Table 1. Looking across each row, it is evident that the running time under Math, and Math depends linearly on the number of loci k, matching the asymptotic time complexity. Similarly, the running time under Math is well matched by the theoretical cubic dependence on k.

View this table:
TABLE 1

Asymptotic time complexity and empirically observed average running time

Next, comparing Math, and Math, observe that the running time for Math is approximately a factor of 10 slower than Math and approximately a factor of 2 slower than Math. Similarly, Math is approximately a factor of 20 and of 4 slower than Math and Math, respectively; and Math is approximately a factor of 40 and of 8 slower than Math and Math, respectively. Importantly, these factors are constant, depending on neither the number of loci k nor the number of haplotypes n. Also note that the time required to compute the CSD for Math appears to depend linearly, rather than quadratically, on d for the modest (but relevant) values considered.

DISCUSSION

We have formulated a sequentially Markov approximation of Math, which we call Math. The relationship between the genealogical process underlying Math and Math is analogous to the relationship between the coalescent with recombination and the SMC. In particular, Math is equivalent to Math with a certain class of coalescence events disallowed. In the case of sampling one additional haplotype, this corresponds to disallowing all coalescence events, the same approximation used to obtain Math, and so we find that Math.

Though the CSD Math can be cast as an HMM, the associated CSP cannot be evaluated using typical HMM methodology because of the continuous state space; to our knowledge, exact evaluation is possible only via the known recursion for Math, which has time complexity exponential in the number of loci. By discretizing the continuous state space into d intervals, obtained using Gaussian quadrature, we obtain the discretized approximation Math for which computing the CSP has time complexity linear in both the number of loci and the number of haplotypes. We find that, even for modest values of d, Math is a very good approximation of Math. Importantly, Math is more accurate than Math and Math with only a (small) constant factor penalty in run time. We remark that we investigated alternative methods for discretizing the CSP computation (e.g., point-based rather than interval-based methods), but settled on the described approach as it exhibited desirable properties and is theoretically well motivated.

We attribute the observed increase in accuracy of Math to the incorporation of two key features of the coalescent with recombination that are not integrated into either Math or Math. Consider the genealogy associated with two particular haplotypes within an ARG. First, observe that the times to the most recent common ancestor (MRCA) at two neighboring loci are dependent, even if ancestral lineages at the two loci are separated by a recombination event. Math explicitly models a Markov approximation to the analogous absorption-time dependence across breakpoints, whereas both Math and Math assume independence. Second, if the time to the MRCA at a locus is small, the probability of recombination between this locus and neighboring loci is small, since it would have had to occur prior to the MRCA. While Math models this property by diminishing the probability of recombination between neighboring loci if the absorption time at the first locus is small, Math and Math assume that recombination is independent of absorption time. We believe that Math and Math tend to underestimate, on average, the true CSP (as suggested in Figure 4B) due to the omission of these key features. The relationship between several CSDs, including Math and Math, is illustrated in Figure 5.

Figure 5.—

Illustration of the relationship between various CSDs. The CSD at the head of each arrow can be seen as an approximation to the CSD at the tail. Each arrow is also annotated with a (short) description of this approximation. The CSDs below the dashed line can be cast as an HMM: Those above the dotted line (including a continuous-state version of Formula, which we denote Formula) have a continuous and infinite state space, while those below have a finite and discrete state space and are therefore amenable to simple dynamic programming algorithms. For more thorough descriptions of each approximation, see the main text and also Paul and Song (2010). Recall in particular that the equality Formula holds only for conditionally sampling a single haplotype.

Toward future research, recall that the CSD can be extended to sampling more than one additional haplotype (Paul and Song 2010). Of particular importance to population genetics tools (Stephens and Scheet 2005; Marchini et al. 2007; Howie et al. 2009) for diploid organisms is sampling two additional haplotypes. Though we focused on conditionally sampling a single additional haplotype in the present work, we note that the sequentially Markov approximation to Math is, in principle, applicable to sampling multiple haplotypes. However, the state space of the resulting HMM description increases exponentially with the number of haplotypes. In this domain, we anticipate that randomized techniques for CSP computation, such as importance sampling and Markov chain Monte Carlo, will exhibit high accuracy and the efficiency required for modern data sets. We pursue this line of research in a forthcoming article.

We believe that it is possible to extend the ideas presented here to different demographic scenarios, for example, spatial structure or models of population subdivision (Davison et al. 2009). It should be possible to extend the principled approach of Paul and Song (2010) toward the CSD via the diffusion generator to these scenarios, as in De Iorio and Griffiths (2004b) and Griffiths et al. (2008). In other scenarios, for example varying population size, the principled approach might not be applicable, so one would have to modify the genealogical interpretation heuristically, e.g., varying coalescence rates. As in the present article, prohibiting certain coalescence events in the conditional genealogy should then allow for an efficient implementation of the resulting CSDs as HMMs.

Though the SMC has been used for simulating population genetic samples (Marjoram and Wall 2006; Chen et al. 2009), it can also be cast as an HMM and used for inference in scenarios in which using the full coalescent with recombination is cumbersome. As described above, the state space of the HMM increases exponentially with the number of haplotypes, making exact computation intractable for large numbers of haplotypes. Nevertheless, research (Hobolth et al. 2007; Dutheil et al. 2009) is in progress for modest numbers of haplotypes. We believe that choosing a discretization using Gaussian quadrature, as described in discretization of the hmm, and the forthcoming randomized techniques alluded to above, will foster progress in this area.

We conclude by recalling that a broad range of population genetic tools have been developed, and will continue to be developed, on the basis of the CSD. These tools typically employ Math, or a similar variant, because the underlying HMM structure admits simple and fast recursions for the relevant calculations (e.g., the CSP). We have introduced a new CSD Math and a discretized approximation Math, which also have simple underlying HMM structures and substantially improve upon the accuracy of Math and Math. We believe that Math, when used in the same contexts as Math and Math, has the potential to produce more accurate results, with only a small constant factor penalty in run time.

APPENDIX

Time-transformed model:

Rewriting the HMM Equations 16 in terms of the transformed state Σ = (𝒯, H) introduced in discretization of the hmm yieldsMath(A1)where the transformed density Math is given byMath(A2)with the base caseMath(A3)The transformed initial, transition, and emission densities are given byMath(A4)Math(A5)Math(A6)Note that care must be taken upon transforming the Dirac-δ in the expression for Math.

Analytic expressions for emission and transition probabilities:

We now provide analytic expressions for the quantities y(i), z(i,j), and υ(i)(k) introduced for the transition probability (16) and the emission probability (17). Recalling that Di = [xi−1, xi) and Dj = [xj−1, xj) and evaluating the associated integrals, we getMath(A7)Math(A8)for ρbn,Math(A9)for ρb = n, andMath(A10)Note that the recursive structure of υ(i)(k) [together with Math and the sum in Equation 17] suggests an efficient implementation.

Description of the dynamic program for D-discretized Math:

Let D = {D1, … , Dd} be a finite partition of ℝ≥0 as described in the text. Recalling the recursion for Math given in Equation 19, consider the following dynamic programming algorithm for computing the D-discretized approximation of Math:

  1. For each DjD and h ∈ ℋ such that nh > 0, compute Math using (14), and set Math

  2. For each ℓ ∈ {2, … , k},

  1. For each DjD, compute Math

  2. For each DjD and h ∈ ℋ such that nh > 0, computeMathMath

  • 3. Compute D-discretized approximation MathMath.

The time complexities of steps 2a and 2b are O(d2) and O(nd), respectively. The time complexities of steps 1 and 3 are both O(nd). We can therefore conclude that the time complexity of the dynamic program is Math.

Detailed balance and locus skipping:

The detailed-balance condition (20) for the discretized model Math can be shown using expressions (15) and (16). Together with Bayes' rule, we find that the following holds:Math(A11)Using expression (16) and assumption (11) we can show thatMath(A12)holds; thus the locus-skipping property (21) for the discretized model Math holds only approximately. Here we make explicit that the error is introduced by approximation (11) in the third step. Thus it is possible to explicitly assess the error and it goes to zero as the number of intervals used for the discretization becomes large.

A description of Math:

Computing the CSP for Math can be done via a genealogical recursion (Paul and Song 2010, Equation 12), but has time complexity exponential in the number of loci, k. To improve upon this result, Paul and Song suggest using the genealogical recursion until the first mutation, and thereafter using a fast alternative CSD Math (Paul and Song 2010, Equation 13). In particular, choosing Math yields Math, for which CSP computation has asymptotic time complexity O(k3· n).

Similarly, choosing Math yields Math, for which CSP computation has the same asymptotic time complexity O(k3 · n). Importantly, Math is more accurate than Math, and so the resulting CSD Math is more accurate than Math.

Acknowledgments

This research is supported in part by National Institutes of Health (NIH) grant R01-GM094402, an Alfred P. Sloan Research Fellowship, and a Packard Fellowship for Science and Engineering. The research of J.S.P was funded in part by an NIH National Research Service Award Trainee appointment on T32-HG00047.

Footnotes

  • Received November 30, 2010.
  • Accepted January 21, 2011.

References

View Abstract