An Accurate Sequentially Markov Conditional Sampling Distribution for the Coalescent With Recombination
Joshua S. Paul, Matthias Steinrücken, Yun S. Song

## Abstract

The sequentially Markov coalescent is a simplified genealogical process that aims to capture the essential features of the full coalescent model with recombination, while being scalable in the number of loci. In this article, the sequentially Markov framework is applied to the conditional sampling distribution (CSD), which is at the core of many statistical tools for population genetic analyses. Briefly, the CSD describes the probability that an additionally sampled DNA sequence is of a certain type, given that a collection of sequences has already been observed. A hidden Markov model (HMM) formulation of the sequentially Markov CSD is developed here, yielding an algorithm with time complexity linear in both the number of loci and the number of haplotypes. This work provides a highly accurate, practical approximation to a recently introduced CSD derived from the diffusion process associated with the coalescent with recombination. It is empirically demonstrated that the improvement in accuracy of the new CSD over previously proposed HMM-based CSDs increases substantially with the number of loci. The framework presented here can be adopted in a wide range of applications in population genetics, including imputing missing sequence data, estimating recombination rates, and inferring human colonization history.

THE conditional sampling distribution (CSD) describes the probability, under a particular population genetic model, that an additionally sampled DNA sequence is of a certain type, given that a collection of sequences has already been observed. In many important settings, the relevant population genetic model is the coalescent with recombination, for which the true CSD, denoted by π, does not have a known analytic formula. Nevertheless, the CSD π and, in particular, approximations thereof have found a wide range of applications in population genetics.

One important problem in which the CSD plays a fundamental role is describing the posterior distribution of genealogies under the coalescent process. Stephens and Donnelly (2000) showed that the true posterior distribution can be written in terms of π and can be approximated by using an approximate CSD, denoted $Math$. This observation has been used (Stephens and Donnelly 2000; Fearnhead and Donnelly 2001; De Iorio and Griffiths 2004a,b; Fearnhead and Smith 2005; Griffiths et al. 2008) to construct importance sampling schemes for likelihood computation and ancestral inference under the coalescent, including extensions such as recombination and population structure. In conjunction with composite-likelihood frameworks (Hudson 2001; Fearnhead and Donnelly 2002), these importance sampling methods have been used, for example, to estimate fine-scale recombination rates (McVean et al. 2004; Fearnhead and Smith 2005; Johnson and Slatkin 2009).

Li and Stephens (2003) introduced a different application of the CSD, observing that the probability of sampling a set of haplotypes can be decomposed into a product of CSDs and therefore can be approximated by a product of approximate CSDs $Math$. Similar applications of the CSD have yielded methods for estimating recombination rates (Li and Stephens 2003; Crawford et al. 2004; Stephens and Scheet 2005) and gene conversion parameters (Gay et al. 2007; Yin et al. 2009), for phasing genotype data into haplotype data (Stephens and Scheet 2005), for imputing missing data to improve power in association studies (Stephens and Scheet 2005; Li and Abecasis 2006; Scheet and Stephens 2006; Marchini et al. 2007; Howie et al. 2009), for inferring ancestry in admixed populations (Price et al. 2009), and for inferring demography (Hellenthal et al. 2008; Davison et al. 2009).

In all applications, the fidelity with which the surrogate CSD $Math$ approximates the true CSD π is critical to the quality of the result. Furthermore, the time required to compute probabilities under the CSD is important, as many of the above methods are now routinely applied to genome-scale data sets. As a result, many approximate CSDs have been proposed, particularly for the coalescent with recombination. Fearnhead and Donnelly (2001) introduced an approximation in which an additionally sampled haplotype is constructed as an imperfect mosaic of previously sampled haplotypes, with mosaic breakpoints caused by recombination events and imperfections corresponding to mutation events. The resulting CSD, which we denote by $Math$, can be cast as a hidden Markov model (HMM), and the associated conditional sampling probability (CSP) can be computed with time complexity linear in both the number of previously sampled haplotypes and the number of loci. Li and Stephens (2003) proposed a related model that can be viewed as a modification to $Math$ limiting the state space of the HMM, hence providing a constant factor improvement in the time complexity; we denote the corresponding CSD by $Math$.

Following the theoretical work of De Iorio and Griffiths (2004a), Griffiths et al. (2008) derived an approximate CSD from the Wright–Fisher diffusion process associated with the two-locus coalescent with recombination. More recently, Paul and Song (2010) generalized this work to an arbitrary number of loci and demonstrated that the resulting CSD, which we denote by $Math$, can also be described by a genealogical process. Though it is more accurate than both $Math$ and $Math$, computing the CSP under $Math$ has time complexity superexponential in the number of loci. To ameliorate this limitation, Paul and Song introduced the approximate CSD $Math$, which follows from prohibiting coalescence events in the genealogical process associated with $Math$. Computing the CSP under $Math$ has time complexity exponential in the number of loci. Although this is an improvement over the superexponential complexity associated with $Math$, it is still impracticable to use $Math$ for >20 loci.

In this article, we introduce an alternate approximation that is scalable in the number of loci, while maintaining the key features of $Math$ that lead to high accuracy. Specifically, motivated by the sequentially Markov coalescent (SMC) introduced by McVean and Cardin (2005), we derive a sequentially Markov approximation to $Math$. The key idea is to consider the marginal genealogies at each locus sequentially, using the genealogical description of $Math$. In general, the sequence of marginal genealogies is not Markov, but, as in McVean and Cardin (2005), we make approximations to provide a Markov construction for the sequence. We denote the resulting approximation of $Math$ by $Math$. The CSD $Math$ can also be obtained from $Math$ by prohibiting a certain class of coalescence events, a fact that mirrors the relation between the SMC and the coalescent with recombination (McVean and Cardin 2005). We formalize this relation by proving that $Math$ is, in fact, equal to $Math$.

Due to its sequentially Markov construction, $Math$ can be cast as an HMM. Unfortunately, the state space of the HMM is continuous, and so efficient algorithms for CSP computation and posterior inference are not known. Our solution is to discretize the state space. The discretization procedure we develop is related, though not identical, to the Gaussian quadrature method employed by Stephens and Donnelly (2000) and Fearnhead and Donnelly (2001). Although we focus on the CSD problem here, we believe that our general approach has the potential to foster applications of the SMC in other settings as well (see Hobolth et al. 2007; Dutheil et al. 2009).

Having discretized the continuous state space, we apply standard HMM theory to obtain an efficient dynamic program for computing the CSP under the discretized approximation of $Math$. The resulting time complexity is linear in both the number of previously sampled haplotypes and the number of loci. This time complexity is the same as that for $Math$ and $Math$ and hence is a substantial improvement over $Math$. In summary, the work presented here provides a practical approximation to $Math$, which was derived from the diffusion process associated with the coalescent with recombination. Furthermore, as detailed later, the improvement in accuracy of our new CSD over $Math$ and $Math$ increases substantially with the number of loci.

The remainder of this article is organized as follows. In model, we present the necessary notation and background and describe our new CSD $Math$. We also give an overview of the proof that $Math$ is equivalent to $Math$ and demonstrate several other useful properties. In discretization of the hmm, we describe the discretization of $Math$, and in empirical results, we provide empirical evidence that the discretized approximation performs well, with regard to both accuracy and run time. Finally, in discussion we mention some connections to existing models and describe possible applications and extensions, in particular conditionally sampling more than one haplotype.

## MODEL

In this section, we describe the key transition and emission distributions for the HMM underlying $Math$. Further, we demonstrate that $Math$ is equivalent to $Math$, the variant of $Math$ with coalescence disallowed, and also show that the transition density satisfies several useful properties.

#### Notation:

We consider haplotypes in the finite-sites finite-alleles setting. Denote the set of loci by L = {1, … ,k} and the set of alleles at locus ℓ ∈ L by E. Mutations occur at locus ℓ ∈ L at rate θ/2 and according to the stochastic matrix $Math$. Denote the set of breakpoints by B = {(1, 2), … , (k − 1, k)}, where recombination occurs at breakpoint bB at rate ρb/2.

The space of k-locus haplotypes is denoted by ℋ = E1 × … × Ek. Given a haplotype α ∈ ℋ, we denote by α[ℓ] ∈ E the allele at locus ℓ ∈ L and by α[1 : ℓ] the partial haplotype (α[1], … , α[ℓ]). A sample configuration of haplotypes is specified by a vector $Math$, with nα being the number of haplotypes of type α in the sample. The total number of haplotypes in the sample is denoted by |n| = n. Finally, we use eα to denote the singleton configuration comprising a single α haplotype.

#### A brief review of the CSD $Math$$Math$:

The approximate CSD $Math$ is described by a genealogical process closely related to the coalescent with recombination. We provide below a brief description of the framework and refer the reader to Paul and Song (2010) for further details.

Suppose that, conditioned on having already observed a haplotype configuration $Math$, we wish to sample a new haplotype α. Define 𝒜*(n) to be the nonrandom trunk ancestry for $Math$, in which lineages associated with the haplotypes do not mutate, recombine, or coalesce with one another, but rather extend infinitely into the past. We assume that the unknown ancestry associated with $Math$ is 𝒜*(n) and sample a conditional ancestry C associated with α. Within the conditional ancestry, lineages evolve backward in time with the following rates:

• Mutation: Each lineage mutates at locus ℓ ∈ L with rate θ/2, according to P(ℓ).

• Recombination: Each lineage undergoes recombination at breakpoint bB with rate ρb/2.

• Coalescence: Each pair of lineages coalesces with rate 1.

• Absorption: Each lineage is absorbed into each lineage of 𝒜*(n) at rate 1/2.

When every lineage has been absorbed into 𝒜*(n), the process terminates. The type of every lineage in C can now be inferred, and a sample for α is generated. An illustration of this process is presented in Figure 1A.

Figure 1.—

Illustration of the corresponding genealogical and sequential interpretations for a realization of $Formula$. The three loci of each haplotype are each represented by a solid circle, with the color indicating the allelic type at that locus. The trunk genealogy 𝒜*(n) and conditional genealogy C are indicated. Time is represented vertically, with the present (time 0) at the bottom of the illustration. (A) The genealogical interpretation: Mutation events, along with the locus and resulting haplotype, are indicated by small arrows. Recombination events, and the resulting haplotype, are indicated by branching events in C. Absorption events, and the corresponding absorption time [t(a) and t(b)] and haplotype [h(a) and h(b), respectively], are indicated by dotted-dashed horizontal lines. (B) The corresponding sequential interpretation: The marginal genealogies at the first, second, and third locus (S1, S2, and S3) are emphasized as dotted, dashed, and solid lines, respectively. Mutation events at each locus, along with resulting allele, are indicated by small arrows. Absorption events at each locus are indicated by horizontal lines.

Although a recursion for computing the CSP $Math$ is known (Paul and Song 2010, Equation 7), it is computationally intractable, and Paul and Song approximate the genealogical process by disallowing coalescence within the conditional genealogy, denoting the resulting CSD by $Math$. The recursion for $Math$ (Paul and Song 2010, Equation 12) is amenable to dynamic programming, though it still has time complexity exponential in the number k of loci.

#### The sequentially Markov coalescent:

The sequential interpretation of the coalescent with recombination was introduced by Wiuf and Hein (1999). They observed that an ancestral recombination graph (ARG) may be simulated sequentially along the chromosome. In particular, the marginal coalescent tree at a given locus can be sampled conditional on the marginal ARG for all previous loci. The full ARG is then sampled by first sampling a coalescent tree at the leftmost locus and then proceeding to the right.

McVean and Cardin (2005) proposed a simplification of this process. Though McVean and Cardin presented their work for the infinite-sites model, we state (but do not derive) the analogous results for a finite-sites, finite-alleles model. In their approach, the marginal coalescent tree at locus ℓ is sampled conditional only on the marginal coalescent tree at locus ℓ − 1. In particular, setting b = (ℓ − 1, ℓ) ∈ B, (1) recombination breakpoints are realized as a Poisson process with rate ρb/2 on the marginal coalescent tree at locus ℓ − 1, (2) the lineage branching from each recombination breakpoint associated with locus ℓ − 1 is removed, and (3) the lineage branching from each recombination breakpoint associated with locus ℓ is subject to coalescence with other lineages at rate 1. The resulting tree is the marginal genealogy at locus ℓ. This approximation is called the sequentially Markov coalescent (SMC) and is equivalent to a variant of the coalescent with recombination that disallows coalescence between lineages ancestral to disjoint regions of the sequence (McVean and Cardin 2005).

#### The sequentially Markov CSD $Math$$Math$

We now describe a sequentially Markov approximation to the genealogical process underlying $Math$. Our construction is similar to that given by McVean and Cardin (2005), described above, though the resulting dynamics are less involved since the conditional genealogy is constructed for a single haplotype. First, observe that under $Math$, the marginal conditional genealogy at a given locus ℓ ∈ L is entirely determined by two random variables: the absorption time, which we denote T, and the absorption haplotype, which we denote H. The present corresponds to time 0 and T ∈ [0, ∞]. See Figure 1B for an illustration. For convenience, we write S = (T, H) for the random marginal conditional genealogy at locus ℓ ∈ L and s = (t, h) for a realization.

Within the marginal conditional genealogy at locus ℓ ∈ L, note that T and H are independent, with T distributed exponentially with parameter n/2 and H distributed uniformly over the n haplotypes of $Math$. Thus, the marginal conditional genealogy S at locus ℓ is distributed with density ζ(n), where$Math$(1)

Conditioning on Sℓ−1= sℓ−1=(tℓ−1, hℓ−1), the marginal conditional genealogy S, for ℓ ≥ 2, is sampled by a process analogous to that described above for the SMC. Setting b = (ℓ − 1, ℓ) ∈ B, the sampling procedure is as follows (see Figure 2 for an accompanying illustration): (1) Recombination breakpoints are realized as a Poisson process with rate ρb/2 on the marginal conditional genealogy sℓ−1; (2) going backward in time, the lineage associated with locus ℓ−1 branching from each recombination breakpoint is removed, so that only the lineage more recent than the first (i.e., the most recent) breakpoint remains; and (3) the lineage associated with locus ℓ branching from the first recombination breakpoint is absorbed into a particular lineage of 𝒜*(n) at rate 1/2.

Figure 2.—

Illustration of the (Markov) process for sampling the absorption time T given the absorption time Tℓ−1 = tℓ−1. In step 1, recombination breakpoints are realized as a Poisson process with rate ρb/2 on the marginal conditional genealogy with absorption time tℓ−1. In step 2, the lineage branching from each breakpoint associated with locus ℓ−1 is removed, so that only the lineage more recent than the first breakpoint, at time Tr, remains. In step 3, the lineage branching from the first recombination breakpoint associated with locus is absorbed after time Ta distributed exponentially with rate n/2. Thus, T = Tr + Ta.

From the above description, we deduce that there is no recombination between loci ℓ−1 and ℓ with probability $Math$, and in this case the marginal conditional genealogy is unchanged; that is, S = sℓ−1. Otherwise, the time Tr of the first recombination breakpoint is distributed exponentially with parameter ρb/2, truncated at time tℓ−1, and the additional time Ta until absorption is distributed exponentially with parameter n/2. Thus we have S = (Tr + Ta, H), where H is chosen uniformly at random from the sample $Math$. Taking a convolution of Tr and Ta, the transition density $Math$ is given by$Math$(2)where $Math$.

Finally, conditioning on S = s, recall that mutations are realized as a Poisson process (cf. Stephens and Donnelly 2000) with rate θ/2. Therefore, a particular allele aE is observed with probability$Math$(3)Hereafter, we omit the superscript $Math$ and the subscripts θ and ρb from these densities, whenever the context is unambiguous.

The sequentially Markov approximation to $Math$ can be cast as a continuous-state HMM. In generating a haplotype α, the observed state, the hidden state, and initial, transition, and emission densities are given by the following:

• Observed state: At locus ℓ ∈ L, the observed state is the allele α[ℓ].

• Hidden state: At locus ℓ ∈ L, the hidden state is the marginal genealogy S = (T, H).

• Initial density: ζ is defined in (1).

• Transition density: ϕ is defined in (2).

• Emission density: ξ is defined in (3).

Writing $Math$ for the sequentially Markov approximation to $Math$, we can use the forward recursion (see, e.g., Doucet and Johansen 2008) to get$Math$(4)where fSMC (·, ·) is defined by$Math$(5)with base case$Math$(6)Though we cannot analytically solve the above recursion for $Math$, in the next section we derive a discretized approximation with time complexity linear in both the number of loci k and the number of haplotypes n. Before doing so, we briefly discuss some appealing properties of $Math$.

#### Properties of $Math$$Math$:

Recall that the SMC approximation of McVean and Cardin (2005) is equivalent to a variant of the coalescent with recombination disallowing coalescence events between lineages ancestral to disjoint regions. Similarly, the CSD $Math$, when used to sample a single haplotype, is a variant of $Math$ disallowing the same class of coalescence events. We might therefore expect that the sequentially Markov approximation of $Math$ described above is equivalent to $Math$, and in fact we can show that this is true.

Proposition 1. For an arbitrary single haplotype α ∈ ℋ and haplotype configuration $Math$, $Math$.

We present a sketch of the proof here and refer the reader to supporting information, File S1, for further details.

Sketch of Proof. The key idea of the proof is to introduce a genealogical recursion for f(α, sk), the joint density function associated with sampling haplotype α $Math$ and the marginal genealogy at the last locus sk. This recursion can be constructed following the lines of Griffiths and Tavaré (1994) to explicitly incorporate coalescent time into a genealogical recursion.

By partitioning with respect to the most recent event occurring at the last locus k, it is possible to inductively show that fSMC(α, sk) = f(α, sk). Furthermore, the equality $Math$ can be verified, and thus we conclude that$Math$

We now describe other intuitively appealing properties of $Math$. In particular, it can be verified that the detailed-balance condition$Math$(7)holds for the initial and transition densities, ζ and ϕ, respectively. This immediately implies that the initial distribution ζ is stationary under the given transition dynamics; i.e., the invariance condition$Math$is satisfied. Thus, S is marginally distributed according to ζ for all loci ℓ ∈ L, and in particular the marginal distribution of T is exponential with rate n/2. This parallels the fact that the marginal genealogies under the SMC (and the coalescent with recombination) are distributed according to Kingman's coalescent.

Similarly, the transition density exhibits a consistency property, which we call the locus-skipping property. Intuitively, this property states that transitioning directly from locus ℓ − 1 to ℓ + 1 can be accomplished by using the transition density parameterized with the sum of the recombination rates. Formally, the following equality holds for all ρ1, ρ2 ≥ 0:$Math$(8)This property, in conjunction with recursion (5), is computationally useful, as it enables loci ℓ ∈ L for which α[ℓ] is unobserved to be skipped in computing the CSP $Math$.

Finally, the conditional expectation of T given Tℓ−1 = tℓ−1 is$Math$(9)where b = (ℓ − 1, ℓ) ∈ B. Asymptotically, this expression provides several intuitive results. As ρb → ∞, 𝔼[T|tℓ−1] → 2/n; that is, recombination happens immediately, and 2/n is the expectation of the additional absorption time Ta. As ρb → 0, we get 𝔼[T|tℓ − 1] → tℓ−1. In this case there is no recombination, and the absorption time does not change. Further, 𝔼[T|tℓ − 1] → 2/ρb + 2/n holds as tℓ−1 → ∞. Here, recombination must occur, and the exponentially distributed time is not truncated, so the expectation is the sum of the expectations of two exponentials. Finally, as tℓ−1 → 0 we have 𝔼[T|tℓ − 1] → 0. No recombination can occur, and so the absorption time is unchanged.

## DISCRETIZATION OF THE HMM

In the previous section we described a sequentially Markov approximation of the CSD $Math$ and showed that it can be cast as an HMM. Because the absorption time component of the hidden state is continuous, the dynamic program associated with the classical HMM forward recursion is not applicable. However, by discretizing the continuous component, we are once again able to obtain a dynamic programming algorithm, resulting in an approximate CSP computation linear in both the number of loci and the number of haplotypes.

#### Rescaling time:

Recall from the previous section that the marginal absorption time at each locus is exponentially distributed with parameter n/2. To use the same discretization for all n, we follow Stephens and Donnelly (2000) and Fearnhead and Donnelly (2001) and transform the absorption time to a more natural scale in which the marginal absorption time is independent of n. In particular, define the transformed state Σ = (𝒯, H) where 𝒯 = (n/2)T. We denote a realization of Σ by σ = (τ, h). In the appendix, we provide expressions for the transformed quantities $Math$ and $Math$ derived from (1), (2), (3), and (5), respectively.

Using this time-rescaled model, the marginal absorption time at each locus is exponentially distributed with parameter 1. Because this distribution is independent of n and the coalescent model parameters ρ and θ, we expect that a single discretization of the transformed absorption time is appropriate for a wide range of haplotype configurations and parameter values.

#### Discretizing absorption time:

Our next objective is to discretize the absorption time 𝒯 ∈ ℝ≥0. Let 0 = x0 < x1 < ··· < xd = ∞ be a finite strictly increasing sequence in ℝ≥0 ∪ {∞} so that D = {Dj = [xj−1, xj)}j=1, … ,d is a d-partition of ℝ≥0.

Toward formulating a D-discretized version of the dynamics exhibited by the transformed HMM, we define the following D-discretized version of the density $Math$$Math$(10)for all ℓ ∈ L. Unfortunately, we cannot obtain a recursion for $Math$ via the definition of $Math$. Therefore, we make an additional approximation, namely that the transition and emission densities are conditionally dependent on the absorption time 𝒯 only through the event {Dj ∋ 𝒯 }; i.e., the densities depend on the interval Dj to which 𝒯 belongs but not on the actual value of 𝒯. Abusing notation, define $Math$ and $Math$ as the transition and emission densities, respectively, conditioned on the event {Dj ∋ 𝒯}. Formally, we make the following approximations:$Math$(11)$Math$(12)Together with the building blocks of the time-rescaled HMM, these assumptions provide a recursive approximation of $Math$, which we denote by $Math$. Specifically, assumptions (11) and (12) imply that the integral recursion for $Math$ reduces to the discrete recursion$Math$(13)with base case$Math$(14)where we have defined distributions $Math$ and $Math$. Setting $Math$, we get$Math$(15)Turning to the transition density $Math$, which is conditioned on the event {Dj ∋ 𝒯}, and recalling that 𝒯 is marginally exponentially distributed with parameter 1, we obtain$Math$(16)with analytic expressions for y(i) and z(i,j) provided in the appendix. Note that assumption (11) is not used here; rather, the formula follows from using the time-rescaled version of the transition density (2) in the double integral. An expression for the emission density $Math$ can be similarly obtained,$Math$(17)with an analytic expression for υ(i)(k) also given in the appendix. Again, assumption (12) is not used here; the second equality of (17) follows from using the time-rescaled version of the emission probability (3) in the integral. In summary, $Math$ can be computed efficiently using (13), and $Math$ can be approximated by$Math$(18)

Equations 1318 provide the requisite D-discretized versions of the transformed densities. Note that these equations characterize an HMM; that the Markov property holds on the discretized state space D follows from assumptions (11) and (12) (Rosenblatt 1959). In fact, (13–18) may alternatively be obtained by assuming that the Markov property holds on D and writing down the relevant transition and emission probabilities with the interpretations given above. In the remainder of this section, we examine some general properties of the discretized dynamics and also provide one method for choosing a discretization D.

#### Computational complexity of the discretized recursion:

We first consider the asymptotic complexity of computing the CSP under the D-discretized approximation for $Math$. Substituting Equation 16 into the key recursion (13) gives$Math$(19)for ℓ ≥ 2. For a fixed discretization D, the expressions $Math$, y(i), and z(i,j) depend only on the total sample size n, the mutation and recombination rates (θ and ρ), and the boundary points x0, … , xd of D; these may be precomputed and cached for relevant ranges of values. In conjunction with the base case (14), there is a dynamic program (see the appendix for details) for computing the CSP under the D-discretized approximation (18) for $Math$ with time complexity O(k · (nd + d2)), where k is the number of loci. As in Fearnhead and Donnelly (2001), this time complexity is better than O(k · (nd)2), the result that would be obtained by naive use of the HMM forward algorithm.

#### Properties of the discretization:

Recall the detailed-balance condition (7) associated with $Math$. Using expressions (15) and (16), together with Bayes' rule, we find that$Math$(20)holds (the details are provided in the appendix). Thus, the discretized approximation of $Math$ satisfies an analogous detailed balance condition. As a result, the marginal distribution at each locus of the discretized Markov chain is (again) given by $Math$ and the approximation exhibits the expected symmetries; for example, equal CSPs are computed whether starting at the leftmost locus and proceeding right or starting at the rightmost locus and proceeding left.

Furthermore, recall the locus-skipping property (8) associated with $Math$. The first equality in (16) and assumption (11) imply the relation$Math$(21)for all ρ1, ρ2 ≥ 0 (see the appendix for details). Thus, the discretized approximation of $Math$ approximately satisfies an analogous locus-skipping condition, up to the error introduced via approximation (11). This approximation is particularly useful in scenarios when data are missing (i.e., α[] is unknown for one or more L), since this property reduces the time complexity of the dynamic program given above. In particular, when m of the k loci are missing, the time complexity is reduced to O((km) · (nd + d2)). This is relevant, for example, in importance sampling applications (Fearnhead and Donnelly 2001).

#### Discretization choice and the definition of $Math$$Math$:

Finally, we discuss a method for choosing a discretization D of the absorption time. Recalling that marginally the transformed absorption time is exponentially distributed with parameter 1, let {(w(j), τ(j))}j = 1, … ,d be the d-point Gaussian quadrature associated with the function f(τ) = e−τ (Abramowitz and Stegun 1972, Section 25.4.45). Set x0 = 0, and set xj such that $Math$. Since $Math$, the points $Math$ determine a partition D = {Dj = [xj−1, xj)}j=1, … ,d of ℝ≥0.

The use of Gaussian quadrature evokes the work of Stephens and Donnelly (2000) and Fearnhead and Donnelly (2001). Although the method we employ is related, it is different in that we do not use the quadrature directly [for example, the values of the quadrature points {τ(j)} are never used explicitly]; rather, we use the Gaussian quadrature as a reasonable way of choosing a discretization D. We henceforth write $Math$ for the d-point Gaussian quadrature-discretized version of $Math$.

## EMPIRICAL RESULTS

In the previous section, we defined a discretized approximation $Math$ of the CSD $Math$. In this section, we examine the accuracy of this approximation and also compare it to the widely used CSDs $Math$ and $Math$, thereby providing evidence that $Math$ is a more accurate and computationally tractable CSD.

#### Data simulation:

For simplicity, we consider a two-allele model with $Math$, θ = θ for ℓ ∈ L and ρb = ρ for bB. We sample a k-locus haplotype configuration $Math$ by (i) using a coalescent with recombination simulator, with ρ = ρ0 and θ = θ0, to sample a k0-locus (with k0 >> k) n-haplotype configuration n0, and (ii) restricting attention to the central k segregating loci in n0. This procedure corresponds to the usage of the CSD on typical genomic data, in which only segregating sites are considered.

Given a k-locus n-haplotype configuration n, we obtain a k-locus n-haplotype conditional configuration C = (α, neα) by withholding a single haplotype α from $Math$ uniformly at random. For notational simplicity, we define π on such a conditional configuration in the natural way: π(C) = π(α | neα).

#### CSD accuracy:

We evaluate the accuracy of a CSD $Math$ relative to a reference CSD π0 using the expected absolute log-ratio (ALR) error,$Math$(22)where N denotes the number of simulated data sets and C(i) is a k-locus n-haplotype conditional configuration sampled as indicated above, and both $Math$ and π0 are evaluated using the true parameter values θ = θ0 and ρ = ρ0. For example, if $Math$, the CSP obtained using $Math$ differs from that obtained by π0 by a factor of 10, on average, for a randomly sampled k-locus n-haplotype conditional configuration.

Using the ALR error, we evaluate the accuracy of several CSDs: $Math$ (Fearnhead and Donnelly 2001); $Math$ (Li and Stephens 2003); $Math$, evaluated using the recursion for $Math$ (Paul and Song 2010); and $Math$, the d-point quadrature-discretized version of $Math$, for d ∈ {4, 18, 16}. We also evaluate $Math$, a variant of $Math$ introduced in Paul and Song (2010) with computational time complexity O(k3 · n); the CSD $Math$ is described in more detail in the appendix.

In what follows, we set θ0 = 0.01 and ρ0 = 0.05 and fix n = 10. For k ≤ 10, it is possible to obtain a very good approximation to the true CSD π using computationally intensive importance sampling. The resulting values of ALRErrk,n(· | π) are plotted in Figure 3A, as a function of k. Supporting the conclusion of Paul and Song (2010), $Math$ is more accurate than both $Math$ and $Math$, with the disparity increasing as k increases. Moreover, the CSD $Math$ is nearly as accurate as $Math$, suggesting that the discretization is fairly accurate even for modest values of d. Finally, the CSD $Math$ has accuracy that is indistinguishable from $Math$.

Figure 3.—

Absolute log-ratio error (ALRErr) of various conditional sampling distributions. See (22) for a formal definition of ALRErrk,n(· | ·). The accuracy of $Formula$ is almost indistinguishable from that of $Formula$, the most accurate of all approximate CSDs considered here. As expected, discretization reduces the accuracy somewhat, but even $Formula$ is substantially more accurate than $Formula$ and $Formula$. With θ0 = 0.01 and ρ0 = 0.05, we used the methodology described in the text to sample 250 conditional configurations, each with n = 10 haplotypes and k loci. (A) Error is measured relative to the true CSD π, estimated using computationally intensive importance sampling. (B) Error is measured relative to $Formula$, computed by numerically solving a recursion for the equivalent CSD $Formula$.

To investigate these results as k increases, we consider the ALR error relative to $Math$, which can be evaluated exactly for $Math$; the resulting values of $Math$ are plotted in Figure 3B, as a function of k. As k increases, both $Math$ and $Math$ continue to diverge from $Math$, suggesting that the increasing disparity in accuracy, directly observable in Figure 3A, continues for larger values of k. As expected, the discretized approximation $Math$ shows increased fidelity to $Math$ for larger values of d, and even $Math$ is substantially more accurate, relative to $Math$, than are $Math$ and $Math$.

It is too computationally expensive to compute $Math$ for k > 20. However, Figure 3B suggests that the CSD $Math$ is nearly indistinguishable from $Math$. Motivated by this observation, we consider the error relative to $Math$ for k > 20. The values of $Math$ and the analogously defined signed log-ratio (SLR) error $Math$ are plotted as a function of k in Figure 4, A and B, respectively. The trends observed in Figure 3 are recapitulated in Figure 4A, suggesting that they continue to hold for substantially larger values of k. Interestingly, Figure 4B shows that $Math$ and $Math$ produce values significantly smaller than $Math$ (and $Math$); for example, $Math$ takes values that are, on average, a factor of 10 smaller than $Math$ for k = 100. In conjunction with our conclusion that $Math$ is more accurate than $Math$ and $Math$, this suggests a similar systematic error with respect to the true CSD.

Figure 4.—

Comparison of the accuracy of various conditional sampling distributions relative to $Formula$ (see Figure 3 for the accuracy of $Formula$). A and B illustrate that the improvement in accuracy of $Formula$ over $Formula$ and $Formula$ is amplified as the number of loci k increases and that both $Formula$ and $Formula$ produce significantly smaller values than $Formula$ (and $Formula$). For θ0 = 0.01 and ρ0 = 0.05, we used the methodology described in the text to sample 250 conditional configurations with n = 10 haplotypes and k loci. (A) Absolute log-ratio error. (B) Signed log-ratio error.

For a discussion of CSD accuracy in the context of the product of approximate conditionals (PAC) method (Li and Stephens 2003), we refer the reader to Paul and Song (2010). Since $Math$ is very close to $Math$ (as demonstrated in the present paper), we anticipate that using it produces similar results for PAC likelihood estimation and recombination rate inference.

#### Running time comparison:

We next consider the empirically observed running time required to compute each CSP. The results, obtained using the conditional configurations with n = 10 and k ∈ {1, … , 100} simulated as previously described, are presented in Table 1. Looking across each row, it is evident that the running time under $Math$, and $Math$ depends linearly on the number of loci k, matching the asymptotic time complexity. Similarly, the running time under $Math$ is well matched by the theoretical cubic dependence on k.

View this table:
TABLE 1

Asymptotic time complexity and empirically observed average running time

Next, comparing $Math$, and $Math$, observe that the running time for $Math$ is approximately a factor of 10 slower than $Math$ and approximately a factor of 2 slower than $Math$. Similarly, $Math$ is approximately a factor of 20 and of 4 slower than $Math$ and $Math$, respectively; and $Math$ is approximately a factor of 40 and of 8 slower than $Math$ and $Math$, respectively. Importantly, these factors are constant, depending on neither the number of loci k nor the number of haplotypes n. Also note that the time required to compute the CSD for $Math$ appears to depend linearly, rather than quadratically, on d for the modest (but relevant) values considered.

## DISCUSSION

We have formulated a sequentially Markov approximation of $Math$, which we call $Math$. The relationship between the genealogical process underlying $Math$ and $Math$ is analogous to the relationship between the coalescent with recombination and the SMC. In particular, $Math$ is equivalent to $Math$ with a certain class of coalescence events disallowed. In the case of sampling one additional haplotype, this corresponds to disallowing all coalescence events, the same approximation used to obtain $Math$, and so we find that $Math$.

Though the CSD $Math$ can be cast as an HMM, the associated CSP cannot be evaluated using typical HMM methodology because of the continuous state space; to our knowledge, exact evaluation is possible only via the known recursion for $Math$, which has time complexity exponential in the number of loci. By discretizing the continuous state space into d intervals, obtained using Gaussian quadrature, we obtain the discretized approximation $Math$ for which computing the CSP has time complexity linear in both the number of loci and the number of haplotypes. We find that, even for modest values of d, $Math$ is a very good approximation of $Math$. Importantly, $Math$ is more accurate than $Math$ and $Math$ with only a (small) constant factor penalty in run time. We remark that we investigated alternative methods for discretizing the CSP computation (e.g., point-based rather than interval-based methods), but settled on the described approach as it exhibited desirable properties and is theoretically well motivated.

We attribute the observed increase in accuracy of $Math$ to the incorporation of two key features of the coalescent with recombination that are not integrated into either $Math$ or $Math$. Consider the genealogy associated with two particular haplotypes within an ARG. First, observe that the times to the most recent common ancestor (MRCA) at two neighboring loci are dependent, even if ancestral lineages at the two loci are separated by a recombination event. $Math$ explicitly models a Markov approximation to the analogous absorption-time dependence across breakpoints, whereas both $Math$ and $Math$ assume independence. Second, if the time to the MRCA at a locus is small, the probability of recombination between this locus and neighboring loci is small, since it would have had to occur prior to the MRCA. While $Math$ models this property by diminishing the probability of recombination between neighboring loci if the absorption time at the first locus is small, $Math$ and $Math$ assume that recombination is independent of absorption time. We believe that $Math$ and $Math$ tend to underestimate, on average, the true CSP (as suggested in Figure 4B) due to the omission of these key features. The relationship between several CSDs, including $Math$ and $Math$, is illustrated in Figure 5.

Figure 5.—

Illustration of the relationship between various CSDs. The CSD at the head of each arrow can be seen as an approximation to the CSD at the tail. Each arrow is also annotated with a (short) description of this approximation. The CSDs below the dashed line can be cast as an HMM: Those above the dotted line (including a continuous-state version of $Formula$, which we denote $Formula$) have a continuous and infinite state space, while those below have a finite and discrete state space and are therefore amenable to simple dynamic programming algorithms. For more thorough descriptions of each approximation, see the main text and also Paul and Song (2010). Recall in particular that the equality $Formula$ holds only for conditionally sampling a single haplotype.

Toward future research, recall that the CSD can be extended to sampling more than one additional haplotype (Paul and Song 2010). Of particular importance to population genetics tools (Stephens and Scheet 2005; Marchini et al. 2007; Howie et al. 2009) for diploid organisms is sampling two additional haplotypes. Though we focused on conditionally sampling a single additional haplotype in the present work, we note that the sequentially Markov approximation to $Math$ is, in principle, applicable to sampling multiple haplotypes. However, the state space of the resulting HMM description increases exponentially with the number of haplotypes. In this domain, we anticipate that randomized techniques for CSP computation, such as importance sampling and Markov chain Monte Carlo, will exhibit high accuracy and the efficiency required for modern data sets. We pursue this line of research in a forthcoming article.

We believe that it is possible to extend the ideas presented here to different demographic scenarios, for example, spatial structure or models of population subdivision (Davison et al. 2009). It should be possible to extend the principled approach of Paul and Song (2010) toward the CSD via the diffusion generator to these scenarios, as in De Iorio and Griffiths (2004b) and Griffiths et al. (2008). In other scenarios, for example varying population size, the principled approach might not be applicable, so one would have to modify the genealogical interpretation heuristically, e.g., varying coalescence rates. As in the present article, prohibiting certain coalescence events in the conditional genealogy should then allow for an efficient implementation of the resulting CSDs as HMMs.

Though the SMC has been used for simulating population genetic samples (Marjoram and Wall 2006; Chen et al. 2009), it can also be cast as an HMM and used for inference in scenarios in which using the full coalescent with recombination is cumbersome. As described above, the state space of the HMM increases exponentially with the number of haplotypes, making exact computation intractable for large numbers of haplotypes. Nevertheless, research (Hobolth et al. 2007; Dutheil et al. 2009) is in progress for modest numbers of haplotypes. We believe that choosing a discretization using Gaussian quadrature, as described in discretization of the hmm, and the forthcoming randomized techniques alluded to above, will foster progress in this area.

We conclude by recalling that a broad range of population genetic tools have been developed, and will continue to be developed, on the basis of the CSD. These tools typically employ $Math$, or a similar variant, because the underlying HMM structure admits simple and fast recursions for the relevant calculations (e.g., the CSP). We have introduced a new CSD $Math$ and a discretized approximation $Math$, which also have simple underlying HMM structures and substantially improve upon the accuracy of $Math$ and $Math$. We believe that $Math$, when used in the same contexts as $Math$ and $Math$, has the potential to produce more accurate results, with only a small constant factor penalty in run time.

## APPENDIX

#### Time-transformed model:

Rewriting the HMM Equations 16 in terms of the transformed state Σ = (𝒯, H) introduced in discretization of the hmm yields$Math$(A1)where the transformed density $Math$ is given by$Math$(A2)with the base case$Math$(A3)The transformed initial, transition, and emission densities are given by$Math$(A4)$Math$(A5)$Math$(A6)Note that care must be taken upon transforming the Dirac-δ in the expression for $Math$.

#### Analytic expressions for emission and transition probabilities:

We now provide analytic expressions for the quantities y(i), z(i,j), and υ(i)(k) introduced for the transition probability (16) and the emission probability (17). Recalling that Di = [xi−1, xi) and Dj = [xj−1, xj) and evaluating the associated integrals, we get$Math$(A7)$Math$(A8)for ρbn,$Math$(A9)for ρb = n, and$Math$(A10)Note that the recursive structure of υ(i)(k) [together with $Math$ and the sum in Equation 17] suggests an efficient implementation.

#### Description of the dynamic program for D-discretized $Math$$Math$:

Let D = {D1, … , Dd} be a finite partition of ℝ≥0 as described in the text. Recalling the recursion for $Math$ given in Equation 19, consider the following dynamic programming algorithm for computing the D-discretized approximation of $Math$:

1. For each DjD and h ∈ ℋ such that nh > 0, compute $Math$ using (14), and set $Math$

2. For each ℓ ∈ {2, … , k},

1. For each DjD, compute $Math$

2. For each DjD and h ∈ ℋ such that nh > 0, compute$Math$$Math$

• 3. Compute D-discretized approximation $Math$$Math$.

The time complexities of steps 2a and 2b are O(d2) and O(nd), respectively. The time complexities of steps 1 and 3 are both O(nd). We can therefore conclude that the time complexity of the dynamic program is $Math$.

#### Detailed balance and locus skipping:

The detailed-balance condition (20) for the discretized model $Math$ can be shown using expressions (15) and (16). Together with Bayes' rule, we find that the following holds:$Math$(A11)Using expression (16) and assumption (11) we can show that$Math$(A12)holds; thus the locus-skipping property (21) for the discretized model $Math$ holds only approximately. Here we make explicit that the error is introduced by approximation (11) in the third step. Thus it is possible to explicitly assess the error and it goes to zero as the number of intervals used for the discretization becomes large.

#### A description of $Math$$Math$:

Computing the CSP for $Math$ can be done via a genealogical recursion (Paul and Song 2010, Equation 12), but has time complexity exponential in the number of loci, k. To improve upon this result, Paul and Song suggest using the genealogical recursion until the first mutation, and thereafter using a fast alternative CSD $Math$ (Paul and Song 2010, Equation 13). In particular, choosing $Math$ yields $Math$, for which CSP computation has asymptotic time complexity O(k3· n).

Similarly, choosing $Math$ yields $Math$, for which CSP computation has the same asymptotic time complexity O(k3 · n). Importantly, $Math$ is more accurate than $Math$, and so the resulting CSD $Math$ is more accurate than $Math$.

## Acknowledgments

This research is supported in part by National Institutes of Health (NIH) grant R01-GM094402, an Alfred P. Sloan Research Fellowship, and a Packard Fellowship for Science and Engineering. The research of J.S.P was funded in part by an NIH National Research Service Award Trainee appointment on T32-HG00047.