## Abstract

The potential for imputed genotypes to enhance an analysis of genetic data depends largely on the accuracy of imputation, which in turn depends on properties of the reference panel of template haplotypes used to perform the imputation. To provide a basis for exploring how properties of the reference panel affect imputation accuracy theoretically rather than with computationally intensive imputation experiments, we introduce a coalescent model that considers imputation accuracy in terms of population-genetic parameters. Our model allows us to investigate sampling designs in the frequently occurring scenario in which imputation targets and templates are sampled from different populations. In particular, we derive expressions for expected imputation accuracy as a function of reference panel size and divergence time between the reference and target populations. We find that a modestly sized “internal” reference panel from the same population as a target haplotype yields, on average, greater imputation accuracy than a larger “external” panel from a different population, even if the divergence time between the two populations is small. The improvement in accuracy for the internal panel increases with increasing divergence time between the target and reference populations. Thus, in humans, our model predicts that imputation accuracy can be improved by generating small population-specific custom reference panels to augment existing collections such as those of the HapMap or 1000 Genomes Projects. Our approach can be extended to understand additional factors that affect imputation accuracy in complex population-genetic settings, and the results can ultimately facilitate improvements in imputation study designs.

GENOTYPE imputation is the estimation of genotypes at untyped markers using patterns of haplotype structure (Halperin and Stephan 2009; Li *et al.* 2009; Marchini and Howie 2010). Imputation is a powerful tool in modern genetic studies. It is routinely used to increase the fraction of the genome covered in genome-wide association studies (GWAS) conducted in human populations, thereby increasing power to detect risk variants through linkage disequilibrium (LD) mapping (Marchini *et al.* 2007; Becker *et al.* 2009; Hao *et al.* 2009; Spencer *et al.* 2009; Li *et al.* 2010). Imputation based on a shared set of markers in data sets genotyped on different platforms enables large-scale meta-analyses (Barrett *et al.* 2008; de Bakker *et al.* 2008; Zeggini *et al.* 2008). In sequencing studies of rare variation, imputation can improve the accuracy of genotype calls from sequencing reads (Li *et al.* 2010, 2011), and power for rare-variant tests of association can be increased by augmenting sequence data with imputed samples (Zawistowski *et al.* 2010).

Imputation procedures generally involve a set of target samples in which genotypes will be imputed, a reference panel of phased haplotypes from which genotypes are copied, and an algorithm for the copying procedure. Each target sample is genotyped for single-nucleotide polymorphisms (SNPs) whose genotypes serve as a scaffold for more complete haplotypes. Reference samples are more densely genotyped than the target samples, and might even be fully sequenced. Imputation algorithms typically employ a computationally intensive hidden Markov model that uses genotypes from the SNP scaffold of a target sample to choose haplotypes from the reference panel that are most similar (Scheet and Stephens 2006; Browning and Browning 2007; Marchini *et al.* 2007; Purcell *et al.* 2007; Li *et al.* 2010). Genotypes are then imputed in the target by locally copying reference haplotypes that provide the best match.

Analyses of the imputed genotypes, such as in marker-based association tests, depend on imputation accuracy (Huang *et al.* 2009b), which in turn depends on numerous factors, including the haplotypic diversity of the target population, the size of the reference panel, and the genetic similarity of the reference and target individuals (Huang *et al.* 2009a, 2011; Browning and Browning 2009; International HapMap3 Consortium 2010; Jostins *et al.* 2011). Therefore, when performing imputation, and when prospectively designing imputation-based studies, it is important to understand how sampling approaches and population parameters affect imputation accuracy.

Currently, imputation accuracy is generally assessed by experiments on real or simulated data, in which known genotypes are masked, imputed, and compared to their true values. A thorough analysis of the effect of study design variables on imputation accuracy requires many experiments, a computationally expensive task that restricts the number of parameter combinations that can be tested. An alternative strategy is to develop a theoretical model that incorporates the study design variables as parameters and permits analytical calculations of imputation accuracy as a function of these variables. Such a model could enable faster predictions of imputation accuracy across a wider range of study designs than use of the empirical method. Moreover, analytical expressions for imputation accuracy as a function of model parameters can facilitate an intuitive understanding of the relationship between imputation accuracy and features of empirical studies.

In this article, we introduce a coalescent-based theoretical model of imputation. We consider a single target haplotype to be imputed for untyped markers and a reference panel of haplotypes, of which one will be chosen as the template for imputation. Our model relies on the premise that for a given target haplotype, an imputation algorithm ideally chooses as a template the reference haplotype whose number of sequence differences from the target is smallest. We take from the reference panel the haplotype with the closest genealogical history to the target, in terms of coalescence time, to serve as the template; this haplotype will have, on average, the fewest sequence differences from the target. By selecting the template haplotype in this way, our model mimics existing imputation approaches that attempt to select the haplotype, or set of haplotypes, with the closest genealogical relationship to the target (Pasaniuc *et al.* 2010; Howie *et al.* 2011). Under our imputation scheme, we quantify imputation accuracy as a function of reference panel size and demographic parameters for the populations containing the target and reference haplotypes.

The flexibility of the coalescent framework enables complex population-genetic models to be considered. Here, we use the model to investigate imputation-based study designs in the age of next-generation sequencing. In practice, researchers have typically relied on large, publicly available data sets to serve as reference panels. Although public panels are easily accessible and often result in sufficient imputation accuracy, the haplotypes available often derive from a different population than the study sample in which genotypes are imputed. Advances in sequencing technology now permit the creation of custom reference panels that contain genome sequences of individuals from the same source population as the study sample—perhaps even a subset of the study sample itself. Custom panels, however, will likely be smaller than public panels, owing to the sequencing cost required to create them. It is therefore of interest to determine the practical utility of custom panels by assessing the improvement in imputation accuracy for either replacing a public panel with a potentially smaller custom panel or augmenting the public panel with custom reference haplotypes.

Using our coalescent framework, we address this question by considering two potential reference panels for imputation in the same target population. The first is an “internal” reference panel of haplotypes drawn from the target population and is meant to represent a custom panel. The second is an “external” reference panel of haplotypes from a distinct population, such as one included in a public database. Our model predicts that an internal reference panel, even when considerably smaller than an existing external reference panel, nearly always improves imputation accuracy. We examine the dependence of this improvement on the relative sizes of the two panels and on demographic parameters such as divergence times and population growth rates. Our results suggest that in practice, augmenting an existing external reference panel (1000 Genomes, for example) with even relatively few haplotypes from the study population of interest can improve genotype imputation.

## Overview of the Model

We use a coalescent framework to model genotype imputation at a nonrecombining genetic locus intended to represent a short region along a chromosome. We assume that genotypes for a single target haplotype *T* will be imputed. We define a reference panel to be a set of sequenced or densely genotyped haplotypes that does not include *T*. One haplotype from the reference panel will be chosen as an imputation template, and alleles from the template haplotype will be copied onto *T*. Assuming that mutations accumulate in proportion to time, the reference haplotypes with the fewest sequence differences from *T* are the descendants of the lineage with which *T* first coalesces. Under this assumption— which holds in expectation under the infinitely many-sites model (*e.g.*, Wakeley 2008)—an imputation algorithm that always selects the haplotype in the reference panel with the fewest sequence differences from *T* (or one such haplotype in case of a tie) is equivalent to the algorithm choosing the reference haplotype with the closest genealogical history to *T*. Thus, when predicting the accuracy produced by a reference panel, to a first approximation, it is reasonable to use a model that considers only coalescence times, rather than a more complete model with stochastic mutations. As we will see, considering only coalescence times is desirable because it makes analytical computations simple, fast, and straightforward to interpret.

By including haplotypes from multiple reference panels in our model, we can compare the performance of reference panels from different populations. Here, we consider a scenario with two possible reference panels for imputing *T*. The first is an internal reference panel consisting of *n*_{1} haplotypes sampled from the same population as *T*. The second is an external reference panel consisting of *n*_{2} haplotypes sampled from a distinct population. Defining the “optimal panel” as that which produces the highest accuracy for imputing *T*, we compute the probability of optimality for both the internal and external reference panels and quantify the gain in imputation accuracy obtained by using the optimal rather than the suboptimal panel.

We model the genealogical history of *T* and the reference haplotypes using a two-population coalescent model of divergence (Takahata and Nei 1985; Rosenberg 2003) (Figure 1A). The two populations are labeled 1 and 2, and *T* is sampled from population 1. The populations diverged from an ancestral population at time *t*_{D} in the past, and no migration has occurred between the descendant populations. Therefore, more recently than the divergence time (*t* < *t*_{D}), a lineage can coalesce only with other lineages from the same population. This assumption is reasonable for pairs of populations that are geographically isolated. More anciently than the split (*t* > *t*_{D}), all remaining lineages are assumed to belong to a homogeneous ancestral population, and any two lineages are allowed to coalesce. We assume effective population sizes of *N*_{1} diploid individuals for population 1, *N*_{2} for population 2, and *N*_{A} for the ancestral population.

At time *t* = 0, corresponding to the present, *n*_{1} reference haplotypes in addition to the single target haplotype *T* are sampled from population 1, and *n*_{2} reference haplotypes are sampled from population 2. The divergence time *t*_{D} and the panel sizes *n*_{1} and *n*_{2} are treated as model parameters. We refer to a lineage with descendants only in population 1 as a lineage of type 1. Similarly, a lineage with descendants only in population 2 has type 2, and a lineage with descendants in both populations has type 1–2. We assume that among available reference haplotypes, the optimal templates for imputing *T* are the descendants of the lineage with which *T* first coalesces. Thus, the internal reference panel is optimal if *T* first coalesces with a lineage of type 1 and the external panel is optimal if *T* first coalesces with a lineage of type 2. If *T* first coalesces with a lineage of type 1–2, then the two reference panels are equally appropriate, and we say that both are optimal.

## Model

We use our coalescent model of genotype imputation to compute the probability of optimality for each reference panel and to quantify differences in imputation accuracy between potential reference panels. For the problem of imputing the target *T*, we first derive the probability of optimality for an internal reference panel of *n*_{1} haplotypes and for an external reference panel of *n*_{2} haplotypes, sampled from populations with a divergence time of *t*_{D} (in units of 2*N*_{A} generations). Let *C*_{1} be the event that *T* first coalesces with a lineage of type 1, let *C*_{2} be the event that *T* first coalesces with a lineage of type 2, and let *C*_{12} be the event that *T* first coalesces with a lineage of type 1–2. From our definition of optimality, it follows that ℙ(*C*_{1}), the probability that the target *T* first coalesces with a lineage of type 1, is the probability that the internal reference panel is optimal for imputing *T*. Similarly, ℙ(*C*_{2}) is the probability that the external reference panel is optimal, and ℙ(*C*_{12}) is the probability that the two reference panels are both optimal, with equal expected imputation accuracy.

In the case that exactly one reference panel is optimal, it is of interest to quantify the improvement in imputation accuracy achieved by using the optimal as opposed to the suboptimal reference panel. Assuming that mutations follow a Poisson process, under the infinitely many-sites model, the expected number of sites incorrectly imputed in *T* is proportional to the coalescence time between the target and the imputation template; the sites that produce imputation errors are precisely those sites at which mutations occur on either the target or the template branch more recently than their coalescence. Expected imputation accuracy for a given reference panel can then be quantified by the expected time that *T* first coalesces with a haplotype from the panel. We now derive several measurements of imputation error to evaluate the difference in imputation accuracy between the internal and external panels.

### Reference panel optimality probabilities

We consider two approaches for obtaining optimality probabilities, a closed-form computation and a recursive computation.

#### Closed-form computation:

Let *C*_{1}, *C*_{2}, and *C*_{12}, respectively, be the events that the internal reference panel is optimal for imputing *T*, the external panel is optimal, and the two panels are equally appropriate. To compute the probabilities for each of these events, we partition the coalescent model into three components: (1) the events in population 1 more recent than the divergence from the ancestral population, (2) the events in population 2 more recent than the divergence, and (3) the events in the ancestral population (Figure 1B). Because we assume that no migration occurs between populations 1 and 2, the coalescent processes in populations 1 and 2 in the period more recent than the divergence are independent. Conditional on the numbers of lineages remaining from populations 1 and 2 at the divergence time, the coalescence events in the ancestral population are independent of the events that occur more recently than the divergence time.

First, we consider the genealogy of haplotypes in population 1 from the present back to the divergence time *t*_{D}. Define *T* coalesces more recently than *t*_{D} and * ^{c}* to be the event that

*T*does not coalesce by

*t*

_{D}. Note that if

*T*coalesces before

*t*

_{D}, then the lineage with which it first coalesces can have descendants only in population 1 and must therefore have type 1. It follows that if event

*C*

_{1}also occurs, so that ℙ(

*C*

_{1},

*T*does not coalesce before the divergence time, then it enters the ancestral population, where it can also coalesce with lineages of types 2 and 1–2. In this scenario, to identify the optimal reference panel, we must consider the coalescence events in population 2 and the ancestral population.

The coalescent process in the ancestral population depends on the numbers of lineages from populations 1 and 2 that survive at the divergence time. Let *i*_{D}, 1 ≤ *i*_{D} ≤ *n*_{1,} denote the number of lineages from population 1 (other than *T*) that survive to enter the ancestral population at the divergence time *t*_{D} (Figure 1B). Similarly, let *j*_{D}, 1 ≤ *j*_{D} ≤ *n*_{2}, be the number of lineages from population 2 that survive at the divergence time. Because the coalescent processes in populations 1 and 2 are independent, *j*_{D} is independent of both *i*_{D} and

By conditioning on the number of lineages from each population remaining at the divergence time *t*_{D}, we can write the quantity ℙ(*C*_{1}) as*h _{n}*

_{,ℓ}(

*t*;

*N*) is the probability that

*n*lineages sampled from a diploid population with effective size

*N*coalesce down to ℓ lineages at time

*t*. Tavaré (1984) demonstrated that

*n*

_{[}

_{m}_{]}=

*n*(

*n*−1) ⋯ (

*n*−

*m*+ 1),

*n*

_{(}

_{m}_{)}=

*n*(

*n*+ 1) ⋯ (

*n*+

*m*− 1), and the factor of

*N*

_{A}/

*N*in the exponent is due to the fact that time is measured in units of 2

*N*

_{A}generations.

To obtain the probability ℙ(* ^{c}*,

*i*), let

_{D}*n*lineages in a diploid population of effective size

*N*coalesce down to ℓ lineages at time

*t*. Define

*I*

_{n}_{,ℓ}, where

*n*lineages can coalesce to ℓ lineages (

*e.g.*, Rosenberg 2003). Then ℙ(

*,*

^{c}*i*

_{D}) is derived by noting that for

*and*

^{c}*i*

_{D}to both occur, the

*n*

_{1}+ 1 total lineages originating in population 1 must coalesce to

*i*

_{D}+ 1 lineages at the divergence time

*t*

_{D}. This can occur in

*n*

_{1}reference haplotypes coalesce to

*i*

_{D}lineages without

*T*also coalescing. Thus,

*) is obtained by summing over all possible values of*

^{c}*i*

_{D},

*C*

_{1}|

*,*

^{c}*i*

_{D},

*j*

_{D}), is the probability that

*T*first coalesces with a lineage of type 1 assuming that, in addition to

*T*,

*i*

_{D}lineages from population 1 and

*j*

_{D}lineages from population 2 survive to the ancestral population.

To derive ℙ(*C*_{1}|* ^{c}*,

*i*

_{D},

*j*

_{D}) in closed form, let

*i*

_{C},

*j*

_{C}, and

*k*

_{C}be the numbers of lineages of types 1, 2, and 1–2, respectively, remaining at the instant when

*T*first coalesces. The probability ℙ(

*C*

_{1}|

*,*

^{c}*i*

_{D},

*j*

_{D}) is computed by summing over all possible values of

*i*

_{C},

*j*

_{C}, and

*k*

_{C},

*δ*

_{a}_{,}

*= 1 if*

_{b}*a*=

*b*and

*δ*

_{a}_{,}

*= 0 otherwise.*

_{b}To derive the probability ℙ(*C*_{1}, *i*_{C}, *j*_{C}, *k*_{C}|* ^{c}*,

*i*

_{D},

*j*

_{D}), let

*N*(

*i*

_{D},

*j*

_{D}→

*i*

_{C},

*j*

_{C},

*k*

_{C}) be the number of ways in which

*i*

_{D}lineages of type 1 and

*j*

_{D}lineages of type 2 can coalesce down to

*i*

_{C},

*j*

_{C}, and

*k*

_{C}lineages of types 1, 2, and 1–2, respectively. Then ℙ(

*C*

_{1},

*i*

_{C},

*j*

_{C},

*k*

_{C}|

*,*

^{c}*i*

_{D},

*j*

_{D}) is given by

*N*(

*i*,

_{D}*j*→

_{D}*i*

_{C},

*j*

_{C},

*k*

_{C}) is derived in the Appendix.

The probability of optimality for the external reference panel, ℙ(*C*_{2}), and the probability that the two reference panels are equally optimal, ℙ(*C*_{12}), are computed in a similar manner to Equation 1. Because the occurrence of *C*_{1} has also occurred, however, ℙ(*C*_{2}, *C*_{12}, *C*_{2}|* ^{c}*,

*i*

_{D},

*j*

_{D}) and ℙ(

*C*

_{12}|

*,*

^{c}*i*

_{D},

*j*

_{D}) in Equations 9 and 10 are similar to Equations 7 and 8. For ℙ(

*C*

_{2}|

*,*

^{c}*i*

_{D},

*j*

_{D}), we compute

*C*

_{2},

*i*

_{C},

*j*

_{C},

*k*

_{C}|

*,*

^{c}*i*

_{D},

*j*

_{D}) is given by

*C*

_{12}|

*,*

^{c}*i*

_{D},

*j*

_{D}), we compute

*C*

_{12},

*i*

_{C},

*j*

_{C},

*k*

_{C}|

*,*

^{c}*i*

_{D},

*j*

_{D}) is given by

#### Recursive computation:

Computing the closed-form expressions (8), (12), and (14) is time–intensive for large *i*_{D} and *j*_{D}. Therefore, the probabilities ℙ(*C*_{1}), ℙ(*C*_{2}), and ℙ(*C*_{12}) are difficult to calculate using Equations 1, 9, and 10 with large panel sizes *n*_{1} and *n*_{2}. In this section, we derive an efficient recursive approach for computing the probabilities ℙ(*C*_{1}|* ^{c}*,

*i*

_{D},

*j*

_{D}), ℙ(

*C*

_{2}|

*,*

^{c}*i*

_{D},

*j*

_{D}), and ℙ(

*C*

_{12}|

*,*

^{c}*i*

_{D},

*j*

_{D}).

Assume that at some time *t* > *t*_{D}, in addition to lineage *T*, *i* lineages of type 1, *j* lineages of type 2, and *k* lineages of type 1–2 exist in the ancestral population. Conditional on this configuration, let *T* first coalesces with a lineage of type 1. We construct a recursive equation for

Let *m* = *i* + *j* + *k* + 1 be the total number of lineages remaining. Each of the *T* first coalesces with a type 1 lineage. If the next coalescence involves *T* and a type 1 lineage, an event that occurs with probability *C*_{1} occurs. Alternatively, if the next coalescence occurs between *T* and either a type 2 or a type 1–2 lineage, events that occur with probabilities *C*_{1} cannot occur. For the remaining lineage pairs, *T* is not involved in the next coalescence, and the probability of *C*_{1} depends on the lineage pair that is involved in the event. For example, two type 1 lineages coalesce with probability *i* − 1. The coalescent process then restarts with *i* − 1, *j*, and *k* lineages of types 1, 2, and 1–2, respectively. Given this new configuration, the probability of event *C*_{1} is

By conditioning on the possible lineage pairs for the next coalescence, we obtain a recursion:*i* > 0, *j* ≥ 0, *k* ≥ 0. *j*, *k* ≥ 0 because there must be at least one lineage of type 1 for event *C*_{1} to occur. The recursion is incorporated into Equation 1 by replacing ℙ(*C*_{1}|* ^{c}*,

*i*

_{D},

*j*

_{D}) with

The terms ℙ(*C*_{2}|* ^{c}*,

*i*

_{D},

*j*

_{D}) in Equation 9 and ℙ(

*C*

_{12}|

*,*

^{c}*i*

_{D},

*j*

_{D}) in Equation 10 can also be evaluated recursively, following the same logic used to obtain Equation 15. Denote by

*T*first coalesces with a type 2 lineage. Then

*C*

_{2}|

*,*

^{c}*i*

_{D},

*j*

_{D}) in Equation 9, holds for

*i*≥ 0,

*j*> 0,

*k*≥ 0.

*i*,

*k*≥ 0.

Finally, conditional on *i*, *j*, and *k*, denoting by *T* first coalesces with a lineage of type 1–2,*i*, *j* ≥ 0, because production of a type 1–2 lineage requires at least one type 1 lineage and at least one type 2 lineage. The recursion is incorporated into Equation 10 by replacing ℙ(*C*_{12}|* ^{c}*,

*i*

_{D},

*j*

_{D}) with

### Expected coalescence times

In this section, we derive formulas that quantify and compare imputation accuracy for internal and external reference panels. Let *S*_{1} be the number of sites that are incorrectly imputed when using an internal reference panel and let *S*_{2} be the number of sites incorrectly imputed when using an external reference panel. We compute the expected numbers of incorrectly imputed sites, *E*[*S*_{1}] and *E*[*S*_{2}], conditional on reference panel sizes *n*_{1} and *n*_{2} and divergence time *t*_{D}.

Let *T*_{1} be the random waiting time until *T* first coalesces with a lineage that has descendants in the internal reference panel, that is, a type 1 or type 1–2 lineage. Similarly, let *T*_{2} be the waiting time until *T* first coalesces with a lineage that has descendants in the external reference panel, that is, a type 2 or type 1–2 lineage. Here, *T*_{1} and *T*_{2} are measured in units of 2*N*_{A} generations. The template haplotype is selected to minimize the coalescence time with lineage *T* (Figure 2). Thus, the total branch length separating the target from the template is 2*T*_{1} when the internal reference panel is used and 2*T*_{2} when the external reference panel is used.

We model mutation events using the infinitely many-sites model and assume that the number of mutations at genotyped sites along a branch of length *t* units of 2*N*_{A} generations follows a Poisson distribution with mean *tθω*/2, where *θ* = 4*N*_{A}*μ* is the per-base population-scaled mutation rate, *μ* is the per-base per-generation mutation rate, and *ω* is the number of sites genotyped (or sequenced) in the reference haplotypes that will be imputed in the target. Under our model, mutations that occur along the branches separating the target haplotype and the template haplotype will be incorrectly imputed. Therefore, when the internal reference panel is used, the expected number of sites incorrectly imputed is*θω*, which does not depend on the model parameters *n*_{1}, *n*_{2}, and *t*_{D}, deriving *E*[*T*_{1}] and *E*[*T*_{2}] is sufficient to determine the expected difference *E*[*S*_{2} − *S*_{1}].

#### Derivation of E[T_{1}]:

To compute the expected waiting time *E*[*T*_{1}] until *T* first coalesces with a lineage that has descendants in the internal reference panel, we condition on the population in which lineage *T* first coalesces:* ^{c}*) are obtained using Equations 6 and 5, respectively. To obtain the expected waiting time

*E*[

*T*

_{1}|

*T*first coalesces, given that it coalesces in population 1, we integrate the conditional survival function

*T*

_{1}given

*n*lineages coalesce to

*k*lineages in time

*t*in a diploid population of size

*N*. In the fifth equality, by the same argument used to derive Equation 4, the probability that lineage

*T*does not coalesce when the

*n*

_{1}+ 1 sampled lineages coalesce to

*i*lineages is

*T*does coalesce is

The quantity *E*[*T*_{1}|* ^{c}*] is the expected time until lineage

*T*first coalesces in the ancestral population with a lineage that has descendants in population 1, given that it does not coalesce in population 1. This expected time can be found by conditioning on the number

*i*

_{D}of type 1 lineages that remain at the divergence time:

*N*, the expected sum of the lengths of all external branches in a genealogy is 4

*N*(Fu and Li 1993). Thus, the mean length of an external branch in a genealogy with

*n*+ 1 lineages—and consequently, the expected time until a specific lineage coalesces with some lineage among a set of

*n*—is 4

*N*/(

*n*+ 1). ℙ(

*,*

^{c}*i*

_{D}) and ℙ(

*) are found using Equations 4 and 5, respectively. The quantity*

^{c}*E*[

*T*

_{1}] is then obtained by inserting Equations 5, 6, 22, and 24 into Equation 21.

#### Derivation of E[T_{2}]:

To compute *E*[*T*_{2}], we condition on the number *j*_{D} of type 2 lineages remaining at the divergence time:

### Exponential growth

Given the utility of population growth models for explaining properties of human genetic variation (*e.g.*, Schaffner *et al.* 2005; Coventry *et al.* 2010), we consider a model of exponential growth in populations 1 and 2 (Figure 3). Let *N*_{1}(*t*) and *N*_{2}(*t*) be functions that define the sizes of populations 1 and 2, respectively, at time *t* in the past, where *t* is measured in units of 2*N*_{A} generations. Here, we assume an ancestral population of constant size and set *t* ∈ [0, *t*_{D}] and *α*_{1}, *α*_{2} > 0. We compare the results from this model to those of the constant-size model to evaluate effects of exponential growth on imputation reference panel selection.

Exponential growth changes only the distribution of coalescent waiting times from our computations in the previous sections. All derived equations depend on the waiting times only through the quantity *h _{n}*

_{,}

*(*

_{k}*t*;

*N*), the probability that

*n*lineages coalesce to

*k*lineages in time

*t*in a diploid population with constant size

*N*. Thus, for the exponential growth model, we replace

*h*

_{n}_{,}

*(*

_{k}*t*;

*N*) from the constant-size model by

*h*

_{n}_{,}

*(*

_{k}*t*;

*N*(0),

*α*), the probability that

*n*lineages coalesce to

*k*lineages in time

*t*for a population with size

*N*(

*t*) =

*N*(0)

*e*

^{−}

*.*

^{αt}The probability *h _{n}*

_{,}

*(*

_{k}*t*;

*N*(0),

*α*) is computed by solving for the value

*t*′ such that

*h*

_{n}_{,}

*(*

_{k}*t*;

*N*(0),

*α*) =

*h*

_{n}_{,}

*(*

_{k}*t*′;

*N*

_{c}) for a specified constant size

*N*

_{c}, and then computing

*h*

_{n}_{,}

*(*

_{k}*t*′;

*N*

_{c}) using Equation 2. Here, we take

*N*

_{c}= 1 for simplicity. Let

*g*(

*t*;

*N*(0),

*α*) denote the transformation taking time

*t*in the growing population to time

*t*′ in the population of constant size 1. Assuming the size of the growing population at time

*t*is

*N*(

*t*) =

*N*(0)

*e*

^{−}

*, the transformation*

^{αt}*g*(

*t*;

*N*(0),

*α*) is

*t*′ =

*g*(

*t*;

*N*(0),

*α*) are the same as those of the untransformed time

*t*(Griffiths and Tavaré 1994; Nordborg 2003). Then

*N*

_{A}is needed because the transformed time

*t*′ is measured in units of 2

*N*

_{A}generations. The modified versions of all equations from the constant-size model appear in Table 2. Because the case in which populations 1 and 2 have constant size is the

*α*= 0 case of the growth scenario, results for the constant model can also be obtained using the formulas in Table 2.

### Simulations

To validate our theoretical results, we carried out coalescent simulations. Given values of *n*_{1}, *n*_{2}, and *t*_{D}, in the constant-size model, following the method of Jewett and Rosenberg (2012), we simulated genealogies and estimated probabilities ℙ(*C*_{1}), ℙ(*C*_{2}), and ℙ(*C*_{12}) as the fractions of the simulated genealogies for which the events *C*_{1}, *C*_{2}, and *C*_{12} occurred. ℙ(*C*_{1}), ℙ(*C*_{2}), and ℙ(*C*_{12}) were obtained from the same set of 10^{6} simulations.

## Results

### Agreement of closed-form, recursive, and simulation-based computations

We derived exact closed-form and recursive equations for the probability ℙ(*C*_{1}) that the internal reference panel is optimal, the probability ℙ(*C*_{2}) that the external reference panel is optimal, and the probability ℙ(*C*_{12}) that both panels are equally optimal. Both the exact and recursive computations require the function *h _{n}*

_{,}

*(*

_{k}*t*;

*N*) to be evaluated. However, Equation 2 for

*h*

_{n}_{,}

*(*

_{k}*t*;

*N*) is numerically unstable for small

*t*and large

*n*. Therefore, for small

*t*and large

*n*, we used an asymptotic approximation to

*h*

_{n}_{,}

*(*

_{k}*t*;

*N*) (Griffiths 1984). For

*n*lineages sampled in the present, the distribution of the number of lineages at time

*t*(expressed in units of 2

*N*generations) is asymptotically normal with mean

*μ*= 2

*η*/

*t*and variance

*σ*

^{2}= 2

*ηt*

^{−1}(

*η*+

*β*)

^{2}[1 +

*η*/(

*η*+

*β*) −

*η*/

*α*−

*η*/(

*α*+

*β*) − 2

*η*]

*β*

^{−2}as

*t*→ 0,

*n*→ ∞, and

*β*= −

*t*/2 and

*η*=

*αβ*/{

*α*(

*e*− 1) +

^{β}*βe*}. We used the asymptotic approximation to

^{β}*h*

_{n}_{,}

*(*

_{k}*t*;

*N*) when both

*n*≥ 40 and

*t*

_{D}≤ 0.1, and we used Equation 1 otherwise.

Table 3 gives values of ℙ(*C*_{1}) computed using the exact and recursive approaches at two divergence times *t*_{D} and several reference panel sizes *n*_{1} and *n*_{2}. For larger *n*_{1} or *n*_{2}, where closed-form evaluation of ℙ(*C*_{1}) is computationally difficult, we report only values computed using the recursion (Equation 15). The closed-form and recursive probabilities agree at parameter values where a comparison is possible. To allow large reference panel sizes to be considered, we subsequently restrict our attention to the recursion.

The closed-form and recursive expressions also agree with the results of the simulations (Table 3). Because the simulations do not rely on the asymptotic approximation, when *t*_{D} ≤ 0.1 and either *n*_{1} ≥ 40 or *n*_{2} ≥ 40, differences between analytical and simulation results are potentially attributable to errors in the approximation. However, these differences are generally negligible.

### Comparison of internal and external reference panels

We report ℙ(*C*_{1}), the optimality probability for an internal reference panel, and *E*[*S*_{2} − *S*_{1}], the expected difference in the number of incorrectly imputed sites between the external and internal reference panels, for a range of panel sizes and for large and small divergence times. The optimality probability ℙ(*C*_{1}) is interpreted as the probability that a given locus is imputed more accurately when using the internal reference panel compared to the external panel. The relative accuracy for imputation using the external panel compared to the internal panel is quantified by *E*[*S*_{2} − *S*_{1}]. A positive value of *E*[*S*_{2} − *S*_{1}] indicates that using the external reference panel will, on average, result in more imputation errors than using the internal panel. A negative value indicates that the external panel will result in fewer imputation errors on average, and a value of zero indicates that the two panels are expected to produce the same number of incorrectly imputed sites. *E*[*S*_{2} − *S*_{1}] is reported in units of the population-scaled mutation rate *θω* = 4*N _{A}μω* for a set of

*ω*imputed bases.

#### Populations of constant size:

In our analyses of the constant-size model, we assumed a constant, equal size for all populations (*N*_{1} = *N*_{2} = *N*_{A}). Note that because we express imputation accuracy in terms of the population-scaled mutation rate *θω*, only the relative effective sizes of the populations and not their absolute sizes influence the results of our analyses. Under this model, if *t*_{D} is small, then *t*_{D} measured in units of 2*N*_{A} generations is related to the population differentiation statistic *F*_{st} through the approximation *t*_{D} ≈ −log(1 − *F*_{st}) ≈ *F*_{st} (Cavalli-Sforza 1969; Reynolds *et al.* 1983). We present results for divergence times of *t*_{D} = 0.005 (*F*_{st} ≈ 0.005) to represent two populations within a continental region, and *t*_{D} = 0.05 (*F*_{st} ≈ 0.05) to represent more strongly diverged populations.

Figure 4A shows ℙ(*C*_{1}) for a range of reference panel sample sizes *n*_{1} and *n*_{2} at the smaller divergence time of *t*_{D} = 0.005. Each curve corresponds to a fixed external reference panel size *n*_{2}, and ℙ(*C*_{1}) is plotted as a function of the internal reference panel size *n*_{1}. ℙ(*C*_{1}) exceeds 0.5 over most of the range of values for *n*_{1} and *n*_{2}, indicating that the internal panel is likely be optimal even when it is much smaller than the external panel. For example, ℙ(*C*_{1}) > 0.7 for internal reference panels of *n*_{1} ≥ 400 haplotypes even when considering an extremely large external panel of *n*_{2} = 10,000 haplotypes (light blue curve).

Figure 4B shows ℙ(*C*_{1}) for the same values of *n*_{1} and *n*_{2} at the larger divergence time of *t*_{D} = 0.05. For fixed values of *n*_{1} and *n*_{2}, ℙ(*C*_{1}) is greater when *t*_{D} = 0.05 than when *t*_{D} = 0.005, indicating an increase in the probability of optimality for an internal panel at larger divergence times. Taken together, Figure 4, A and B, shows that under our model, smaller internal reference panels are more likely to be optimal than much larger external panels and that this improvement increases as the two source populations for the panels become more genetically distinct.

Next, we compute the expected difference *E*[*S*_{2} − *S*_{1}] in the number of incorrectly imputed sites between internal and external reference panels to quantify the improvement in accuracy (Figure 4C, *t*_{D} = 0.005; Figure 4D, *t*_{D} = 0.05). Increasing the size *n*_{1} of the internal panel while holding the external panel size *n*_{2} fixed results in an increase in *E*[*S*_{2} − *S*_{1}]. Conversely, increasing *n*_{2} while holding *n*_{1} fixed leads to a decrease in *E*[*S*_{2} − *S*_{1}]. Because the internal reference panel size has no effect on *E*[*S*_{2}] and the external panel size does not affect *E*[*S*_{1}], an increase in *E*[*S*_{2} − *S*_{1}] for fixed *n*_{1} represents an increase in *E*[*S*_{2}], while an increase of *E*[*S*_{2} − *S*_{1}] for fixed *n*_{2} indicates a decrease of *E*[*S*_{1}]. The model thus predicts the intuitive result that increasing the size of a reference panel (internal or external) leads to fewer imputation errors. Increasing the number of haplotypes in a reference panel has the largest improvement when the panel is initially small. For example, once the internal panel has reached a certain size (*n*_{1} ≈ 30 for *t*_{D} = 0.005), including additional haplotypes in the panel produces only small increases in accuracy. Similarly, the addition of haplotypes to the external panel yields the greatest improvement in accuracy when the external panel is small; hence, a gap is visible between the lines for *n*_{2} = 50 (orange) and *n*_{2} = 100 (purple) in Figure 4, C and D, while the lines for *n*_{2} = 500 (red) and *n*_{2} = 10,000 (light blue) are nearly indistinguishable.

The magnitude of the divergence time affects the expected accuracies of the reference panels. Comparison of Figure 4, C and D, shows that the performance of the internal reference panel relative to the external panel improves for the larger divergence time. For example, at the smaller divergence time, *n*_{1} = 67 haplotypes are required in the internal panel to acquire the same expected imputation accuracy as an external panel of *n*_{2} = 100 haplotypes (Figure 4C, purple line). For the larger divergence time, an internal reference panel need only contain *n*_{1} = 17 haplotypes to provide the same expected accuracy as *n*_{2} = 100 haplotypes in an external panel (Figure 4D, purple line).

The results from Figure 4 can be directly interpreted in terms of events in the coalescent model. ℙ(*C*_{1}) increases with *t*_{D} for fixed *n*_{1} and *n*_{2} because increasing the divergence time between the populations lengthens the amount of time that the target haplotype *T* remains in population 1, where it can coalesce only with type 1 lineages. In fact, for large divergence times or large *n*_{1}, the target *T* is likely to coalesce before reaching the ancestral population, explaining why ℙ(*C*_{1}) ≈ 1 nearly independently of *n*_{2} for larger *t*_{D}. *E*[*S*_{2} − *S*_{1}] levels off quickly as *n*_{1} and *n*_{2} increase because the expected time until a single lineage in a diploid population of constant size *N* coalesces with one of ℓ other lineages is 4*N*/(ℓ + 1) (Fu and Li 1993). Thus, for our parameter values, the expected time until *T* coalesces with a type 1 lineage is *E*[*T*_{1}] = 4*N*/(*n*_{1} + 1) and the expected time until *T* coalesces with a type 2 lineage is *j*_{D} is the number of lineages from population 2 that remain at time *t*_{D}. The quantity 4*N*/(ℓ + 1) is small even when ℓ is only moderately large. Hence, changes in *E*[*T*_{1}] and *E*[*T*_{2}] with respect to *n*_{1} and *j*_{D}, respectively, are small once *n*_{1} or *j*_{D} exceeds around 50 lineages. Because *j*_{D} increases quickly with *n*_{2} for small *t*_{D}, both *E*[*T*_{1}] and *E*[*T*_{2}] change little once *n*_{1} and *n*_{2} are moderate in size. *E*[*S*_{2} − *S*_{1}], which is proportional to *E*[*T*_{2}] − *E*[*T*_{1}], therefore changes little as well.

#### Exponentially growing populations:

We next examine ℙ(*C*_{1}) and *E*[*S*_{2} − *S*_{1}] under a model of exponential growth in populations 1 and 2. Here we assume that both populations have size *N*_{A} at the divergence time *t*_{D} and size 100*N*_{A} in the present. We again show results for divergence times of *t*_{D} = 0.005 and 0.05, measured in units of 2*N*_{A} generations.

Figure 5A compares ℙ(*C*_{1}) under the constant-size model (solid lines) and exponential growth model (dashed lines) at the smaller divergence time of *t*_{D} = 0.005. For fixed *n*_{1} and *n*_{2}, the value of ℙ(*C*_{1}) under the growth model is less than the corresponding ℙ(*C*_{1}) value for the constant model. For instance, ℙ(*C*_{1}) is 0.68 for *n*_{1} = 200 and *n*_{2} = 500 under the constant-size model, but it falls to 0.35 for the growth model. The differences in accuracy between the growth and constant-size models are similar although less pronounced for the larger divergence time (Figure 5B). Thus, recent exponential growth in the populations from which the reference and target haplotypes are sampled has the effect of reducing the optimality of the internal reference panel.

The expected difference *E*[*S*_{2} − *S*_{1}] in imputation accuracy between the two panels is also affected by exponential growth. When the divergence time is small (*t*_{D} = 0.005), the expected difference in accuracy for a given *n*_{1} and *n*_{2} decreases very slightly under exponential growth (Figure 5C). The change in *E*[*S*_{2} − *S*_{1}] between the constant-size and growth models is more extreme at the larger divergence time (Figure 5C). Thus, imputation accuracy for an external reference panel relative to an internal panel improves in the exponential growth model. Recall that for the larger divergence time, only *n*_{1} = 17 internal haplotypes were required to achieve the same expected accuracy as *n*_{2} = 100 external haplotypes under the constant model. Under the exponential growth model, the number of internal haplotypes needed to match the performance of the *n*_{2} = 100 external haplotypes increases to *n*_{1} = 46. Although smaller internal reference panels can still achieve accuracy similar to that of larger external panels in the presence of exponential growth, the number of internal haplotypes required to do so increases.

The effects of growth can be understood using intuition about the coalescent model. Including exponential growth in our coalescent model increases the mean waiting time for coalescent events compared to the constant-size case (Figure 6). Therefore, the probability that the target *T* coalesces more recently than the divergence time *t*_{D} decreases, and the number of type 2 lineages *j*_{D} that enter into the ancestral population increases. Both of these factors increase the probability that *T* will survive to the ancestral population and, therefore, the probability that *T* will coalesce first with a type 2 or type 1–2 lineage. This explains the reduction in ℙ(*C*_{1}) observed for the growth model. Similarly, the decrease in *E*[*S*_{2} − *S*_{1}] under the exponential growth model compared to the constant-size model can be explained by the longer coalescent waiting times. The expected waiting time *E*[*T*_{1}] until *T* coalesces with a type 1 lineage increases; however, the expected waiting time *E*[*T*_{2}] until *T* coalesces with a type 2 or type 1–2 lineage decreases because the longer waiting times in population 2 result in a larger number of type 2 lineages surviving to the ancestral population. Both the increase in *E*[*T*_{1}] and the decrease in *E*[*T*_{2}] lead to a reduction in *E*[*S*_{2} − *S*_{1}] since *E*[*S*_{2} − *S*_{1}] ∝ *E*[*T*_{2}] − *E*[*T*_{1}].

## Discussion

We have introduced a theoretical model of genotype imputation accuracy, employing the coalescent framework to model the ancestry of an imputation target haplotype and a set of reference haplotypes. We examined an imputation algorithm that chooses the reference haplotype with the most similar genealogical history to the target as the template for imputation, and we used the expected coalescence time between the target and template haplotypes to predict the expected number of incorrectly imputed sites in the target.

Framing imputation in a coalescent model has two major benefits. First, the coalescent model enables the derivation of analytical formulas for imputation accuracy. These formulas provide a computationally fast method for studying accuracy across a range of imputation study design variables, and they facilitate the interpretation of observed relationships between imputation accuracy and demographic parameters in terms of well-understood properties of the coalescent model. Second, the coalescent allows complex modeling of population histories, so that a variety of relationships between the reference and target populations can be considered. Here we presented a simple model with a single study population and a single external population. However, this basic model can be extended to more complicated imputation scenarios by specifying different demographic settings for the coalescent. For example, the model can be reformulated to include multiple external populations, each with a unique number of available haplotypes and a distinctive divergence time from the study population. Equations analogous to the ones derived in this article can be computed for the more complicated model and used to determine the optimal imputation reference panel when several external panels are available.

We used the model to study the effect of population subdivision on imputation accuracy, employing a two-population divergence model to compare internal reference panels drawn from the same source population as the target to external panels from a distinct population. To quantify imputation accuracy, we focused on two quantities: (1) ℙ(*C*_{1}), the probability that a target lineage on which genotypes are to be imputed coalesces first with a lineage from the internal panel, and (2) *E*[*S*_{2} − *S*_{1}], the expected difference in the number of imputation errors between the external and internal reference panels.

We have interpreted ℙ(*C*_{1}) as the probability that the internal reference panel is optimal and results in fewer expected imputation errors at a locus than the external panel. ℙ(*C*_{1}) has two additional interpretations. First, a reasonable imputation strategy is to augment an available external panel with internal reference haplotypes. In this setting, ℙ(*C*_{1}) is the probability that the target lineage will coalesce first with one of the additional internal lineages. Thus, ℙ(*C*_{1}) can be interpreted as the probability that imputation accuracy improves by augmenting the existing external reference panel with internal haplotypes. A second alternative interpretation of ℙ(*C*_{1}) is obtained by considering imputation on a genome-wide scale, rather than at a single locus. In a genome-wide context, ℙ(*C*_{1}) can be viewed as the fraction of sites in the genome that are more accurately imputed by the internal reference panel than by the external reference panel.

The model predicts that even when an internal reference panel is considerably smaller than an external panel, the internal panel is nearly always optimal in the sense that it contains the haplotype with the closest genealogical history to the target. Furthermore, the probability of optimality for the internal panel, ℙ(*C*_{1}), increases as the divergence time increases. For populations of constant size, a large external reference panel can provide approximately the same accuracy as a modestly sized internal reference panel if the divergence time is small (*E*[*S*_{2}] ≈ *E*[*S*_{1}]). As the divergence time increases, a small internal reference panel results in increasingly more accurate imputation compared to the large external panel. The expected improvement in imputation accuracy for a smaller internal panel becomes less pronounced if the populations experience exponential growth following the divergence. Thus, exponential growth attenuates the effect of the divergence time, improving the relative performance of an external reference panel in terms of both optimality probability and expected imputation accuracy when compared to populations of constant size.

The results from our model have implications for imputation strategies in population-based genetic association studies. Currently, large public data sets from the HapMap (International HapMap3 Consortium, 2010) and 1000 Genomes (1000 Genomes Project Consortium, 2010) projects are frequently used as external reference panels. However, advances in sequencing technology will enable investigators to create custom internal reference panels from the same source population as their study samples. Not only will custom internal panels enable successful imputation of rare variants private to the target population, our model predicts that a custom internal reference panel will often improve imputation accuracy in general, even if it is much smaller than an existing external reference panel. Under the model, the best strategy is to combine the internal haplotypes with an available external panel to create a single cosmopolitan reference panel. This approach combines the benefits contributed by the large sample size of the external panel and the greater genetic similarity of the internal panel.

Our equations for panel optimality and imputation accuracy rely on a rule that mimics computational imputation algorithms: the reference haplotype whose coalescence time with the target is minimal serves as the imputation template. In actual data, this rule might not always hold, as stochasticity of mutations and the use of small sets of “tagging” SNPs for estimating pairwise distances among haplotypes could cause a sequence whose coalescence time with the target is not minimal to be most genetically similar to the target. The problem may be more pronounced for rare variants that do not exist on a unique background of tagging SNPs. In addition, the rule for template selection might not be strictly followed by imputation software. We have assumed that the entire length of the target haplotype is imputed using the same reference haplotype; however, owing to past recombination events, a real target haplotype is likely to be composed of multiple segments, each with a distinct optimal template. Our model, therefore, implicitly assumes that an imputation algorithm will correctly jump between reference haplotypes when imputing a target. Deviations from this assumption provide a source of imputation error that was not treated in our analysis. Furthermore, as our analysis considered known haplotypes, we did not explicitly account for haplotype phasing errors. Haplotyping accuracy is known to increase with sample size, so that small internal reference panels may be more prone than large external panels to haplotyping errors that reduce imputation accuracy (Browning and Browning 2011). Consequently, our results can be interpreted as an approximation to the imputation error owing to use of a particular reference panel, without considering stochasticity in the choice of template haplotype and without considering phasing errors or errors that might occur from the imputation algorithm. In future research, it will be important to examine factors such as stochasticity in mutation, properties of sets of tagging SNPs, and recombination.

We have assumed that no migration occurs between the two populations in our model. Relaxing this assumption will likely reduce the magnitude of the difference in accuracy obtained on the basis of the internal and external panels. Migration allows the target haplotype to coalesce with a lineage ancestral to the external reference panel more recently than the divergence time, either if the target migrates to the population containing the external reference panel or if an ancestor of an external reference haplotype migrates to the population containing the target. Therefore, including migration in the model could reduce the expected coalescence time between the target and an ancestor of the external reference panel, leading both to an increase in accuracy and to an increase in the probability of optimality for an external reference panel. Determining rates of migration that lead to a noticeable effect requires further investigation.

Despite the fact that we have presented a simplified model without recombination or migration, our results provide mathematical formulas that allow us to estimate imputation accuracy for a variety of demographic and sampling scenarios. Our equations capture a variety of phenomena pertaining to imputation accuracy, and they facilitate the interpretation of these phenomena in terms of properties of the coalescent model. Genotype imputation is a valuable tool in genetic studies of complex disease, and optimizing imputation accuracy is important for conducting analyses with imputed data. The formulas we have derived are a step toward the development of more complicated models that can be used to make practical quantitative predictions about imputation accuracy, thereby facilitating sampling design.

## Acknowledgments

The authors thank E. Buzbas, S. Gopalakrishnan, L. Huang, and two anonymous reviewers for helpful comments. Support was provided by National Institutes of Health grants R01 GM081441, R01 HG005855, and T32 HG000040, and by a grant from the Burroughs Wellcome Fund.

## Appendix: The Quantity *N*(*i*_{D}, *j*_{D} → *i*_{C}, *j*_{C}, *k*_{C})

*N*(

*i*

_{D},

*j*

_{D}→

*i*

_{C},

*j*

_{C},

*k*

_{C})

Here we derive the number of ways *N*(*i*_{D}, *j*_{D} → *i*_{C}, *j*_{C}, *k*_{C}) in which *i*_{D} type 1 lineages and *j*_{D} type 2 lineages can coalesce to *i*_{C}, *j*_{C}, and *k*_{C} lineages of types 1, 2, and 1–2. This quantity is used to obtain the closed forms of the probabilities ℙ(*C*_{1}), ℙ(*C*_{2}), and ℙ(*C*_{12}) (Equations 1, 9, and 10).

We first note that if *k*_{C} type 1–2 lineages remain, then at least *k*_{C} type 1 lineages, and at least *k*_{C} type 2 lineages, must coalesce together to produce these lineages. Let *k*_{C} lineages of type 1–2 (Figure A1). Further, let *r*th type 1–2 lineage. The possible values of *k*_{C} nonempty subsets. Similarly, the possible values of *k*_{C} nonempty subsets.

Let *ψ*(*n*, *k*) denote the number of partitions of an integer *n* into *k* positive integers (Andrews 1984, p. 16). Let *q*th partition of this kind, with *r*th part in the partition. We can permute the *k* parts of the *q*th partition in *k*! ways. Denote the *z*th permutation of partition *q* by *n* lineages—the number of sequences in which *n* lineages can coalesce to one lineage—by *F*(*n*) ≡ *I _{n}*

_{,1}, where

*I*

_{n}_{,1}is computed using Equation 3. Then the quantity

*N*(

*i*

_{D},

*j*

_{D}→

*i*

_{C},

*k*

_{C},

*j*

_{C}) is given by

The quantity *α*(*n*, *k*, *q*) is the number of ways to distribute *n* distinguishable objects into *k* unordered, nonempty subsets whose sizes are those of the parts of the *q*th partition *π ^{q}*(

*n*,

*k*). To obtain an expression for

*α*(

*n*,

*k*,

*q*), define

*a*(

*φ*;

*π*(

^{q}*n*,

*k*)) to be the number of parts of the partition

*π*(

^{q}*n*,

*k*) that have size

*φ*. Then

*α*(

*n*,

*k*,

*q*) is given by

*k*parts and

*a*(

*φ*;

*π*(

^{q}*n*,

*k*))! is the number of ways to permute the parts of size

*φ*.

*α*(

*n*,

*k*,

*q*) because we take the partitions to be unordered, and there are

*a*(

*φ*;

*π*(

^{q}*n*,

*k*))! arrangements of the subsets of size

*φ*in which the same elements in these subsets are grouped together.

In Equation A1,

The quantity *R*(*i*, *j*, *k*, *η*, *γ*) in Equation A1 is the number of labeled histories for the *k*_{C} lineages of type 1–2. Given *k*_{C} nonempty parts, and a particular partition of the *k*_{C} nonempty parts. Each one of the *k _{C}* type 1–2 lineages is made by combining a part from the partition of the

*k*! ways to pair up the parts. We index these ways by

_{C}*z*. For the

*z*th way of permuting the parts, the lineages in part

*r*th lineage of type 1–2.

The *r*th pair of parts of lineages undergoes

Because there are *r*th pair of parts as they coalesce to form the *r*th lineage of type 1–2, there are *k*_{C}! possible ways to permute the partitions of the

## Footnotes

*Communicating editor: Y. S. Song*

- Received December 19, 2011.
- Accepted May 4, 2012.

- Copyright © 2012 by the Genetics Society of America

Available freely online through the author-supported open access option.