A Coalescent Model for Genotype Imputation
Ethan M. Jewett, Matthew Zawistowski, Noah A. Rosenberg, Sebastian Zöllner

Abstract

The potential for imputed genotypes to enhance an analysis of genetic data depends largely on the accuracy of imputation, which in turn depends on properties of the reference panel of template haplotypes used to perform the imputation. To provide a basis for exploring how properties of the reference panel affect imputation accuracy theoretically rather than with computationally intensive imputation experiments, we introduce a coalescent model that considers imputation accuracy in terms of population-genetic parameters. Our model allows us to investigate sampling designs in the frequently occurring scenario in which imputation targets and templates are sampled from different populations. In particular, we derive expressions for expected imputation accuracy as a function of reference panel size and divergence time between the reference and target populations. We find that a modestly sized “internal” reference panel from the same population as a target haplotype yields, on average, greater imputation accuracy than a larger “external” panel from a different population, even if the divergence time between the two populations is small. The improvement in accuracy for the internal panel increases with increasing divergence time between the target and reference populations. Thus, in humans, our model predicts that imputation accuracy can be improved by generating small population-specific custom reference panels to augment existing collections such as those of the HapMap or 1000 Genomes Projects. Our approach can be extended to understand additional factors that affect imputation accuracy in complex population-genetic settings, and the results can ultimately facilitate improvements in imputation study designs.

GENOTYPE imputation is the estimation of genotypes at untyped markers using patterns of haplotype structure (Halperin and Stephan 2009; Li et al. 2009; Marchini and Howie 2010). Imputation is a powerful tool in modern genetic studies. It is routinely used to increase the fraction of the genome covered in genome-wide association studies (GWAS) conducted in human populations, thereby increasing power to detect risk variants through linkage disequilibrium (LD) mapping (Marchini et al. 2007; Becker et al. 2009; Hao et al. 2009; Spencer et al. 2009; Li et al. 2010). Imputation based on a shared set of markers in data sets genotyped on different platforms enables large-scale meta-analyses (Barrett et al. 2008; de Bakker et al. 2008; Zeggini et al. 2008). In sequencing studies of rare variation, imputation can improve the accuracy of genotype calls from sequencing reads (Li et al. 2010, 2011), and power for rare-variant tests of association can be increased by augmenting sequence data with imputed samples (Zawistowski et al. 2010).

Imputation procedures generally involve a set of target samples in which genotypes will be imputed, a reference panel of phased haplotypes from which genotypes are copied, and an algorithm for the copying procedure. Each target sample is genotyped for single-nucleotide polymorphisms (SNPs) whose genotypes serve as a scaffold for more complete haplotypes. Reference samples are more densely genotyped than the target samples, and might even be fully sequenced. Imputation algorithms typically employ a computationally intensive hidden Markov model that uses genotypes from the SNP scaffold of a target sample to choose haplotypes from the reference panel that are most similar (Scheet and Stephens 2006; Browning and Browning 2007; Marchini et al. 2007; Purcell et al. 2007; Li et al. 2010). Genotypes are then imputed in the target by locally copying reference haplotypes that provide the best match.

Analyses of the imputed genotypes, such as in marker-based association tests, depend on imputation accuracy (Huang et al. 2009b), which in turn depends on numerous factors, including the haplotypic diversity of the target population, the size of the reference panel, and the genetic similarity of the reference and target individuals (Huang et al. 2009a, 2011; Browning and Browning 2009; International HapMap3 Consortium 2010; Jostins et al. 2011). Therefore, when performing imputation, and when prospectively designing imputation-based studies, it is important to understand how sampling approaches and population parameters affect imputation accuracy.

Currently, imputation accuracy is generally assessed by experiments on real or simulated data, in which known genotypes are masked, imputed, and compared to their true values. A thorough analysis of the effect of study design variables on imputation accuracy requires many experiments, a computationally expensive task that restricts the number of parameter combinations that can be tested. An alternative strategy is to develop a theoretical model that incorporates the study design variables as parameters and permits analytical calculations of imputation accuracy as a function of these variables. Such a model could enable faster predictions of imputation accuracy across a wider range of study designs than use of the empirical method. Moreover, analytical expressions for imputation accuracy as a function of model parameters can facilitate an intuitive understanding of the relationship between imputation accuracy and features of empirical studies.

In this article, we introduce a coalescent-based theoretical model of imputation. We consider a single target haplotype to be imputed for untyped markers and a reference panel of haplotypes, of which one will be chosen as the template for imputation. Our model relies on the premise that for a given target haplotype, an imputation algorithm ideally chooses as a template the reference haplotype whose number of sequence differences from the target is smallest. We take from the reference panel the haplotype with the closest genealogical history to the target, in terms of coalescence time, to serve as the template; this haplotype will have, on average, the fewest sequence differences from the target. By selecting the template haplotype in this way, our model mimics existing imputation approaches that attempt to select the haplotype, or set of haplotypes, with the closest genealogical relationship to the target (Pasaniuc et al. 2010; Howie et al. 2011). Under our imputation scheme, we quantify imputation accuracy as a function of reference panel size and demographic parameters for the populations containing the target and reference haplotypes.

The flexibility of the coalescent framework enables complex population-genetic models to be considered. Here, we use the model to investigate imputation-based study designs in the age of next-generation sequencing. In practice, researchers have typically relied on large, publicly available data sets to serve as reference panels. Although public panels are easily accessible and often result in sufficient imputation accuracy, the haplotypes available often derive from a different population than the study sample in which genotypes are imputed. Advances in sequencing technology now permit the creation of custom reference panels that contain genome sequences of individuals from the same source population as the study sample—perhaps even a subset of the study sample itself. Custom panels, however, will likely be smaller than public panels, owing to the sequencing cost required to create them. It is therefore of interest to determine the practical utility of custom panels by assessing the improvement in imputation accuracy for either replacing a public panel with a potentially smaller custom panel or augmenting the public panel with custom reference haplotypes.

Using our coalescent framework, we address this question by considering two potential reference panels for imputation in the same target population. The first is an “internal” reference panel of haplotypes drawn from the target population and is meant to represent a custom panel. The second is an “external” reference panel of haplotypes from a distinct population, such as one included in a public database. Our model predicts that an internal reference panel, even when considerably smaller than an existing external reference panel, nearly always improves imputation accuracy. We examine the dependence of this improvement on the relative sizes of the two panels and on demographic parameters such as divergence times and population growth rates. Our results suggest that in practice, augmenting an existing external reference panel (1000 Genomes, for example) with even relatively few haplotypes from the study population of interest can improve genotype imputation.

Overview of the Model

We use a coalescent framework to model genotype imputation at a nonrecombining genetic locus intended to represent a short region along a chromosome. We assume that genotypes for a single target haplotype T will be imputed. We define a reference panel to be a set of sequenced or densely genotyped haplotypes that does not include T. One haplotype from the reference panel will be chosen as an imputation template, and alleles from the template haplotype will be copied onto T. Assuming that mutations accumulate in proportion to time, the reference haplotypes with the fewest sequence differences from T are the descendants of the lineage with which T first coalesces. Under this assumption— which holds in expectation under the infinitely many-sites model (e.g., Wakeley 2008)—an imputation algorithm that always selects the haplotype in the reference panel with the fewest sequence differences from T (or one such haplotype in case of a tie) is equivalent to the algorithm choosing the reference haplotype with the closest genealogical history to T. Thus, when predicting the accuracy produced by a reference panel, to a first approximation, it is reasonable to use a model that considers only coalescence times, rather than a more complete model with stochastic mutations. As we will see, considering only coalescence times is desirable because it makes analytical computations simple, fast, and straightforward to interpret.

By including haplotypes from multiple reference panels in our model, we can compare the performance of reference panels from different populations. Here, we consider a scenario with two possible reference panels for imputing T. The first is an internal reference panel consisting of n1 haplotypes sampled from the same population as T. The second is an external reference panel consisting of n2 haplotypes sampled from a distinct population. Defining the “optimal panel” as that which produces the highest accuracy for imputing T, we compute the probability of optimality for both the internal and external reference panels and quantify the gain in imputation accuracy obtained by using the optimal rather than the suboptimal panel.

We model the genealogical history of T and the reference haplotypes using a two-population coalescent model of divergence (Takahata and Nei 1985; Rosenberg 2003) (Figure 1A). The two populations are labeled 1 and 2, and T is sampled from population 1. The populations diverged from an ancestral population at time tD in the past, and no migration has occurred between the descendant populations. Therefore, more recently than the divergence time (t < tD), a lineage can coalesce only with other lineages from the same population. This assumption is reasonable for pairs of populations that are geographically isolated. More anciently than the split (t > tD), all remaining lineages are assumed to belong to a homogeneous ancestral population, and any two lineages are allowed to coalesce. We assume effective population sizes of N1 diploid individuals for population 1, N2 for population 2, and NA for the ancestral population.

Figure 1

Two-population coalescent model for imputation reference panel selection. (A) Two populations, labeled 1 and 2, of sizes N1 and N2 diploid individuals, diverge from an ancestral population of size NA at time tD. A single haplotype T for which genotypes at untyped markers are to be imputed is sampled from population 1. We consider two possible reference panels for imputing T: an internal reference panel of n1 haplotypes sampled from population 1 and an external reference panel of n2 haplotypes sampled from population 2. If T first coalesces with a type 1 lineage (blue), then the internal panel is optimal for imputing T (event C1). The external panel is optimal (event C2) if T first coalesces with a lineage of type 2 (red). Finally, if T first coalesces with a type 1–2 lineage (orange), then the two reference panels are equivalent (event C12). (B) To compute the probability of optimality for each reference panel, we condition on Formula (the event that T coalesces before the divergence), the quantities iD and jD (the numbers of lineages originating in populations 1 and 2, respectively, that remain at the time of divergence), and iC, jC, and kC (the numbers of type 1, type 2, and type 1–2 lineages remaining at the instant when T first coalesces). In the realization pictured, T does not coalesce before the divergence time (event Formulac) and iD = 3, jD = 2, iC = 2, and jC = kC = 1. Because T first coalesces with a type 1–2 lineage (event C12), the two reference panels are equivalent for imputing T.

At time t = 0, corresponding to the present, n1 reference haplotypes in addition to the single target haplotype T are sampled from population 1, and n2 reference haplotypes are sampled from population 2. The divergence time tD and the panel sizes n1 and n2 are treated as model parameters. We refer to a lineage with descendants only in population 1 as a lineage of type 1. Similarly, a lineage with descendants only in population 2 has type 2, and a lineage with descendants in both populations has type 1–2. We assume that among available reference haplotypes, the optimal templates for imputing T are the descendants of the lineage with which T first coalesces. Thus, the internal reference panel is optimal if T first coalesces with a lineage of type 1 and the external panel is optimal if T first coalesces with a lineage of type 2. If T first coalesces with a lineage of type 1–2, then the two reference panels are equally appropriate, and we say that both are optimal.

Model

We use our coalescent model of genotype imputation to compute the probability of optimality for each reference panel and to quantify differences in imputation accuracy between potential reference panels. For the problem of imputing the target T, we first derive the probability of optimality for an internal reference panel of n1 haplotypes and for an external reference panel of n2 haplotypes, sampled from populations with a divergence time of tD (in units of 2NA generations). Let C1 be the event that T first coalesces with a lineage of type 1, let C2 be the event that T first coalesces with a lineage of type 2, and let C12 be the event that T first coalesces with a lineage of type 1–2. From our definition of optimality, it follows that ℙ(C1), the probability that the target T first coalesces with a lineage of type 1, is the probability that the internal reference panel is optimal for imputing T. Similarly, ℙ(C2) is the probability that the external reference panel is optimal, and ℙ(C12) is the probability that the two reference panels are both optimal, with equal expected imputation accuracy.

In the case that exactly one reference panel is optimal, it is of interest to quantify the improvement in imputation accuracy achieved by using the optimal as opposed to the suboptimal reference panel. Assuming that mutations follow a Poisson process, under the infinitely many-sites model, the expected number of sites incorrectly imputed in T is proportional to the coalescence time between the target and the imputation template; the sites that produce imputation errors are precisely those sites at which mutations occur on either the target or the template branch more recently than their coalescence. Expected imputation accuracy for a given reference panel can then be quantified by the expected time that T first coalesces with a haplotype from the panel. We now derive several measurements of imputation error to evaluate the difference in imputation accuracy between the internal and external panels.

Reference panel optimality probabilities

We consider two approaches for obtaining optimality probabilities, a closed-form computation and a recursive computation.

Closed-form computation:

Let C1, C2, and C12, respectively, be the events that the internal reference panel is optimal for imputing T, the external panel is optimal, and the two panels are equally appropriate. To compute the probabilities for each of these events, we partition the coalescent model into three components: (1) the events in population 1 more recent than the divergence from the ancestral population, (2) the events in population 2 more recent than the divergence, and (3) the events in the ancestral population (Figure 1B). Because we assume that no migration occurs between populations 1 and 2, the coalescent processes in populations 1 and 2 in the period more recent than the divergence are independent. Conditional on the numbers of lineages remaining from populations 1 and 2 at the divergence time, the coalescence events in the ancestral population are independent of the events that occur more recently than the divergence time.

First, we consider the genealogy of haplotypes in population 1 from the present back to the divergence time tD. Define D to be the event that lineage T coalesces more recently than tD and Dc to be the event that T does not coalesce by tD. Note that if T coalesces before tD, then the lineage with which it first coalesces can have descendants only in population 1 and must therefore have type 1. It follows that if event D occurs, then C1 also occurs, so that ℙ(C1, D) = ℙ(D). If, however, T does not coalesce before the divergence time, then it enters the ancestral population, where it can also coalesce with lineages of types 2 and 1–2. In this scenario, to identify the optimal reference panel, we must consider the coalescence events in population 2 and the ancestral population.

The coalescent process in the ancestral population depends on the numbers of lineages from populations 1 and 2 that survive at the divergence time. Let iD, 1 ≤ iDn1, denote the number of lineages from population 1 (other than T) that survive to enter the ancestral population at the divergence time tD (Figure 1B). Similarly, let jD, 1 ≤ jDn2, be the number of lineages from population 2 that survive at the divergence time. Because the coalescent processes in populations 1 and 2 are independent, jD is independent of both iD and D.

By conditioning on the number of lineages from each population remaining at the divergence time tD, we can write the quantity ℙ(C1) as(C1)=(C1,D)+(C1,Dc)=(D)+iD=1n1jD=1n2(C1,Dc,iD,jD)=(D)+iD=1n1jD=1n2(C1|Dc,iD,jD)(Dc,iD,jD)=(D)+iD=1n1jD=1n2(C1|Dc,iD,jD)(Dc,iD)hn2,jD(tD;N2),(1)where the last equality follows from independence between populations 1 and 2 and hn,ℓ(t; N) is the probability that n lineages sampled from a diploid population with effective size N coalesce down to ℓ lineages at time t. Tavaré (1984) demonstrated thathn,(t;N)=m=n(2m1)(1)m(m1)n[m]!(m)!n(m)e(m2)tNA/N(2)where n[m] = n(n−1) ⋯ (nm + 1), n(m) = n(n + 1) ⋯ (n + m − 1), and the factor of NA/N in the exponent is due to the fact that time is measured in units of 2NA generations.

To obtain the probability ℙ(Dc, iD), let An(t;N) be the event that n lineages in a diploid population of effective size N coalesce down to ℓ lineages at time t. Define In,ℓ, whereIn,=(n2)(n12)(+12)=n!(n1)!2n!(1)!,(3)is the number of ways that n lineages can coalesce to ℓ lineages (e.g., Rosenberg 2003). Then ℙ(Dc, iD) is derived by noting that for Dc and iD to both occur, the n1 + 1 total lineages originating in population 1 must coalesce to iD + 1 lineages at the divergence time tD. This can occur in In1+1,iD+1 ways. In In1,iD of these, the n1 reference haplotypes coalesce to iD lineages without T also coalescing. Thus,(Dc,iD)=(Dc,An1+1iD+1(tD;N1))=(Dc|An1+1iD+1(tD;N1))(An1+1iD+1(tD;N1))=In1,iDIn1+1,iD+1(An1+1iD+1(tD;N1))=iD(iD+1)n1(n1+1)hn1+1,iD+1(tD;N1).(4)The probability ℙ(Dc) is obtained by summing over all possible values of iD,(Dc)=iD=1n1(Dc,iD),(5)and the probability ℙ(D) in Equation 1 can be computed as(D)=1(Dc)=1iD=1n1(Dc,iD).(6)The final term to derive in Equation 1, ℙ(C1|Dc, iD, jD), is the probability that T first coalesces with a lineage of type 1 assuming that, in addition to T, iD lineages from population 1 and jD lineages from population 2 survive to the ancestral population.

To derive ℙ(C1|Dc, iD, jD) in closed form, let iC, jC, and kC be the numbers of lineages of types 1, 2, and 1–2, respectively, remaining at the instant when T first coalesces. The probability ℙ(C1|Dc, iD, jD) is computed by summing over all possible values of iC, jC, and kC,(C1|Dc,iD,jD)=kC=0min{iD,jD}iC=δkC,0iDkCjC=δkC,0jDkC(C1,iC,jC,kC|Dc,iD,jD),(7)where δa,b = 1 if a = b and δa,b = 0 otherwise.

To derive the probability ℙ(C1, iC, jC, kC|Dc, iD, jD), let N(iD, jDiC, jC, kC) be the number of ways in which iD lineages of type 1 and jD lineages of type 2 can coalesce down to iC, jC, and kC lineages of types 1, 2, and 1–2, respectively. Then ℙ(C1, iC, jC, kC|Dc, iD, jD) is given by(C1,iC,jC,kC|Dc,iD,jD)=N(iD,jDiC,jC,kC)iCIiD+jD+1,iC+jC+kC.(8)The quantity N(iD, jDiC, jC, kC) is derived in the Appendix.

The probability of optimality for the external reference panel, ℙ(C2), and the probability that the two reference panels are equally optimal, ℙ(C12), are computed in a similar manner to Equation 1. Because the occurrence of D implies that the event C1 has also occurred, however, ℙ(C2, D) = ℙ(C12, D) = 0. Thus, the probability that the external reference panel is optimal can be written as(C2)=iD=1n1jD=1n2(C2|Dc,iD,jD)(Dc,iD)hn2,jD(tD;N2),(9)and the probability that the two reference panels are both optimal is(C12)=iD=1n1jD=1n2(C12|Dc,iD,jD)(Dc,iD)hn2,jD(tD;N2).(10)The closed-form expressions for ℙ(C2|Dc, iD, jD) and ℙ(C12|Dc, iD, jD) in Equations 9 and 10 are similar to Equations 7 and 8. For ℙ(C2|Dc, iD, jD), we compute(C2|Dc,iD,jD)=kC=0min{iD,jD}iC=δkC,0iDkCjC=δkC,0jDkC(C2,iC,jC,kC|Dc,iD,jD),(11)where ℙ(C2, iC, jC, kC|Dc, iD, jD) is given by(C2,iC,jC,kC|Dc,iD,jD)=N(iD,jDiC,jC,kC)jCIiD+jD+1,iC+jC+kC.(12)Similarly, for ℙ(C12|Dc, iD, jD), we compute(C2|Dc,iD,jD)=kC=1min{iD,jD}iC=0iDkCjC=0jDkC(C12,iC,jC,kC|Dc,iD,jD),(13)where ℙ(C12, iC, jC, kC|Dc, iD, jD) is given by

(C12,iC,jC,kC|Dc,iD,jD)=N(iD,jDiC,jC,kC)kCIiD+jD+1,iC+jC+kC.(14)

Recursive computation:

Computing the closed-form expressions (8), (12), and (14) is time–intensive for large iD and jD. Therefore, the probabilities ℙ(C1), ℙ(C2), and ℙ(C12) are difficult to calculate using Equations 1, 9, and 10 with large panel sizes n1 and n2. In this section, we derive an efficient recursive approach for computing the probabilities ℙ(C1|Dc, iD, jD), ℙ(C2|Dc, iD, jD), and ℙ(C12|Dc, iD, jD).

Assume that at some time t > tD, in addition to lineage T, i lineages of type 1, j lineages of type 2, and k lineages of type 1–2 exist in the ancestral population. Conditional on this configuration, let ˜(C1|i,j,k) denote the probability that T first coalesces with a lineage of type 1. We construct a recursive equation for ˜(C1|i,j,k) by conditioning on the lineage pair involved in the next coalescence event and considering its effect on the subsequent coalescent process.

Let m = i + j + k + 1 be the total number of lineages remaining. Each of the (m2) pairs of lineages is equally likely to be the next to coalesce. Nine distinct pairs of lineage types can coalesce in the next event. For each pair, we compute the probability that the next coalescence will involve lineages of the specified types. Conditional on the lineage types that coalesce, we compute the subsequent probability that T first coalesces with a type 1 lineage. If the next coalescence involves T and a type 1 lineage, an event that occurs with probability i/(m2)=2i/[m(m1)], then event C1 occurs. Alternatively, if the next coalescence occurs between T and either a type 2 or a type 1–2 lineage, events that occur with probabilities j/(m2)=2j/[m(m1)] and k/(m2)=2k/[m(m1)], respectively, then C1 cannot occur. For the remaining lineage pairs, T is not involved in the next coalescence, and the probability of C1 depends on the lineage pair that is involved in the event. For example, two type 1 lineages coalesce with probability (i2)/(m2)=i(i1)/[m(m1)], reducing the number of type 1 lineages to i − 1. The coalescent process then restarts with i − 1, j, and k lineages of types 1, 2, and 1–2, respectively. Given this new configuration, the probability of event C1 is ˜(C1|i1,k,j). The remaining cases for the recursion appear in Table 1.

View this table:
Table 1 Derivation of the recursion Formula

By conditioning on the possible lineage pairs for the next coalescence, we obtain a recursion:˜(C1|i,j,k)=2im(m1)+i(i1)+2ikm(m1)˜(C1|i1,j,k)+j(j1)+2jkm(m1)˜(C1|i,j1,k)+2ijm(m1)˜(C1|i1,j1,k+1)+k(k1)m(m1)˜(C1|i,j,k1).(15)Equation 15 holds for i > 0, j ≥ 0, k ≥ 0. ˜(C1|0,j,k)=0 for all j, k ≥ 0 because there must be at least one lineage of type 1 for event C1 to occur. The recursion is incorporated into Equation 1 by replacing ℙ(C1|Dc, iD, jD) with ˜(C1|iD,jD,0).

The terms ℙ(C2|Dc, iD, jD) in Equation 9 and ℙ(C12|Dc, iD, jD) in Equation 10 can also be evaluated recursively, following the same logic used to obtain Equation 15. Denote by ˜(C2|i,j,k) the probability that T first coalesces with a type 2 lineage. Then˜(C2|i,j,k)=2jm(m1)+i(i1)+2ikm(m1)˜(C2|i1,j,k)+j(j1)+2jkm(m1)˜(C2|i,j1,k)+2ijm(m1)˜(C2|i1,j1,k+1)+k(k1)m(m1)˜(C2|i,j,k1).(16)Equation 16, which can replace ℙ(C2|Dc, iD, jD) in Equation 9, holds for i ≥ 0, j > 0, k ≥ 0. ˜(C2|i,0,k)=0 for all i, k ≥ 0.

Finally, conditional on i, j, and k, denoting by ˜(C12|i,j,k) the probability that T first coalesces with a lineage of type 1–2,˜(C12|i,j,k)=2km(m1)+i(i1)+2ikm(m1)˜(C12|i1,j,k)+j(j1)+2jkm(m1)˜(C12|i,j1,k)+2ijm(m1)˜(C12|i1,j1,k+1)+k(k1)m(m1)˜(C12|i,j,k1).(17)The boundary condition for Equation 17 is ˜(C12|i,0,0)=˜(C12|0,j,0)=0 for i, j ≥ 0, because production of a type 1–2 lineage requires at least one type 1 lineage and at least one type 2 lineage. The recursion is incorporated into Equation 10 by replacing ℙ(C12|Dc, iD, jD) with ˜(C12|iD,jD,0).

Expected coalescence times

In this section, we derive formulas that quantify and compare imputation accuracy for internal and external reference panels. Let S1 be the number of sites that are incorrectly imputed when using an internal reference panel and let S2 be the number of sites incorrectly imputed when using an external reference panel. We compute the expected numbers of incorrectly imputed sites, E[S1] and E[S2], conditional on reference panel sizes n1 and n2 and divergence time tD.

Let T1 be the random waiting time until T first coalesces with a lineage that has descendants in the internal reference panel, that is, a type 1 or type 1–2 lineage. Similarly, let T2 be the waiting time until T first coalesces with a lineage that has descendants in the external reference panel, that is, a type 2 or type 1–2 lineage. Here, T1 and T2 are measured in units of 2NA generations. The template haplotype is selected to minimize the coalescence time with lineage T (Figure 2). Thus, the total branch length separating the target from the template is 2T1 when the internal reference panel is used and 2T2 when the external reference panel is used.

Figure 2

Coalescence times between the target T and the reference panels. T1 indicates the time at which the target haplotype T first coalesces with a type 1 or type 1–2 lineage. We choose one of the descendant reference haplotypes from that coalescence event (highlighted in purple) to be the template from the internal reference panel. We assume that when using the internal panel, the number of mutations that result in incorrectly imputed sites follows a Poisson distribution with mean 2T1θω/2, where 2T1 is the total branch length separating the target T from the templates sampled from the internal panel in units of 2NA generations. Here, θ = 4NAμ is the per-base population-scaled mutation rate, μ is the per-base per-generation mutation rate, and ω is the number of bases genotyped in the reference population that will be imputed in T. Similarly, T2 is the time at which the target haplotype T first coalesces with a type 2 or type 1–2 lineage and 2T2 is the branch length between T and the set of potential templates from the external reference panel (the best external reference panel is highlighted in green).

We model mutation events using the infinitely many-sites model and assume that the number of mutations at genotyped sites along a branch of length t units of 2NA generations follows a Poisson distribution with mean tθω/2, where θ = 4NAμ is the per-base population-scaled mutation rate, μ is the per-base per-generation mutation rate, and ω is the number of sites genotyped (or sequenced) in the reference haplotypes that will be imputed in the target. Under our model, mutations that occur along the branches separating the target haplotype and the template haplotype will be incorrectly imputed. Therefore, when the internal reference panel is used, the expected number of sites incorrectly imputed isE[S1]=E[E[S1|T1]]=E[2T1θω/2]=θωE[T1].(18)Similarly, the expected number of sites incorrectly imputed using the external panel isE[S2]=θωE[T2].(19)It follows that the expected difference in the number of sites incorrectly imputed between the external and internal reference panels isE[S2S1]=E[S2]E[S1]=θω(E[T2]E[T1]).(20)Thus, up to the scaling factor θω, which does not depend on the model parameters n1, n2, and tD, deriving E[T1] and E[T2] is sufficient to determine the expected difference E[S2S1].

Derivation of E[T1]:

To compute the expected waiting time E[T1] until T first coalesces with a lineage that has descendants in the internal reference panel, we condition on the population in which lineage T first coalesces:E[T1]=E[T1|D](D)+E[T1|Dc](Dc).(21)Here, ℙ(D) and ℙ(Dc) are obtained using Equations 6 and 5, respectively. To obtain the expected waiting time E[T1|D] until lineage T first coalesces, given that it coalesces in population 1, we integrate the conditional survival function ST1|D(t) of T1 given D:E[T1|D]=t=0tDST1|D(t)dt.(22)ST1|D(t) is calculated as follows:ST1|D(t)=(T1t|D)=1(T1<t,D)/(D)=1(T1<min{t,tD})/(D)=11(D)i=1n1+1(T1<min{t,tD}|An1+1i(min{t,tD};N1))(An1+1i(min{t,tD};N1))=11(D)i=1n1+1[1In1,i1In1+1,i]hn1+1,i(min{t,tD};N1)=11(D)i=1n1+1[1i(i1)n1(n1+1)]hn1+1,i(min{t,tD};N1).(23)In the fourth equality, Ank(t;N) is the event that n lineages coalesce to k lineages in time t in a diploid population of size N. In the fifth equality, by the same argument used to derive Equation 4, the probability that lineage T does not coalesce when the n1 + 1 sampled lineages coalesce to i lineages is In1,i1/In1+1,i. Therefore, the probability that T does coalesce is 1In1,i1/In1+1,i. The term ℙ(D) in Equation 23 is given by Equation 6. Although the integral in Equation 22 can be carried out analytically, we present the formula in its current form because it is easier to modify Equation 22 from the form given to account for exponential growth.

The quantity E[T1|Dc] is the expected time until lineage T first coalesces in the ancestral population with a lineage that has descendants in population 1, given that it does not coalesce in population 1. This expected time can be found by conditioning on the number iD of type 1 lineages that remain at the divergence time:E[T1|Dc]=iD=1n1E[T1|iD,Dc](iD|Dc)=iD=1n1(tD+4NAiD+1)(Dc,iD)(Dc).(24)Here, we have used the fact that in a population of diploid size N, the expected sum of the lengths of all external branches in a genealogy is 4N (Fu and Li 1993). Thus, the mean length of an external branch in a genealogy with n + 1 lineages—and consequently, the expected time until a specific lineage coalesces with some lineage among a set of n—is 4N/(n + 1). ℙ(Dc, iD) and ℙ(Dc) are found using Equations 4 and 5, respectively. The quantity E[T1] is then obtained by inserting Equations 5, 6, 22, and 24 into Equation 21.

Derivation of E[T2]:

To compute E[T2], we condition on the number jD of type 2 lineages remaining at the divergence time:

E[T2]=jD=1n2E[T2|jD]hn2,jD(tD;N2)=jD=1n2(tD+4NAjD+1)hn2,jD(tD;N2).(25)

Exponential growth

Given the utility of population growth models for explaining properties of human genetic variation (e.g., Schaffner et al. 2005; Coventry et al. 2010), we consider a model of exponential growth in populations 1 and 2 (Figure 3). Let N1(t) and N2(t) be functions that define the sizes of populations 1 and 2, respectively, at time t in the past, where t is measured in units of 2NA generations. Here, we assume an ancestral population of constant size and set N1(t)=N1(0)eα1t and N2(t)=N2(0)eα2t for t ∈ [0, tD] and α1, α2 > 0. We compare the results from this model to those of the constant-size model to evaluate effects of exponential growth on imputation reference panel selection.

Figure 3

The two-population coalescent model of divergence, assuming exponential growth in the descendant populations. The sizes of populations 1 and 2 change over time according to Formula and Formula, respectively, for t ∈ [0, tD]. The quantities α1, α2 > 0 are growth rates, and N1(0) and N2(0) are the sizes of populations 1 and 2 in the present. At time tD, populations 1 and 2 merge instantaneously into the ancestral population, which has constant size NA. In our analysis, to explore the effect of exponential population growth on imputation accuracy, we vary N1(0) and N2(0) while holding N1(tD) and N2(tD) fixed.

Exponential growth changes only the distribution of coalescent waiting times from our computations in the previous sections. All derived equations depend on the waiting times only through the quantity hn,k(t; N), the probability that n lineages coalesce to k lineages in time t in a diploid population with constant size N. Thus, for the exponential growth model, we replace hn,k(t; N) from the constant-size model by hn,k(t; N(0), α), the probability that n lineages coalesce to k lineages in time t for a population with size N(t) = N(0)eαt.

The probability hn,k(t; N(0), α) is computed by solving for the value t′ such that hn,k(t; N(0), α) = hn,k(t′; Nc) for a specified constant size Nc, and then computing hn,k(t′; Nc) using Equation 2. Here, we take Nc = 1 for simplicity. Let g(t; N(0), α) denote the transformation taking time t in the growing population to time t′ in the population of constant size 1. Assuming the size of the growing population at time t is N(t) = N(0)eαt, the transformation g(t; N(0), α) isg(t;N(0),α)=0t1/N(z)dz={eαt1N(0)α,ifα0,t/N(0),ifα=0,(26)where the units of the transformed time t′ = g(t; N(0), α) are the same as those of the untransformed time t (Griffiths and Tavaré 1994; Nordborg 2003). Thenhn,k(t;N(0),α)=hn,k(g(t;N(0),α);1)=i=kn(2i1)(1)ikk(i1)n[i]k!(ik)!n(i)e(i2)g(t;N(0),α)NA,(27)where the factor NA is needed because the transformed time t′ is measured in units of 2NA generations. The modified versions of all equations from the constant-size model appear in Table 2. Because the case in which populations 1 and 2 have constant size is the α = 0 case of the growth scenario, results for the constant model can also be obtained using the formulas in Table 2.

View this table:
Table 2 Reformulation of the results of Equations (1) through (25) for the case of exponential growth

Simulations

To validate our theoretical results, we carried out coalescent simulations. Given values of n1, n2, and tD, in the constant-size model, following the method of Jewett and Rosenberg (2012), we simulated genealogies and estimated probabilities ℙ(C1), ℙ(C2), and ℙ(C12) as the fractions of the simulated genealogies for which the events C1, C2, and C12 occurred. ℙ(C1), ℙ(C2), and ℙ(C12) were obtained from the same set of 106 simulations.

Results

Agreement of closed-form, recursive, and simulation-based computations

We derived exact closed-form and recursive equations for the probability ℙ(C1) that the internal reference panel is optimal, the probability ℙ(C2) that the external reference panel is optimal, and the probability ℙ(C12) that both panels are equally optimal. Both the exact and recursive computations require the function hn,k(t; N) to be evaluated. However, Equation 2 for hn,k(t; N) is numerically unstable for small t and large n. Therefore, for small t and large n, we used an asymptotic approximation to hn,k(t; N) (Griffiths 1984). For n lineages sampled in the present, the distribution of the number of lineages at time t (expressed in units of 2N generations) is asymptotically normal with mean μ = 2η/t and variance σ2 = 2ηt−1(η + β)2[1 + η/(η + β) − η/α − η/(α + β) − 2η]β−2 as t → 0, n → ∞, and 12ntα<, where β = −t/2 and η = αβ/{α(eβ − 1) + βeβ}. We used the asymptotic approximation to hn,k(t; N) when both n ≥ 40 and tD ≤ 0.1, and we used Equation 1 otherwise.

Table 3 gives values of ℙ(C1) computed using the exact and recursive approaches at two divergence times tD and several reference panel sizes n1 and n2. For larger n1 or n2, where closed-form evaluation of ℙ(C1) is computationally difficult, we report only values computed using the recursion (Equation 15). The closed-form and recursive probabilities agree at parameter values where a comparison is possible. To allow large reference panel sizes to be considered, we subsequently restrict our attention to the recursion.

View this table:
Table 3 Comparison of closed-form, recursive, and simulated probabilities

The closed-form and recursive expressions also agree with the results of the simulations (Table 3). Because the simulations do not rely on the asymptotic approximation, when tD ≤ 0.1 and either n1 ≥ 40 or n2 ≥ 40, differences between analytical and simulation results are potentially attributable to errors in the approximation. However, these differences are generally negligible.

Comparison of internal and external reference panels

We report ℙ(C1), the optimality probability for an internal reference panel, and E[S2S1], the expected difference in the number of incorrectly imputed sites between the external and internal reference panels, for a range of panel sizes and for large and small divergence times. The optimality probability ℙ(C1) is interpreted as the probability that a given locus is imputed more accurately when using the internal reference panel compared to the external panel. The relative accuracy for imputation using the external panel compared to the internal panel is quantified by E[S2S1]. A positive value of E[S2S1] indicates that using the external reference panel will, on average, result in more imputation errors than using the internal panel. A negative value indicates that the external panel will result in fewer imputation errors on average, and a value of zero indicates that the two panels are expected to produce the same number of incorrectly imputed sites. E[S2S1] is reported in units of the population-scaled mutation rate θω = 4NAμω for a set of ω imputed bases.

Populations of constant size:

In our analyses of the constant-size model, we assumed a constant, equal size for all populations (N1 = N2 = NA). Note that because we express imputation accuracy in terms of the population-scaled mutation rate θω, only the relative effective sizes of the populations and not their absolute sizes influence the results of our analyses. Under this model, if tD is small, then tD measured in units of 2NA generations is related to the population differentiation statistic Fst through the approximation tD ≈ −log(1 − Fst) ≈ Fst (Cavalli-Sforza 1969; Reynolds et al. 1983). We present results for divergence times of tD = 0.005 (Fst ≈ 0.005) to represent two populations within a continental region, and tD = 0.05 (Fst ≈ 0.05) to represent more strongly diverged populations.

Figure 4A shows ℙ(C1) for a range of reference panel sample sizes n1 and n2 at the smaller divergence time of tD = 0.005. Each curve corresponds to a fixed external reference panel size n2, and ℙ(C1) is plotted as a function of the internal reference panel size n1. ℙ(C1) exceeds 0.5 over most of the range of values for n1 and n2, indicating that the internal panel is likely be optimal even when it is much smaller than the external panel. For example, ℙ(C1) > 0.7 for internal reference panels of n1 ≥ 400 haplotypes even when considering an extremely large external panel of n2 = 10,000 haplotypes (light blue curve).

Figure 4

Imputation performance for the constant-size two-population model. For two different divergence times tD, the figure shows the probability ℙ(C1) that the internal reference panel is optimal and the expectation E[S2S1] of the number of additional imputation errors made when imputing using the external reference panel rather than the internal reference panel. (A) ℙ(C1), tD = 0.005. (B) ℙ(C1), tD = 0.05. (C) E[S2S1], tD = 0.005. (D) E[S2S1], tD = 0.05. E[S2S1] is reported in units of the population-scaled mutation rate θω = 4NAμω for the imputed region of ω bases. Reference panel size is the number of haplotypes in the panel. For clarity, the scale of C and D differs from that of A and B.

Figure 4B shows ℙ(C1) for the same values of n1 and n2 at the larger divergence time of tD = 0.05. For fixed values of n1 and n2, ℙ(C1) is greater when tD = 0.05 than when tD = 0.005, indicating an increase in the probability of optimality for an internal panel at larger divergence times. Taken together, Figure 4, A and B, shows that under our model, smaller internal reference panels are more likely to be optimal than much larger external panels and that this improvement increases as the two source populations for the panels become more genetically distinct.

Next, we compute the expected difference E[S2S1] in the number of incorrectly imputed sites between internal and external reference panels to quantify the improvement in accuracy (Figure 4C, tD = 0.005; Figure 4D, tD = 0.05). Increasing the size n1 of the internal panel while holding the external panel size n2 fixed results in an increase in E[S2S1]. Conversely, increasing n2 while holding n1 fixed leads to a decrease in E[S2S1]. Because the internal reference panel size has no effect on E[S2] and the external panel size does not affect E[S1], an increase in E[S2S1] for fixed n1 represents an increase in E[S2], while an increase of E[S2S1] for fixed n2 indicates a decrease of E[S1]. The model thus predicts the intuitive result that increasing the size of a reference panel (internal or external) leads to fewer imputation errors. Increasing the number of haplotypes in a reference panel has the largest improvement when the panel is initially small. For example, once the internal panel has reached a certain size (n1 ≈ 30 for tD = 0.005), including additional haplotypes in the panel produces only small increases in accuracy. Similarly, the addition of haplotypes to the external panel yields the greatest improvement in accuracy when the external panel is small; hence, a gap is visible between the lines for n2 = 50 (orange) and n2 = 100 (purple) in Figure 4, C and D, while the lines for n2 = 500 (red) and n2 = 10,000 (light blue) are nearly indistinguishable.

The magnitude of the divergence time affects the expected accuracies of the reference panels. Comparison of Figure 4, C and D, shows that the performance of the internal reference panel relative to the external panel improves for the larger divergence time. For example, at the smaller divergence time, n1 = 67 haplotypes are required in the internal panel to acquire the same expected imputation accuracy as an external panel of n2 = 100 haplotypes (Figure 4C, purple line). For the larger divergence time, an internal reference panel need only contain n1 = 17 haplotypes to provide the same expected accuracy as n2 = 100 haplotypes in an external panel (Figure 4D, purple line).

The results from Figure 4 can be directly interpreted in terms of events in the coalescent model. ℙ(C1) increases with tD for fixed n1 and n2 because increasing the divergence time between the populations lengthens the amount of time that the target haplotype T remains in population 1, where it can coalesce only with type 1 lineages. In fact, for large divergence times or large n1, the target T is likely to coalesce before reaching the ancestral population, explaining why ℙ(C1) ≈ 1 nearly independently of n2 for larger tD. E[S2S1] levels off quickly as n1 and n2 increase because the expected time until a single lineage in a diploid population of constant size N coalesces with one of ℓ other lineages is 4N/(ℓ + 1) (Fu and Li 1993). Thus, for our parameter values, the expected time until T coalesces with a type 1 lineage is E[T1] = 4N/(n1 + 1) and the expected time until T coalesces with a type 2 lineage is E[T2]=tD+jD=1n2hn2,jD(tD;N)[4N/(jD+1)], where jD is the number of lineages from population 2 that remain at time tD. The quantity 4N/(ℓ + 1) is small even when ℓ is only moderately large. Hence, changes in E[T1] and E[T2] with respect to n1 and jD, respectively, are small once n1 or jD exceeds around 50 lineages. Because jD increases quickly with n2 for small tD, both E[T1] and E[T2] change little once n1 and n2 are moderate in size. E[S2S1], which is proportional to E[T2] − E[T1], therefore changes little as well.

Exponentially growing populations:

We next examine ℙ(C1) and E[S2S1] under a model of exponential growth in populations 1 and 2. Here we assume that both populations have size NA at the divergence time tD and size 100NA in the present. We again show results for divergence times of tD = 0.005 and 0.05, measured in units of 2NA generations.

Figure 5A compares ℙ(C1) under the constant-size model (solid lines) and exponential growth model (dashed lines) at the smaller divergence time of tD = 0.005. For fixed n1 and n2, the value of ℙ(C1) under the growth model is less than the corresponding ℙ(C1) value for the constant model. For instance, ℙ(C1) is 0.68 for n1 = 200 and n2 = 500 under the constant-size model, but it falls to 0.35 for the growth model. The differences in accuracy between the growth and constant-size models are similar although less pronounced for the larger divergence time (Figure 5B). Thus, recent exponential growth in the populations from which the reference and target haplotypes are sampled has the effect of reducing the optimality of the internal reference panel.

Figure 5

Imputation performance for the exponential-growth two-population model. For two different divergence times tD, the figure shows the probability ℙ(C1) that the internal reference panel is optimal and the expectation E[S2S1] of the number of additional imputation errors made when imputing using the external reference panel rather than the internal reference panel. Values for the exponential growth model are plotted with dashed lines and, for comparison, the corresponding values for a constant-size model are shown with solid lines. (A) ℙ(C1), tD = 0.005. (B) ℙ(C1), tD = 0.05. (C) E[S2S1], tD = 0.005. (D) E[S2S1], tD = 0.05. E[S2S1] is reported in units of the population-scaled mutation rate θω = 4NAμω for the imputed region of ω bases. Reference panel size is the number of haplotypes in the panel. For clarity, the scale of C and D differs from that of A and B.

The expected difference E[S2S1] in imputation accuracy between the two panels is also affected by exponential growth. When the divergence time is small (tD = 0.005), the expected difference in accuracy for a given n1 and n2 decreases very slightly under exponential growth (Figure 5C). The change in E[S2S1] between the constant-size and growth models is more extreme at the larger divergence time (Figure 5C). Thus, imputation accuracy for an external reference panel relative to an internal panel improves in the exponential growth model. Recall that for the larger divergence time, only n1 = 17 internal haplotypes were required to achieve the same expected accuracy as n2 = 100 external haplotypes under the constant model. Under the exponential growth model, the number of internal haplotypes needed to match the performance of the n2 = 100 external haplotypes increases to n1 = 46. Although smaller internal reference panels can still achieve accuracy similar to that of larger external panels in the presence of exponential growth, the number of internal haplotypes required to do so increases.

The effects of growth can be understood using intuition about the coalescent model. Including exponential growth in our coalescent model increases the mean waiting time for coalescent events compared to the constant-size case (Figure 6). Therefore, the probability that the target T coalesces more recently than the divergence time tD decreases, and the number of type 2 lineages jD that enter into the ancestral population increases. Both of these factors increase the probability that T will survive to the ancestral population and, therefore, the probability that T will coalesce first with a type 2 or type 1–2 lineage. This explains the reduction in ℙ(C1) observed for the growth model. Similarly, the decrease in E[S2S1] under the exponential growth model compared to the constant-size model can be explained by the longer coalescent waiting times. The expected waiting time E[T1] until T coalesces with a type 1 lineage increases; however, the expected waiting time E[T2] until T coalesces with a type 2 or type 1–2 lineage decreases because the longer waiting times in population 2 result in a larger number of type 2 lineages surviving to the ancestral population. Both the increase in E[T1] and the decrease in E[T2] lead to a reduction in E[S2S1] since E[S2S1] ∝ E[T2] − E[T1].

Figure 6

The effect of population growth on coalescent waiting times. Increasing the present-day size N(0) of a population while holding the size N(tD) at time tD fixed increases the mean waiting time for each coalescence event.

Discussion

We have introduced a theoretical model of genotype imputation accuracy, employing the coalescent framework to model the ancestry of an imputation target haplotype and a set of reference haplotypes. We examined an imputation algorithm that chooses the reference haplotype with the most similar genealogical history to the target as the template for imputation, and we used the expected coalescence time between the target and template haplotypes to predict the expected number of incorrectly imputed sites in the target.

Framing imputation in a coalescent model has two major benefits. First, the coalescent model enables the derivation of analytical formulas for imputation accuracy. These formulas provide a computationally fast method for studying accuracy across a range of imputation study design variables, and they facilitate the interpretation of observed relationships between imputation accuracy and demographic parameters in terms of well-understood properties of the coalescent model. Second, the coalescent allows complex modeling of population histories, so that a variety of relationships between the reference and target populations can be considered. Here we presented a simple model with a single study population and a single external population. However, this basic model can be extended to more complicated imputation scenarios by specifying different demographic settings for the coalescent. For example, the model can be reformulated to include multiple external populations, each with a unique number of available haplotypes and a distinctive divergence time from the study population. Equations analogous to the ones derived in this article can be computed for the more complicated model and used to determine the optimal imputation reference panel when several external panels are available.

We used the model to study the effect of population subdivision on imputation accuracy, employing a two-population divergence model to compare internal reference panels drawn from the same source population as the target to external panels from a distinct population. To quantify imputation accuracy, we focused on two quantities: (1) ℙ(C1), the probability that a target lineage on which genotypes are to be imputed coalesces first with a lineage from the internal panel, and (2) E[S2S1], the expected difference in the number of imputation errors between the external and internal reference panels.

We have interpreted ℙ(C1) as the probability that the internal reference panel is optimal and results in fewer expected imputation errors at a locus than the external panel. ℙ(C1) has two additional interpretations. First, a reasonable imputation strategy is to augment an available external panel with internal reference haplotypes. In this setting, ℙ(C1) is the probability that the target lineage will coalesce first with one of the additional internal lineages. Thus, ℙ(C1) can be interpreted as the probability that imputation accuracy improves by augmenting the existing external reference panel with internal haplotypes. A second alternative interpretation of ℙ(C1) is obtained by considering imputation on a genome-wide scale, rather than at a single locus. In a genome-wide context, ℙ(C1) can be viewed as the fraction of sites in the genome that are more accurately imputed by the internal reference panel than by the external reference panel.

The model predicts that even when an internal reference panel is considerably smaller than an external panel, the internal panel is nearly always optimal in the sense that it contains the haplotype with the closest genealogical history to the target. Furthermore, the probability of optimality for the internal panel, ℙ(C1), increases as the divergence time increases. For populations of constant size, a large external reference panel can provide approximately the same accuracy as a modestly sized internal reference panel if the divergence time is small (E[S2] ≈ E[S1]). As the divergence time increases, a small internal reference panel results in increasingly more accurate imputation compared to the large external panel. The expected improvement in imputation accuracy for a smaller internal panel becomes less pronounced if the populations experience exponential growth following the divergence. Thus, exponential growth attenuates the effect of the divergence time, improving the relative performance of an external reference panel in terms of both optimality probability and expected imputation accuracy when compared to populations of constant size.

The results from our model have implications for imputation strategies in population-based genetic association studies. Currently, large public data sets from the HapMap (International HapMap3 Consortium, 2010) and 1000 Genomes (1000 Genomes Project Consortium, 2010) projects are frequently used as external reference panels. However, advances in sequencing technology will enable investigators to create custom internal reference panels from the same source population as their study samples. Not only will custom internal panels enable successful imputation of rare variants private to the target population, our model predicts that a custom internal reference panel will often improve imputation accuracy in general, even if it is much smaller than an existing external reference panel. Under the model, the best strategy is to combine the internal haplotypes with an available external panel to create a single cosmopolitan reference panel. This approach combines the benefits contributed by the large sample size of the external panel and the greater genetic similarity of the internal panel.

Our equations for panel optimality and imputation accuracy rely on a rule that mimics computational imputation algorithms: the reference haplotype whose coalescence time with the target is minimal serves as the imputation template. In actual data, this rule might not always hold, as stochasticity of mutations and the use of small sets of “tagging” SNPs for estimating pairwise distances among haplotypes could cause a sequence whose coalescence time with the target is not minimal to be most genetically similar to the target. The problem may be more pronounced for rare variants that do not exist on a unique background of tagging SNPs. In addition, the rule for template selection might not be strictly followed by imputation software. We have assumed that the entire length of the target haplotype is imputed using the same reference haplotype; however, owing to past recombination events, a real target haplotype is likely to be composed of multiple segments, each with a distinct optimal template. Our model, therefore, implicitly assumes that an imputation algorithm will correctly jump between reference haplotypes when imputing a target. Deviations from this assumption provide a source of imputation error that was not treated in our analysis. Furthermore, as our analysis considered known haplotypes, we did not explicitly account for haplotype phasing errors. Haplotyping accuracy is known to increase with sample size, so that small internal reference panels may be more prone than large external panels to haplotyping errors that reduce imputation accuracy (Browning and Browning 2011). Consequently, our results can be interpreted as an approximation to the imputation error owing to use of a particular reference panel, without considering stochasticity in the choice of template haplotype and without considering phasing errors or errors that might occur from the imputation algorithm. In future research, it will be important to examine factors such as stochasticity in mutation, properties of sets of tagging SNPs, and recombination.

We have assumed that no migration occurs between the two populations in our model. Relaxing this assumption will likely reduce the magnitude of the difference in accuracy obtained on the basis of the internal and external panels. Migration allows the target haplotype to coalesce with a lineage ancestral to the external reference panel more recently than the divergence time, either if the target migrates to the population containing the external reference panel or if an ancestor of an external reference haplotype migrates to the population containing the target. Therefore, including migration in the model could reduce the expected coalescence time between the target and an ancestor of the external reference panel, leading both to an increase in accuracy and to an increase in the probability of optimality for an external reference panel. Determining rates of migration that lead to a noticeable effect requires further investigation.

Despite the fact that we have presented a simplified model without recombination or migration, our results provide mathematical formulas that allow us to estimate imputation accuracy for a variety of demographic and sampling scenarios. Our equations capture a variety of phenomena pertaining to imputation accuracy, and they facilitate the interpretation of these phenomena in terms of properties of the coalescent model. Genotype imputation is a valuable tool in genetic studies of complex disease, and optimizing imputation accuracy is important for conducting analyses with imputed data. The formulas we have derived are a step toward the development of more complicated models that can be used to make practical quantitative predictions about imputation accuracy, thereby facilitating sampling design.

Acknowledgments

The authors thank E. Buzbas, S. Gopalakrishnan, L. Huang, and two anonymous reviewers for helpful comments. Support was provided by National Institutes of Health grants R01 GM081441, R01 HG005855, and T32 HG000040, and by a grant from the Burroughs Wellcome Fund.

Appendix: The Quantity N(iD, jDiC, jC, kC)

Here we derive the number of ways N(iD, jDiC, jC, kC) in which iD type 1 lineages and jD type 2 lineages can coalesce to iC, jC, and kC lineages of types 1, 2, and 1–2. This quantity is used to obtain the closed forms of the probabilities ℙ(C1), ℙ(C2), and ℙ(C12) (Equations 1, 9, and 10).

We first note that if kC type 1–2 lineages remain, then at least kC type 1 lineages, and at least kC type 2 lineages, must coalesce together to produce these lineages. Let iD* and jD* be the numbers of type 1 and type 2 lineages, respectively, that combine to create the kC lineages of type 1–2 (Figure A1). Further, let iDr* type 1 lineages and jDr* type 2 lineages combine to produce the rth type 1–2 lineage. The possible values of iD1*,,iDkD* are given by all possible partitions of iD* objects into kC nonempty subsets. Similarly, the possible values of jD1*,,jDkC* are given by all possible partitions of jD* objects into kC nonempty subsets.

Let ψ(n, k) denote the number of partitions of an integer n into k positive integers (Andrews 1984, p. 16). Let πq(n,k)=(π1q(n,k),,πkq(n,k)) denote the qth partition of this kind, with πrq(n,k) denoting the rth part in the partition. We can permute the k parts of the qth partition in k! ways. Denote the zth permutation of partition q by π(q,z)(n,k)=(π1(q,z)(n,k),,πk(q,z)(n,k)). For simplicity of notation, denote the number of labeled histories among a set of n lineages—the number of sequences in which n lineages can coalesce to one lineage—by F(n) ≡ In,1, where In,1 is computed using Equation 3. Then the quantity N(iD, jDiC, kC, jC) is given byN(iD,jDiC,kC,jC)=iD*=kCiDiCjD*=kCjDjC(iDiD*)(jDjD*)η=1ψ(iD*,kC)γ=1ψ(jD*,kC)α(iD*,kC,η)α(jD*,kC,γ)IiDiD*,iCIjDjD*,jC×R(iD*,jD*,kC,η,γ)(iD+jD(iC+kC+jC)iDiD*iC,jDjD*jC,iD*+jD*kC).(A1)

The quantity α(n, k, q) is the number of ways to distribute n distinguishable objects into k unordered, nonempty subsets whose sizes are those of the parts of the qth partition πq(n, k). To obtain an expression for α(n, k, q), define a(φ; πq(n, k)) to be the number of parts of the partition πq(n, k) that have size φ. Then α(n, k, q) is given byα(n,k,q)=(nπ1q(n,k),,πkq(n,k))φ=1ka(φ;πq(n,k))!,(A2)where (π1q(n,k),n,πkq(n,k)) is the number of ways to choose the elements in the k parts and a(φ; πq(n, k))! is the number of ways to permute the parts of size φ. φ=1ka(φ;πq(n,k))! is the factor by which we overcount α(n, k, q) because we take the partitions to be unordered, and there are a(φ; πq(n, k))! arrangements of the subsets of size φ in which the same elements in these subsets are grouped together.

Figure A1

An illustration of the quantity N(iD, jDiC, jC, kC). One possible way in which iD = 15 type 1 lineages, jD = 12 type 2 lineages, and one target lineage T can coalesce to iC = 2 type 1 lineages, jC = 2 type 2 lineages, and kC = 3 type 1–2 lineages. Type 1 lineages are in red, type 2 lineages are in blue, and type 1–2 lineages are in orange. In the scenario pictured, lineage T first coalesces with a lineage of type 2. Here Formula is the number of type 1 lineages that coalesce with Formula type 2 lineages to produce the kC = 3 type 1–2 lineages. Formula is the number of type 1 lineages that combine with Formula type 2 lineages to create the first type 1–2 lineage. In general, Formula is the number of type 1 lineages that coalesce with Formula type 2 lineages to produce the rth type 1–2 lineage.

In Equation A1, IiDiD*,iC is the number of labeled histories for the iDiD* lineages that coalesce to form type 1 lineages, and it is computed using Equation 3. Similarly, IjDjD*,jC is the number of labeled histories for the jDjD* lineages that coalesce to form type 2 lineages.

The quantity R(i, j, k, η, γ) in Equation A1 is the number of labeled histories for the iD*+jD* lineages that ultimately coalesce to form the kC lineages of type 1–2. Given iD* and jD*, consider a particular partition of the iD* lineages into kC nonempty parts, and a particular partition of the jD* lineages into kC nonempty parts. Each one of the kC type 1–2 lineages is made by combining a part from the partition of the iD* lineages with a part from the partition of the jD* lineages. To find all possible ways to pair up parts, we fix the indices of the parts of the jD* lineages and we permute the parts of the iD* lineages. There are kC! ways to pair up the parts. We index these ways by z. For the zth way of permuting the parts, the lineages in part πr(η,z)(iD*,kC) combine with the lineages in part πrγ(jD*,kC) to produce the rth lineage of type 1–2.

The rth pair of parts of lineages undergoes πr(η,z)(iD*,kC)+πrγ(jD*,kC)1 coalescence events on its way down to a single lineage. Thus, there areW(iD*,jD*,kC,π(η,z)(iD*,kC),πγ(jD*,kC))(iD*+jD*kCπ1(η,z)(iD*,kC)+π1γ(jD*,kC)1,,πkC(η,z)(iD*,kC)+πkCγ(jD*,kC)1)(A3)possible ways to order the coalescence events among all pairs of parts.

Because there are F(πr(η,z)(iD*,kC)+πrγ(jD*,kC)) labeled histories for the rth pair of parts as they coalesce to form the rth lineage of type 1–2, there are W(iD*,jD*,kC,π(η,z)(iD*,kC),πγ(jD*,kC)) r=1kCF(πr(η,z)(iD*,kC)+πrγ(jD*,kC)) labeled histories for all of the iD* and jD* lineages when paired in this way. Finally, summing over all kC! possible ways to permute the partitions of the iD* lineages with respect to the partitions of the jD* lineages,R(iD*,jD*,kC,η,γ)=z=1kC!W(iD*,jD*,kC,π(η,z)(iD*,kC),πγ(jD*,kC))×r=1kCF(πr(η,z)(iD*,kC)+πrγ(jD*,kC)).(A4)We have separately considered three parts of the labeled history of the lineages in the ancestral population: (1) the labeled history of the iDiD* lineages that coalesce to form type 1 lineages, (2) the labeled history of the jDjD* lineages that coalesce to form type 2 lineages, and (3) the labeled history of the iD*+jD* lineages that coalesce to form type 1–2 lineages. To combine these components into one full history for all lineages, we must consider only how the coalescence times in each of these components relate to the coalescence times in the other components. Thus, the final quantity in Equation A1 is the number of ways to interweave the coalescence events in these labeled histories. There are iDiD*iC coalescence events among the lineages that ultimately have type 1, jDjD*jC coalescence events among the lineages that ultimately have type 2, and iD*+jD*kC coalescence events among the lineages that ultimately have type 1–2. The number of ways to interweave the coalescences for these three histories is equal to the number of ways to choose which of the iD+jDiCjCkC total coalescence events correspond to events within each of these different histories. This quantity is the trinomial coefficient

(iD+jDiCkCjCiDiD*iC,jDjD*jC,iD*+jD*kC).(A5)

Footnotes

  • Communicating editor: Y. S. Song

  • Received December 19, 2011.
  • Accepted May 4, 2012.

Available freely online through the author-supported open access option.

Literature Cited

View Abstract