Abstract
Genealogies from rapidly growing populations have approximate “star” shapes. We study the degree to which this approximation holds in the context of estimating the time to the most recent common ancestor (T_{MRCA}) of a set of lineages. In an exponential growth scenario, we find that unless the product of population size (N) and growth rate (r) is at least ∼10^{5}, the “pairwise comparison estimator” of T_{MRCA} that derives from the star genealogy assumption has bias of 1050%. Thus, the estimator is appropriate only for large populations that have grown very rapidly. The “treelength estimator” of T_{MRCA} is more biased than the pairwise comparison estimator, having low bias only for extremely large values of Nr.
A fundamental development in population genetics has been the recognition that the pattern of genetic variation in a set of sampled sequences is heavily affected by the particular genealogy of the lineages. In general, however, this underlying genealogy is unknown. To account for the effect of the genealogy in analyses of population genetic data, it is useful to consider “random genealogies” that are consistent with the data and to average over many such genealogies. The coalescent framework provides a natural way to construct these random genealogies under various assumptions about the demography of populations (Donnelly 1996; Nordborg 2001).
Use of the coalescent to model unknown genealogies sometimes leads to intensive computations in statistical inference from data (Stephens 2001). Thus, for ease of computation, analyses can be conditioned on assumed genealogical shapes that are specified prior to analysis or that are inferred from genetic data. Such methods often ignore uncertainty about the exact shape of the genealogy, producing potentially biased estimates with misleadingly small confidence intervals (Slatkin and Rannala 2000; Rannala and Bertorelle 2001; Rosenberg and Nordborg 2002). When methods based on assumed genealogies are applied, it is important to quantify associated limitations.
The “starshaped” or “maypole” genealogy (Figure 1A) is perhaps the simplest type of genealogy and the easiest to analyze, as it has the only shape for which all lineages experience independent evolution. In a starshaped genealogy, sampled lineages provide independent replicates of the evolutionary process since the time of their most recent common ancestor (MRCA). Slatkin and Hudson (1991) found that genealogies of samples taken from populations growing exponentially in size tend to be “starlike,” much more so than genealogies from constantsized populations (Figure 1; see also Donnelly 1996; Nordborg 2001). Because many human populations have experienced rapid population growth, starshaped genealogies have been explicitly assumed in diverse analyses of human genetic data (Rischet al. 1995; Thomaset al. 1998; McPeek and Strahs 1999; Reich and Goldstein 1999; Liuet al. 2001; Stumpf and Goldstein 2001, for example). Additionally, starshaped genealogies are implicit in methods of analysis that treat “unrelated” individuals as independent trials from a population with specified allele frequencies or other parameters.
Here we determine the degree to which the assumption of a starshaped genealogy is appropriate for a sample taken from an exponentially growing population. Importantly, the error introduced by the assumption depends on the nature of the eventual calculation that will be performed conditional on the starshaped genealogy. Slatkin (1996) defined a “stellate index” to quantify the degree to which a given genealogy resembles a starshaped genealogy and considered properties of this index under various population models. Our goal is different, in that we aim to determine biases of estimators that result from assuming that genealogies of samples taken from exponentially growing populations are star shaped. For this purpose, we find that other quantities are more natural than the index of Slatkin (1996). We focus on estimation of the time to the MRCA of a set of sampled lineages.
Consider the genealogy of a set of n sampled lineages at a nonrecombining locus. Properties of this genealogy include the time to the MRCA of the sample (T_{n} or T_{MRCA}), the total length of all branches of the genealogy (L_{n}), and the average coalescence time of a pair of sampled lineages (P_{n}). Ratios of these quantities can be used to explore shapes of genealogies under various demographic models (Slatkin 1996; Uyenoyama 1997; Schierup and Hein 2000).
Suppose that an estimate for T_{n} is desired. Under the assumption of a starshaped genealogy, P_{n} and T_{n} are equivalent. Because unbiased estimates of the coalescence time of a pair of lineages can frequently be obtained (Tajima 1983, for example), P_{n} is estimated as the average of estimated pairwise coalescence times, over all pairs of lineages. This idea underlies methods summarized by Stumpf and Goldstein (2001), in which P_{n} is estimated under a stepwise mutation model (either by comparing pairs of lineages or by comparing each lineage to a putative ancestral type), and Pˆ_{n} is used as the estimator of T_{n}. This “pairwise comparison estimator” of T_{n} is unbiased only if the sample has a starshaped genealogy (or if n = 2).
In general, bias of this estimator of T_{n} is downward, because the average pairwise coalescence time is always less than or equal to the overall coalescence time. In a constantsized population of haploid size N, the bias of the pairwise estimator can be considerable. If we treat P_{n} and T_{n} as functions of population size N and growth rate r and measure them in units of N generations, then instead of P_{n}/T_{n} = 1 as in a starshaped genealogy, the constant model yields
Figure 2 demonstrates that the bias of the pairwise comparison estimator increases with sample size, decreases with growth rate, and decreases with population size. To explain the perhaps surprising dependence on sample size, note that
The dependence on r and N can be understood as follows. In the constant population size model, denote the random time to the coalescence of k to k  1 lineages by W_{k}, measured in units of N generations. For k = 1, 2,..., n  1, let X_{k} (also in units of N generations) denote the total time elapsed in the coalescence of n lineages to k lineages, so that X_{k} = W_{n} + W_{n}_{1} +... + W_{k}_{+1}. The corresponding time that it takes for n lineages to coalesce to k lineages in the exponential growth model is obtained from g^{1}(X_{k}), where
As in the constant population size case,
For rapidly growing large populations, the bias of the pairwise comparison estimator may be 10% or less (Figure 2). Estimated growth rates for periods of exponential growth of various human populations range from 0.001 to 0.02 per generation (Pritchardet al. 1999; Thomsonet al. 2000). Thus, for groups with sufficiently large N, the starshaped genealogy might lead to nearly unbiased estimation of T_{n}. However, under the exponential growth model, N equals the current census population size only if the variance of reproductive success equals 1 (other properties of populations are also incorporated into the parameter N—see Nordborg 2001; Nordborg and Krone 2002). Estimates of N for human populations under exponential growth models are considerably smaller than census sizes (Pritchardet al. 1999; Thomsonet al. 2000, for example). For human groups, it is questionable whether N is large enough for starshaped genealogies to be applied to estimation of T_{n}.
For small populations the bias of the pairwise comparison estimator is particularly large. For Nr < 100 the bias for a sample of reasonable size will be considerable, >20%. Thus, in small populations, even if they have expanded exponentially, the starshaped genealogy assumption cannot substitute for genealogical modeling of the data; schemes that explicitly account for uncertainty in the genealogy (for reviews, see Rosenberg and Feldman 2002; Tanget al. 2002) are likely more appropriate. For estimating T_{MRCA}, population sizes and growth rates of relatively small groups such as Jewish priests (Thomaset al. 1998) may be too small to produce approximate starshaped genealogies. In these groups it is probable that the pairwise comparison estimator underestimates coalescence times, and use of the estimator should be accompanied by quantification of its bias.
A further problem with this estimation procedure is that on the assumption of a starshaped genealogy, the variance of the pairwise comparison estimator is typically underestimated, as its calculation ignores uncertainty associated with not knowing the genealogy. Under exponential growth, P_{n}/T_{n} can be quite variable (Figure 3), compared to its constant value of 1 in the star genealogy model. By assuming that P_{n} and T_{n} are equal, the pairwise comparison estimator ignores inherent variation in the relationship between these two quantities, which exists even if N and r are known exactly. Of course, all modelbased procedures experience problems similar to this limitation of the star genealogy model. The variance of estimators is typically evaluated conditional on a model, such as the star genealogy model or the constant population size model; uncertainty associated with not knowing the model is difficult to incorporate into the calculation of confidence intervals.
The variance of P_{n}/T_{n} is larger for smaller values of Nr. Thus, as Nr decreases, not only does P_{n} move farther away from T_{n}, but also T_{n} becomes harder to predict from P_{n} (Figure 3). Only for Nr > ∼10^{5} is it nearly certain that T_{n} < 1.25P_{n} (Figure 3C).
Other estimators based on the assumption of starshaped genealogies may suffer from more severe bias than the pairwise comparison estimator of T_{MRCA}, because other genealogical ratios decline more rapidly with sample size than does P_{n}/T_{n}. Under the assumption of a starshaped genealogy, an alternate estimator of T_{MRCA} is the “treelength estimator” or the estimated total branch length of the genealogy divided by n (Karnet al. 2002, for example). Unbiased estimators of the total branch length L_{n} can be obtained, for example, under the infinitesites model, from the number of polymorphic sites observed in a data set divided by the mutation rate. To evaluate the bias of the treelength estimator, we must consider the ratio of L_{n}/n to T_{n}, a ratio that equals 1 for a starshaped genealogy.
Under the constantsized population model, we have (using Equation 2 of Tavaréet al. 1997)
The potential error of the star genealogy assumption is perhaps greatest when properties of the genealogy itself, such as T_{n}, are of interest. If the goal of analysis is to compare genealogies for different loci or populations relative to each other, bias may affect estimates similarly and may have a reduced impact, although differences in sample size and population size should be taken into consideration. Also, if the genealogy is treated as a nuisance parameter, such as in fine mapping of disease susceptibility loci, the assumption might not have severe consequences. The success of methods based on the star genealogy assumption in pinpointing previously identified susceptibility genes (Liuet al. 2001, for example) suggests that human genealogies may be sufficiently starlike for mapping of some disorders, although modeling of the dependence among lineages can lead to more accurate positional inference (Morriset al. 2002).
We have seen here that the approximate starshaped features of genealogies in an exponentially growing population may be insufficient to guarantee low bias in analyses based on the star genealogy assumption, unless the population has grown very rapidly to a very large size. For estimation of T_{n}, the numerical results and approximate expressions shown can guide the use of the assumption. Future uses of starshaped genealogies in population genetic analysis will benefit from demonstration that the assumption is appropriate in the relevant contexts.
Acknowledgments
We thank H. Innan, P. Marjoram, and M. Nordborg for helpful comments. D. Zulman suggested the term “maypole genealogy.” This research was supported by a National Science Foundation Postdoctoral Fellowship in Bioinformatics to N.A.R. and by National Institutes of Health grant GM28016 to M. W. Feldman.
Footnotes

Communicating editor: M. K. Uyenoyama
 Received August 15, 2002.
 Accepted April 7, 2003.
 Copyright © 2003 by the Genetics Society of America