Genetics, Vol. 164, 1677-1682, August 2003, Copyright © 2003

On the Use of Star-Shaped Genealogies in Inference of Coalescence Times

Noah A. Rosenberga and Aaron E. Hirshb
a Program in Molecular and Computational Biology, University of Southern California, Los Angeles, California 90089
b Department of Biological Sciences, Stanford University, Stanford, California 94305

Corresponding author: Noah A. Rosenberg, 1042 W. 36th Pl., DRB 289, University of Southern California, Los Angeles, CA 90089., noahr{at}usc.edu (E-mail)

Communicating editor: M. K. UYENOYAMA


*  ABSTRACT
*TOP
*ABSTRACT
*LITERATURE CITED

Genealogies from rapidly growing populations have approximate "star" shapes. We study the degree to which this approximation holds in the context of estimating the time to the most recent common ancestor (TMRCA) of a set of lineages. In an exponential growth scenario, we find that unless the product of population size (N) and growth rate (r) is at least ~105, the "pairwise comparison estimator" of TMRCA that derives from the star genealogy assumption has bias of 10–50%. Thus, the estimator is appropriate only for large populations that have grown very rapidly. The "tree-length estimator" of TMRCA is more biased than the pairwise comparison estimator, having low bias only for extremely large values of Nr.


A fundamental development in population genetics has been the recognition that the pattern of genetic variation in a set of sampled sequences is heavily affected by the particular genealogy of the lineages. In general, however, this underlying genealogy is unknown. To account for the effect of the genealogy in analyses of population genetic data, it is useful to consider "random genealogies" that are consistent with the data and to average over many such genealogies. The coalescent framework provides a natural way to construct these random genealogies under various assumptions about the demography of populations (DONNELLY 1996 Down; NORDBORG 2001 Down).

Use of the coalescent to model unknown genealogies sometimes leads to intensive computations in statistical inference from data (STEPHENS 2001 Down). Thus, for ease of computation, analyses can be conditioned on assumed genealogical shapes that are specified prior to analysis or that are inferred from genetic data. Such methods often ignore uncertainty about the exact shape of the genealogy, producing potentially biased estimates with misleadingly small confidence intervals (SLATKIN and RANNALA 2000 Down; RANNALA and BERTORELLE 2001 Down; ROSENBERG and NORDBORG 2002 Down). When methods based on assumed genealogies are applied, it is important to quantify associated limitations.

The "star-shaped" or "maypole" genealogy (Fig 1A) is perhaps the simplest type of genealogy and the easiest to analyze, as it has the only shape for which all lineages experience independent evolution. In a star-shaped genealogy, sampled lineages provide independent replicates of the evolutionary process since the time of their most recent common ancestor (MRCA). SLATKIN and HUDSON 1991 Down found that genealogies of samples taken from populations growing exponentially in size tend to be "star-like," much more so than genealogies from constant-sized populations (Fig 1; see also DONNELLY 1996 Down; NORDBORG 2001 Down). Because many human populations have experienced rapid population growth, star-shaped genealogies have been explicitly assumed in diverse analyses of human genetic data (RISCH et al. 1995 Down; THOMAS et al. 1998 Down; MCPEEK and STRAHS 1999 Down; REICH and GOLDSTEIN 1999 Down; LIU et al. 2001 Down; STUMPF and GOLDSTEIN 2001 Down, for example). Additionally, star-shaped genealogies are implicit in methods of analysis that treat "unrelated" individuals as independent trials from a population with specified allele frequencies or other parameters.



View larger version (8K):
In this window
In a new window
Download PPT slide
 
Figure 1. Shapes of genealogies. (A) Perfect star-shaped genealogy or "maypole genealogy." (B) Example of genealogy simulated using the coalescent process for the exponential growth model with population size 100,000 and growth rate 0.01. (C) Example of genealogy simulated using the coalescent process for the constant population size model. Genealogies are scaled to have the same TMRCA. In B and C, labels are assigned randomly to lineages.

Here we determine the degree to which the assumption of a star-shaped genealogy is appropriate for a sample taken from an exponentially growing population. Importantly, the error introduced by the assumption depends on the nature of the eventual calculation that will be performed conditional on the star-shaped genealogy. SLATKIN 1996 Down defined a "stellate index" to quantify the degree to which a given genealogy resembles a star-shaped genealogy and considered properties of this index under various population models. Our goal is different, in that we aim to determine biases of estimators that result from assuming that genealogies of samples taken from exponentially growing populations are star shaped. For this purpose, we find that other quantities are more natural than the index of SLATKIN 1996 Down. We focus on estimation of the time to the MRCA of a set of sampled lineages.

Consider the genealogy of a set of n sampled lineages at a nonrecombining locus. Properties of this genealogy include the time to the MRCA of the sample (Tn or TMRCA), the total length of all branches of the genealogy (Ln), and the average coalescence time of a pair of sampled lineages (Pn). Ratios of these quantities can be used to explore shapes of genealogies under various demographic models (SLATKIN 1996 Down; UYENOYAMA 1997 Down; SCHIERUP and HEIN 2000 Down).

Suppose that an estimate for Tn is desired. Under the assumption of a star-shaped genealogy, Pn and Tn are equivalent. Because unbiased estimates of the coalescence time of a pair of lineages can frequently be obtained (TAJIMA 1983 Down, for example), Pn is estimated as the average of estimated pairwise coalescence times, over all pairs of lineages. This idea underlies methods summarized by STUMPF and GOLDSTEIN 2001 Down, in which Pn is estimated under a stepwise mutation model (either by comparing pairs of lineages or by comparing each lineage to a putative ancestral type), and n is used as the estimator of Tn. This "pairwise comparison estimator" of Tn is unbiased only if the sample has a star-shaped genealogy (or if n = 2).

In general, bias of this estimator of Tn is downward, because the average pairwise coalescence time is always less than or equal to the overall coalescence time. In a constant-sized population of haploid size N, the bias of the pairwise estimator can be considerable. If we treat Pn and Tn as functions of population size N and growth rate r and measure them in units of N generations, then instead of Pn/Tn = 1 as in a star-shaped genealogy, the constant model yields

(1)

(using Equation 1 of UYENOYAMA 1997 Down). For a large sample from a constant-sized population, Pn underestimates Tn by 50%. Of course, for exponentially growing populations, this bias decreases as the rate of growth increases. To determine the nature of the bias, we simulated the coalescent with exponential growth (SLATKIN and HUDSON 1991 Down; DONNELLY and TAVARE 1995 Down; NORDBORG 2001 Down), with current population size N, sample size n, and exponential growth rate r (so that t generations in the past, population size was N exp[-rt]). For each set of parameter values, we computed the mean value of Pn/Tn over 10,000 simulations.

Fig 2 demonstrates that the bias of the pairwise comparison estimator increases with sample size, decreases with growth rate, and decreases with population size. To explain the perhaps surprising dependence on sample size, note that E[Pn] is constant as a function of n (TAJIMA 1983 Down), whereas E[Tn] increases slowly with n (UYENOYAMA 1997 Down, for example). Thus, the increase in bias with sample size is slow, yet noticeable in the difference between Fig 2A (smaller sample size) and Fig 2B (larger sample size).



View larger version (47K):
In this window
In a new window
Download PPT slide
 
Figure 2. Expected values of the ratio of the average pairwise coalescence time to the time to the most recent common ancestor, or E[Pn/Tn]. (A) Simulations with samples of size 5. (B) Simulations with samples of size 25. (C) Comparison of simulations with closed-form approximation (4), n = 5. (D) Comparison of simulations with closed-form approximation (4), n = 25. Each point in A and B is based on 10,000 realizations of the coalescent process with exponential growth; the simulation results in C and D derive from averaging Pn/Tn across all appropriate simulations among those that are shown in A and B.

The dependence on r and N can be understood as follows. In the constant population size model, denote the random time to the coalescence of k to k - 1 lineages by Wk, measured in units of N generations. For k = 1, 2, ... , n - 1, let Xk (also in units of N generations) denote the total time elapsed in the coalescence of n lineages to k lineages, so that Xk = Wn + Wn-1 + ... + Wk+1. The corresponding time that it takes for n lineages to coalesce to k lineages in the exponential growth model is obtained from g-1(Xk), where

(2)

(NORDBORG 2001 Down, Equation 8). For any genealogy, Pn(N, r) is a linear combination of the coalescence times for that genealogy. Thus, for the exponential growth model, the numerator of Pn(N, r)/Tn(N, r) is a linear combination of g-1(Xn-1), g-1(Xn-2), ... , g-1(X1), and the denominator is g-1(X1). All terms in both numerator and denominator have an r-1 coefficient, which can therefore be removed. Thus, we have

(3)

for some collection of nonnegative constants ck (with ck > 0 for at least one value of k). By demonstrating that its derivative with respect to Nr is positive, it is easily shown that Pn(N, r)/Tn(N, r) is an increasing function of Nr. Consequently, E[Pn(N, r)/Tn(N, r)] is increasing in Nr, and increases in either N or r decrease the bias of the pairwise comparison estimator. Equation 3 implies that the parameters N and r are coupled in the compound parameter Nr: for example, the parameter sets (N, r) = (104, 0.001), (105, 0.0001), (106, 0.00001), all of which have Nr = 10, produce the same value of E[Pn(N, r)/Tn(N, r)] (Fig 2).

As in the constant population size case, E[Pn(N, r)/Tn(N, r)] can be approximated under exponential growth:

(4)

Note that (1) can be obtained from (4) by taking the limit as r -> 0. The closed-form expression in (4) gives a reasonable approximation to E[Pn/Tn] (Fig 2C and Fig D).

For rapidly growing large populations, the bias of the pairwise comparison estimator may be 10% or less (Fig 2). Estimated growth rates for periods of exponential growth of various human populations range from 0.001 to 0.02 per generation (PRITCHARD et al. 1999 Down; THOMSON et al. 2000 Down). Thus, for groups with sufficiently large N, the star-shaped genealogy might lead to nearly unbiased estimation of Tn. However, under the exponential growth model, N equals the current census population size only if the variance of reproductive success equals 1 (other properties of populations are also incorporated into the parameter N—see NORDBORG 2001 Down; NORDBORG and KRONE 2002 Down). Estimates of N for human populations under exponential growth models are considerably smaller than census sizes (PRITCHARD et al. 1999 Down; THOMSON et al. 2000 Down, for example). For human groups, it is questionable whether N is large enough for star-shaped genealogies to be applied to estimation of Tn.

For small populations the bias of the pairwise comparison estimator is particularly large. For Nr < 100 the bias for a sample of reasonable size will be considerable, >20%. Thus, in small populations, even if they have expanded exponentially, the star-shaped genealogy assumption cannot substitute for genealogical modeling of the data; schemes that explicitly account for uncertainty in the genealogy (for reviews, see ROSENBERG and FELDMAN 2002 Down; TANG et al. 2002 Down) are likely more appropriate. For estimating TMRCA, population sizes and growth rates of relatively small groups such as Jewish priests (THOMAS et al. 1998 Down) may be too small to produce approximate star-shaped genealogies. In these groups it is probable that the pairwise comparison estimator underestimates coalescence times, and use of the estimator should be accompanied by quantification of its bias.

A further problem with this estimation procedure is that on the assumption of a star-shaped genealogy, the variance of the pairwise comparison estimator is typically underestimated, as its calculation ignores uncertainty associated with not knowing the genealogy. Under exponential growth, Pn/Tn can be quite variable (Fig 3), compared to its constant value of 1 in the star genealogy model. By assuming that Pn and Tn are equal, the pairwise comparison estimator ignores inherent variation in the relationship between these two quantities, which exists even if N and r are known exactly. Of course, all model-based procedures experience problems similar to this limitation of the star genealogy model. The variance of estimators is typically evaluated conditional on a model, such as the star genealogy model or the constant population size model; uncertainty associated with not knowing the model is difficult to incorporate into the calculation of confidence intervals.



View larger version (18K):
In this window
In a new window
Download PPT slide
 
Figure 3. Distribution of the ratio of the average pairwise coalescence time to the time to the most recent common ancestor (Pn/Tn). Each distribution is based on 10,000 realizations of the coalescent process with exponential growth, using a sample of size 25.

The variance of Pn/Tn is larger for smaller values of Nr. Thus, as Nr decreases, not only does Pn move farther away from Tn, but also Tn becomes harder to predict from Pn (Fig 3). Only for Nr > ~105 is it nearly certain that Tn < 1.25Pn (Fig 3C).

Other estimators based on the assumption of star-shaped genealogies may suffer from more severe bias than the pairwise comparison estimator of TMRCA, because other genealogical ratios decline more rapidly with sample size than does Pn/Tn. Under the assumption of a star-shaped genealogy, an alternate estimator of TMRCA is the "tree-length estimator" or the estimated total branch length of the genealogy divided by n (KARN et al. 2002 Down, for example). Unbiased estimators of the total branch length Ln can be obtained, for example, under the infinite-sites model, from the number of polymorphic sites observed in a data set divided by the mutation rate. To evaluate the bias of the tree-length estimator, we must consider the ratio of Ln/n to Tn, a ratio that equals 1 for a star-shaped genealogy.

Under the constant-sized population model, we have (using Equation 2 of TAVARE et al. 1997 Down)

(5)

where {gamma} {cong} 0.5772 is Euler's constant. In the exponential growth model the analogous argument to (4) suggests

(6)

The closed-form expression in (6) approximates E[Ln/(nTn)] less accurately than (4) approximates E[Pn/Tn] (Fig 4). However, Ln/(nTn), obtained from the same simulated genealogies that underlie Fig 2, is similar to Pn/Tn, in the ways that it depends on the parameters. The major difference is that the large-sample limit of Ln/(nTn) is zero under constant population size, whereas Pn/Tn has a large-sample limit of 1/2. Thus, as sample size increases, Ln/(nTn) decreases much faster than Pn/Tn, and the tree-length estimator has bias considerably larger than that of the pairwise comparison estimator.



View larger version (48K):
In this window
In a new window
Download PPT slide
 
Figure 4. Expected values of the ratio of the total length of the genealogy to the sample size times the time to the most recent common ancestor, or E[Ln/(nTn)]. (A) Simulations with samples of size 5. (B) Simulations with samples of size 25. (C) Comparison of simulations with closed-form approximation (6), n = 5. (D) Comparison of simulations with closed-form approximation (6), n = 25. Each point in A and B is based on 10,000 realizations of the coalescent process with exponential growth; the simulation results in C and D derive from averaging Ln/(nTn) across all appropriate simulations among those that are shown in A and B.

The potential error of the star genealogy assumption is perhaps greatest when properties of the genealogy itself, such as Tn, are of interest. If the goal of analysis is to compare genealogies for different loci or populations relative to each other, bias may affect estimates similarly and may have a reduced impact, although differences in sample size and population size should be taken into consideration. Also, if the genealogy is treated as a nuisance parameter, such as in fine mapping of disease susceptibility loci, the assumption might not have severe consequences. The success of methods based on the star genealogy assumption in pinpointing previously identified susceptibility genes (LIU et al. 2001 Down, for example) suggests that human genealogies may be sufficiently star-like for mapping of some disorders, although modeling of the dependence among lineages can lead to more accurate positional inference (MORRIS et al. 2002 Down).

We have seen here that the approximate star-shaped features of genealogies in an exponentially growing population may be insufficient to guarantee low bias in analyses based on the star genealogy assumption, unless the population has grown very rapidly to a very large size. For estimation of Tn, the numerical results and approximate expressions shown can guide the use of the assumption. Future uses of star-shaped genealogies in population genetic analysis will benefit from demonstration that the assumption is appropriate in the relevant contexts.


*  ACKNOWLEDGMENTS

We thank H. Innan, P. Marjoram, and M. Nordborg for helpful comments. D. Zulman suggested the term "maypole genealogy." This research was supported by a National Science Foundation Postdoctoral Fellowship in Bioinformatics to N.A.R. and by National Institutes of Health grant GM28016 to M. W. Feldman.

Manuscript received August 15, 2002; Accepted for publication April 7, 2003.


*  LITERATURE CITED
*TOP
*ABSTRACT
*LITERATURE CITED

DONNELLY, P., 1996 Interpreting genetic variability: the effects of shared evolutionary history, pp. 25–40 in Variation in the Human Genome. Wiley, Chichester, UK.

DONNELLY, P. and S. TAVARÉ, 1995  Coalescents and genealogical structure under neutrality. Annu. Rev. Genet. 29:401-421.[Medline]

KARN, R. C., A. ORTH, F. BONHOMME, and P. BOURSOT, 2002  The complex history of a gene proposed to participate in a sexual isolation mechanism in house mice. Mol. Biol. Evol. 19:462-471.[Abstract/Free Full Text]

LIU, J. S., C. SABATTI, J. TENG, B. J. B. KEATS, and N. RISCH, 2001  Bayesian analysis of haplotypes for linkage disequilibrium mapping. Genome Res. 11:1716-1724.[Abstract/Free Full Text]

MCPEEK, M. S. and A. STRAHS, 1999  Assessment of linkage disequilibrium by the decay of haplotype sharing, with application to fine-scale genetic mapping. Am. J. Hum. Genet. 65:858-875.[Medline]

MORRIS, A. P., J. C. WHITTAKER, and D. J. BALDING, 2002  Fine-scale mapping of disease loci via shattered coalescent modeling of genealogies. Am. J. Hum. Genet. 70:686-707.[Medline]

NORDBORG, M., 2001 Coalescent theory, pp. 179–212 in Handbook of Statistical Genetics, edited by D. J. BALDING, C. CANNINGS and M. BISHOP. Wiley, Chichester, UK.

NORDBORG, M., and S. M. KRONE, 2002 Separation of time scales and convergence to the coalescent in structured populations, pp. 194–232 in Modern Developments in Theoretical Population Genetics, edited by M. SLATKIN and M. VEUILLE. Oxford University Press, Oxford.

PRITCHARD, J. K., M. T. SEIELSTAD, A. PÉREZ-LEZAUN, and M. W. FELDMAN, 1999  Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol. Biol. Evol. 16:1791-1798.[Abstract]

RANNALA, B. and G. BERTORELLE, 2001  Using linked markers to infer the age of a mutation. Hum. Mutat. 18:87-100.[Medline]

REICH, D. E., and D. B. GOLDSTEIN, 1999 Estimating the age of mutations using variation at linked markers, pp. 129–138 in Microsatellites: Evolution and Applications, edited by D. B. GOLDSTEIN and C. SCHLÖTTERER. Oxford University Press, Oxford.

RISCH, N., D. DE LEON, L. OZELIUS, P. KRAMER, and L. ALMASY et al., 1995  Genetic analysis of idiopathic torsion dystonia in Ashkenazi Jews and their recent descent from a small founder population. Nat. Genet. 9:152-159.[Medline]

ROSENBERG, N. A., and M. W. FELDMAN, 2002 The relationship between coalescence times and population divergence times, pp. 130–164 in Modern Developments in Theoretical Population Genetics, edited by M. SLATKIN and M. VEUILLE. Oxford University Press, Oxford.

ROSENBERG, N. A. and M. NORDBORG, 2002  Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nat. Rev. Genet. 3:380-390.[Medline]

SCHIERUP, M. H. and J. HEIN, 2000  Consequences of recombination on traditional phylogenetic analysis. Genetics 156:879-891.[Abstract/Free Full Text]

SLATKIN, M., 1996  Gene genealogies within mutant allelic classes. Genetics 143:579-587.[Abstract]

SLATKIN, M. and R. R. HUDSON, 1991  Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations. Genetics 129:555-562.[Abstract]

SLATKIN, M. and B. RANNALA, 2000  Estimating allele age. Annu. Rev. Genomics Hum. Genet. 1:225-249.[Medline]

STEPHENS, M., 2001 Inference under the coalescent, pp. 213–238 in Handbook of Statistical Genetics, edited by D. J. BALDING, C. CANNINGS and M. BISHOP. Wiley, Chichester, UK.

STUMPF, M. P. H. and D. B. GOLDSTEIN, 2001  Genealogical and evolutionary inference with the human Y chromosome. Science 291:1738-1742.[Abstract/Free Full Text]

TAJIMA, F., 1983  Evolutionary relationship of DNA sequences in finite populations. Genetics 105:437-460.[Abstract/Free Full Text]

TANG, H., D. O. SIEGMUND, P. SHEN, P. J. OEFNER, and M. W. FELDMAN, 2002  Frequentist estimation of coalescence times from nucleotide sequence data using a tree-based partition. Genetics 161:447-459.[Abstract/Free Full Text]

TAVARÉ, S., D. J. BALDING, R. C. GRIFFITHS, and P. DONNELLY, 1997  Inferring coalescence times from DNA sequence data. Genetics 145:505-518.[Abstract]

THOMAS, M. G., K. SKORECKI, H. BEN-AMI, T. PARFITT, and N. BRADMAN et al., 1998  Origins of Old Testament priests. Nature 394:138-140.[Medline]

THOMSON, R., J. K. PRITCHARD, P. SHEN, P. J. OEFNER, and M. W. FELDMAN, 2000  Recent common ancestry of human Y chromosomes: evidence from DNA sequence data. Proc. Natl. Acad. Sci. USA 97:7360-7365.[Abstract/Free Full Text]

UYENOYAMA, M. K., 1997  Genealogical structure among alleles regulating self-incompatibility in natural populations of flowering plants. Genetics 147:1389-1400.[Abstract]




This article has been cited by other articles:


Home page
GeneticsHome page
A. Sano and H. Tachida
Gene Genealogy and Properties of Test Statistics of Neutrality Under Population Growth
Genetics, March 1, 2005; 169(3): 1687 - 1697.
[Abstract] [Full Text] [PDF]