- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Rosenberg, N. A.
- Articles by Hirsh, A. E.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Rosenberg, N. A.
- Articles by Hirsh, A. E.
On the Use of Star-Shaped Genealogies in Inference of Coalescence Times
Noah A. Rosenberga and Aaron E. Hirshba Program in Molecular and Computational Biology, University of Southern California, Los Angeles, California 90089
b Department of Biological Sciences, Stanford University, Stanford, California 94305
Corresponding author: Noah A. Rosenberg, 1042 W. 36th Pl., DRB 289, University of Southern California, Los Angeles, CA 90089., noahr{at}usc.edu (E-mail)
Communicating editor: M. K. UYENOYAMA
| ABSTRACT |
|---|
Genealogies from rapidly growing populations have approximate "star" shapes. We study the degree to which this approximation holds in the context of estimating the time to the most recent common ancestor (TMRCA) of a set of lineages. In an exponential growth scenario, we find that unless the product of population size (N) and growth rate (r) is at least
105, the "pairwise comparison estimator" of TMRCA that derives from the star genealogy assumption has bias of 1050%. Thus, the estimator is appropriate only for large populations that have grown very rapidly. The "tree-length estimator" of TMRCA is more biased than the pairwise comparison estimator, having low bias only for extremely large values of Nr.
A fundamental development in population genetics has been the recognition that the pattern of genetic variation in a set of sampled sequences is heavily affected by the particular genealogy of the lineages. In general, however, this underlying genealogy is unknown. To account for the effect of the genealogy in analyses of population genetic data, it is useful to consider "random genealogies" that are consistent with the data and to average over many such genealogies. The coalescent framework provides a natural way to construct these random genealogies under various assumptions about the demography of populations (![]()
![]()
Use of the coalescent to model unknown genealogies sometimes leads to intensive computations in statistical inference from data (![]()
![]()
![]()
![]()
The "star-shaped" or "maypole" genealogy (Fig 1A) is perhaps the simplest type of genealogy and the easiest to analyze, as it has the only shape for which all lineages experience independent evolution. In a star-shaped genealogy, sampled lineages provide independent replicates of the evolutionary process since the time of their most recent common ancestor (MRCA). ![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
![]()
|
Here we determine the degree to which the assumption of a star-shaped genealogy is appropriate for a sample taken from an exponentially growing population. Importantly, the error introduced by the assumption depends on the nature of the eventual calculation that will be performed conditional on the star-shaped genealogy. ![]()
![]()
Consider the genealogy of a set of n sampled lineages at a nonrecombining locus. Properties of this genealogy include the time to the MRCA of the sample (Tn or TMRCA), the total length of all branches of the genealogy (Ln), and the average coalescence time of a pair of sampled lineages (Pn). Ratios of these quantities can be used to explore shapes of genealogies under various demographic models (![]()
![]()
![]()
Suppose that an estimate for Tn is desired. Under the assumption of a star-shaped genealogy, Pn and Tn are equivalent. Because unbiased estimates of the coalescence time of a pair of lineages can frequently be obtained (![]()
![]()
n is used as the estimator of Tn. This "pairwise comparison estimator" of Tn is unbiased only if the sample has a star-shaped genealogy (or if n = 2).
In general, bias of this estimator of Tn is downward, because the average pairwise coalescence time is always less than or equal to the overall coalescence time. In a constant-sized population of haploid size N, the bias of the pairwise estimator can be considerable. If we treat Pn and Tn as functions of population size N and growth rate r and measure them in units of N generations, then instead of Pn/Tn = 1 as in a star-shaped genealogy, the constant model yields
![]() |
(1) |
(using Equation 1 of ![]()
![]()
![]()
![]()
Fig 2 demonstrates that the bias of the pairwise comparison estimator increases with sample size, decreases with growth rate, and decreases with population size. To explain the perhaps surprising dependence on sample size, note that E[Pn] is constant as a function of n (![]()
![]()
|
The dependence on r and N can be understood as follows. In the constant population size model, denote the random time to the coalescence of k to k - 1 lineages by Wk, measured in units of N generations. For k = 1, 2, ... , n - 1, let Xk (also in units of N generations) denote the total time elapsed in the coalescence of n lineages to k lineages, so that Xk = Wn + Wn-1 + ... + Wk+1. The corresponding time that it takes for n lineages to coalesce to k lineages in the exponential growth model is obtained from g-1(Xk), where
![]() |
(2) |
(![]()
![]() |
(3) |
for some collection of nonnegative constants ck (with ck > 0 for at least one value of k). By demonstrating that its derivative with respect to Nr is positive, it is easily shown that Pn(N, r)/Tn(N, r) is an increasing function of Nr. Consequently, E[Pn(N, r)/Tn(N, r)] is increasing in Nr, and increases in either N or r decrease the bias of the pairwise comparison estimator. Equation 3 implies that the parameters N and r are coupled in the compound parameter Nr: for example, the parameter sets (N, r) = (104, 0.001), (105, 0.0001), (106, 0.00001), all of which have Nr = 10, produce the same value of E[Pn(N, r)/Tn(N, r)] (Fig 2).
As in the constant population size case, E[Pn(N, r)/Tn(N, r)] can be approximated under exponential growth:
![]() |
(4) |
Note that (1) can be obtained from (4) by taking the limit as r
0. The closed-form expression in (4) gives a reasonable approximation to E[Pn/Tn] (Fig 2C and Fig D).
For rapidly growing large populations, the bias of the pairwise comparison estimator may be 10% or less (Fig 2). Estimated growth rates for periods of exponential growth of various human populations range from 0.001 to 0.02 per generation (![]()
![]()
![]()
![]()
![]()
![]()
For small populations the bias of the pairwise comparison estimator is particularly large. For Nr < 100 the bias for a sample of reasonable size will be considerable, >20%. Thus, in small populations, even if they have expanded exponentially, the star-shaped genealogy assumption cannot substitute for genealogical modeling of the data; schemes that explicitly account for uncertainty in the genealogy (for reviews, see ![]()
![]()
![]()
A further problem with this estimation procedure is that on the assumption of a star-shaped genealogy, the variance of the pairwise comparison estimator is typically underestimated, as its calculation ignores uncertainty associated with not knowing the genealogy. Under exponential growth, Pn/Tn can be quite variable (Fig 3), compared to its constant value of 1 in the star genealogy model. By assuming that Pn and Tn are equal, the pairwise comparison estimator ignores inherent variation in the relationship between these two quantities, which exists even if N and r are known exactly. Of course, all model-based procedures experience problems similar to this limitation of the star genealogy model. The variance of estimators is typically evaluated conditional on a model, such as the star genealogy model or the constant population size model; uncertainty associated with not knowing the model is difficult to incorporate into the calculation of confidence intervals.
|
The variance of Pn/Tn is larger for smaller values of Nr. Thus, as Nr decreases, not only does Pn move farther away from Tn, but also Tn becomes harder to predict from Pn (Fig 3). Only for Nr >
105 is it nearly certain that Tn < 1.25Pn (Fig 3C).
Other estimators based on the assumption of star-shaped genealogies may suffer from more severe bias than the pairwise comparison estimator of TMRCA, because other genealogical ratios decline more rapidly with sample size than does Pn/Tn. Under the assumption of a star-shaped genealogy, an alternate estimator of TMRCA is the "tree-length estimator" or the estimated total branch length of the genealogy divided by n (![]()
Under the constant-sized population model, we have (using Equation 2 of ![]()
![]() |
(5) |
where
0.5772 is Euler's constant. In the exponential growth model the analogous argument to (4) suggests
![]() |
(6) |
The closed-form expression in (6) approximates E[Ln/(nTn)] less accurately than (4) approximates E[Pn/Tn] (Fig 4). However, Ln/(nTn), obtained from the same simulated genealogies that underlie Fig 2, is similar to Pn/Tn, in the ways that it depends on the parameters. The major difference is that the large-sample limit of Ln/(nTn) is zero under constant population size, whereas Pn/Tn has a large-sample limit of 1/2. Thus, as sample size increases, Ln/(nTn) decreases much faster than Pn/Tn, and the tree-length estimator has bias considerably larger than that of the pairwise comparison estimator.
|
The potential error of the star genealogy assumption is perhaps greatest when properties of the genealogy itself, such as Tn, are of interest. If the goal of analysis is to compare genealogies for different loci or populations relative to each other, bias may affect estimates similarly and may have a reduced impact, although differences in sample size and population size should be taken into consideration. Also, if the genealogy is treated as a nuisance parameter, such as in fine mapping of disease susceptibility loci, the assumption might not have severe consequences. The success of methods based on the star genealogy assumption in pinpointing previously identified susceptibility genes (![]()
![]()
We have seen here that the approximate star-shaped features of genealogies in an exponentially growing population may be insufficient to guarantee low bias in analyses based on the star genealogy assumption, unless the population has grown very rapidly to a very large size. For estimation of Tn, the numerical results and approximate expressions shown can guide the use of the assumption. Future uses of star-shaped genealogies in population genetic analysis will benefit from demonstration that the assumption is appropriate in the relevant contexts.
| ACKNOWLEDGMENTS |
|---|
We thank H. Innan, P. Marjoram, and M. Nordborg for helpful comments. D. Zulman suggested the term "maypole genealogy." This research was supported by a National Science Foundation Postdoctoral Fellowship in Bioinformatics to N.A.R. and by National Institutes of Health grant GM28016 to M. W. Feldman.
Manuscript received August 15, 2002; Accepted for publication April 7, 2003.
| LITERATURE CITED |
|---|
DONNELLY, P., 1996 Interpreting genetic variability: the effects of shared evolutionary history, pp. 2540 in Variation in the Human Genome. Wiley, Chichester, UK.
DONNELLY, P. and S. TAVARÉ, 1995 Coalescents and genealogical structure under neutrality. Annu. Rev. Genet. 29:401-421.[Medline]
KARN, R. C., A. ORTH, F. BONHOMME, and P. BOURSOT, 2002 The complex history of a gene proposed to participate in a sexual isolation mechanism in house mice. Mol. Biol. Evol. 19:462-471.
LIU, J. S., C. SABATTI, J. TENG, B. J. B. KEATS, and N. RISCH, 2001 Bayesian analysis of haplotypes for linkage disequilibrium mapping. Genome Res. 11:1716-1724.
MCPEEK, M. S. and A. STRAHS, 1999 Assessment of linkage disequilibrium by the decay of haplotype sharing, with application to fine-scale genetic mapping. Am. J. Hum. Genet. 65:858-875.[Medline]
MORRIS, A. P., J. C. WHITTAKER, and D. J. BALDING, 2002 Fine-scale mapping of disease loci via shattered coalescent modeling of genealogies. Am. J. Hum. Genet. 70:686-707.[Medline]
NORDBORG, M., 2001 Coalescent theory, pp. 179212 in Handbook of Statistical Genetics, edited by D. J. BALDING, C. CANNINGS and M. BISHOP. Wiley, Chichester, UK.
NORDBORG, M., and S. M. KRONE, 2002 Separation of time scales and convergence to the coalescent in structured populations, pp. 194232 in Modern Developments in Theoretical Population Genetics, edited by M. SLATKIN and M. VEUILLE. Oxford University Press, Oxford.
PRITCHARD, J. K., M. T. SEIELSTAD, A. PÉREZ-LEZAUN, and M. W. FELDMAN, 1999 Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Mol. Biol. Evol. 16:1791-1798.[Abstract]
RANNALA, B. and G. BERTORELLE, 2001 Using linked markers to infer the age of a mutation. Hum. Mutat. 18:87-100.[Medline]
REICH, D. E., and D. B. GOLDSTEIN, 1999 Estimating the age of mutations using variation at linked markers, pp. 129138 in Microsatellites: Evolution and Applications, edited by D. B. GOLDSTEIN and C. SCHLÖTTERER. Oxford University Press, Oxford.
RISCH, N., D. DE LEON, L. OZELIUS, P. KRAMER, and L. ALMASY et al., 1995 Genetic analysis of idiopathic torsion dystonia in Ashkenazi Jews and their recent descent from a small founder population. Nat. Genet. 9:152-159.[Medline]
ROSENBERG, N. A., and M. W. FELDMAN, 2002 The relationship between coalescence times and population divergence times, pp. 130164 in Modern Developments in Theoretical Population Genetics, edited by M. SLATKIN and M. VEUILLE. Oxford University Press, Oxford.
ROSENBERG, N. A. and M. NORDBORG, 2002 Genealogical trees, coalescent theory and the analysis of genetic polymorphisms. Nat. Rev. Genet. 3:380-390.[Medline]
SCHIERUP, M. H. and J. HEIN, 2000 Consequences of recombination on traditional phylogenetic analysis. Genetics 156:879-891.
SLATKIN, M., 1996 Gene genealogies within mutant allelic classes. Genetics 143:579-587.[Abstract]
SLATKIN, M. and R. R. HUDSON, 1991 Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations. Genetics 129:555-562.[Abstract]
SLATKIN, M. and B. RANNALA, 2000 Estimating allele age. Annu. Rev. Genomics Hum. Genet. 1:225-249.[Medline]
STEPHENS, M., 2001 Inference under the coalescent, pp. 213238 in Handbook of Statistical Genetics, edited by D. J. BALDING, C. CANNINGS and M. BISHOP. Wiley, Chichester, UK.
STUMPF, M. P. H. and D. B. GOLDSTEIN, 2001 Genealogical and evolutionary inference with the human Y chromosome. Science 291:1738-1742.
TAJIMA, F., 1983 Evolutionary relationship of DNA sequences in finite populations. Genetics 105:437-460.
TANG, H., D. O. SIEGMUND, P. SHEN, P. J. OEFNER, and M. W. FELDMAN, 2002 Frequentist estimation of coalescence times from nucleotide sequence data using a tree-based partition. Genetics 161:447-459.
TAVARÉ, S., D. J. BALDING, R. C. GRIFFITHS, and P. DONNELLY, 1997 Inferring coalescence times from DNA sequence data. Genetics 145:505-518.[Abstract]
THOMAS, M. G., K. SKORECKI, H. BEN-AMI, T. PARFITT, and N. BRADMAN et al., 1998 Origins of Old Testament priests. Nature 394:138-140.[Medline]
THOMSON, R., J. K. PRITCHARD, P. SHEN, P. J. OEFNER, and M. W. FELDMAN, 2000 Recent common ancestry of human Y chromosomes: evidence from DNA sequence data. Proc. Natl. Acad. Sci. USA 97:7360-7365.
UYENOYAMA, M. K., 1997 Genealogical structure among alleles regulating self-incompatibility in natural populations of flowering plants. Genetics 147:1389-1400.[Abstract]
This article has been cited by other articles:
![]() |
A. Sano and H. Tachida Gene Genealogy and Properties of Test Statistics of Neutrality Under Population Growth Genetics, March 1, 2005; 169(3): 1687 - 1697. [Abstract] [Full Text] [PDF] |
||||
- THIS ARTICLE
-
Abstract
- Full Text (PDF)
- Alert me when this article is cited
- Alert me if a correction is posted
- SERVICES
- Similar articles in this journal
- Similar articles in PubMed
- Alert me to new issues of the journal
- Download to citation manager
- Reprints & Permissions
- CITING ARTICLES
- Citing Articles via HighWire
- Citing Articles via Google Scholar
- GOOGLE SCHOLAR
- Articles by Rosenberg, N. A.
- Articles by Hirsh, A. E.
- Search for Related Content
- PUBMED
- PubMed Citation
- Articles by Rosenberg, N. A.
- Articles by Hirsh, A. E.









