Skip to main content
  • Facebook
  • Twitter
  • YouTube
  • LinkedIn
  • Google Plus
  • Other GSA Resources
    • Genetics Society of America
    • G3: Genes | Genomes | Genetics
    • Genes to Genomes: The GSA Blog
    • GSA Conferences
    • GeneticsCareers.org

Institution: Massachusetts Inst of Technol MIT Libs

  • Log in
Genetics

Main menu

  • HOME
  • ISSUES
    • Current Issue
    • Early Online
    • Archive
  • ABOUT
    • About the journal
    • Why publish with us?
    • Editorial board
    • Early Career Reviewers
    • Contact us
  • SERIES
    • All Series
    • Genomic Prediction
    • Multiparental Populations
    • FlyBook
    • WormBook
    • YeastBook
  • ARTICLE TYPES
    • About Article Types
    • Commentaries
    • Editorials
    • GSA Honors and Awards
    • Methods, Technology & Resources
    • Perspectives
    • Primers
    • Reviews
    • Toolbox Reviews
  • PUBLISH & REVIEW
    • Scope & publication policies
    • Submission & review process
    • Article types
    • Prepare your manuscript
    • Submit your manuscript
    • After acceptance
    • Guidelines for reviewers
  • SUBSCRIBE
    • Why subscribe?
    • For institutions
    • For individuals
    • Email alerts
    • RSS feeds
  • Other GSA Resources
    • Genetics Society of America
    • G3: Genes | Genomes | Genetics
    • Genes to Genomes: The GSA Blog
    • GSA Conferences
    • GeneticsCareers.org

User menu

  • Log out

Search

  • Advanced search
Genetics

Advanced Search

  • HOME
  • ISSUES
    • Current Issue
    • Early Online
    • Archive
  • ABOUT
    • About the journal
    • Why publish with us?
    • Editorial board
    • Early Career Reviewers
    • Contact us
  • SERIES
    • All Series
    • Genomic Prediction
    • Multiparental Populations
    • FlyBook
    • WormBook
    • YeastBook
  • ARTICLE TYPES
    • About Article Types
    • Commentaries
    • Editorials
    • GSA Honors and Awards
    • Methods, Technology & Resources
    • Perspectives
    • Primers
    • Reviews
    • Toolbox Reviews
  • PUBLISH & REVIEW
    • Scope & publication policies
    • Submission & review process
    • Article types
    • Prepare your manuscript
    • Submit your manuscript
    • After acceptance
    • Guidelines for reviewers
  • SUBSCRIBE
    • Why subscribe?
    • For institutions
    • For individuals
    • Email alerts
    • RSS feeds
Previous ArticleNext Article

The Effects of Rate Variation on Ancestral Inference in the Coalescent

Lada Markovtsova, Paul Marjoram and Simon Tavaré
Genetics November 1, 2000 vol. 156 no. 3 1427-1436
Lada Markovtsova
* Department of Mathematics, University of Southern California, Los Angeles, California 90089-1113
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Paul Marjoram
† Biostatistics Division, Department of Preventive Medicine, University of Southern California, Los Angeles, California 90033
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Simon Tavaré
* Department of Mathematics, University of Southern California, Los Angeles, California 90089-1113† Biostatistics Division, Department of Preventive Medicine, University of Southern California, Los Angeles, California 90033‡ Program in Molecular Biology, Department of Biological Sciences, University of Southern California, Los Angeles, California 90089-1340
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
  • Article
  • Figures & Data
  • Info & Metrics
Loading

Abstract

We describe a Markov chain Monte Carlo approach for assessing the role of site-to-site rate variation in the analysis of within-population samples of DNA sequences using the coalescent. Our framework is a Bayesian one. We discuss methods for assessing the goodness-of-fit of these models, as well as problems concerning the separate estimation of effective population size and mutation rate. Using a mitochondrial data set for illustration, we show that ancestral inference concerning coalescence times can be dramatically affected if rate variation is ignored.

IT is widely recognized in the phylogenetic community that variation in mutation rates across sites can have significant impact on phylogenetic analyses based on sequence data. The review article of Yang (1996a) provides an historical overview with particular reference to gamma models for rate variation. A number of authors have studied the effects of rate variation for within-species samples of sequences (Aris-Brosou and Excoffier 1996; Deng and Fu 1996; Tajima 1996; Yang 1996b; Misawa and Tajima 1997). This work has concentrated largely on the number of segregating sites observed in the sample. The focus on this simple summary statistic was predicated in part on the lack of theoretical and computational approaches for studying complete sequence data. Studies of rate variation in hypervariable region I of human mtDNA (cf. Lundstromet al. 1992b; Wakeley 1993; Excoffier and Yang 1999; Meyeret al. 1999) have emphasized the likely drawbacks of using simple mutation processes such as the infinitely many sites model to analyze data where site-to-site variation is known to occur. Other studies, such as Sigurdardóttir et al. (2000), have attempted to take advantage of large, detailed pedigree information to infer mutation rates from observed mutation events.

In this article we develop a Markov chain Monte Carlo (MCMC) approach for studying rate variation for within-population samples of DNA sequences evolving according to a coalescent model (Kingman 1982). Our method generates observations from the posterior distribution of the coalescent tree topology, the coalescence times, and the mutation parameters, conditional on the sequence data observed in the sample. MCMC methods for maximum-likelihood estimation in the coalescent were introduced by Kuhner et al. (1995, 1998), and further developed by Wilson and Balding (1998) in the Bayesian setting. Bayesian MCMC approaches are now available in the phylogenetic setting as well (cf. Yang and Rannala 1997; Larget and Simon 1999; Mauet al. 1999). The Bayesian approach provides a useful computational device for maximum-likelihood estimation, as we illustrate in results. We also suggest a method for assessing the adequacy of the fit of such models to the data.

We assume we have a random sample of n chromosomes, for each of which we have the DNA sequence of a region of interest. We denote the collection of sequence data by D. We illustrate our approach with a sample of mitochondrial sequences from the Nuu Chah Nulth obtained by Ward et al. (1991). The data D are 360-bp sequences from region I of the control region obtained from a sample of n = 63 individuals. The observed base frequencies are (πA, πG, πC, πT) = (0.3297, 0.1120, 0.3371, 0.2212). The data have 26 segregating sites, a mean heterozygosity of 0.0145 per site and 28 distinct haplotypes with a haplotype homozygosity of 0.0562. We show that allowance for rate variation provides a better fit to these data. Furthermore, rate variation is seen to give much smaller values for the conditional coalescence times than are predicted using an infinitely many sites model. This is true in general, since the infinitely many sites model fails to allow for recurrent mutations and thus tends to underestimate the mutation rate, with a consequent overestimation of coalescence times. For the data here, the expected time to the most recent common ancestor of the sample is reduced by a factor of about two. These results suggest that allowance should be made for rate variation when using coalescent-based models for ancestral inference. Our approach provides a way to do this. We conclude by discussing whether it is reasonable to use such methods to make inferences about population size and mutation rate separately, rather than, as is more common, estimating a compound parameter that is proportional to their product.

The coalescent: The simplest version of the coalescent assumes that the population is random mating and of a large, constant, effective size N. The coalescent describes the ancestry of a random sample of n chromosomes taken from the present-day population. For convenience we refer to the time at which the sample was taken as time 0 and let time increase as we look back into the past. For a wide range of models, the probability that two individuals share a common ancestor in the previous generation is σ2/(N – 1), where σ2 is the variance of the distribution of the number of offspring produced by a parent in a single generation (Cannings 1974). For convenience we scale time in units of N/σ2 generations so that in a large population a given pair of individuals coalesces at rate 1. As is common, we shall assume σ2 = 1 so that a coalescent time of 1 corresponds to N generations. Note that this assumption affects the interpretation of estimates of ages and mutation rates on the basis of the data. If the true σ2 is >1 then estimates for times to the MRCA and ages of mutations will most likely be overestimates, with a consequent underestimation of mutation rates. Let Tj denote the time period during which the sample has j distinct ancestors. The Tj have independent exponential distributions with parameter j(j – 1)/2. When a coalescence event occurs, two lines of ancestry are picked, uniformly at random, and are coalesced to form one resulting line. The process of coalescences terminates when a single line of ancestry remains. The genealogy can be viewed as consisting of two components: the topology of the tree structure and the times between coalescent events. We use Λ to denote the topology of the tree and T = {Tn, Tn–1,..., T2} to denote the set of coalescence times in the sample. Accessible reviews of the coalescent are given by Hudson (1991) and Donnelly and Tavaré (1995), for example.

MATERIALS AND METHODS

Mutation model for sequences: We use a variety of finite-sites models, in which mutations are assumed to occur according to a model of Felsenstein, described in detail in Thorne et al. (1992). Each model has a transition-transversion parameter κ, assumed to be the same at each site, and a rate parameter gi at the ith of the L sites in the sequence. Among the parameters of interest is the effective per-site average substitution rate θ defined in Equation 4; it may be calculated from the values of gi, κ, and the base frequencies in the sequences. Further details of the mutation model are deferred to the appendix.

The parameters in the model are denoted by the vector M. In all cases, we treat the unknown rate parameters M as having a prior distribution, and we simulate observations from the posterior distribution of M given D. In the most general setting, there are L + 1 rates, resulting in a highly overparameterized model. We therefore examine a number of special cases of this general model, as described below.

Model 1: All sites mutate at the same rate, so that gi ≡ g for all sites i. Here M = (g, κ).

Model 2: The special case of model 1 in which κ is assumed known, so that M = (g). This model serves as a simple description of mutation in hypervariable region I of mtDNA. It was used by Kuhner et al. (1995) in their analysis of the same data set.

In the remaining models, we assume κ is known. There are L rate parameters, and M = (g1,..., gL). To reduce the number of rate parameters we assign each site to one of a number of rate classes. If there are c classes, then the model has c rates and M = (h1,..., hc), where hl now refers to the common rate for the lth class. We treat two subcases of this model, one based on a statistical description of rate variation, the second on a biological one.

Model 3: Meyer et al. (1999) developed a method for studying rate variation in the hypervariable region I of mtDNA using the Tamura-Nei mutation model (Tamura and Nei 1993) with gamma rate variation. They used a worldwide sample of sequences to estimate the relative rate at each site in the region. We used their results to classify the sites in hypervariable region I into five different rate classes, assigning all sites that they classified as unvarying to the class with the lowest rate. This classification resulted in 194 sites in the class with the smallest rate, 103 in the next class, 23 in the third and fourth classes, and 17 sites in the most rapidly evolving class. The classification of each site may be obtained from http://hto-e.usc.edu/datasets/mtrates.txt. The mutation rate for each of the five classes is allowed to vary. Thus gi = hl if site i is assigned to class l. The mutation parameters are M = (h1, h2, h3, h4, h5).

Model 4: Here we use just two rate classes, one for purines and one for pyrimidines (cf. Lundstromet al. 1992b). In this case M = (h1, h2). It is reasonable to do this since we observe no transversions in the data.

Computational approach: Here we describe our approach for generating observations from the posterior distribution of the mutation parameters M and features of the tree topology Λ and times T, given the observed sequence data. We define the effective mutation parameter θ/2 to be the expected number of substitutions per site per unit time that result in a change of base. Details of the derivation of θ are given in Equations 4 and 6 in the appendix. Among the issues we address are the sensitivity of the effective mutation parameter θ to different mutation models, the effect these differences might have on ancestral inference concerning (for example) the time to the MRCA of the sample, and the adequacy of the fit of the models to the data.

We assume a prior distribution for M and develop an MCMC method for generating observations from the conditional density f(G|D) of G = (Λ, T, M) given D. Since the topology of the coalescent is independent of the times of coalescent events, and both are independent of the mutation process, we can write f(G∣D)=P(D∣G)p1(Λ)p2(T)p3(M)∕f(D). (1) The first term on the right can be computed using a peeling algorithm (cf. Felsenstein 1981) and the mutation model we have described. The term p1(Λ) on the right of (1) is the coalescent tree topology distribution, p2(T) is the density of the coalescence times T, and p3(M) is the prior distribution for the mutation rates M. The normalizing constant f(D) is unknown and hard to compute. We therefore use a version of the Metropolis-Hastings method (Metropoliset al. 1953; Hastings 1970) to simulate from the required conditional distribution.

Markov chain Monte Carlo method: The algorithm produces correlated samples from a distribution π of interest, in our case π(G) ≡ f(G|D). It starts with an arbitrary choice of Λ, T, and M. New realizations of G are then proposed and accepted or rejected, according to the basic Metropolis-Hastings method:

  1. Denote the current state by G = (Λ, T, M).

  2. Output the current value of G.

  3. Propose G′ = (Λ′, T′, M′) according to a kernel Q(G → G′

  4. Compute the acceptance probability h=min{1,π(G′)Q(G′→G)π(G)Q(G→G′)}. (2)

  5. Accept the new state G′ with probability h, otherwise stay at G.

  6. Return to step 1.

Let X(t) denote the state of this chain after t iterations. Once X(t) has reached stationarity its values represent samples from the distribution π(G) = π(Λ, T, M). In many cases it is desirable to simulate approximately independent samples from the posterior distribution of interest, in which case we use output from every mth iteration for a suitable choice of m.

There are many possible choices for the updating mechanism Q(·, ·). The particular Q(·, ·) we use will impact the efficiency of the scheme, both in terms of speed of computation and rate of approach to stationarity. Furthermore the chain X(t) must be irreducible and positive recurrent to ensure that the limiting distribution is indeed π(Λ, T, M). Informally, the chain X(t) must be able to reach any feasible state, from any feasible starting point, in a finite period of time. Any of the other updating schemes referenced in the Introduction might be adapted to the problems explored here, but we have chosen to use a version of the scheme given in Markovtsova et al. (2000). For full details we refer the reader to that article, but we now give a brief informal description. The algorithm makes local changes to the tree by picking a random coalescence event and considering this event and the next coalescence event. So if the first event involves the transition from k ancestors to k – 1, then we also consider the coalescence involving the transition from k – 1 to k – 2 ancestors. We then make a local rearrangement of the lines involved in these two coalescences. We simultaneously propose new times for the periods during which there are k and k – 1 ancestors to the sample. These new times may be generated according to the predata coalescent distribution, or according to Normal distributions with mean given by the current values of the times. We update the mutation parameters M every 10 iterations in the analyses presented in the next section. New values are proposed according to a Normal distribution centered around the currently accepted value.

RESULTS

We ran the updating scheme described in the previous section on the Nuu Chah Nulth data using mutation models 1–4. The output typically appeared to be nonstationary for up to 200,000 iterations of the algorithm. We sampled every 10,000th iteration to approximate a random sample from the stationary distribution. In a bid to be very conservative, and since the algorithms run rapidly, we generally discarded the first 2500 samples and based our analysis on the next 5000 samples. The acceptance rate was typically ∼80%. For runs in which we needed to tune a variance parameter the burn-in length varied, but the estimated parameter values were unchanged for the different variances we tried.

One should begin to sample from the process X(·) once it has “reached stationarity.” There are many heuristic tests for this, none of which is infallible. For a critique see Gilks et al. (1996). Some simple diagnostics are functions of the statistics of interest such as autocorrelations and moving averages. It is also valuable to run the chain from several different, widely spaced, starting points and compare the long-term behavior. We also used the tests contained in the software package CODA (Bestet al. 1995). All tests were passed with the exception of the test that indicates the presence of correlation in the log-likelihood series; this could be removed if necessary by sampling less frequently.

Some time might be saved by starting the process from a genealogy (Λ, T) for which P(Λ,T∣D) is relatively high. The rationale for this is that it is sensible to start from a region of the state-space that is well supported by the data. As an example of this one might use the UPGMA tree for the data set, as described in Kuhner et al. (1995). However, we usually started from random tree topologies since convergence from different starting points is potentially a useful diagnostic for stationarity.

The transition-transversion parameter κ: The analysis of model 1 produced a median value of κ of 65.1, with lower and upper quartiles of 32.7 and 162.7, respectively. Note that since the data are consistent with no transversions having occurred during the evolution of the sample, the posterior distribution for κ has a very long right tail and statistics for the mean, which are strongly influenced by outliers, are potentially misleading and are therefore not presented. The median value of g was 6.87 × 10–4 and the median value for w was 4.47 × 10–2. These results show that the data are consistent with a value of κ = 100, as has previously been used by Kuhner et al. (1995) in their analysis of the same data. For the analysis of subsequent models we treated κ = 100 as fixed.

The mutation parameter θ: Posterior statistics for the per-site average mutation rate θ for each of the models are presented in Table 1, and Figure 1 shows the estimated posterior density for each model. We note that the posterior distribution supports higher values for θ in model 3. This is largely due to the high mutation rate that is assigned to class 5, which contains 17 sites believed to mutate at the highest rate. The posterior distributions of the per-site mutation rate for each of the five rate classes in model 3 are shown in Figure 2.

Figure 3 shows the posterior distribution of θ for the purine and pyrimidine sites in model 4, under which both mutation rates had uniform prior distributions. The mean rate for purine sites is 0.015 compared to a mean rate of 0.062 for pyrimidine sites. We note the considerable range of plausible values for the pyrimidine rate. We may use the Bayesian approach to estimate the maximum-likelihood estimate (MLE) of the two rates. We obtained estimates of 0.012 for purines and 0.059 for pyrimidines (data not shown). These estimates can be compared to those of Lundstrom et al. (1992a) and Kuhner et al. (1995), who analyze purine and pyrimidine sites separately. The first authors obtained rates of 0.008 for purines and 0.024 for pyrimidines; the second, 0.005 and 0.052, respectively. The joint analysis of the two types of site tells a somewhat different story; the purine rate is much higher than the separate analyses might have predicted. Presumably this arises because, for this particular data set, including information about pyrimidines when analyzing purines happens to encourage trees to be shorter than they would be if purines were analyzed alone.

View this table:
  • View inline
  • View popup
TABLE 1

Summary statistics for θ

We conclude by returning to the results from the simplest mutation mechanism, given by model 2. We have taken κ = 100 and a uniform prior on (0, 100) for θ. Since the posterior density of θ is proportional to the likelihood in this case, we may use an estimate of the posterior density to find the maximum-likelihood estimate of θ. From the density shown in Figure 1, we obtained an MLE of θ^=0.038 . Kuhner et al. (1995) obtained the value θ^=0.040 for these data, using the same value of κ. The difference in estimates arises from a combination of the parameters chosen for the density estimation, the different approaches to the optimization, and noise. From an estimate of the curvature of the log density we get an estimate of the standard error of θ^ of 0.010, resulting in an ∼95% confidence interval of (0.018, 0.058). The Watterson (1975) estimator of θ, based on 26 segregating sites in the data, is 0.015 with an estimated standard error of 0.005; the 95% confidence interval for θ is then (0.005, 0.025). The lower value obtained using the Watterson estimator is expected, because multiple mutations at the same site are ignored.

Figure 1.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 1.

—Posterior density of per site mutation rate θ for each model.

Goodness-of-fit: The adequacy of the fit of these models can be assessed using a variant of the parametric bootstrap. We simulated observations from the posterior distribution of (Λ, T, M), and for each of the trees (Λ, T) we then simulated the mutation process with parameters specified by M. The distribution of certain summary statistics observed in the simulated data is found, and the values of the statistics actually observed in the data are compared to these distributions. We chose to use the number of haplotypes, the maximal haplotype frequency, the haplotype homozygosity, the number of segregating sites, and a measure of nucleotide diversity. In practice, we use the output from the MCMC runs to generate the observations on (Λ, T, M). In Table 2 we give the results of this comparison for models 2 and 3 using 4000 values from each posterior distribution. In principle, for a perfect fit, we might expect to see ∼50% of the simulations with values less than or equal to the observed value in the data. The criteria for accepting our model would be that the observed values fall with the central 95% of the distribution generated by the simulations. There is some evidence that the constant rate model fits less well than the variable rate case, particularly regarding the haplotype distribution. The number of segregating sites seen in each of the classes in model 3 also seems to fit reasonably well. The total number of segregating sites observed in the bootstrap samples gives some evidence of lack-of-fit; both models predict more segregating sites than are seen in the data. One explanation for this apparent discrepancy might be that the models are not allowing for a high enough level of rate heterogeneity and therefore do not typically produce enough recurrent mutations. The mutations that do occur will then tend to be spread over a greater number of sites. A model that allows for more than five different mutation rate classes may well produce a better fit.

View this table:
  • View inline
  • View popup
TABLE 2

Assessing goodness-of-fit of the models

Figure 2.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 2.

—Posterior density of per site θ's for model 3.

Figure 3.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 3.

—Posterior density of per site θ for purines and pyrimidines under model 4.

The distribution of time to the most recent common ancestor: Posterior statistics for the time to the most recent common ancestor (TMRCA) are given in Table 3, and the corresponding posterior densities appear in Figure 4. The prior distribution of the time to MRCA of a sample of n = 63 has a mean of 2(1 – 1/63) = 1.97. With an effective size of N = 600, a 20-year generation time, and a value of σ2 = 1 for the variance of the offspring distribution, this is ∼23,600 years. The posterior distribution for each of the models presented here suggests that times considerably less than this are most likely. Note that the mean TMRCA for model 3 is slightly lower than that for the other models. This is a reflection of the higher estimate for θ under this model. Griffiths and Tavaré (1994) used the infinitely many sites model to obtain a mean posterior TMRCA of 14,400 years on the basis of an estimated per site θ of 0.014. This is substantially larger than the posterior mean of 7100 obtained from model 3, reflecting in part the smaller value of θ chosen. The joint posterior density of T and θ for model 3 is given in Figure 5.

View this table:
  • View inline
  • View popup
TABLE 3

Summary statistics for time to MRCA

Simultaneous estimation of N and u: Several authors (e.g., Tavaréet al. 1997; Wilson and Balding 1998) have developed Bayesian methods that make inferences about the mutation rate u and the population size N separately. In the absence of other information, the statistical features of the sequence data are a consequence of the value of the compound mutation parameter ν = 2Nu. Here we examine the influence of the assumed prior on the joint posterior distribution of N and u.

We illustrate this by considering the Nuu Chah Nulth data set once more. For simplicity we consider a case for which the prior distribution for N is uniform over the positive real line; this can be thought of as illustrating a case in which we wish to estimate an unknown effective population size. Using this framework the mode of the posterior distribution of N is the maximum-likelihood estimate of the population size. We use several different Normal prior distributions for u to indicate the influence the choice of prior distribution plays on the results. The mean of the Normal prior distribution for u, 7.6 × 10–5, was chosen to be consistent with the results for θ given in this article. We vary σ, the standard deviation of the Normal prior, to reflect differing levels of uncertainty about u. We show results for three cases: case 1 is for σ = 10–6, case 2 is for σ = 2 × 10–5, and case 3 is for σ = 10–4. In all three cases the posterior distributions for the compound parameter ν = 2Nu are practically identical, regardless of the choice of prior for u (see Figure 6, in which the posterior densities of the per-site effective mutation rate θ are plotted).

However, the similarity of the posterior distributions for θ conceals a more complicated story. In Figure 6 we show the posterior distributions of N and u for each of cases 1, 2, and 3. In each case, the central plot is a scatter plot showing sampled values of N and u. Along the edges of each scatter plot are the corresponding marginal distributions of N and u.

As we can see, N is highly correlated with u. While the marginal distributions look well behaved, they hide the nonidentifiability that is revealed in the scatter plot. The correlation is not apparent in case 1, because the choice of prior for u has restricted the parameters to a small region. The data, in isolation, can be used only for inference about ν (or θ, a multiple of ν). The data support a particular posterior distribution for ν, the single evolutionary parameter. The choice of priors for N and u dictates the way in which the posterior distribution for ν is decomposed into separate distributions. For example, if one uses a tight prior for u, i.e., a prior with low variance as in case 1, one attains a relatively low variance for the posterior distribution of N. However, if one assigns a prior with high variance to u, as in case 3, the posterior for N changes dramatically. In the limit, when we fix u, for example, the posterior distribution for N has a small variance. However, in this situation, the posterior for N is a rescaling of the posterior for ν. In a Bayesian framework it is natural for the choice of prior to influence the shape of the posterior distribution, but it is important to be aware of the extent of this influence when interpreting the results. Consequently, we feel there is merit to such an analysis only if one has strong beliefs about one of the two parameters. The same observations apply to analyses using related approaches, as in Tavaré et al. (1997), and Wilson and Balding (1998).

We have deliberately chosen widely differing priors for u to illustrate the points made here. However, the priors are not unreasonable and the results presented stress the vital role played by the choice of prior distributions. In the absence of any prior information the parameters u and N are confounded in the following sense: any combination of u, N and u′, N′ such that uN = u′N′ will have the same posterior likelihood, hence the behavior seen in Figure 6. The presence of a prior distribution for N and/or u lessens the degree of confounding, although, as can be seen, priors offering reasonable support to a wide range of parameter values will still leave a large degree of nonidentifiability.

Figure 4.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 4.

—Posterior density of time to MRCA for each model.

DISCUSSION

We have described an MCMC approach that allows for the estimation of mutation parameters and statistics, such as the time to the MRCA, in a Bayesian setting. Such a setting allows reasonable prior information to be included in the analysis, possibly avoiding some of the confounding issues that might otherwise be present.

This approach also provides an alternative method for finding more traditional maximum-likelihood estimates of mutation rates by using noninformative priors.

We noted that the adequacy of the model may be assessed using a parametric bootstrap based on simulating the mutation process using trees and mutation parameters simulated from the posterior distribution. We saw that the distribution of quantities such as the TMRCA of the sample is highly dependent on the mutation model used. For the variable rate model with five rate classes, which provides an adequate fit to the data, the estimated mean TMRCA is about one-half that estimated using an infinitely many sites model. This example emphasizes the need for caution in practical applications of theoretical approaches to ancestral inference.

It is straightforward to adapt the algorithm given in this article to allow for other demographic scenarios such as deterministic population size fluctuations. It is also straightforward to adapt the algorithm to include different mutation models. For an example concerning the age of a unique event polymorphism, see Markovtsova et al. (2000). We are currently developing an approach to rate variation that allows sites to change the rate class to which they belong, in contrast to the approach described in this article in which the class is fixed. Code that implements the methods described in this article is available in the form of C++ source code and executables from the authors and at http://hto-e.usc.edu.

Figure 5.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 5.

—Joint posterior density of TMRCA and θ under model 3.

Figure 6.
  • Download figure
  • Open in new tab
  • Download powerpoint
Figure 6.

—Posterior distributions for N and u. Top left, case 1. Top right, case 2. Bottom left, case 3. Bottom right, posterior density of θ.

Acknowledgments

We thank the referees for helpful comments on an earlier version of this article. The authors were supported in part by National Science Foundation grant BIR 95-04393 and National Institutes of Health grant GM 58897.

APPENDIX

Mutation model: We model only the effects of base substitutions, which we assume to occur according to a finitely many sites model as follows. In copying a sequence of L sites from one generation to the next, a potential substitution occurs with probability u. This potential substitution occurs at site i with probability pi, where pi ≥ 0 and p1 + · · · + pL = 1. Values of pi = 0 correspond to invariable sites, and differences in the pi reflect heterogeneities in the substitution rates at each site. A potential substitution at site i changes a base from j to k with probability pjk[i] . Here and in what follows we number the bases A, G, C, T as 1, 2, 3, and 4, respectively. The mutation matrices P[i] can in principle be different, but here we assume they are identical, P[i]:=P,i=1,2,…,L, so that we can focus on the effects of rate variation alone. We assume that P has stationary distribution π = (πA, πG, πC, πT), so that at stationarity a randomly chosen sequence looks as though its bases are independent copies of π. In common with most studies of molecular variation we assume that π is known, usually from estimates of the proportion of each base in the set of sequences being analyzed.

In the coalescent setting, we define ν=2Nu, so that the locations of potential substitutions at each site on the branches of the coalescent tree evolve as independent Poisson processes, the process for site i having rate νi/2, where νi = νpi. At any substitution point, the current base at site i is changed according to the transition matrix P. We note that not all potential substitutions have to result in changes to the existing base at a site, as pjj > 0 is allowed. The effective substitution rate θi/2 at site i is the expected number of substitutions per unit time that results in a change in base at site i: θi2=νi2∑j=14πj(1−pjj). (3) We write ϴ = θ1 + · · · θL for the overall effective mutation rate and θ=ϴ∕L (4) for the average rate per site.

It remains to specify the form of P. Since our applications focus on mitochondrial DNA, in which transitions occur with much higher frequency than transversions, we use a model of Felsenstein. When a potential substitution occurs, it may be one of two types: general, in which case an existing base j is substituted by a base of type k with probability πk,1 ≤ j, k ≤ 4; or within-group, in which case a pyrimidine is replaced by C or T with probability proportional to πC and πT, respectively, and a purine is replaced by A or G with probability proportional to πA and πG, respectively. The conditional probability of a general mutation is defined to be 1/(1 + κ), while the conditional probability of a within-group mutation is defined to be κ/(1 + κ), where κ ≥ 0 is the transition-transversion parameter. If we write gi=νi2(1+κ),wi=κgi, (5) then κ is the ratio of the within-class to general substitution rates. From (3), the effective mutation rate at site i is given by θi2=gi(1−∑j=14πj2)+2wi(πAπGπA+πG+πCπTπC+πT). (6) We denote by rjk(t; g, w) the probability that a base of type j has changed to a base of type k a time t later, when the rate parameters in (5) are given by g and w, respectively. Thorne et al. (1992) show that rjk(t;g,w)={e−(g+w)t+e−gt(1−e−wt)πkπH(k)+(1−e−gt)πk,j=ke−gt(1−e−wt)πkπH(k)+(1−e−gt)πk,j≠k,H(j)=H(k)(1−e−gt)πk,H(j)≠H(k),} (7) where πR = πA + πG, πY = πC + πT, and H(i) denotes whether base i is a purine or a pyrimidine, so that H(A) = H(G) = R and H(C) = H(T) = Y.

The Hastings ratio: Writing G = (Λ, T, M), the kernel Q can be expressed as the product of three terms: Q(G→G′)=Q1(Λ→Λ′)Q2(T→T′∣Λ→Λ′)Q3(M→M′). Consequently the Hastings ratio appearing in (2) can be written in the form P(D∣G′)P(D∣G)p1(Λ′)p1(Λ)p2(T′)p2(T)p3(M′)p3(M)Q1(Λ′→Λ)Q1(Λ→Λ′)Q2(T′→T∣Λ′→Λ)Q2(T→T′∣Λ→Λ′)Q3(M′→M)Q3(M→M′), (8) the unknown term f(D) canceling. We can further simplify (8) by noting that, since pairs of lines are chosen uniformly to coalesce, all topologies are, a priori, equally likely. Hence p1(Λ′) = p1(Λ). Furthermore, our transition kernel changes only two of the times on the tree, Tl and Tl–1, say. Finally, it is easy to show that Q1(Λ → Λ′) = Q1 (Λ′ → Λ), reducing (8) to P(D∣G′)P(D∣G)p2(T′)p3(M′)p2(T)p3(M)fl(tl)fl−1(tl−1)fl(tl′)fl−1(tl−1′)Q3(M′→M)Q3(M→M′), (9) where fl(·) and fl–1(·) are the densities of the time updating mechanism at levels l and l – 1. If one uses a transition kernel that proposes new times that are exponential with parameter l(l – 1)/2 at level l (i.e., the unconditional coalescent distribution for times), then further cross-cancellation reduces (9) to P(D∣G′)P(D∣G)p3(M′)p3(M)Q3(M′→M)Q3(M→M′). (10) A similar simplification also follows if one proposes new mutation rates independently of the currently accepted rate.

Footnotes

  • Communicating editor: G. A. Churchill

  • Received February 22, 2000.
  • Accepted July 17, 2000.
  • Copyright © 2000 by the Genetics Society of America

LITERATURE CITED

  1. ↵
    1. Aris-Brosou S.,
    2. Excoffier L.
    , 1996 The impact of population expansion and mutation rate heterogeneity on DNA sequence polymorphism. Mol. Biol. Evol. 13: 494–504.
    OpenUrlAbstract
  2. ↵
    1. Best N. G.,
    2. Cowles M. K.,
    3. Vines S. K.
    , 1995 CODA Manual Version 0.30. MRC Biostatistics Unit, Cambridge, UK.
  3. ↵
    1. Cannings C.
    , 1974 The latent roots of certain Markov chains arising in genetics: a new approach. I. Haploid models. Adv. Appl. Prob. 6: 260–290.
    OpenUrlCrossRef
  4. ↵
    1. Deng W-H.,
    2. Fu Y-X.
    , 1996 The effects of variable mutation rates across sites on the phylogenetic estimation of effective population size of mutation rate of DNA sequences. Genetics 144: 1271–1281.
    OpenUrlAbstract/FREE Full Text
  5. ↵
    1. Donnelly P.,
    2. Tavaré S.
    , 1995 Coalescents and genealogical structure under neutrality. Annu. Rev. Genet. 29: 401–421.
    OpenUrlCrossRefPubMedWeb of Science
  6. ↵
    1. Excoffier L.,
    2. Yang Z.
    , 1999 Substitution rate variation among sites in mitochondrial hypervariable region I of humans and chimpanzees. Mol. Biol. Evol. 16: 1357–1368.
    OpenUrlAbstract
  7. ↵
    1. Felsenstein J.
    , 1981 Evolutionary trees from DNA sequence data: a maximum likelihood approach. J. Mol. Evol. 17: 368–376.
    OpenUrlCrossRefPubMedWeb of Science
  8. ↵
    1. Gilks W. R.,
    2. Richardson S.,
    3. Spiegelhalter D. J.
    (Editors), 1996 Markov Chain Monte Carlo in Practice. Chapman and Hall, London/New York.
  9. ↵
    1. Griffiths R. C.,
    2. Tavaré S.
    , 1994 Ancestral inference in population genetics. Stat. Sci. 9: 307–319.
    OpenUrlCrossRefWeb of Science
  10. ↵
    1. Hastings W. K.
    , 1970 Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57: 97–109.
    OpenUrlAbstract/FREE Full Text
  11. ↵
    1. Hudson R. R.
    , 1991 Gene genealogies and the coalescent process, pp. 1–44 in Oxford Surveys in Evolutionary Biology, Vol. 7, edited by Futuyma D., Antonovics J.. Oxford University Press, Oxford/New York/London.
  12. ↵
    1. Kingman J. F. C.
    , 1982 On the genealogy of large populations. J. Appl. Prob. 19A: 27–43.
  13. ↵
    1. Kuhner M. K.,
    2. Yamato J.,
    3. Felsenstein J.
    , 1995 Estimating effective population size and mutation rate from sequence data using Metropolis-Hastings sampling. Genetics 140: 1421–1430.
    OpenUrlAbstract/FREE Full Text
  14. ↵
    1. Kuhner M. K.,
    2. Yamato J.,
    3. Felsenstein J.
    , 1998 Maximum likelihood estimation of population growth rates based on the coalescent. Genetics 149: 429–434.
    OpenUrlAbstract/FREE Full Text
  15. ↵
    1. Larget B.,
    2. Simon D. L.
    , 1999 Markov chain Monte Carlo algorithms for the Bayesian analysis of phylogenetic trees. Mol. Biol. Evol. 16: 750–759.
    OpenUrlCrossRefWeb of Science
  16. ↵
    1. Lundstrom R. S.,
    2. Tavaré S.,
    3. Ward R. H.
    , 1992a Estimating substitution rates from molecular data using the coalescent. Proc. Natl. Acad. Sci. USA 89: 5961–5965.
    OpenUrlAbstract/FREE Full Text
  17. ↵
    1. Lundstrom R. S.,
    2. Tavaré S.,
    3. Ward R. H.
    , 1992b Modeling the evolution of the human mitochondrial genome. Math. Biosci. 112: 319–336.
    OpenUrlCrossRefPubMedWeb of Science
  18. ↵
    1. Markovtsova L.,
    2. Marjoram P.,
    3. Tavaré S.
    , 2000 The age of a unique event polymorphism. Genetics 156: 401–409.
    OpenUrlAbstract/FREE Full Text
  19. ↵
    1. Mau R.,
    2. Newton M. A.,
    3. Larget B.
    , 1999 Bayesian phylogenetic inference via Markov chain Monte Carlo methods. Biometrics 55: 1–12.
    OpenUrlCrossRefPubMedWeb of Science
  20. ↵
    1. Metropolis N.,
    2. Rosenbluth A. W.,
    3. Rosenbluth M. N.,
    4. Teller A. H.,
    5. Teller E.
    , 1953 Equations of state calculations by fast computing machine. J. Chem. Phys. 21: 1087–1091.
    OpenUrlCrossRefWeb of Science
  21. ↵
    1. Meyer S.,
    2. Weiss G.,
    3. von Haeseler A.
    , 1999 Pattern of nucleotide substitution and rate heterogeneity in the hypervariable regions I and II of human mtDNA. Genetics 152: 1103–1110.
    OpenUrlAbstract/FREE Full Text
  22. ↵
    1. Misawa K.,
    2. Tajima F.
    , 1997 Estimation of the amount of DNA polymorphism when the neutral mutation rate varies among sites. Genetics 147: 1959–1964.
    OpenUrlAbstract/FREE Full Text
  23. ↵
    1. Sigurdardóttir S.,
    2. Helgason A.,
    3. Gulcher J. R.,
    4. Stefansson K.,
    5. Donnelly P.
    , 2000 The mutation rate in the human mtDNA control region. Am. J. Hum. Genet. 66: 1599–1609.
    OpenUrlCrossRefPubMedWeb of Science
  24. ↵
    1. Tajima F.
    , 1996 The amount of DNA polymorphism maintained in a finite population when the neutral mutation rate varies among sites. Genetics 143: 1457–1465.
    OpenUrlAbstract/FREE Full Text
  25. ↵
    1. Tamura K.,
    2. Nei M.
    , 1993 Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10: 512–526.
    OpenUrlAbstract
  26. ↵
    1. Tavaré S.,
    2. Balding D.,
    3. Griffiths R. C.,
    4. Donnelly P.
    , 1997 Inferring coalescence times from DNA sequence data. Genetics 145: 505–518.
    OpenUrlAbstract/FREE Full Text
  27. ↵
    1. Thorne J. L.,
    2. Kishino H.,
    3. Felsenstein J.
    , 1992 Inching towards reality: an improved likelihood model of sequence evolution. J. Mol. Evol. 34: 3–16.
    OpenUrlCrossRefPubMedWeb of Science
  28. ↵
    1. Wakeley J.
    , 1993 Substitution rate variation among sites in hypervariable region 1 of human mitochondrial DNA. J. Mol. Evol. 37: 613–623.
    OpenUrlPubMedWeb of Science
  29. ↵
    1. Ward R. H.,
    2. Frazier B. L.,
    3. Dew K.,
    4. Pääbo S.
    , 1991 Extensive mitochondrial diversity within a single Amerindian tribe. Proc. Natl. Acad. Sci. USA 88: 8720–8724.
    OpenUrlAbstract/FREE Full Text
  30. ↵
    1. Watterson G. A.
    , 1975 On the number of segregating sites in genetical models without recombination. Theor. Popul. Biol. 7: 256–276.
    OpenUrlCrossRefPubMedWeb of Science
  31. ↵
    1. Wilson I. J.,
    2. Balding D. J.
    , 1998 Genealogical inference from microsatellite data. Genetics 150: 499–510.
    OpenUrlAbstract/FREE Full Text
  32. ↵
    1. Yang Z.
    , 1996a Among-site rate variation and its impact on phylogenetic analyses. TREE 9: 367–372.
    OpenUrl
  33. ↵
    1. Yang Z.
    , 1996b Statistical properties of a DNA sample under the finite-sites model. Genetics 144: 1941–1950.
    OpenUrlAbstract/FREE Full Text
  34. ↵
    1. Yang Z.,
    2. Rannala B.
    , 1997 Bayesian phylogenetic inference using DNA sequences: a Markov chain Monte Carlo method. Mol. Biol. Evol. 14: 717–724.
    OpenUrlAbstract
View Abstract
Previous ArticleNext Article
Back to top

PUBLICATION INFORMATION

Volume 156 Issue 3, November 2000

Genetics: 156 (3)

ARTICLE CLASSIFICATION

INVESTIGATIONS
View this article with LENS
Email

Thank you for sharing this Genetics article.

NOTE: We request your email address only to inform the recipient that it was you who recommended this article, and that it is not junk mail. We do not retain these email addresses.

Enter multiple addresses on separate lines or separate them with commas.
The Effects of Rate Variation on Ancestral Inference in the Coalescent
(Your Name) has forwarded a page to you from Genetics
(Your Name) thought you would be interested in this article in Genetics.
Print
Alerts
Enter your email below to set up alert notifications for new article, or to manage your existing alerts.
SIGN UP OR SIGN IN WITH YOUR EMAIL
View PDF
Share

The Effects of Rate Variation on Ancestral Inference in the Coalescent

Lada Markovtsova, Paul Marjoram and Simon Tavaré
Genetics November 1, 2000 vol. 156 no. 3 1427-1436
Lada Markovtsova
* Department of Mathematics, University of Southern California, Los Angeles, California 90089-1113
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Paul Marjoram
† Biostatistics Division, Department of Preventive Medicine, University of Southern California, Los Angeles, California 90033
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Simon Tavaré
* Department of Mathematics, University of Southern California, Los Angeles, California 90089-1113† Biostatistics Division, Department of Preventive Medicine, University of Southern California, Los Angeles, California 90033‡ Program in Molecular Biology, Department of Biological Sciences, University of Southern California, Los Angeles, California 90089-1340
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
del.icio.us logo Digg logo Reddit logo Twitter logo CiteULike logo Facebook logo Google logo Mendeley logo
Citation

The Effects of Rate Variation on Ancestral Inference in the Coalescent

Lada Markovtsova, Paul Marjoram and Simon Tavaré
Genetics November 1, 2000 vol. 156 no. 3 1427-1436
Lada Markovtsova
* Department of Mathematics, University of Southern California, Los Angeles, California 90089-1113
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Paul Marjoram
† Biostatistics Division, Department of Preventive Medicine, University of Southern California, Los Angeles, California 90033
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site
Simon Tavaré
* Department of Mathematics, University of Southern California, Los Angeles, California 90089-1113† Biostatistics Division, Department of Preventive Medicine, University of Southern California, Los Angeles, California 90033‡ Program in Molecular Biology, Department of Biological Sciences, University of Southern California, Los Angeles, California 90089-1340
  • Find this author on Google Scholar
  • Find this author on PubMed
  • Search for this author on this site

Citation Manager Formats

  • BibTeX
  • Bookends
  • EasyBib
  • EndNote (tagged)
  • EndNote 8 (xml)
  • Medlars
  • Mendeley
  • Papers
  • RefWorks Tagged
  • Ref Manager
  • RIS
  • Zotero

Related Articles

Cited By

More in this TOC Section

  • The Fate of Deleterious Variants in a Barley Genomic Prediction Population
  • Comparative Genomics and Transcriptomics To Analyze Fruiting Body Development in Filamentous Ascomycetes
  • The Genomic Basis for Short-Term Evolution of Environmental Adaptation in Maize
Show more Investigations
  • Top
  • Article
    • Abstract
    • MATERIALS AND METHODS
    • RESULTS
    • DISCUSSION
    • Acknowledgments
    • APPENDIX
    • Footnotes
    • LITERATURE CITED
  • Figures & Data
  • Info & Metrics

GSA

The Genetics Society of America (GSA), founded in 1931, is the professional membership organization for scientific researchers and educators in the field of genetics. Our members work to advance knowledge in the basic mechanisms of inheritance, from the molecular to the population level.

Online ISSN: 1943-2631

  • For Authors
  • For Reviewers
  • For Subscribers
  • Submit a Manuscript
  • Editorial Board
  • Press Releases

SPPA Logo

GET CONNECTED

RSS  Subscribe with RSS.

email  Subscribe via email. Sign up to receive alert notifications of new articles.

  • Facebook
  • Twitter
  • YouTube
  • LinkedIn
  • Google Plus

Copyright © 2019 by the Genetics Society of America

  • About GENETICS
  • Terms of use
  • Advertising
  • Permissions
  • Contact us
  • International access