Abstract

Ease and accuracy of typing, together with high levels of polymorphism and widespread distribution in the genome, make microsatellite (or short tandem repeat) loci an attractive potential source of information about both population histories and evolutionary processes. However, microsatellite data are difficult to interpret, in particular because of the frequency of back-mutations. Stochastic models for the underlying genetic processes can be specified, but in the past they have been too complicated for direct analysis. Recent developments in stochastic simulation methodology now allow direct inference about both historical events, such as genealogical coalescence times, and evolutionary parameters, such as mutation rates. A feature of the Markov chain Monte Carlo (MCMC) algorithm that we propose here is that the likelihood computations are simplified by treating the (unknown) ancestral allelic states as auxiliary parameters. We illustrate the algorithm by analyzing microsatellite samples simulated under the model. Our results suggest that a single microsatellite usually does not provide enough information for useful inferences, but that several completely linked microsatellites can be informative about some aspects of genealogical history and evolutionary processes. We also reanalyze data from a previously published human Y chromosome microsatellite study, finding evidence for an effective population size for human Y chromosomes in the low thousands and a recent time since their most recent common ancestor: the 95% interval runs from ~15,000 to 130,000 years, with most likely values around 30,000 years.

PATTERNS of heritable genetic variation in contemporary samples can provide information both about parameters of evolutionary processes and about details of the genealogical history of the sample. Data from the male-specific part of the human Y chromosome, for example, can provide evidence both about mutation rates and about the number and reproductive behavior of human males. When combined with information from mitochondrial, autosomal, and X chromosome loci, additional insights about recent human evolutionary history may be obtained.

Extracting historical and evolutionary information from the genetic data is, however, difficult, due to the complex interaction of the underlying genetic processes. Traditionally, the interpretation of genetic samples has been based on summary statistics, such as heterozygosity or pairwise measures of identity (Nei 1987; Slatkin 1995). Such an approach can waste much of the information contained in the data (Felsenstein 1992). Intuitively, this is because pairwise measures of identity do not explicitly take account of the ancestral relationships underlying the data (Donnelly 1996). For microsatellite data, a network can be constructed (Cooperet al. 1996; Zerjalet al. 1997) that displays some of the structure of the data and suggests historical relationships, but does not provide a basis for assessing the uncertainty associated with inferences.

Recently, inferential methods that use the coalescent (Kingman 1982; Hudson 1991) to model explicitly the genealogical relationships underlying a genetic sample have become available (Griffiths and Tavaré 1994; Kuhneret al. 1995). Tavaré et al. (1997) present computational methods for genealogical inference under the assumptions of the coalescent model with infinitesites mutation, so that back-mutation is assumed to not occur. Microsatellite loci present a particular challenge to genealogical inference because these loci form an important source of highly polymorphic molecular genetic data (Jarne and Lagoda 1996), but the mutation process is such that back-mutations cannot reasonably be ignored. Nielsen (1997) developed Griffiths and Tavaré's (1994) algorithm to obtain maximum likelihood estimates of the scaled mutation parameter θ at microsatellite loci. The method was found to be computationally costly, even for a single locus, making accurate estimation difficult.

Here, we present a computationally tractable method for drawing inferences from microsatellite data, not only about θ but also about population histories. Very briefly, the method is based on the coalescent model of genealogy together with a ladder (or stepwise) model of microsatellite mutation and is implemented via a Markov chain Monte Carlo (MCMC) simulation algorithm.

In the following section we start by outlining the coalescent-with-ladder-mutation modeling framework and the process of drawing inferences from microsatellite data under this model. Following Tavaré et al. (1997), we adopt a fully probabilistic approach in which the uncertainty about an unknown parameter is expressed in terms of its probability distribution, given the data and the model. As well as making efficient use of all the available information, another important advantage of this approach is interpretability. For scalar parameters, either singly or in combination, inferences are naturally presented visually, in terms of probability density curves or surfaces. Even very complex unknowns, for example, the entire genealogy of the sample, can be described either in terms of probability density curves for important features, such as height or total branch length, or in terms of pictures of a sample of realizations from the probability distribution. A further advantage of a fully probabilistic analysis is flexibility. For example, inferences about the genealogical tree or about the effective population size, or both, can be obtained, according to the goals of the investigator.

In the recent past, fully probabilistic analyses of complex genetic processes were not computationally feasible. While computational cost remains an issue, advances in stochastic simulation methodology, such as MCMC algorithms, now allow problems of substantial size and complexity to be tackled. One important feature of the MCMC algorithm that we propose here is that the allelic type of the ancestral gene at each coalescence is assigned and successively updated according to its conditional probabilities. This simplifies the likelihood computations, which in turn allow flexibility in the choice of algorithms for stepping through the space of candidate trees.

The quality of genealogical inference that can be achieved under ideal circumstances is investigated using data simulated from the model. The method is then illustrated by reanalyzing the data of Cooper et al. (1996) for five microsatellite loci on the human Y chromosome. Because of the complexities of the genetic phenomena under study, we find that data from a single microsatellite locus do not suffice for accurate inferences, even when the modeling assumptions hold exactly. However, if data from a number of completely linked loci are available and the mutation process can be assumed to be the same at each locus, then much more precise inferences can be made.

THE MODEL

Genealogies and the coalescent: Interpreting genetic data requires an understanding of the patterns of shared ancestry among the genes in the sample. Currently, the most successful mathematical description of the genealogical processes underlying these patterns is provided by the so-called coalescent model.

The simplest version of the coalescent, the standard coalescent, assumes neutrality, random mating, and a constant, large population size. These assumptions can each be weakened to some extent and at some computational cost. However, the novelty of this article is the introduction of a model for microsatellite mutation, and to simplify the presentation of this development we work primarily with the standard coalescent.

Time in the coalescent model is measured in units of N generations, where N is the (fixed and large) population size. Tracing backward in time the lineages of each gene in a sample of size n, the time t1 until the first “coalescence” of two lineages at a common ancestor has the exponential distribution with mean 2/n(n − 1). Continuing backward in time, the time t2 between the first and the second coalescences has the exponential distribution with mean 2/(n − 1)(n − 2), and so forth, until the time tn − 1 between the final two coalescences (i.e., the time during which the sample has exactly two ancestors) has the exponential distribution with mean 1. Crucially, each of these times is independent of the other times. Hence the joint probability density of the coalescence times t1, …, tn − 1 is p(t1,,tn1)=i=1n1(n+1i2)exp((n+1i2)ti). (1) Because all pairs of lineages remaining at any time are equally likely to coalesce, p(t1, …, tn − 1) is proportional to the probability density, under the coalescent model, of any (labeled) genealogy with the coalescence times t1, …, tn − 1.

Equation 1 pertains to the predata coalescent, in which the sample size n is fixed but the allelic types are not yet observed. Once the allelic types are known, the coalescent probabilities are altered: evaluating the updated probabilities after observing a sample of microsatellite data is the primary goal of this article. A particular feature of the predata coalescent is that most of the lineages coalesce relatively quickly. (In other words, most branches are short.) On the other hand, the time period during which the tree has just two lineages is on average 1 coalescent unit, more than half the mean height of the coalescent tree. Another notable feature of the predata coalescent is the high variability in tree height: its standard deviation is about 60% of the mean height for typical values of n. See Donnelly and Tavaré (1995) and references therein for further details of the coalescent model.

Microsatellite mutations: Given the genealogy, mutations in the standard coalescent are assumed to occur independently and at constant rate θ/2, where θ=2Nμ and μ denotes the mutation rate per gene per generation. This means that the number of mutations in any section of the tree with total branch length t has the Poisson distribution with parameter tθ/2.

Although additional variation can be distinguished in some cases, microsatellite alleles are usually characterized by the copy number of the repeat motif. For the data of Cooper et al. (1996) discussed below, the repeat motif is the four-base sequence GATA.

Mutations of microsatellite alleles are thought to be due predominantly to polymerase slippage (Levinson and Gutman 1987; Dover 1996), which produces mutant alleles close in length to the original; the mutant alleles differ by whole copies of the repeat motif. Direct studies of mutations using a large number of parent-offspring triplets (Weber and Wong 1993) for autosomal microsatellites, and using pedigrees over larger numbers of generations for Y chromosome microsatellites (Heyeret al. 1997), show only single gains or losses of the GATA motif for 11 observed mutations. The mechanisms for gains of repeats through slippage may well differ from those for losses. There may also be evidence of between-species differences (Rubinszteinet al. 1996).

For autosomal DNA, rare, large mutational steps are thought to occur (Di Rienzo et al. 1994), and there is evidence from somatic mutations in cancer patients of heterogeneity between loci (Di Rienzo et al. 1998). These may be due to unequal crossing over, and so it remains uncertain whether or not they occur on the nonrecombining portion of the Y chromosome.

Perhaps the simplest plausible model for the changes in repeat number at each mutation event is the stepwise, or ladder, model (Ohta and Kimura 1973), under which the repeat number behaves like a simple random walk; i.e., it is equally likely to increase or decrease by 1 unit at each mutation, and changes of more than 1 unit do not occur. Although the ladder model may not describe fully the complexities of the microsatellite mutation process, it does incorporate “local” changes in allele length, while remaining tractable (Shriveret al. 1993; Valdeset al. 1993; Goldsteinet al. 1996). More detailed models of microsatellite mutation, such as the extended models of Di Rienzo et al. (1994) and Slatkin (1995), can readily be incorporated into the inferential framework described here.

STATISTICAL INFERENCE

The direct probability paradigm: We have a sample, D, of genes at a particular microsatellite locus, and a collection of unknown parameters, N, μ, and the tree parameters—the coalescence time and the two descendant nodes of each internal node—which we collectively denote Ƴ. We want to make valid and useful statements about N, μ, and Ƴ, given D and the modeling assumptions. In the direct probability, or Bayesian, paradigm of statistical inference, such statements are based on the probability distribution of N, μ, and Ƴ, conditional on D and the model. The required probability distribution is usually obtained by first specifying a distribution to model the background information before D is observed, then updating this prior distribution, via Bayes' rule, to incorporate the information conveyed by D.

The coalescent model specifies a probability distribution for Ƴ. This distribution can be thought of as a prior distribution for the genealogical tree, which should be updated in the light of the data D. Information about μ obtained from pedigree studies, such as those described above, can be summarized by a probability density curve that would usually be smooth and unimodal. Information about N is more difficult to specify, because N should be interpreted as an effective, rather than actual, population size. However, previous genetic studies, together with archaeological evidence, do give some idea of the effective population sizes for recent human evolution (Fullertonet al. 1994; Hammer 1995; Hardinget al. 1997). Corresponding probability distributions would normally be very diffuse, reflecting the imprecise background information, but would again be smooth and unimodal.

Although the probability distributions chosen to represent knowledge about N and μ are not unique, in many cases the postdata inferences will be insensitive to reasonable specifications. If this is not the case, investigation of the sensitivity will indicate the information needed to produce more reliable inferences. An alternative approach sometimes adopted is to undertake analyses conditional on particular values for N and μ. As noted by Brookfield (1997) and Tavaré et al. (1997), this approach can be seriously misleading, because information in the data that is informative about N or μ may be misinterpreted as informative about Ƴ. Repeating the analysis for various values of N and μ cannot overcome the problem; the only satisfactory solution is to let the data speak simultaneously for all the parameters, N, μ, and Ƴ.

MCMC methods: MCMC algorithms generate approximate random samples from a probability distribution Π by constructing a Markov chain whose equilibrium distribution is Π. Consecutive states of a Markov chain are usually correlated, but if the chain is run for a suitably long “burn-in” period, and then every ith state is recorded for some sufficiently large i, the resulting values will form an approximate random sample from Π. Features of Π can then be investigated by examining corresponding properties of this sample. For example the probability assigned by Π to any region of the parameter space can be approximated by the proportion of the sample values that lie in this region. For a further discussion see Besag et al. (1995) and Brooks (1998).

It is not usually possible to prove that a Markov chain has converged to its equilibrium distribution. However, a number of diagnostic checks that allow many cases of nonconvergence to be detected have been proposed. The chains implemented below have been checked using the suite of diagnostic tools contained in the software package CODA (Bestet al. 1995). In each case, several chains were started at widely spaced, “overdispersed” starting points, and no convergence problems were indicated.

The Metropolis-Hastings algorithm: One general method for producing a Markov chain with the required equilibrium distribution is the Metropolis-Hastings algorithm (Metropoliset al. 1953; Hastings 1970). Given a current location Θ in parameter space, where Θ stands for the parameter vector (N, μ, Ƴ), a new candidate location Θ′ is chosen from a proposal distribution q(Θ′|Θ). The new location is acccepted according to the value of u=q(ϴϴ)q(ϴϴ)p(Dϴ)p(Dϴ)π(ϴ)π(ϴ), (2) where p(D|Θ) denotes the likelihood, the probability of the data given the parameter vector Θ, and π(Θ) denotes the prior probability density of Θ.

If u > 1, the proposal Θ′ is accepted; otherwise it is accepted with probability u. If Θ′ is not accepted, the chain remains in its current state, Θ. The Markov chain constructed in this way converges to p(Θ|D), the probability distribution of the unknown parameters given the data, provided that q is such that the chain is aperiodic and irreducible, which means that it should be possible to get from any point in the state space to any other—given enough steps.

Although q is to a large extent arbitrary, in practice it must be chosen carefully to ensure that the chain has good mixing properties: i.e., from an arbitrary initial state, the chain reaches its equilibrium distribution reasonably quickly. The most important aspect of q is the choice of a candidate tree. The steps in tree space must usually be “local”—i.e., the candidate tree Ƴ′ must be similar to the current tree Ƴ—to ensure that a reasonable proportion of candidates are accepted. However, this requirement can conflict with the need for good mixing properties. Computational factors may also be important in the specification of q: it may be necessary to restrict q to a narrow class such that p(D|Θ′) can be calculated easily from p(D|Θ).

We overcome these potential problems with two innovations, discussed further below. First, we use an augmented parameter space, in which the allelic states at the internal nodes of the coalescent are regarded as unknown parameters. The resulting increase in the dimension of the parameter space is more than compensated by the simplification of the likelihood computations. Second, we implement a mechanism for generating candidate trees that allows “large” moves in tree space while retaining reasonable acceptance probabilities.

Computing the likelihood using data augmentation: One way to calculate the likelihood is via “pruning” (Felsenstein 1981). This algorithm proceeds recursively, starting at the terminal nodes, to evaluate conditional probabilities for the data given the allelic state at the root. The likelihood is then the sum of these conditional probabilities, weighted by the prior probability of each allele (a uniform prior is often chosen, in which case the weighting is invisible).

Although calculation of the likelihood via pruning is feasible for problems of moderate size, the fully probabilistic approach to inference adopted here permits much faster likelihood computations. The key idea is that the likelihood would be relatively easy to compute if the allelic states at the internal nodes of the genealogical tree were known. Then, the likelihood would be simply a product of terms, one for each branch of the tree. The term corresponding to a branch of length t, linking nodes whose allelic states differ by d ≥ 0, is vd(t,θ)=etθ2k=0(tθ4)2k+dk!(k+d)!=etθ2Id(tθ2), (3) in which Id denotes the dth-order modified Bessel function of the first kind (Gradshteyn and Ryzhik 1980). Although vd involves an infinite sum, in practice only the first few terms are required for an accurate approximation. This is because the value of k corresponds to the number of pairs of mutations in opposite directions, which is usually very small. Fast algorithms for computing Id(x) are widely available; see, for example, Press et al. (1992).

Equation 3 specifies the likelihood that would apply if the internal allelic states were known. Unfortunately, they are unknown. However, the simple likelihood formula based on (3) can nevertheless be exploited under the direct probability paradigm, because the internal allelic states can be regarded as additional parameters. The parameter space is therefore augmented: in addition to N, μ, and Ƴ, there is an allelic state for each internal node.

Increasing the dimension of the parameter space in this way is impractical in traditional statistical approaches. With direct probability inference based on an MCMC algorithm, however, there is no substantial difficulty. If the parameter space becomes very large, then convergence of the algorithm can become slow and/or difficult to assess, but this did not arise for the examples discussed below.

The augmented parameter space allows great flexibility in the choice of proposal distributions q. We use a very simple method for generating candidate trees. Basically, the method involves removing a branch from the tree at random and adding it anywhere on the tree, but locations close to “similar” allelic types are preferentially chosen. In this way large jumps in tree space are possible, while acceptance rates remain sufficiently high. Before describing the branch-swapping algorithm in more detail, we introduce some notation: for a node x, we write t(x) for its coalescence time [t(x) = 0 if x is a terminal (data) node], while α(x) denotes the allelic state at node x.

The branch-swapping algorithm: Choose an internal node x at random, except that the root may not be chosen. We then attempt to move the parent of x to a new location in the tree. To this end, we choose a node y above which to attach the parent of x. For this to be possible, either y is the root or t(z) > t(x), where z denotes the parent of y. Choosing y at random among nodes satisfying this condition is likely to be unsatisfactory: if α(y) is very different from α(x), the candidate tree will almost certainly be rejected. To avoid an excessive rejection rate, the probability of a node being selected is set to be a decreasing function of |α(y) − α(x)|. Specifically, we assign P(yx)11+α(x)α(y). (4)For example, nodes whose allelic state differs from that of x by one are half as likely to be chosen as nodes with the same allelic state. To simplify the computation, we set P(y|x) = 0 when y is the parent of x. The distribution specified by (4) is somewhat arbitrary: there exist many other suitable distributions, but this choice seems to work well in practice.

Once y has been chosen, if it is not the root then the parent of x is inserted at a point chosen uniformly between max{t(y), t(x)} and t(z). If y is the root, the parent of x is located at a time chosen from the standard exponential distribution above the root (and thus becomes the new root). Finally, a new allelic state for the parent of x is chosen according to a discretized normal distribution, with mean (α(x) + α(y))/2 and standard deviation (|α(x) − α(y)| + 1)/4. Again, this choice is somewhat arbitrary but seems to lead to both reasonable acceptance rates and good mixing. The chain produced using this proposal distribution is clearly aperiodic and is irreducible because we can recreate any tree in, at most, n − 1 steps (where n is the sample size) by successively moving terminal nodes, one at a time, to their position on the new tree, simultaneously changing the coalescence time and allelic state of the branch point.

Other updating algorithms: Although the branch swapping algorithm described above leads to acceptable convergence properties, we found that convergence rates could be improved by including between each branch-swapping step another updating algorithm that attempted to alter branch lengths only, not the tree topology. The two scalar parameters, N and μ, are updated using a uniform probability density on a logarithmic scale, centered on the current value, and with length tailored to optimize convergence.

RESULTS

Data simulated under the model: Shown at the top of Figure 1 is a genealogical tree, labeled “true,” with a microsatellite copy number indicated at each of the n = 10 terminal nodes. This tree was simulated from the coalescent-with-ladder-mutation model, with θ = 5. The height of the tree, T, is 1.25 coalescent units, which is less than 1.54, the median height of the predata coalescent when n = 10, but is very close to the modal height. The value of L, the total branch length, is 4.82, which again is less than the median of 5.21 for the pre-data coalescent when n = 10, but very close to the modal value.

Figure 1.

The top tree (“true”) is simulated from the coalescent-with-ladder-mutation model with θ = 5. The other four trees are simulated from the postdata distribution given the allelic data of the true tree. These trees are samples numbered 2000, 4000, 6000, and 8000 from the MCMC run corresponding to row 1 of Table 1.

Note that, of the four genes with allelic type 6 in the true tree of Figure 1, only one pair has very recent shared ancestry. In fact, one of the other 6-alleles has no ancestry in common with this pair beyond the root of the tree, whereas its nearest relative in the sample is a 3-allele. Clearly, accurate reconstruction of the true genealogical tree from only the allelic-type data is unachievable here, although some information about key parameters, such as θ, T, and L is available.

Application of MCMC algorithm: What can be inferred from the allelic types shown on the true tree is suggested by the four other trees shown in Figure 1, which were simulated from the postdata coalescent. A uniform prior was assumed for θ, and the value realized in each simulation is shown against each tree. Not surprisingly, the simulated trees bear little resemblance to the true tree: there is not enough information at a single microsatellite locus to reconstruct the tree with any accuracy when θ is unknown.

View this table:
TABLE 1

Inferences for θ, T, and L from a single tree

More detailed information about the inferences for θ, T, and L that can be drawn from the data is provided by the first row of Table 1, which gives the postdata median and 95% probability intervals for these parameters. The accuracy of inferences about θ is very poor, with a 95% interval of (2.9, 95), compared with a true value of 5. At first sight, the situation looks better for T: the median height of postdata trees is 1.32, close to the correct value of 1.25. However, the 95% interval is wide: (0.42, 4.0). Moreover, the 95% interval for the height of the predata coalescent with n = 10 is (0.50, 4.5), so that the postdata 95% interval for T is not much narrower than the corresponding predata interval. Similarly for L, the postdata 95% interval is (1.8, 10), compared with a predata interval of (2.2, 12).

The effect of sample size: The true tree of Figure 1 is a sub-tree of a tree with n = 40 terminal nodes (full tree not shown). The height T of the full tree is 1.25, the same as that for the n = 10 sub-tree, but L is now increased to 7.15. The second row of Table 1 summarizes the quality of inference attainable from the larger sample size. For θ, the width of the 95% interval has decreased substantially from 92 to 34. However, there has been only slight improvement in inference about T and L. This may be because the additional data convey information primarily about the part of the tree near the terminal nodes, rather than near the root.

The effect of additional, linked loci: We have seen that there is only limited information about θ, T, and L at a single microsatellite, even when the modeling assumptions hold exactly. But is it perhaps possible to obtain good inferences from several completely linked loci? Such data arise, for example, from the nonrecombining part of the Y chromosome. The true tree of Figure 2 is the same as that of Figure 1 (N.B. different time scale), but in addition to the allelic data of Figure 1, a further four independent simulations of the ladder mutation process are given, each with θ = 5. This simulation mimics data from five completely linked microsatellite loci with a common value of θ. Once again, four trees are shown simulated from the coalescent based on the five-locus data, with a completely flat predata distribution for θ.

As expected, the trees simulated from the postdata coalescent are, with information from l = 5 loci, more similar to the original tree than in the one-locus case. Nevertheless, none of the simulations comes close to reconstructing the original tree.

Summary statistics for the n = 10, l = 5, case are given in row 3 of Table 1. Even with five loci, the post data uncertainty about T and L remain large, although inference about θ is now much improved. Row 4 of Table 1 quantifies a further improvement when n is increased to 40 (allelic data not shown).

Average performance over many trees: Each row of Table 1 corresponds to only one realization of a genealogical tree and allelic data. To obtain a better overall appreciation of the quality of inference achievable from microsatellite data, it is useful to assess average performance over many tree and mutation simulations. Care is needed to effectively summarize such a large quantity of simulation results, in part because the uncertainty in inference about θ, T, and L tends to increase with the magnitude of the true value.

For each of θ, T, and L, Figure 3 shows both the mean absolute deviation (MAD) of the MCMC output values from the true value, and the length of the 95% probability interval (PIL) calculated from the MCMC run. For each combination of θ, n and l, the height of the bar gives an average of results from 140 datasets simulated from the coalescent-with-ladder-mutation model and expressed as a proportion of the true value. In ~5% of MCMC runs, the value of θ lay outside the 95% probability interval, and similarly for T and L, suggesting that the MCMC runs had adequately converged.

Figure 2.

The true tree (top) is the same as that of Figure 1, but the results of four additional, independent simulations of the mutation process are also shown, mimicking data from five completely linked loci, each having the same mutation mechanism and with θ = 5. The other four trees are simulated from the postdata distribution given all five data sets. These trees are samples numbered 2000, 4000, 6000, and 8000 from the MCMC run corresponding to row 3 of Table 1.

The poor quality of inferences about θ when l = 1, noted for the particular tree of Figure 1, remains evident on averaging over many trees, especially for n = 10. In the latter case the MAD of θ is 3 to 5 times the true value and the PIL as much as 20 times the true value. Inferences become somewhat more precise as θ increases and markedly better as n and l increase.

Increases in n and l are less effective in improving the precision of T, with the improvement from worst to best cases only ~20% for both MAD and PIL when θ = 1, rising to 30% for larger values of θ. The same patterns are shown as for θ, with precision increasing with θ, n, and l. Inferences about L are harder to interpret because the true value increases with n. In the predata coalescent, the standard deviation of L decreases relative to its mean as n increases. Hence measures of uncertainty, expressed as a proportion of the true length, tend to decrease with increasing n.

A limited number of simulations were performed with n = 200, l = 5, and θ = 5. Confidence in θ increased slightly with average values of MAD and PIL decreasing to 0.18 and 0.60, respectively. Only slight improvements to inferences on T were observed, but the precision of L increased further with n = 200, giving a MAD of 0.19 and a PIL of 0.70.

Human Y chromosome microsatellite data: Human mitochondrial DNA sequences have been interpreted as supporting the theory—dubbed “Out of Africa”—that modern humans are descendants of a small group that lived in Africa perhaps about 200,000 years ago and subsequently spread throughout the world, eliminating most or all other extant human lineages. However, inferences about the time since the most recent common ancestor (TMRCA) of the sample generally underestimate the amount of variability (Tavaréet al. 1997), and geographical location of the MRCA is problematic and contentious (Templeton 1993).

Patterns observed from autosomal DNA seem somewhat different. For example, β-globin data suggest a much longer TMRCA (Hardinget al. 1997). These differing interpretations are not necessarily in conflict because autosomal and mitochondrial DNA reflect different aspects of human history, and the results may be affected by selection effects. Recombination of autosomal DNA sequences may also lead to some problems for inference.

A third potential source of evidence, reflecting a further aspect of human prehistory, comes from genetic variation on the human Y chromosome. Recently, a number of polymorphic microsatellites have become available for population surveys (Cooperet al. 1996; Dekaet al. 1996; Ruiz Linareset al. 1996; Hammeret al. 1997; Zerjalet al. 1997).

A large effort has been concentrated on estimating the TMRCA of a sample of genes drawn from a locus—in this case the entire nonrecombining portion of the human Y chromosome. While the TMRCA may not be the most important time of human history (Brookfield 1997), it is central to interpreting genetic samples and has been investigated by several authors (Goldsteinet al. 1996; Tavaréet al. 1997). Furthermore, the method proposed here allows simultaneous inferences about the TMRCA (the height of the tree) and, for example, the (effective) population size, N.

Data: We consider the data of Cooper et al. (1996), which consist of the genotypes of 212 individuals at five Y chromosome microsatellite loci from East Anglia (UK), Sardinia, and Nigeria, together with a linked Alu insert. Since we are concerned here with inference from microsatellite haplotypes, we did not include the Alu insert in our analyses, although it could readily have been incorporated by means of a further augmentation of the parameter space.

Figure 3.

Average mean absolute deviation (MAD), left, and probability interval length (PIL), right, for θ (top), T (middle), and L (bottom), each scaled by their respective true values. All values are averages over MCMC-generated samples of size 1000 (i.e., 1.6 × 106 branch-swapping steps) from each of 140 datasets simulated under the coalescent-with-ladder-mutation model. Bars correspond to single locus with sample size of 10 (white) and 40 (light gray), and five linked loci with a sample size of 10 (dark gray) and 40 (black).

Two datasets were used: the complete set of Nigerian and Sardinian haplotypes, together with the initial sample of 22 East Anglians (dataset NSE), and all 174 East Anglian haplotypes (dataset EA). The first of these sets gives approximately equal weighting to the three regions; the second provides a larger sample from a single location. Although the coalescent-with-ladder-mutation model is unlikely to be exactly appropriate for these datasets, inferences based on this model can nevertheless be informative. It is of particular interest to see what aspects of the postdata distributions differ substantially from the corresponding predata distributions.

Priors: Under the standard coalescent, no information about the values N and μ can be obtained from the allelic data except through their product, θ = 2Nμ. Postdata inferences about θ are therefore more robust than inferences about either N or μ separately. It is useful to distinguish the two because information about them can be obtained from other sources, particularly in the case of μ. Heyer et al. (1997) used three observed mutations in 1491 meioses to obtain a point estimate of mutation rate of 0.2% per meiosis. Assuming a Poisson distribution for the number of mutations, and using a standard exponential pre-prior, the distribution of μ based on these data, which we implemented as the prior distribution for our analyses, is gamma with mode 3/1492 and mean 4/1492. Inferences about the TMRCA are insensitive to this assumption: a uniform prior for μ leads to very similar conclusions (results not shown).

Figure 4.

Posterior density curves for NSE data, together with corresponding prior density curves. See Table 2 legend for details of data and prior distributions. The prior for μ is shown as the dotted line in the top left. Elsewhere, the dotted line and the dotted and dashed line correspond to the low- and high-variance priors for N, respectively. Solid and dashed lines show the postdata probability density assuming the low- and high-variance priors, respectively. All postdata densities are based on 20,000 output values.

Tavaré et al. (1997) used two prior distributions for N: a gamma with mean 5000 and shape parameter 5, and a lognormal with parameters 9 and 1. Both these distributions are centered at roughly 5000 individuals, but the gamma is concentrated between ~1000 and 10,000, whereas the lognormal is more diffuse and positively skew, giving some support to values in excess of 20,000. We also adopt these predata distributions for N, referring to them (as well as the implied priors for θ and the TMRCA) as the “low-variance” and “high-variance” priors, respectively.

Implementation of MCMC algorithm: Forty iterations of the branch-swapping algorithm were effected between every attempt to update N and μ, and there were 100 such attempts between samples. After discarding the first 2000 samples (the burn-in), 10,000 samples were retained. Two such sets of samples were taken, with different starting trees, for each prior and dataset combination. The posterior distributions for μ, N, T, and L approximated from the two MCMC runs were checked and in each case found to be effectively indistinguishable. They were then combined to give a total of 20,000 samples.

Results are given in Figure 4 (probability density curves for dataset NSE; those for dataset EA are very similar and are not shown) and Table 2 (summary statistics for both datasets). For dataset NSE, a number of individual trees sampled from the MCMC output were examined in detail. Although there was some relation between geographic location of haplotype and tree structure, this was restricted to recent nodes. Clades of more than six haplotypes all from a single location were rare, and haplotypes from all locations were typically represented on both sides of the root node.

Inferences about θ: Figure 4 (top right) shows, for dataset NSE, the two postdata probability density curves for θ = 2Nμ, as well as the corresponding predata curves. The postdata curves are very similar, despite the differences in the two priors. For example, the postdata medians are both around 11, compared with prior medians of around 22 and 39, respectively, for the low- and high-variance priors (Table 2). Moreover, the two postdata 95% probability intervals are practically indistinguishable: (7.7, 17.0) and (7.6, 16.4). For dataset EA, the postdata medians and upper 95% interval limit are both a little lower (Table 2).

As expected, the postdata distributions for the two components of θ, the mutation rate, μ, and the population size, N, are negatively correlated, and each is more strongly affected by the prior than is the postdata distribution of θ. Figure 4 (top left) shows the two post-NSE-data density curves for μ, together with the predata curves. Both posterior curves are somewhat sharper than the prior, with diminished support for high values of μ. The postdata density curves for N (Figure 4, bottom left) are very similar, despite the substantial difference in the prior curves. The post-EA-data distributions are very similar to those for NSE. In all cases they reflect diminished support for high values of N. The postdata medians are ~3000, with most likely values between ~1500 and 8000 for both datasets. Although the limitations of the modeling assumptions require that caution be attached to the interpretation of a particular analysis, the similarity of the postdata distributions provides some confidence for the conclusion that the Y chromosome effective population size during recent human history is a few thousands, consistent with the results of previous analyses.

Inferences about the TMRCA: An estimate for the number of generations since the MRCA of the sample can be made by multiplying together the postdata values for N and T. Further multiplication by the generation time G gives a posterior density curve for the number of years since the MRCA. Figure 4 (bottom right) shows both the pre- and post-NSE-data density curves, assuming G = 20. This value allows comparison with the results of Tavaré et al. (1997), but may be too low: alternative values can be implemented simply by proportional adjustment.

View this table:
TABLE 2

Summary of human Y chromosome analyses

The two postdata curves are very similar and reflect a very marked shift of support toward smaller values compared with the predata distributions. For example, the postdata distributions are sharply peaked at values of ~30 kyr, a value that has little a priori support. Most likely postdata values are between ~10 and 100 kyr, while values >150 kyr have probabilities of ~1.5 and 2% for the low- and high-variance priors, respectively. For the much larger EA sample, drawn from a single geographic location, postdata distributions are shifted slightly downward compared with the post-NSE-data distributions (Table 2).

The posterior distributions for (scaled) tree height, T, have medians of ~0.7 in all cases compared with prior medians of ~1.7. The scaled lengths, L, are not reduced to the same extent. This may be evidence for “radial”-type trees, suggesting some recent population growth. Nevertheless, the posterior values are also consistent with the standard coalescent model.

DISCUSSION

We have developed a methodology for carrying out fully probabilistic analyses of microsatellite samples, which opens up possibilities for inferences much more detailed than those previously possible. For example, the implications of the data for the scaled mutation parameter, θ, and the height and shape of the genealogical tree can be assessed simultaneously. One key feature of our direct probability analysis is that likelihood calculations are greatly simplified by augmenting the parameter space to include the internal allelic states. This innovation permits great flexibility in algorithms for exploring the space of possible trees, as well as in the range of modeling assumptions that become practicable. Here, we have focused on perhaps the simplest, plausible modeling framework: the coalescent-with-ladder-mutation.

Results from simulation studies, in which the modeling assumptions are known to hold exactly, indicate that accurate inference about θ requires sampling several, tightly linked loci: a single locus provides little information, even when the sample size is large. With five loci, good quality inferences about θ are achievable, but those for other aspects of the tree, such as T and L, remain far from precise.

Turning to analyses of published data, although our modeling assumptions are, inevitably, not fully realistic, our results provide support both for an effective population size of human Y chromosomes in the low thousands and for relatively short times (point estimates around 30 kyr) since the most recent common ancestor. These conclusions in turn support the theory that extant human males have spread relatively recently from a small group. In addition, the relatively small value for effective population size may reflect high between-male variance in reproductive success. The range of supported values for θ is ~8 to 16. Improved predata estimates for the mutation rate μ would enable more accurate inference about the population size N and the TMRCA. Inferences from the two datasets were very similar, despite the fact that one was geographically dispersed and the other geographically homogeneous and much larger. Additionally, there is little evidence of “clumping” of haplotypes from the same region, except in the very recent past from posterior trees.

Values of the TMRCA supported by our analyses are low compared both with times suggested by nongenetic evidence and with published studies based on autosomal DNA and mitochondria (Templeton 1993; Hardinget al. 1997). They are, however, broadly consistent with the analysis of Tavaré et al. (1997), based on Y chromosome sequence data and the coalescent-with-infinite-sites model. [Our 95% intervals are narrower than those of Tavaré et al. (1997), reflecting more information from five microsatellites than from >15 kb of sequence, despite the limitations imposed by recurrent mutations.] Wide variation between Y chromosome, mtDNA, and autosomal TMRCAs are plausible for purely stochastic reasons. Additional factors not accounted for in the model may also explain the difference: male generation time may be greater than female, and selective sweeps may play a large part in Y chromosome evolution.

Our analyses were based on males from three locations and may not represent all human Y chromosome history. Cooper et al. (1996) estimated the timing of population splits using a maximum divergence approach. This gives an estimate of μT, where T is the TMRCA in generations. Their estimates of μT were 11.4 for the whole data set and 7.75 for EA. These give point estimates for the TMRCA of 110 kyr for the whole dataset and 77 kyr for the EA dataset. Estimates of uncertainty are not available with this method. These values are toward the upper tails of our corresponding posterior distributions. Further, under our analyses the data suggest values for the TMRCA for the EA sample only slightly lower than those for the NSE sample. They also suggest little increase in inferential precision with increasing sample size, in contrast to the conclusions of the original authors.

Producing the first row of Table 1 required about about 50 min on a desktop workstation—equivalent to 320,000 attempted tree rearrangements and 16,000 attempted changes to θ per minute. Increasing the sample size and number of loci increases the time required. To perform the same number of steps on a tree with five loci and a sample size of 200 takes ~400 min. Computational resources should not provide a barrier to extending our analyses to incorporate more sophisticated modeling assumptions. These might include more detailed models for population growth and structure and for microsatellite mutation. Although such developments are well worth pursuing, it may turn out that the primary insights attainable from the data are apparent from the simpler analyses presented here.

Acknowledgments

We thank Mark Beaumont, Richard Nichols, and Bill Amos for helpful discussions and comments, and the latter also for drawing our attention to the dataset. This work was supported in part by the Stochastic Modeling in Science and Technology initiative of the United Kingdom Engineering and Physical Sciences Research Council (Grant no. K72599).

Footnotes

  • Communicating editor: R. R. Hudson

  • Received November 21, 1997.
  • Accepted June 3, 1998.

LITERATURE CITED

View Abstract