Abstract
Bayesian posterior distributions are obtained for the time to the most recent common ancestor (MRCA) for a nonrecombining segment of DNA (such as the nonpseudoautosomal arm of the Y chromosome or the mitochondrial genome) for two individuals given that they match at k out of n scored markers. We argue that the distribution of the time t to the MRCA is the most natural measure of relatedness for such nonrecombining regions. Both an infinitealleles (no recurring mutants) and stepwise mutation model are examined, and these agree well when n is moderate to large and k/n is close to one. As expected, the infinite alleles model underestimates t relative to the stepwise model. Using a modest number (20) of microsatellite markers is sufficient to obtain reasonably precise estimates of t for individuals separated by 200 or less generations. Hence, the multilocus haplotypes of two individuals can be used not only to date very deep ancestry but also rather recent ancestry as well. Finally, our results have forensic implications in that a complete match at all markers between a suspect and a sample excludes only a modest subset of the population unless a very large number of markers (>500 microsatellites) are used.
MOLECULAR marker information has proven an invaluable tool for assessing the degree of relatedness between individuals. To date, most uses of marker information have been concerned with zero, one or two, or deepgeneration relatives. By zerogeneration, we mean matching/rejecting a forensic sample and a suspect. One or twogeneration assessment includes paternity testing and assessing the degree of relatedness between individuals in natural populations. Typically, these tests have very low power for detecting relatives more distant than sibs and first cousins (Queller and Goodnight 1989; Lynch and Ritland 1999). Finally, human geneticists have been very successful in using marker information to assess very deep relationships, on the order of hundreds (or more typically thousands) of generations. Many of these deep relationship studies have used the haplotypes of nonrecombining chromosomes, such as the nonpseudoautosomal arm of the Y (e.g., Hammer 1995; Dekaet al. 1996; Skoreckiet al. 1997; Bianchiet al. 1998; Kittleset al. 1998; Wilson and Balding 1998; Thomaset al. 2000) and mitochrondial DNA (e.g., Torroniet al. 1994; Merriwetheret al. 1995; Forsteret al. 1996; Brownet al. 1998; Stone and Stoneking 1998; Torroniet al. 1998). Here we show how haplotype information can also be used to estimate the age of the common ancestor for individuals sharing an intermediate ancestry (tens to hundreds of generations) with reasonable precision. Since our comparison is restricted to the time to the common ancestor for a particular pair of individuals of interest, the fine details of the population history and structure do not enter into the analysis, other than very weakly through the mean of the assumed prior (as discussed below).
For unlinked markers, the product rule (multiplying singlelocus genotype probabilities together to obtain a multilocus genotype probability) applied to highly polymorphic loci allows just a few (510) unlinked markers to be quite sufficient for identifying individuals that share a common relative in the last generation (such as parentoffspring or sibs). However, because of recombination, unlinked markers have very weak power for distinguishing individuals sharing deeper common ancestry. While there is a growing body of literature on estimating relatedness of two individuals given autosomal marker information (Thompson 1975; Queller and Goodnight 1989; Blouinet al. 1996; Ritland 1996; Marshallet al. 1998; Lynch and Ritland 1999), it is less clear how to proceed when using markers from nonrecombining DNAs. The product rule does not hold for such regions, as the markers are inherited as a single block.
The key to assessing the amount of relatedness using markers on a nonrecombining chromosome is that any two individuals (indeed, the entire population) will have a most recent common ancestor for that region. This follows from coalescence theory (Hudson 1991; Donnelly and Tavaré 1995), which shows that in a demographically stable population the expected time back to this most recent common ancestor (MRCA) follows a geometric distribution with parameter λ = 1/N_{e}, the inverse of the appropriate measure of effective population size [the effective numbers N_{m} and N_{f} of males and females for the Y and mitochondrial (mt)DNAs, respectively]. Using marker information, we can estimate the time t to the MRCA, and it is the distribution of t that provides a natural metric for describing the relatedness (at least for that region) between two individuals. It is important to stress this difference between the MRCA of a region of DNA and the MRCA for two individuals. Two individuals can easily share an ancestor that is more recent than their MRCA for a particular DNA region. Hence, estimates of the time to the MRCA using a particular DNA region provide an upper bound on the time back to the most recent common ancestor shared by two individuals.
Here we develop a Bayesian estimator for t, obtaining the complete posterior distribution for the time t to the MRCA. We assume that n markers are scored on the nonrecombining chromosome of interest (either the Y or mtDNA) and that we observe matches in allelic state at k of these. We start by assuming the infinite alleles model, where each mutation is assumed to be unique. We then modify our results by assuming a (symmetric singlestep) stepwise mutational model, which is a more appropriate descriptor for microsatellite markers. As we show, when k/n is close to one and n moderate to large, the two different mutational models give essentially the same distribution for t.
THE INFINITE ALLELES MODEL
Suppose we score the allelic states at n defined markers on a nonrecombining segment of DNA for two individuals and we wish to estimate the time back to their MRCA (for this segment). We first assume for each marker that (at most) only a single mutation has occurred over both lineages leading from the MRCA to the two individuals being considered. This is not an unreasonable assumption if the individuals match at most markers. We refer to this model as the infinite alleles model, as our development is also exact for the situation where each mutation is unique so that matches occur only when the marker locus in both lines has not mutated. Finally, we assume that the markers are exchangeable in the sense that the pergeneration mutation rate μ is the same for each locus. We relax these assumptions below.
Let q(t) be the probability of a match (at any given single marker) between two individuals with a most recent common ancestor t generations ago. The number of matches (k) out of n loci follows a binomial distribution, with
From Equations 1 and 2a, the resulting likelihood for the time t back to the MRCA given that we observe k out of n matches is
BAYESIAN POSTERIOR DISTRIBUTIONS FOR TIME TO MRCA
While the MLE describes one feature of the distribution of t (the mode), the most complete picture is given by the full posterior distribution of t, which can be obtained by a Bayesian analysis (e.g., Lee 1997). Such an analysis proceeds from Bayes’ theorem, with the posterior distribution p(tk) being proportional to the product of a prior distribution p(t) for t and a likelihood L(tn, k) given the data (k out of n matches),
The main objection to a Bayesian analysis raised by nonBayesians is that the choice of a prior is often very subjective and, as such, this can greatly bias the posterior. For the time back to the MRCA, an appropriate prior naturally follows from standard coalescence theory, as the expected time back to a MRCA under pure drift in an effective population size of N_{e} follows the geometric distribution with success parameter
Treating time as continuous, the geometric prior is equivalent to using an exponential distribution with hyperparameter
Recalling Equation 3, the resulting posterior distribution becomes
Returning to the posterior distribution, the normalizing constant is easily computed by expanding the (1  e^{2μ}^{t})^{n}^{}^{k} term, noting that we can express the function being integrated as
With a flat prior (λ = 0) the normalizing term further simplifies to
For zero marker mismatches (k = n), the posterior is simply an exponential distribution with parameter λ + 2nμ,
Likewise, the cumulative probability distribution for the time back to the MRCA is just
The posterior distributions with one or more mismatches also follow from Equation 12. For example, for one (k = n  1) and two (k = n  2) mismatches, the posteriors are
Figures 1, 2, 3 plot the posterior distributions corresponding to different numbers of matches (k) for n = 5, 10, 20, 50, and 100 markers under the assumption that μ = 1/500 and there is a flat prior (λ = 0). Provided that N_{e} ≫ 250, the results are the same for any other prior using λ = 1/N_{e}.
Table 1 summarizes the mean, standard deviation (SD), and mode (the MLE) for the posterior distributions as well as the 2.5, 50, 90, and 97.5% cutoff values. Note that a 95% credible region for t is given by the 2.5 and 97.5% values. As expected, the distribution of t is highly skewed, as mode < median < mean. The distribution becomes increasingly skewed to the right as the number of mismatches increases, which is reflected in not only an increase in the mean but also in the variance. However, note that the mean/SD ratio declines with increasing numbers of mismatches, so that the coefficient of variation declines as k decreases.
Finally, note that the resolution for t offered by using n = 5 markers is very poor, but rather fine precision is offered by using 100 markers. While scoring the latter number of markers may be unrealistic, using 20 markers is both feasible as well as offering reasonable precision.
DIFFERENTIAL MUTATION RATES
As our knowledge of the parameters associated with the mutational process continues to improve, it is likely that we may find significant differences in the mutation rates at different markers. Fortunately, it is straightforward to modify the likelihood function (Equation 3) to take this into account. Suppose n markers are examined, generating the matching data x_{1},..., x_{n}, where the x_{i} are coded as
For the simplest case of no mismatches (k = n), the posterior for the time to the MRCA becomes
CORRECTING FOR MULTIPLE HITS: THE STEPWISE MUTATION MODEL
Microsatellites, given their higher mutation rates compared to single nucleotide polymorphisms (SNPs), are clearly the marker of choice for estimating the time to MRCA when the individuals are assumed to be at least modestly related. As microsatellite “alleles” correspond to different lengths of the repeat unit in the microsatellite array, the infinite alleles model assumed previously is not appropriate if multiple mutations are expected in any given marker. Since mutations change the number of repeats (and hence the size) of an array, two (or more) mutations can recover the initial state found in the MRCA. Likewise, parallel mutations in both lineages leading from the MRCA can also lead to the two individuals sharing the same allelic state, even though mutations have occurred. Thus, using an infinite alleles model for microsatellites will tend to underestimate the time to the MRCA. To examine the severity of this underestimation, we consider the divergence, assuming a stepwise mutational model (SMM) assuming equal probabilities of an up (increase array size by one) or down (decrease array size by one) move (the symmetric singlestep SMM). The roots of the SMM trace from Ohta and Kimura’s (1973) model of charge differences in electrophoretically scored proteins. A number of workers have shown this model to be a good fit for microsatellites in both indirect studies examining the distribution of array sizes in natural populations (Edwardset al. 1992; Shriveret al. 1993; Valdeset al. 1993; Di Rienzoet al. 1994) and direct studies looking at actual mutations arising in pedigrees (Brinkmannet al. 1998; Kayseret al. 2000). In the latter two studies, the vast majority (35 out of 37) of new mutations were single step, while the remaining (2 out of 37) were two step.
Denote the allelic state (array size) in the MRCA as state 0, and let X_{t} denote the array size at time t. As before we assume a pergeneration mutation rate of μ. Under this model, the transition probabilities between states become
To apply this model, we need to compute q(t), the probability of a match between two lineages sharing a MRCA t generations ago. A little thought shows that for the onestep model allelic states in two lineages can only match if an even number (2M) of mutations have occurred. The appendix shows that
Since the probability of 2M total mutations along both lineages assuming t generations back to the MRCA is given by a Poisson distribution with parameter 2tμ, we have the probability of a match conditioned on t as
Figure 4 compares the probability of a match as a function of τ for the infinite allele and onestep models. The match probabilities under both models are rather similar for τ < 0.5 (125 generations with μ = 1/500), but they diverge rather quickly after that. Note that even after 20τ generations (5000 generations with μ = 1/500), the match probability under the stepwise model is still nontrivial (0.09), reflecting the very slow decrease in the probability of a match as the total number of mutations increases. By contrast, the corresponding match probability is essentially zero (2.06 × 10^{9}) under the infinite alleles model, as a single mutation causes the allelic states to diverge and subsequent back (and/or parallel) mutations are not allowed.
As shown in Table 2, both the mean and variance (measured by the standard deviation) of the distribution for time to the MRCA are larger under the stepwise model than the infinite alleles model. This is certainly expected, as the effect of the stepwise mutational model is to allow for a match following two (or more) mutations, while a match is assumed to never recover following a mutation under the infinite alleles model. Table 2 gives the ratios of the means and standard deviations for the distribution of t under the stepwise model compared to the same statistic under the infinite alleles model. When the number of markers is large (20 or greater) and the number of mismatches is small, these ratios are close to one. The ratio of standard deviations is always larger than the means ratio, reflecting the longer tail (relative to that for the infinite alleles model) generated under the stepwise model. As the number of mismatches increases, the mean and SD ratios increase, reflecting increasingly larger probabilities for t under the stepwise relative to the infinite alleles model.
Also note from Table 2 that for the same fraction of observed matches (say 4/5, 8/10, 16/20, 40/50, and 80/100), the ratios of the means and standard errors under the stepwise vs. the infinite alleles model decrease toward 1 as the number n of markers scored increases. For example, for 80% observed matches, assuming a flat prior (λ = 0), the mean ratios are 4.8, 1.34, 1.2, 1.1, and 1.1 (for n = 5, 10, 20, 50, and 100 markers). Likewise, the ratios of standard deviations are 34.2, 2.0, 1.3, 1.2, and 1.2. The reason for this can be seen by considering the case where we observe what appears to be a complete match at all n markers. In this case, there is some small chance that two (or more) mutations have occurred in one (or more) markers. However, if indeed a total of two mutations have occurred across both lineages, the probability that both occurred in the same marker is just 1/n (assuming the same mutation rate across markers). Hence, as the number of markers increases, the probability that a multiple mutation is masked decreases due to scoring changes over more loci.
The final observation from Table 2 is that, unlike for the infinite alleles model, the choice of the hyperparameter λ for the prior can significantly affect the posterior distribution. This is true when the number of markers is very small and/or the fraction of mismatches is nontrivial. The most extreme difference between the two assumed mutational models is seen under a flat prior (λ = 0). As the value of λ increases (corresponding to a decrease in the assumed effective population size as λ = 1/N_{e}), the ratios of the means and standard deviations under the two models become increasingly similar. Almost all of this difference is due to significant decreases in the mean and variance under the distribution of t under the stepwise model as λ increases. This arises because the effect of increasing λ is to shorten the distributions of times to MRCA in the prior (decreasing N_{e} decreases the coalescent times), which in turn downweights matches under the stepwise model from individuals with assumed very long times to the MRCA.
Our analysis under the SMM model considers markers only as showing a match or mismatch, ignoring any additional information on the differences between marker alleles when a mismatch has occurred. Over short time scales (on the order of 1/2μ < 1) we are likely not losing much information, as most markers will likely have at most one mutation and hence scoring a match vs. no match is sufficient. Over longer time scales, we are clearly losing information. In such cases, a logical extension of our model would be replacing the probability of a match with the probability that the two microsatellite alleles in the individuals being compared differ by r repeats. This is accomplished as follows. Consider the probability of an even number (2k) of differences between microsatellite array sizes. Again, a little thought shows that this can occur only with an even number of total mutations (2M). With a total of 2M mutations, the probability that the array sizes differ by 2k is the probability that the number of up (+) mutations is either M + k or M  k (by symmetry, these two probabilities are the same under the SMM). Thus, the probability of a difference (in absolute value) of Δ = 2k given 2M total mutations is
Using the same logic, for a difference of 2k + 1 an odd number (2M + 1) of total mutations are required, and following the same steps leading to Equation 27a gives
The likelihood function follows from the multinomial distribution. If a total of n markers are scored, and n_{i} denotes the number of markers differing in size by i (with the largest difference being k), then
As an example of applying Equation 31, consider haplotypes 1 and 3 from Thomas et al.’s (2000) study on the Lemba and the Cohen (Y chromosome) modal haplotype. Six microsatellite markers were scored and, of these, both alleles match at four markers, while one marker differs by one repeat and another by two repeats. In this case, Equation 31 becomes
While Equation 31 provides the foundation for a full Bayesian analysis, we caution that its usefulness depends on accurately capturing the mutation model for the markers in question. While the stepwise mutation model seems a reasonable fit, the fine details are still unclear. For example, there are suggestions that mutation rate may increase with array size (Brinkmannet al. 1998; Fu and Chakraborty 1998) but also observations that suggest this is not the case (Valdeset al. 1993). There are also observations suggesting a bias toward increased array size (Kayseret al. 2000; Fu and Chakraborty 1998), and although most mutational steps change array size by one, mutations of two or more steps are seen (Brinkmannet al. 1998; Kayseret al. 2000). Finally, there may be multiple molecular processes operating at microsatellites, such as a majority of small changes against a background of rare major changes (Di Rienzoet al. 1994) or independent deletions (Walsh 1987). Until these details are sorted out, the risk run by using a model to correct for multiple mutations is that at least as much bias could be introduced by assuming an incorrect mutation model as is removed by accounting for multiple hits. One approach would be to use Fu and Chakraborty’s (1998) minimum chisquare approach for estimating the generalized stepwise mutation model and compare the minimal (best) fitting parameters with those for the symmetric singlestep model that we have assumed.
DISCUSSION
Both the Y chromosome and mtDNA have been successfully used for assessing deep ancestry, while unlinked autosomal markers have proven much more valuable for determining very recent ancestry. Here we examine the effectiveness of using multilocus haplotypes from a region of nonrecombining DNA (such as the majority of Y chromosome or mtDNA) to estimate the time t to the MRCA of two individuals for that region. We argue that, to assess the relatedness of individuals on the basis of their haplotypes in a region of nonrecombining DNA, the time to the MRCA is the natural replacement for probability calculations based on the product rule used for unlinked markers.
We assume n prechosen markers are examined and are (initially) coded as either matching (agreeing in allelic state) or not matching. Using a Bayesian approach, the resulting posterior distribution for t is a function of n, the number of matches k, the permarker mutation rate μ, and the hyperparameter λ = 1/N_{e} (the reciprocal of the effective population size) of the assumed prior distribution of t (the assumed distribution for the time to the MRCA for two random individuals in the absence of any marker information). We examined two rather different mutation models—infinite alleles (each mutation is unique, with a match implying no mutation at that marker since the MRCA) and the SMM (which allows for matches created by several parallel mutations).
Our results show that it is possible to use Y or mtDNA marker information to provide reasonable estimates for t when individuals share a MRCA of intermediate age (tens to hundreds of generations when μ = 10^{3}), provided a sufficient number of markers are scored (Table 1, Figures 1, 2, 3). Estimates using only 5 markers have distributions that are highly skewed for large t values, especially when the possibility of back mutations regenerating a match is taken into account (i.e., the SMM model). As the number of markers increases, the width of the credible intervals around an estimate of t decreases. While 10 markers give a reasonable interval, 20 markers seems a more workable tradeoff between cost and precision. Our results also suggest that the forensic use of either the Y or mtDNA is rather limited. They can be used for exclusion, but make only weak probability statements about a sample and a suspect when there is a complete match. Using Equation 14c, a complete match at 10 out of 10 markers between a sample and a suspect implies only that the individual generating the sample and the suspect have a 90% chance of sharing a MRCA no more than 58 generations ago (assuming a typical microsatellite mutation rate of μ = 1/500). With a complete match, the number of markers that need to be scored to have a 90% probability that the sample and suspect have a MRCA no more than 1 generation ago is ∼580, an order of magnitude more than the number of currently known Ylinked markers (M. Hammer, personal communication). A 50% probability that the MRCA does not exceed 1 generation requires roughly 340 microsatellite markers.
We obtained the posterior distributions for t given the marker data by using a Bayesian analysis, which requires a prior distribution for t. From standard populationgenetics theory, t follows a geometric distribution with success parameter 1/N_{e} (as the Y and mtDNA behave as haploids). Hence, the functional form of the prior is well motivated, while its exact shape is determined by the hyperparameter λ = 1/N_{e} used for any particular prior. For the infinite alleles model, the actual value of λ had essentially no effect unless we used a very small value for N_{e} (<200) and/or there were a very significant number of mismatches (k/n ≪ 1). Thus, we used a flat prior (λ = 0, corresponding to an infinite effective population size) for the general results tabulated, although the equations given cover arbitrary values of λ ≥ 0. Conversely, under the stepwise model, the value of λ often had a significant effect on the posterior, especially when the number of markers n is small and the number of matches is modest or poor (k/n ≪ 1). In such cases, the use of a flat prior gave the largest difference in the posteriors for the two mutational models. As the value of λ is increased (corresponding to decreasing N_{e}), the posteriors under the two different models become increasingly similar (Table 2). This occurs because, as we decrease N_{e}, we make the initial assumption that individuals with long times to the MRCA become increasingly unlikely. It is these individuals that still have a modest probability of showing a match under the stepwise model, greatly inflating the estimated times to the MRCA relative to the infinite alleles estimate. In cases where there was a recent population expansion (or contraction) or a selective sweep, the distribution of times to MRCA may deviate from a geometric distribution. However, even in such cases, we expect there to be little dependence on the prior except in cases where changes in λ under a geometric prior have a significant effect on the posterior.
One potential issue of concern is whether our results are somehow biased by scoring only markers known to be polymorphic in the human population as a whole (but not necessarily in the particular pair of individuals being contrasted). When the goal is to estimate the coalescent time for an entire population, using markers known (in advance) to be polymorphic in the population of interest creates an ascertainment bias that needs to be corrected (e.g., Nielsen 2000). In our analysis, μ is the mutation rate for microsatellites conditional on their being polymorphic. However, since the estimates of microsatellite mutation rates used are also ascertained by scoring microsatellites known to be polymorphic, we have corrected for this effect. When considering random microsatellites, the mutation rate μ in our analysis is replaced by cμ, where c > 1 is an ascertainment correction that can be specified only by knowing how the polymorphic markers were ascertained.
While we have implicitly framed much of our discussion in terms of microsatellites [simple tandem repeats (STRs)], SNP data can be included by using the infinite alleles model and the appropriate mutation rates. Equation 18a allows the method to handle mixed (STRs and SNPs) marker data by using the appropriate expression for q_{k}(t), the probability of a match for marker k at time t. For SNPs, the infinite alleles model is used, q_{k}(t) = exp(2μ_{k}t), while microsatellites use Equation 25, q_{k}(t) = exp(2μ_{k}t)I_{0}(2μ_{k}t), which corrects for multiple hits under the stepwise mutation model (I_{0} denoting the zeroorder modified type I Bessel function; Olver 1964). In either case, μ_{k} is the mutation rate for marker locus k. More generally, if microsatellite comparisons are coded by differences in size (as opposed to match/no match), Equation 28 gives the probability that two individuals differ by j steps at the kth microsatellite as
Our approach is easily modified to estimate the time between an individual and a particular (known or inferred) ancestral haplotype. When comparing an individual’s haplotype against a fixed standard, we are following only one branch from the MRCA, so that q(t) = (1  μ)^{t} ∼ exp(μt), as opposed to q(t) ∼ exp(2μt) when looking at mutations over both branches. Hence, we simply replace the mutation rate μ by μ/2, and all of our previous results apply.
Estimation of the time to the MRCA for a sample of individuals (as opposed to our simpler setting of just two individuals) has been examined by Fu (1996) and Tavaré et al. (1997), under the assumption of an infinite sites model (Watterson 1975). These analyses assume a Poisson likelihood for the number of segregating sites, while we assumed a binomial likelihood for number of segregating sites under the assumption that the n sites to be scored in the two individuals were fixed in advance. Given that the rough figure is one common polymorphism every 10 kb for the human Y (M. Hammer, personal communication), focusing on specific sites known to be polymorphic in the population as a whole (as opposed to sequencing large regions) is not an unreasonable approach. When n is large and μ small, the two different likelihoods for estimating t should give very similar results.
As we tried to stress in several places, the major caveat to this (or any other) approach for estimating the time t to MRCA is our uncertainty about both the mutational process and rates. The Bayesian framework allows us to incorporate these uncertainties; for example, if p(μ) is some assumed prior for the mutation rates, then the marginal posterior (after integrating out μ) can be used to estimate t,
APPENDIX
Suppose a total of S mutations have occurred over the two lineages, with n mutations in lineage 1 and S  n in lineage 2, where n is a random variable. Let X_{1} and X_{2} denote the changes in array size from the common ancestor. What is the distribution for d = X_{1}  X_{2} (the difference in array size between the two lineages) conditioned on the total number of mutations S in both lineages? Jay Taylor (Department of Ecology and Evolutionary Biology, University of Arizona) suggested the following approach, which is considerably more elegant than a more brute force method I initially used. Define the random variable Z_{i} as the change (±1) in array size generated by the ith mutation. Under the symmetric singlestep model, Pr(Z_{i} = +1) = Pr(Z_{i} = 1) = ½. Further,
Using Taylor’s result, the probability that two arrays are the same size given that a total of 2M mutations have occurred between them equals the probability of M up moves and M down moves, or
Acknowledgments
Thanks go to Mike Lynch, Mike Hammer, Monty Slatkin, Francois Rousset, and YunXin Fu for useful comments and advice and to Bennet Greenspan and Max Rothschild for dragging me into this problem in the first place. Special thanks go to Jay Taylor for very useful discussions on the probability of a match in two independent Markov chains. Curses go to Danny Gianola for planting the seed of Bayesian statistics and then adding plenty of fertilizer and to Bruce Southey, Sandra RodriguezZas, and Dan Sorensen for useful tutoring. Two anonymous reviewers provided very useful comments and criticisms.
Footnotes

Communicating editor: Y.X. Fu
 Received July 22, 2000.
 Accepted March 22, 2001.
 Copyright © 2001 by the Genetics Society of America